Kalbotyra ISSN 1392-1517 eISSN 2029-8315
2023 (76) 54–65 DOI: https://doi.org/10.15388/Kalbotyra.2023.76.4
Ruslan Isaev
Computer Science department
Ala-Too International University
Ankara Street 1/8, Tunguch
720048 Bishkek, Kyrgyz Republic
E-mail: ruslan.isaev@alatoo.edu.kg
ORCID iD: https://orcid.org/0000-0003-4426-8837
Gulzada Esenalieva
Computer Science department
Ala-Too International University
Ankara Street 1/8, Tunguch
720048 Bishkek, Kyrgyz Republic
E-mail: gulzada.esenalieva@alatoo.edu.kg
ORCID iD: https://orcid.org/0009-0000-9135-1671
Ermek Doszhanov
Computer Science department
Ala-Too International University
Ankara Street 1/8, Tunguch
720048 Bishkek, Kyrgyz Republic
E-mail: ermek.doszhanov@alatoo.edu.kg
ORCID iD: https://orcid.org/0009-0002-4939-5683
Abstract. The concept of stop words introduced by H. P. Lun in the mid-20th century plays a huge role in today’s NLP practice. Stop words are used to reduce noisy text data, remove uninformative words, speed up text processing, and minimize the amount of memory required to store data.
The Kyrgyz language is an agglutinative Turkic language for which no scientific study of stop words has been previously published in English. In our study, we combined frequency analysis with rule-based linguistic analysis. First, we found the most frequently used words, set a threshold, and removed words below the threshold. This way we got a list of the most frequently used words. Then we reduced the list by excluding from the list all words that do not belong to the category of function words of the Kyrgyz language. Finally, we got a list of 50 words that can be considered stop words in the Kyrgyz language. In our analysis, we used a single corpus of sentences collected and posted as an open source project by one of the local broadcasters.
Keywords: stop words, Kyrgyz language, frequency analysis, Turkic stop words, NLP
_______
Submitted: 06/09/2023. Accepted: 23/11/2023
Copyright © 2023 Ruslan Isaev, Gulzada Esenalieva, Ermek Doszhanov. Published by Vilnius University Press
This is an Open Access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Introduction
The concept of stop words was firstly introduced by Hans Peter Luhn, the pioneer of computational linguistics, in 1958. His concept of ʻstop listsʼ explained such words as noisy data, which can be neglected (neither indexed nor searchable) by the computational machine. The concept of stop words plays huge role in Natural Language Processing (NLP). As a part of general NLP, the stop words help to remove noisy or irrelevant information from the data, focus on the more meaningful or informative words in the text, speed up the processing time, and reduce the amount of storage needed for storing the data. Removing stop words increase accuracy and efficiency of such NLP techniques, as Topic Modelling, Sentiment Analysis, Information Retrieval, or Feature Selection (Ladani et al. 2020, 466).
Literally, stop words are words that are commonly used in a language but are generally considered to be of little value or significance in the context of text analysis or NLP tasks. Usually such words are removed from texts before any analysis or processing takes place. Here are some examples of stop words in English: the, and, or, a, an, of, to, in, and is (Ladani et al. 2020, 466).
The purpose of this study is to provide a list of stop words adapted for use in the Kyrgyz language. Since the concept of stop words gives a lot of freedom in interpreting and working with them, the list of stop words presented in this article is not universal.
1 Literature review
1.1 Rules and strategies for including to stop words
The pattern for stop words’ removal, suggesting to remove them before training a model, arises some debates. For instance, there are some works, proposing to remove stop words after model inference (removing stop words after the process of training a model) (Schofield et al. 2017, 435), or even training models without removing stop words (Cordeiro et al. 2004, 137).
The rules for considering words as stop words are derived from the stop words’ general meaning: the words that are commonly used but do not carry much meaning in a sentence. The meaning is broad, but there are some general characteristics, which can be taken as stop words’ properties (Sadeghi et al. 2014, 479):
Along with this, the numbers (1, 2, 3, etc), emojis, and symbols like (@, #, &, %, *) have to be considered as stop words. Some texts may contain foreign words, these words can be also considered as stop words (Al-Shargabi et al. 2011, 2). In this study, we automatically exclude these symbols from the stop words’ list.
It worth to mention, that there is no definitive list of stop words for any language, and the decision to include a word as a stop word may vary depending on the specific application and context. This is why it’s better to follow the general strategy: identify words that do not provide much meaningful information in the context of the certain corpus and can be safely removed without affecting the accuracy of the analysis.
1.2 Stop words’ identification techniques
Today researchers propose different sophisticated methods for stop words identification: clustering algorithms in machine learning, CF-IDF, the Mutual Information method (MI), methods based on Zipf’s Law, TRBS (Kaur et al. 2018, 208). However, even a simple frequency-analysis of word occurrences in a corpus provides better results for the tasks as Information Retrieval: top 10–20 frequent stop words can reduce by 20–30% the size of tokens for processing (Sadeghi et al. 2014, 476).
Along with the above methods, some authors propose to use combinations of different methods. For example, they propose the combination of a statistical model (frequency-analysis), and an information model, where they take any token (stop word) as a signal, and look at how much information or entropy is carried by the signal (Zou et al. 2006, 1012–1013).
One important note is that there is no a universal approach for identifying stop words in a corpus. The best approach depends on the specific characteristics of the corpus and the language being analyzed. It’s often necessary to experiment with different techniques to find the most effective one for a particular corpus.
1.3 Stop words in Turkic languages
Turkic languages are a group of agglutinate languages, spoken by more than 160 million people, including today about thirty languages. The languages of this family include Turkish (72mln), Uzbek (24mln), Azerbaijani (23mln), Kazakh (13mln), Uigur (10.5mln), Tatar (5.4mln), Turkmen (5mln), Kyrgyz (4.4mln), and other languages. Kyrgyz language speakers live in Kyrgyzstan, Uzbekistan, and China (Xinjiang) (Turkic Languages).
Turkic sets of stop words can vary across different dialects and regions of Turkic languages. Additionally, there are variations in the way Turkic languages are written, such as the use of the Cyrillic alphabet in Kazakh and Kyrgyz, or the use of the Latin alphabet in Turkish (Turkic Languages). There are several tools and libraries helping to work with stop words of Turkic languages (stop words-tr library for Turkish, stop words-uz library for Uzbek, etc).
Most of the existing models and methods used to identify stop words, are suitable for European natural language families. However, they can’t successfully cover the problem of stop words’ identification for agglutinate Turkic language family. In Turkic languages, many stop words are ʻmaskedʼ, and require different techniques for adding to stop lists, such as collocation and bigram methods (Madatov et al. 2022, 1).
Additionally, such libraries, as NLTK and SpaCy, provide options for creating custom lists of stop words. It’s important to notice, that lists of stop words are flexible, and might vary depending on specific needs, particular application or project.
2 Methodology
In this research, we use a hybrid approach, involving two steps: 1) Frequency-based approach: identification the most common words in a corpus of Kyrgyz sentences with a threshold of 4000 occurrences; 2) Linguistic-based approach: identification words based on their grammatical function (POS identification), then dividing them into content and function words. Finally, we present a list of most frequent function words, as a stop words’ list for the Kyrgyz language.
2.1 Frequency-based approach
First, we conducted a frequency analysis to determine the distribution of words in Kyrgyz texts. The corpus of Kyrgyz texts of the Kloop media was tokenized and lowercased. Then, using the FreqDist method of the NLTK library, the frequency of words was found. Having received the 200 most common words, we decided to set the frequency threshold at 4000 word occurrences in the corpus. So, we received 165 words to be subsequently analyzed in the next stage. The list of frequencies of Kyrgyz words can be found in the appendix to this article.
2.2 Linguistic-based approach
Secondly, we have divided the list of most frequent words into two categories: content and function words. Function words are the class of words playing rather a grammatical role in the text (closed-class words): prepositions, conjunctions, determiners, qualifiers, pronouns, interrogatives, numerals, etc. Content words are the class of semantically richer words, having a greater semantic meaning in the text (open-class words): nouns, verbs, adjectives, adverbs. (Bell et al. 2009, 92).
The list of words after removing all content words is provided below:
There are two significant notes: 1) the main reason why we ended up with a list of only 50 words due to the fact that we set a frequency threshold of 4000 occurrences of a word in a corpus; 2) the English translations of the words above can be seen as different parts of speech depending on the context in which they are used; the same is true for Kyrgyz words: they are assigned one part of speech, but the actual POS depends on the context in which they are used.
3 Results
The list of 50 words presented in the part of Linguistic-based approach, can be considered as a list of Kyrgyz stop words, based on the corpus of Kyrgyz sentences from Kloop media. There are other lists of open-source Kyrgyz stop words: 1) the list of Turkish stop words translated into Kyrgyz and taken as a set of Kyrgyz equivalents of stop words (both languages are Turkic); 2) the list of Kyrgyz stop words provided by the SpaCy library. While these lists of Kyrgyz stop words can be used, we have not been able to find any scientific basis or research to clarify the inclusion of these particular words into the lists of Kyrgyz stop words.
4 Discussion and conclusion
As it was mentioned above, there are many sophisticated methods that help to define certain sets of stop words. Besides, there are specific strategies, advised for identification of Turkic language stop words. This analysis is carried out on the only corpus of Kyrgyz sentences, collected by the “Kloop” media (2020). It cannot reflect the entire set of Kyrgyz stop words. Other corpora, as the Kyrgyz news Corpus of the Leipzig Corpora Collection (2020), can be analyzed in future as well.
Moreover, in relation to Turkic languages, including Kyrgyz, having rich and diverse affixes’ nature, it might be reasonable to widen the ʻstop lists’ concept, and including some affixes into the ʻstop lists’ (Tukeyev et al. 2020, 4–5). In this method, the morphological analysis should be carried out before removing ʻstop words’ and ʻstop affixes’.
The main purpose of this study is to fill a gap in research in this area. The approach, proposed in this article, is not ideal. Thus, the list of stop words of the Kyrgyz language proposed in this study can be subsequently modified by more sophisticated methods and models. Finally, we have to mention that it is recommended to define stop words for each particular corpus.
Data sources
The Kloop Media corpus of Kyrgyz texts https://github.com/kyrgyz-nlp/kloop-corpus
Transliteration tool (ISO 9 mode) https://www.translitteration.com/transliteration/en/kyrgyz/iso-9/
Kyrgyz stop words by SpaCy. https://github.com/explosion/spaCy/blob/master/spacy/lang/ky/stop_words.py
Leipzig Corpora Collection. Kyrgyz news corpus based on material from 2020. https://corpora.uni-leipzig.de?corpusId=kir_news_2020
Turkic Languages. http://www.languagesgulper.com/eng/Turkic.html
Turkish stop words translated into Kyrgyz. https://www.kaggle.com/datasets/crocuta/kyrgyz-language-stopwords
References
Al-Shargabi, Bassam, Waseem Al-Romimah & Fekry Olayah. 2011. A comparative study for Arabic text classification algorithms based on stop words elimination. Proceedings of the 2011 International Conference on Intelligent Semantic Web-Services and Applications (ISWSA’11). Association for Computing Machinery, New York, NY, USA, Article 11, 1–5. https://doi.org/10.1145/1980822.1980833
Bell, Alan, Jason Brenier, Michelle Gregory, Cynthia Girand & Dan Jurafsky. 2009. Predictability effects on durations of content and function words in conversational English. Journal of Memory and Language 60, 92–111. https://doi.org/10.1016/j.jml.2008.06.003
Cordeiro, João & Pavel Brazdil. 2004. Learning Text Extraction Rules, without Ignoring Stop Words. Pattern Recognition in Information Systems 2004, 128–138. Retrieved from https://www.di.ubi.pt/~jpaulo/publications/PRIS2004.pdf
Kaur, Jashanjot & Preetpal Buttar. 2018. A Systematic Review on stop word Removal Algorithms. International Journal on Future Revolution in Computer Science & Communication Engineering 4 (4), 207–210. Retrieved from http://www.ijfrcsce.org/index.php/ijfrcsce/article/view/1499/1499
Ladani, Dhara & Nikita Desai. 2020. Stop word Identification and Removal Techniques on TC and IR applications: A Survey. 6th International Conference on Advanced Computing and Communication Systems (ICACCS), 466–472. https://doi.org/10.1109/ICACCS48705.2020.9074166
Madatov, Khabibulla, Shukurla Bekchanov & Jernej Vičič. 2022. Dataset of stop words extracted from Uzbek texts. Data in Brief 43, 108351, 1–7. https://doi.org/10.1016/j.dib.2022.108351
Sadeghi, Mohammad & Jesús Vegas. 2014. Automatic identification of light stop words for Persian information retrieval systems. Journal of Information Science 40, 476–487. https://doi.org/10.1177/0165551514530655
Schofield, Alexandra, Måns Magnusson & David Mimno. 2017. Pulling Out the Stops: Rethinking stop word Removal for Topic Models. Conference of the European Chapter of the Association for Computational Linguistics, 432–436. http://dx.doi.org/10.18653/v1/E17-2069
Tukeyev, Ualsher, Aidana Karibayeva & Zhandos Zhumanov. 2020. Morphological segmentation method for Turkic language neural machine translation. Cogent Engineering 7, 1856500, 1–15. https://doi.org/10.1080/23311916.2020.1856500
Zou, Feng, Fu Lee Wang, Xiaotie Deng, Song Han & Lu Wang. 2006. Automatic construction of Chinese stop word list. Proceedings of the 5th WSEAS International Conference on Applied Computer Science, Hangzhou, China, April 16–18, 2006, 1010–1015. Retrieved from https://www.cs.cityu.edu.hk/~lwang/research/hangzhou06.pdf
Appendix
Index |
Kyr |
Translit |
Eng |
Freq |
Index |
Kyr |
Translit |
Eng |
Freq |
1 |
менен |
menen |
with |
71302 |
2 |
жана |
jana |
and |
59732 |
3 |
ал |
al |
she/he/it |
56863 |
4 |
деп |
dep |
saying |
44065 |
5 |
боюнча |
boûnča |
by/according |
40321 |
6 |
бул |
bul |
this |
35208 |
7 |
бир |
bir |
one |
33546 |
8 |
үчүн |
ùčùn |
for |
31914 |
9 |
эле |
èle |
just |
26122 |
10 |
болгон |
bolgon |
was/has been |
24342 |
11 |
да |
da |
too/also |
24247 |
12 |
анын |
anyn |
her/his |
22929 |
13 |
бирок |
birok |
but/however |
22528 |
14 |
эмес |
èmes |
not |
18599 |
15 |
болуп |
bolup |
being |
17835 |
16 |
кийин |
kijin |
after |
16500 |
17 |
жок |
žok |
no |
16133 |
18 |
тууралуу |
tuuraluu |
about |
14763 |
19 |
чейин |
čejin |
till/up to |
14720 |
20 |
керек |
kerek |
needed |
14600 |
21 |
бар |
bar |
have/is |
13815 |
22 |
алып |
alyp |
taking |
13786 |
23 |
алар |
alar |
they |
13069 |
24 |
деди |
dedi |
said/told |
12865 |
25 |
эки |
èki |
two |
12328 |
26 |
эми |
èmi |
now |
12269 |
27 |
кыргызстандын |
kyrgyzstandyn |
of Kyrgyzstan |
11739 |
28 |
өз |
ôz |
-self/own |
11409 |
29 |
алардын |
alardyn |
their |
11090 |
30 |
жылдын |
žyldyn |
of a/the year |
10731 |
31 |
аны |
any |
him/her/it |
10669 |
32 |
жылы |
žyly |
year |
10410 |
33 |
жатат |
žatat |
lies |
10379 |
34 |
айтымында |
ajtymynda |
according to |
10266 |
35 |
дагы |
dagy |
still/again |
10186 |
36 |
деген |
degen |
said/told |
10127 |
37 |
мен |
men |
I/me |
10005 |
38 |
сөз |
sôz |
word |
9916 |
39 |
турган |
turgan |
was standing |
9858 |
40 |
кыргыз |
kyrgyz |
Kyrgyz |
9808 |
41 |
биз |
biz |
we |
9796 |
42 |
билдирди |
bildirdi |
has reported |
9694 |
43 |
ош |
oš |
Osh |
9688 |
44 |
гана |
gana |
only |
9645 |
45 |
каршы |
karšy |
against |
9458 |
46 |
мамлекеттик |
mamlekettik |
(belonging to) state |
9311 |
47 |
башка |
baška |
other |
8684 |
48 |
алган |
algan |
has taken/received |
8624 |
49 |
жаткан |
žatkan |
was lying/lied |
8580 |
50 |
болот |
bolot |
will be |
8481 |
51 |
ошондой |
ošondoj |
the same |
8449 |
52 |
анда |
anda |
then |
8313 |
53 |
жаңы |
žan̦y |
new |
8278 |
54 |
берген |
bergen |
has given/gave |
8223 |
55 |
басма |
basma |
printed |
8166 |
56 |
учурда |
uçurda |
at the moment |
8129 |
57 |
же |
je |
or |
8076 |
58 |
адам |
adam |
human |
7921 |
59 |
эч |
eç |
no-(thing) |
7817 |
60 |
үй |
üy |
house |
7639 |
61 |
башкы |
başkı |
main |
7588 |
62 |
кандай |
kanday |
how |
7426 |
63 |
ага |
aga |
to her/him |
7258 |
64 |
андан |
andan |
from her/him |
7220 |
65 |
иш |
iş |
work |
7172 |
66 |
айтып |
aytıp |
saying |
7017 |
67 |
бишкек |
bişkek |
Bishkek |
6998 |
68 |
президент |
prezident |
President |
6989 |
69 |
kloop |
kloop |
kloop |
6961 |
70 |
ошол |
oşol |
that |
6935 |
71 |
жол |
jol |
way |
6891 |
72 |
келген |
kelgen |
came |
6840 |
73 |
болсо |
bolso |
if (will be) so |
6832 |
74 |
мүмкүн |
mümkün |
maybe/perhaps |
6806 |
75 |
кыргызстан |
kırgızstan |
Kyrgyzstan |
6743 |
76 |
көп |
köp |
a lot/many |
6690 |
77 |
ар |
ar |
each |
6612 |
78 |
мыйзам |
mıyzam |
law |
6591 |
79 |
экенин |
ekenin |
that’s |
6510 |
80 |
сүрөт |
süröt |
picture |
6502 |
81 |
биринчи |
birinçi |
first |
6424 |
82 |
ата |
ata |
father |
6323 |
83 |
мурдагы |
murdagı |
former |
6191 |
84 |
башчысы |
başçısı |
chief/head |
6152 |
85 |
карата |
karata |
in relation to |
6132 |
86 |
кылмыш |
kılmış |
crime |
6116 |
87 |
калган |
kalgan |
remained/other |
6097 |
88 |
тарабынан |
tarabınan |
by/from |
6079 |
89 |
жолу |
jolu |
way/time |
6064 |
90 |
укук |
ukuk |
the right |
6040 |
91 |
үч |
üç |
three |
5985 |
92 |
алуу |
aluu |
getting/receiving |
5984 |
93 |
жогорку |
jogorku |
high |
5938 |
94 |
миң |
miŋ |
thousand |
5926 |
95 |
чек |
çek |
cheque |
5884 |
96 |
баш |
baş |
head |
5863 |
97 |
кол |
kol |
hand |
5842 |
98 |
буга |
buga |
to this |
5807 |
99 |
кабыл |
kabıl |
accept |
5797 |
100 |
маалымат |
maalymat |
information |
5797 |
101 |
бири |
biri |
one (of) |
5776 |
102 |
билдирген |
bildirgen |
reported |
5756 |
103 |
айтты |
ajtty |
told/said |
5739 |
104 |
1 |
1 |
1 |
5715 |
105 |
эл |
èl |
people/public |
5691 |
106 |
катары |
katary |
as |
5653 |
107 |
мындай |
myndaj |
this way/like this |
5546 |
108 |
күнү |
kùnù |
day |
5479 |
109 |
болду |
boldu |
was/happened |
5467 |
110 |
сом |
som |
som |
5318 |
111 |
дейт |
dejt |
says |
5266 |
112 |
каза |
kaza |
died/death |
5246 |
113 |
ылайык |
ylajyk |
according (to) |
5203 |
114 |
талап |
talap |
require/demand |
5151 |
115 |
орун |
orun |
place/seat |
5145 |
116 |
улам |
ulam |
little by little |
5142 |
117 |
кеткен |
ketken |
gone |
5107 |
118 |
болчу |
bolču |
was |
4904 |
119 |
өткөн |
ôtkôn |
the past/passed |
4902 |
120 |
млн |
mln |
mln |
4889 |
121 |
жөнүндө |
žônùndô |
about |
4878 |
122 |
бардык |
bardyk |
(in) all |
4878 |
123 |
бери |
beri |
since |
4877 |
124 |
премьер |
premʹer |
Prime Minister |
4865 |
125 |
келип |
kelip |
come/arrived |
4831 |
126 |
атамбаев |
atambaev |
Atambayev |
4829 |
127 |
коргоо |
korgoo |
defend/protection |
4806 |
128 |
чыккан |
čykkan |
came out/was released |
4804 |
129 |
5 |
5 |
5 |
4789 |
130 |
аралык |
aralyk |
distant |
4784 |
131 |
эң |
èn̦ |
the most |
4782 |
132 |
кыргызстанда |
kyrgyzstanda |
in Kyrgyzstan |
4771 |
133 |
эгер |
èger |
if |
4755 |
134 |
10 |
10 |
10 |
4750 |
135 |
акча |
akča |
money |
4704 |
136 |
берүү |
berùù |
giving |
4689 |
137 |
жатышат |
žatyšat |
(are) lying |
4681 |
138 |
жыл |
žyl |
year |
4656 |
139 |
ичинде |
ičinde |
inside |
4636 |
140 |
сот |
sot |
the court |
4608 |
141 |
ала |
ala |
take |
4571 |
142 |
азыр |
azyr |
now |
4566 |
143 |
милиция |
miliciâ |
the police |
4566 |
144 |
нече |
neče |
few/several |
4550 |
145 |
киши |
kiši |
man |
4495 |
146 |
өзү |
ôzù |
her/him/it-self |
4481 |
147 |
чыгып |
čygyp |
going/letting/coming out |
4478 |
148 |
атайын |
atajyn |
specially/on purpose |
4459 |
149 |
анткени |
antkeni |
because |
4434 |
150 |
шайлоо |
šajloo |
election |
4433 |
151 |
депутат |
deputat |
Deputy |
4401 |
152 |
эмне |
èmne |
what |
4394 |
153 |
жардам |
žardam |
aid/help/assistance |
4345 |
154 |
мындан |
myndan |
from this |
4344 |
155 |
иштеп |
ištep |
working |
4300 |
156 |
биздин |
bizdin |
our |
4220 |
157 |
шаардык |
šaardyk |
municipal/urban |
4220 |
158 |
2 |
2 |
2 |
4193 |
159 |
улуттук |
uluttuk |
national |
4178 |
160 |
сатып |
satyp |
buying |
4173 |
161 |
кызматы |
kyzmaty |
service |
4161 |
162 |
жакшы |
žakšy |
good |
4093 |
163 |
айткан |
ajtkan |
told/said |
4090 |
164 |
кызматкерлери |
kyzmatkerleri |
service workers/officers |
4040 |
165 |
аркылуу |
arkyluu |
through |
4033 |