This paper had developed and tested optimized content extraction algorithm using NLP method, TFIDF method for word of weight, VSM for information search, cosine method for similar quality calculation from learning document at the distance learning system database. This test covered following things: 1) to parse word structure at the distance learning system database documents and Cyrillic Mongolian language documents at the section, to form new documents by algorithm for identifying word stem; 2) to test optimized content extraction from text material based on e-test results (key word, correct answer, base form with affix and new form formed by word stem without affix) at distance learning system, also to search key word by automatically selecting using word extraction algorithm; 3) to test Boolean and probabilistic retrieval method through extended vector space retrieval method. This chapter covers: to process document content extraction retrieval algorithm, to propose recommendations query through word stem, not depending on word position based on Cyrillic Mongolian language documents distinction.
Basic training material and data distinction:
Problems related to natural language always followed and studied any kind of research work in Mongolia. New Mongolian or Cyrillic Mongolian language is official language of Mongolia. All levels of academic education are operated in Mongolian as a natural language completely.
Cyrillic Mongolian language included a kind of agglutinative language and words were depended on rules for word generating and inflecting. A word generating is based on attaching suffix and affix to word stem [
Root morphemes: indicating main idea and not possible to parsing word structure [
Affix morphemes: It is named all possible morphemes to attached root morphemes. Affix morphemes are divided into two categories as a generating suffix and attaching affix. A generating suffix is inflecting word and generating new word. For example: “Хүн + лэг”, “Ном + хон”, “Үзэг + дэл” etc.
Attaching affix indicated the relationship between two words. For example: “Хүн + ээс”, “Ном + ын”, “Үзэг + ээр” etc.
Word stem: main part of the word and inflecting by any affix [
It has to face several challenges during calculating text documents that depend on the features of Cyrillic Mongolian. In order to rule for attaching word affix, word stem should be described as a “word stem + affix 1 + affix 2 +…+ affix N”.
Morphological position at Cyrillic Mongolian language has distinctions from other languages. It is the biggest problem for calculating. For example: A comparative example of the most widely used language is shown in
There are several opportunities to express this meaning except this example in Cyrillic Mongolian language according to
Principal noun members don’t have constant position. Therefore, it has been facing challenge to search full sentence. It shows that word sequence at the sentence or full sentence search are not optimized method.
Languages | Morphological position at the sentence | |||||
---|---|---|---|---|---|---|
(1) | (2) | (3) | (4) | (5) | (6) | |
English | She (1) | to gave (2) | him (3) | the (4) | book (5) | (6) |
Chinese | 她 (1) | 给 (2) | 他 (3) | 这本 (4) | 书 (5) | (6) |
Cyrillic Mongolian | Тэр (1) | өгсөн (2) | түүнд (3) | энэ (4) | номыг (5) | (6) |
Cyrillic Mongolian | Morphological position at the sentence | ||||||
---|---|---|---|---|---|---|---|
positions | (1) | (2) | (3) | (4) | (5) | (6) | |
position 1 | Тэр (1) | Өгсөн (2) | Түүнд (3) | энэ (4) | Номыг (5) | 6 | |
position 2 | Энэ (4) | Номыг (5) | тэр (1) | түүнд (3) | Өгсөн (2) | 6 | |
position 3 | Тэр (1) | Түүнд (3) | энэ (4) | Номыг (5) | Өгсөн (2) | 6 | |
position 4 | Түүнд (3) | энэ (4) | Номыг (5) | тэр (1) | Өгсөн (2) | 6 | |
position 5 | Энэ (4) | Номыг (5) | Түүнд (3) | тэр (1) | Өгсөн (2) | 6 | |
Main purpose of this research work was not dedicated to language processing for characteristic of Cyrillic Mongolian language. This research work focused on optimized content extraction from training document at the distance learning training system database. Collection of documents should be written on Cyrillic Mongolian language. It will be reached successful result for optimized content extraction from any kind of documents, when we decided the several characteristics. Therefore we have suggested the following methods. It should be include the following:
1) To preprocessing for documents on distance learning system database.
2) To parse word structure, separate word affix and identify word stem using inflecting word method in NLP at Cyrillic Mongolian language.
3) To search new sentence through word stem without word affix from words or the first words with affix including question sentence, right answer, key words at e-test.
4) To extract key words from Cyrillic Mongolian language documents.
5) To search words through vector space search method based on statistic.
6) To select words through TF-IDF method from query result.
7) To calculate similar quality using Cosine method.
8) To test and process search algorithm document optimized content extraction from Cyrillic Mongolian language.
Text segmentation is the process of dividing written text into meaningful units, such as words, sentences, or topics [
Sentence segmentation is the problem of dividing a string of written language into its component sentences [
Researchers who are working with text information are required to break line of the text for comparing the quality and optimization [
In other words, it is need to break line by one or several sentence for detailed search results. Also need to compare between content and separate section. For example: It is required to calculate optimizations how to meet for search purpose which are sentence or section found from text material.
Therefore, we need to distinguish word, sentence and part of the painting lines among text using the painting algorithm. Words appropriate to search among the text are painted and the after painted sentence or part should be break. This type of painting algorithm is an effective way to avoid next calculation.
Information search system is able to display particular word roots and its inflecting form from database when defined the word roots.
In linguistics, a root word holds the most basic meanings of any word and uninflected form. The first study of the roots recognition algorithms made in 1968. For example: word stem should be input in the roots recognition algorithms such as “Book, ном” is found “Books, номнууд”, “person, хүн” is found “хүний”, “хүнээс”, “хүнтэй” and “fish, загас” is found “fishing, загасчлах”, “fished, загасчилсан”, and “Fisher, загасчин”.
Advanced search is a search of information which is expanded the request by user. The following techniques widely used:
1) To search by synonymic.
2) To search word roots and its inflected word within search request.
3) To correct all mistakes and to search correct request by automatically or to suggest this search.
4) To search for each of the elements in the original request.
Information retrieval process will be starts request entered at system by user. However, user request is not determined only object. Information retrieval systems are usually ranked multiple objects according to requests level. The information retrieval system can measure the following parameters:
1) precision: the fraction of relevant objects that are retrieved.
2) recall: the fraction of relevant objects that are retrieved.
3) error: the fraction of relevant objects that are not retrieved.
4) Harmonic mean.
The general information search models are classified as following methods [
1) IR model based on set theory (Set Theoretic models). It is included Boolean model, model based on Mohy set and extended Boolean model.
2) IR model based on algebraic theory (Algebraic models). It is included vector space model, indexing model and neural network model.
3) IR model based on probabilistic statistics (Probabilistic models). It is included regression model, probabilistic model, IR model for language model and class network model.
In addition, it is also include list for machine learning based on statistics. Boolean methods are used in all branches of science. In other words, search methods based on Boolean models and it has been developed and expanded with other technologies.
Vector space model or term vector model is an algebraic model for representing text documents (and any objects, in general) as vectors of identifiers, such as, for example, index terms. In this model; texts are vector, which consist of t space terms. It is usually calculate constant weigh of specific single word [
The similarity between documents can be found by computing the Cosine similarity between their vector representations [
Matrix
Two vectors can be determined between vectors position at space. There are various functions for similarity calculation. Angle functions between two vectors are used commonly. The similarity of desired text and revealed text is calculated the following Equation (2).
In the Cosine similarity calculation using following formula, text and every search at t dimensional space are used as a point of numerical value. Thus, each character can be a measure of t dimensional space. The first point at space with specific relatively and point of numerical value are created vectors. In that case, Cosine similarity is angle which is created by vector at calculation space. If angles between vectors are less, content similarities should be more. It is possible to revealed same two texts. In that case, the value of Cosine similarity between two vectors would be 1.
Identification of main content using word weight is implemented by computer. Specific TF-IDF is made combination of frequency for frequency and reverse relevant [
where, nij is a number of indication words at dj text. But divided is a sum of all the detected word in the dj text.
Reverse text frequency is a number of commonly used words. IDF of a given word is a number of general subtract the number of text including word. It can be found the following Equation (4):
Here:
When the optimized content extraction is search from learning text documents, certain answer found from search system through user’s desired requirements and it suggested back to user. This action should be performed by computer dynamic, to break suitable lines through compare similarity of between query word and documents and to suggest brief parts.
The following algorithms developed and tested based on several methods of search engine studies. This model consists of 4 main components, such as to process wrong answer form based on result of e-test, to recognize the word stem based on parsing word structure, keyword extraction, and the search process.
1) The processing wrong answer form is to prepare search key data for each text set which is including answer key word, correct answer, text of question sentence.
2) In the first step of recognizing the word stem based on parsing word structure, correct answer of wrong answered question and question sentence are combined each other. The parsing word structure at text set, recognizing word stem, forming new set with word stem are next step. The following 2 kinds of set will be formed in the result of those steps. а) The first basic set consists of exam question and correct answer. For example: Байгаль орчинд үзүүлэх эерэг болон сөрөг нөлөөллийг тодорхойлох үнэлгээ аль нь вэ?: Хүрээлэн буй орчны үнэлгээ. Which assessment can assist of positive or negative impact of environment?: Environmental assessment b) The new set with word stem also consist of exam question and correct answer. For example: Байгаль орчин үзүүлэх эерэг сөрөг нөлөөлөл тодорхойлох үнэлгээ?: Хүрээлэн орчны үнэлгээ. Assessment assist positive, negative impact environment?: Environmental assessment. Possible form of search can be shift next search action.
3) In the second step, processed two kinds of set can used and it can extract five key words from each set using Stop Word method and statistical method (In Stop Word method, word and symbol without meaning should be deleted from set. For example: Байгаль орчин үзүүлэх эерэг болон сөрөг нөлөөлөл тодорхойлох үнэлгээ аль нь вэ?: болон, аль, нь, вэ,? etc. Which assessment can assist of positive or negative impact of environment?: or, which, can,? etc. Nominated key words are selected by statistical method). In this step, search should be use each keywords and it should be compared three form of search data result which is processed part 1.
4) The search can be based on 1 and 3 form from text set by VSM method at search action part. The search should be covered two kinds of forms at database documents. Two kinds of forms are processing basic documents at database and forming new document consisting of word stem. For example: а) basic document, b) new document consisting of word stem.
The search action is calculating search result and finding suitable content. For example: The action should be made the following principle such as making search, selecting content, breaking line of selected text border, indexing selected parts, ranking content by the highest rank, suggesting through search sequence, and showing to users. The suggested search system architecture diagram is shown in
For the experiment in this chapter, making search and content extraction at are similar to data monitoring method. In e-learning system, the training covered the overall average 4 - 8 basic course and professional courses. There are select 3 basic courses and 2 professional courses selected from learning text materials by search method and an optimization can calculate by mathematical method.
The text materials for e-subject such as “Tourism regional planning” (T1), “Marketing” (T2), “Management” (T3), “Psychology” (T4) and “Education science” (T5) are selected and tested which is taught bachelor course at the National University of Mongolia, Mongolian University of Science and Technology, Mongolian University of Life Science and University of the Humanities.
Those learning materials are dominated by theory form, contents are similar mutually. It is main reason to select those subjects.
It was preprocessed experimental text according to requirements of text document statistical analysis and calculated basic statistical information. It is including: numbers of symbol, word and sentence for each experimental text material are shown in
300 question sets at e-text database are made experiment for each text material using VSM search method. It is selected a question form from experiment and introduced experimental results and analysis at learning material by detailed. For example: Question 1: Which assessment can assist of positive or negative impact of environment?: For this question set, experimental search form would be the following form.
Question key word: Байгаль, нөлөөлөл, үнэлгээ.
Correct answer: с. Хүрээлэн буй орчны үнэлгээ.
Question (word basic form and basic document): Байгаль орчинд үзүүлэх эерэг болон сөрөг нөлөөллийг тодорхойлох үнэлгээ аль нь вэ?
Word parsing structure (by word stem and new document): Байгаль орчин үзүүлэх эерэг болон сөрөг нөлөө тодорхой үнэлгээ аль нь вэ?
Experimental results are shown in Tables 3-13.
The average length of word and sentence at experimental text material are calculated by statistic and results are shown in
It is possible to make conclusion based on results, T2 and T3 can be the most understandable learning material. The sentence lengths are same and sentence words are few are indicated that content are understandable.
Search results using key words by hand and mechanically are shown in
Text documents | Count of characters | Count of words | Count of sentence |
---|---|---|---|
Т1 | 66,510 | 10,151 | 554 |
Т2 | 127,107 | 20,579 | 1918 |
Т3 | 107,291 | 16,581 | 930 |
Т4 | 130,302 | 20,634 | 1329 |
Т5 | 94,645 | 14,416 | 700 |
Text documents | Average length of sentence | Average length of word |
---|---|---|
Т1 | 18.32 | 6.55 |
Т2 | 10.73 | 6.17 |
Т3 | 16.09 | 6.47 |
Т4 | 15.52 | 6.31 |
Т5 | 20.59 | 6.56 |
Search words (keywords by hand) | Frequency of words | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
Т1 | Т2 | Т3 | Т4 | Т5 | ||||||
fre | Pd | fre | Pd | fre | Pd | fre | Pd | fre | Pd | |
Keyword 1 | 21 | 0.00207 | 2 | 0.00010 | 0 | 0.00000 | 6 | 0.00029 | 16 | 0.00111 |
Keyword 2 | 1 | 0.00010 | 1 | 0.00005 | 3 | 0.00018 | 0 | 0.00000 | 0 | 0.00000 |
Keyword 3 | 3 | 0.00030 | 0 | 0.00000 | 2 | 0.00012 | 2 | 0.00010 | 4 | 0.00028 |
Totals | 25 | 0.00246 | 3 | 0.00015 | 5 | 0.00030 | 8 | 0.00039 | 20 | 0.00139 |
a. fre―frequency of word; b. Pd―probability distribution.
Search word (correct answer word) | Frequency of words | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
Т1 | Т2 | Т3 | Т4 | Т5 | ||||||
fre | Pd | fre | Pd | fre | Pd | fre | Pd | fre | Pd | |
Хүрээлэн | 10 | 0.00099 | 0 | 0 | 4 | 0.00024 | 12 | 0.00058 | 4 | 0.00028 |
Буй | 17 | 0.00167 | 13 | 0.00063 | 26 | 0.00157 | 33 | 0.00160 | 21 | 0.00146 |
Орчны | 19 | 0.00187 | 4 | 0.00019 | 18 | 0.00109 | 17 | 0.00082 | 8 | 0.00055 |
үнэлгээ | 21 | 0.00207 | 0 | 0 | 15 | 0.00090 | 6 | 0.00029 | 4 | 0.00028 |
Total | 67 | 0.00660 | 17 | 0.00083 | 63 | 0.00380 | 68 | 0.00330 | 37 | 0.00257 |
Experimental search results using correct answer words are shown in
Experimental search results using sentence main word are shown in
Text set of T1 text material are formed angle equal to 0.73 and it meet to purpose of experiment for theory. Text set of T4 text material are formed angle equal to 0.71. However, their numerical difference was 0.71 and 0.73, but their difference was 0.045 percent.
If made little mistake, it easy to face with difficulties during learning machine. Experiment is made to compare between computer search result and hand method. T1 learning materials meet to search content at this experiment.
But it was become doubtful, when query words of experimental texts are indicated too much. The following experiments are made to solve these issues. It is including the following works. The parsing word structure removes word affix, composing new sets by word stem, and then extracts optimized content. The searching process will not consider word position. The results are shown in
Experimental search results using sentence word stem are shown in
Inner composition and text similarities at each text document are calculated by Cosine calculation use Equation (3).
Text set of T1 text material are formed angle equal to 0.84 and it meet to purpose of experiment for theory. For other text materials, all numerical value was increased but it was not meet to search purpose.
Search word (answer sentence main word) | Frequency of words | q | ||||||
---|---|---|---|---|---|---|---|---|
Т1 | Т2 | Т3 | Т4 | Т5 | ||||
Байгаль | 21 | 0 | 0 | 9 | 15 | 3 | ||
орчинд | 6 | 5 | 34 | 13 | 23 | 5 | ||
үзүүлэх | 4 | 2 | 7 | 5 | 10 | 5 | ||
эерэг | 7 | 0 | 1 | 3 | 0 | 3 | ||
болон | 31 | 28 | 39 | 33 | 46 | 5 | ||
сөрөг | 4 | 0 | 4 | 5 | 0 | 3 | ||
нөлөөллийг | 2 | 2 | 5 | 1 | 1 | 5 | ||
тодорхойлох | 8 | 2 | 3 | 5 | 6 | 5 | ||
үнэлгээ | 7 | 3 | 7 | 5 | 9 | 5 | ||
аль нь вэ | 0 | 0 | 0 | 0 | 0 | 0 | ||
Total words | 90 | 42 | 100 | 79 | 110 | |||
Text material | Inner composition | Cosine |
---|---|---|
T1 | 386 | 0.73 |
Т2 | 210 | 0.55 |
Т3 | 490 | 0.69 |
Т4 | 361 | 0.71 |
Т5 | 520 | 0.70 |
Search word (sentence word stem) | Frequency of words | q | ||||
---|---|---|---|---|---|---|
Т1 | Т2 | Т3 | Т4 | Т5 | ||
Байгаль | 21 | 0 | 0 | 9 | 15 | 3 |
орчин | 19 | 13 | 53 | 14 | 29 | 4 |
үзүүлэх | 23 | 2 | 25 | 23 | 19 | 4 |
эерэг | 7 | 0 | 1 | 7 | 0 | 3 |
болон | 62 | 46 | 52 | 51 | 53 | 4 |
сөрөг | 4 | 0 | 4 | 6 | 0 | 3 |
нөлөөлөл | 13 | 2 | 34 | 62 | 2 | 4 |
тодорхойлох | 63 | 5 | 3 | 41 | 45 | 4 |
үнэлгээ | 22 | 4 | 15 | 6 | 21 | 4 |
аль нь вэ | 0 | 0 | 0 | 0 | 0 | 0 |
Нийтүг | 234 | 72 | 187 | 219 | 184 |
Text documents | Inner composition | Cosine |
---|---|---|
T1 | 1106 | 0.84 |
Т2 | 360 | 0.56 |
Т3 | 925 | 0.80 |
Т4 | 1051 | 0.83 |
Т5 | 890 | 0.81 |
Search word (word stem) | Frequency of words | ||||
---|---|---|---|---|---|
frequency of single word | TF | IDF | Weight of word | Rank of weight | |
Байгаль | 21 | 0.03 | 14.86 | 0.457 | IV |
орчин | 19 | 0.01 | 14.10 | 0.098 | V |
эерэг | 7 | 0.03 | 43.38 | 1.298 | II |
сөрөг | 4 | 0.04 | 95.76 | 3.614 | I |
үнэлгээ | 22 | 0.03 | 21.64 | 0.646 | III |
Search word (word stem) | Frequency of words | q | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Т1 | Т2 | Т3 | Т4 | Т5 | |||||||
Байгаль | 21 | 0 | 0 | 9 | 15 | 3 | |||||
Орчин | 19 | 5 | 53 | 14 | 29 | 4 | |||||
эерэг | 7 | 0 | 1 | 7 | 0 | 3 | |||||
сөрөг | 4 | 0 | 4 | 5 | 0 | 3 | |||||
үнэлгээ | 22 | 3 | 15 | 6 | 21 | 4 | |||||
Totals | 73 | 8 | 73 | 37 | 65 | ||||||
Text documents | Inner composition | Cosine |
---|---|---|
T1 | 301 | 0.93 |
Т2 | 40 | 0.78 |
Т3 | 355 | 0.73 |
Т4 | 151 | 0.92 |
Т5 | 295 | 0.87 |
Five key words are extracted from Cyrillic Mongolian language documents through keyword extraction algorithm which is processed in using natural language processing technology. The nominated key words are shown in
The search result relevance is calculated based on query words as a
In this text material, the weight of following words such as positive “сөрөг”, negative “эерэг”, assessment “үнэлгээ”, natural “байгаль” and environment “орчин” are indicated the highest. Therefore it is possible to search the highest five values. The experimental results using extracted key word are shown in
Inner composition and text similarities at each text document are calculated by Cosine calculation use Equation (3). These results are shown in
Text set of T1 text material are formed and increased angle equal to 0.93 and it meet to purpose of experiment for theory.
In this chapter, we introduced optimized content extraction algorithm in text documents from distance learning system database and its result of application.
Optimized content is extracted by NLP, TF-IDF and cosine methods. The following methods have been done to achieve the research goal. We also approved to search by word stem and word position without consideration.
The text material pre-processing is important to determine the results of statistic accurately. These text materials had average word length of 6.41 and average sentence length of 16.25. This study performance was similar to other studies on the Cyrillic Mongolian language.
In the experiment result with making search using word stem of question sentence, word frequency and distribution probability are become more sophisticated and cosine is increased until 0.84 percent.
Those indicators meet the purpose for search. Five key words are extracted from Cyrillic Mongolian language documents through key word extraction algorithm using natural language processing technology. The search was using automatically extracted key words. Text material similarities are 0.93.
It was proved through experiment that our developed optimized content extraction search algorithm from Cyrillic Mongolian language was very useful.
Bat-Erdene Nyandag,Ru Li,G. Indruska, (2016) Performance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database. Journal of Computer and Communications,04,79-89. doi: 10.4236/jcc.2016.410009