Word sense disambiguation is used in many natural language processing fields. One of the ways of disambiguation is the use of decision list algorithm which is a supervised method. Supervised methods are considered as the most accurate machine learning algorithms but they are strongly influenced by knowledge acquisition bottleneck which means that their efficiency depends on the size of the tagged training set, in which their preparation is difficult, time-consuming and costly. The proposed method in this article improves the efficiency of this algorithm where there is a small tagged training set. This method uses a statistical method for collocation extraction from a big untagged corpus. Thus, the more important collocations which are the features used for creation of learning hypotheses will be identified. Weighting the features improves the efficiency and accuracy of a decision list algorithm which has been trained with a small training corpus.
There are some words in every language with multiple meanings and different applications that their meaning is determined based on the context in which they are placed. That is these words are vague words. Context can be a sentence or phrase. Disambiguation of the meaning of these words (WSD: Word Sense Disambiguation) is one of the research areas in the field of natural language processing and is used in Information Retrieval (IR), Machine Translation (MT), information extraction and documents classification.
Ambiguous words are divided into two categories in terms of distinction level meaning. This phenomenon is called granularity. Various meanings of words have low distinction level and are called fine-grained. For example, it should be specified in machine translation that the word “discussion” must be translated to which of its equivalent in Persian according to its context. The meanings of homographs have high different level or are coarse-grained. For example, what does the word “شیر” (shir) mean in a sentence (shiras a dairy product which is milk, shir as a tool which is faucet or shir as an animal which is lion)? Most applications in the real world are dealing with coarse-grained level [
In the 1990s when machine learning approaches were raised, a great improvement in the area of disambiguation of the meaning of words was obtained. In this decade, supervised algorithms with optimal accuracy were provided which still have the best accuracy. Since the accuracy of these algorithms is generally related to manually tagged training data, knowledge acquisition bottleneck could be occurred in case of ambiguous words with no corresponding big tagged data or in terms of languages with no available semantic tagged corpus. There isn’t any large enough training corpus to cover the entire ambiguous words to train a supervised algorithm, even in languages such as English which was among the first target languages for making big manually labeled corpus. The ability of making such corpus is only a hypothesis because making such training data is time consuming and costly [
On the other hand, unsupervised algorithms do not need semantic tagged corpuses and therefore do not face knowledge acquisition bottleneck problem. However they do not have proper accuracy. The creation of a WSD system is not a goal in itself but they are needed as a tool to improve the efficiency of other practical applications such as information retrieval and machine translation. Therefore the accuracy of such systems can affect the whole system accuracy. Also, the system should not have knowledge acquisition bottleneck problem in order to be able to provide adequate coverage on all the ambiguous words in a Language.
A lot of research has been done to overcome the problem in the recent years. Methods such as semi-supervised training which uses corpus with and without tag at the same time or methods which used other linguistic tools such as dictionaries, thesaurus and ontology in the corpus are from this type.
Small tagged corpus with all ambiguous words coverage in a language is faster and less costly than a large one. The proposed method tries to upgrade the decision list algorithm which is a supervised algorithm with a relatively small tagged corpus and a large untagged one, so that the accuracy of supervised algorithm trained with a small tagged corpus gets close to corresponding supervised algorithm trained with larger untagged one.
Collocations extraction usually takes place based on Association Measures (AM) usage on big corpuses. AM uses statistical data of words in corpus in order to identify collocations [
In which N is number of words in the corpus and
This measure considers the ration of dependency of two words to their independence. Other versions of this measure have also been suggested. For example, [
Then they defined another measure named Mutual Dependency With frequency logarithm bias with the explanation that having self-frequency bias (not reverse) is useful in small amounts in statistical factors:
[
In which
measure was in fact a suggestion for solving a problem named null hypothesis, according to this hypothesis simultaneous occurrence of two words together is not always indicative of their dependence but this collocation has taken place because of the chance and accident (For example the combination of “of the” and “in the”). X2 measure has the ability to detect null collocation in this way that if its value is above threshold level, then it is the reason for the occurrence of null hypothesis.
Measure introduced in [
Also [
Decision list algorithm was proposed for disambiguation of homograph by [
Each probability represents the relationship between each feature with one of the possible meanings of the homograph word. Then this probability is sorted from largest to smallest in a decision list and possibilities which are lower than a certain threshold are excluded. When a new test sample arrives, words in the decision list are searched one by one from top to bottom according to their corresponding window about the new word until an item is found, the search stops in this case and found feature class will be assigned to the new test sample.
Decision List algorithm is a collocation-based algorithm. This means that features are local and only words themselves are about target word. Hence, they do not need pre-processing to determine the grammatical tags and are applicable for languages in which accurate grammatical tagging is not available.
As mentioned, decision list algorithm extracts and classifies desirable features by receiving tagged corpus and use of probabilities which it calculates for neighbor words. These features are a form of collocation words with the target word. Collocation words are defined as follows: words which have simultaneous occurrence frequency in text or speech are greater than being considered as accident.
Since the decision list depends on tagged training corpus for extraction of these collocations and this tagged corpus cannot be prepared in a large volume for all homograph words existing in a language, so some of the collocations existing for homograph in small training corpus are not identifiable.
Collocations are in different forms and their number of involved words and method of their combination are very different. Some collocations are rigid and some are flexible. For example, a flexible collocation such as “making” and “decision” can be seen in different forms such as “to make a decision”, “decision has been taken”, “a very important decision was taken”, etc. on the other hand a collocation such as “General Motors” can be seen in one from and it is a rigid collocation. It is clear that the meaning of collocation in some tasks such as WSD is wider and more flexible than the definitions which are in the field of collocation extraction because WSD methods usually show the fact that a special word could be in a collocation with one of the meanings of an ambiguous word and there is no mandatory in having separate and special meaning in a pair collocation such as “General Motors”. For example, the collocation related between “اشراف” (nobles) and “پادشاه” (king). “King” is a word which help to determine the meaning of ambiguous word of “nobles” while these two words are not in one collocation based on some definitions for collocation with have been mentioned above. [
[
Initially, vocabulary relations of pair words are retrieved using only the statistical information. This phase is comparable with the work of [
Then the output of the first step goes in parallel to the next step. The first stage output pairs of words are used in the second step to create n-gram collocations. This step analyzes all sentences including pair words and distribution of words and POSs (Part-of-speech) of position surrendering these bi-gram words. It keeps words (or POS of words) which capture of a position with possibility higher than specific threshold. For example, bi-gram words of “average-industrial” create a bigger collocation of “the Dow Jones industrial average” because these words can always be seen in more difficult nominal terms in corpus. The corpus must have POS tags before this step and the third step.
Finally, retrieved bi-gram words in the first step will be filtered in the third step by combining the results of a parser and statistical methods. In this step, Xtract adds syntactic information to retrieved collocations of the first step and filters out inappropriate information. For example, if a bi-gram word contains a noun and a verb, this step recognizes it as a bi-gram subject-verb or a verb-object word and if they do not have such relation, they will be rejected.
The idea of the proposed method is that the frequency of some of the collocations is not enough to receive appropriate probability and be placed in the upper tier of the decision list used by training corpus due to small size but at the same time they are in a collocation in one of the meanings of homograph. Such collocations can improve efficiency if they are detected and are strengthened in the decision list with added weight. For example, consider this sentence: “کارشناسان عقیده دارند، قدرت اقتصادی اروپای شرقی از نفس افتاده است که به سبب انجام ندادن اصلاحات اقتصادی در لهستان است”, (Experts believe the Eastern Europe economic power has fallen sharply, because the economic reforms in Poland has not been doing) a decision list with is from a not so great trained corpus accepts the proposed class of “است” collocation which has the most occurrence along with “نفس” among other collocations of the window and is in the upper part of the list. This collocation suggests the wrong class due to being in many sentences such as “یکی از معایب این کار بالا رفتن اعتماد به نفس کاذب است” (One of the disadvantages of this work is to rise of Self Confidence, so much). But the collocation of “افتادن))فتاده” is in the lower tiers of the decision list due to small size of corpus and not due to not being in the collocation. Now if we had previously identified “افتادن” as a better collection than “است”, we would be able to add a weight to its probability so that it can be higher in the decision list and identify the correct class. Identification of such collocations is possible by a big untagged corpus.
The first step of Xtract has been used in this method for identification of collocation words whether attached or adjacent to each other or spaced a few words from each other (which one of these pairs is the target homograph word). Smadja has called first step as extraction of important bi-gram words. According to (Smadja, 1993), there are strong evidence that most of the juxtaposition lexical relations are between words which are separated from each other by maximum 5 words. In other words, most of the lexical relations that are involved a word such as w can be retrieved by testing the neighbor of w which occurs in quintuple neighbor window (−5 and +5 around w).
Only statistical methods have been used in this step so that the related bi-gram words are identified. These methods are based on the assumption that if two word are collocation, then:
First: these two words must be observed with each other with a high frequency in a way that being observed together is beyond chance and accident.
Second: these two words must be relatively observed hard (uncompromising) together.
The word’s distribution in the sentence has been analyzed by considering these two hypotheses and the used filter has been placed based on these hypotheses.
Initially a list of wi words with data and information of collocation frequency of w and wi is provided in which w is the homograph and wi is candidate word for being in collocation with w. this list only contains frequency with collocation of w and wi which has been frequency divided based on the position of it occurrence compared to w (the possible distance between two collocation words).
occurred in window of (−5, +5) according to its place of occurrence in each of the window locations. Fri is the appearance frequency of wi along with w in the corpus and pj in which j is between −5 and +5 and nonzero is appearance frequency of wi along with w which are spaced by j words. pj shows histogram of appearance frequency of wi along with w in the given position.
Then, more important bi-gram words will be extracted from the list by statistic measures describing connection strength of the words and amount of hardness of this connection. The first measure is power or strength:
In which freqi is simultaneous collocation frequency of wi in window of (−5, +5),
The next measure is spread:
In which
Three following filters have been defined for filtering inappropriate wis as well as optimal window for appropriate wis:
This proviso helps to remove pairs of words which do not have enough frequency. This proviso determines that appearance frequency of wi in neighborhood of w must have at least one standard deviation higher than the mean which means that occurrence frequency with the target word must be higher compared to total other candidate words in the corpus. This thresholding eliminates a large number of lexical relations in most of the statistical distributions. For example, “دفاع” (defense) will be removed in
This proviso will eliminate wis which their distribution histogram in window around w is smoother and with fewer peaks than a certain limit. In fact, it accepts tougher and more uncompromising bi-gram words. The assumption here is that if two words are frequently used together in a syntactic structure, then we will have a feature pattern of collocation. This means that they will be seen in all positions and statuses with one equal chance. For example, “این” (this) will be removed in
This proviso is in a different way compared to two previous provisos. First two provisos eliminated wis completely but this proviso is applied on wis which has met the previous two provisos and has been identified as appropriate bi-gram words and eliminates the improper position of the window (+5, −5). The first and second
W | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 4 | 0 | 0 | 0 | 1 | 0 | 4 | 1 | 2 | 12 | بخش | اشراف |
0 | 1 | 7 | 2 | 1 | 1 | 1 | 4 | 0 | 2 | 19 | شهر | اشراف |
0 | 0 | 1 | 1 | 6 | 0 | 1 | 1 | 2 | 0 | 12 | اروپایی | اشراف |
15 | 38 | 40 | 87 | 227 | 10 | 13 | 16 | 21 | 26 | 493 | داشتن | اشراف |
17 | 37 | 48 | 29 | 7 | 15 | 46 | 33 | 24 | 34 | 290 | این | اشراف |
0 | 0 | 1 | 0 | 24 | 0 | 0 | 0 | 0 | 0 | 25 | زادگان | اشراف |
19 | 62 | 48 | 46 | 45 | 3 | 56 | 68 | 64 | 67 | 478 | در | اشراف |
0 | 1 | 3 | 0 | 30 | 0 | 3 | 2 | 1 | 2 | 42 | علمی | اشراف |
1 | 0 | 0 | 2 | 0 | 0 | 4 | 1 | 0 | 0 | 8 | کنترل | اشراف |
9 | 59 | 45 | 27 | 13 | 80 | 56 | 55 | 28 | 47 | 419 | از | اشراف |
0 | 0 | 1 | 0 | 1 | 13 | 1 | 5 | 1 | 3 | 25 | طبقه | اشراف |
1 | 0 | 1 | 0 | 0 | 0 | 27 | 1 | 2 | 0 | 32 | اعیان | اشراف |
0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 3 | دفاع | اشراف |
1 | 2 | 6 | 1 | 129 | 0 | 1 | 2 | 3 | 1 | 146 | کامل | اشراف |
provisos delete output rows of the first step, but the third proviso selects the column from the remaining rows. One or more positions may be considered for each bi-gram words which are corresponding with the histogram peaks and so the result is selected in several
u0, k0 and k1 thresholds should be determined by tests and depend on the use made of the collocations. Generally, the lower threshold will accept more data and has higher recall and lower accuracy. Smadja which has used Xtract for automatic creation of over a ten-million-word corpus vocabulary has considered threshold values for k0, u0 and k1 respectively 1 and 10 and 1. Our suggested method also uses three above measures for extraction of bi-gram words with space and without space which the values of each one will be described in the following.
Another point that should be considered is the importance of the distance that the extracted collocation has with the homographs. Decision list applies the rules of ±1, ±2 and ±K words to this aim in order to identify words which are only in collocation with homograph in case of occurrence with space and a particular position compared to homograph which means a word before and a word after, two words before and two words after and at the end words in a window with radius of K around the word. Two ±1 and ±2 rules are for harder collocations while collocations extracted by ±K rule are considered as soft collocations in a way that all collocations obtained from this rule are searched for in the window of ±K.
The idea that comes to mind in this regard is that can the use of that position (compared to the homograph) which has been the maximum occurrence of the collocation word as the accurate size of the window enhance the system efficiency? To evaluate it in
We can see that the utility of this window is not the same for the different homographs. Most utility has been for the word “اعمال” (impose or acts). While the efficiency for the word “گرم” (hot or gram) has fallen. The reason for this could be that the collocations of the homograph word “گرم” are softer and this type of windowing limits their scope of the search of reduces their efficiency. However, reduced efficiency percentage for words which have been affected by this method is lower than the percentage of increased efficiency for words that have benefited.
A decision list which only uses collocations suggested by Smadja Method is provided using training corpus
Training corpus with size 500 | Training corpus with size 1200 | homograph | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
اعمال | اشراف | دور | حسن | گرم | نفس | اعمال | اشراف | دور | حسن | گرم | نفس | Windowing method |
0.8063 | 0.8197 | 0.8606 | 0.9171 | 0.8738 | 0.8512 | 0.8118 | 0.8680 | 0.9072 | 0.9322 | 0.9268 | 0.8914 | Window with size 5 |
0.8451 | 0.8118 | 0.9075 | 0.9205 | 0.8266 | 0.8501 | 0.8643 | 0.8565 | 0.9212 | 0.9311 | 0.9254 | 0.8936 | Window with size max frequency |
after the extraction of collocation. It is clear that this list does not need to use ±1, ±2 and ±k rules because size of the window that surrounding the homograph for learning the class of each collocation is determined. Such classification which we call it “special decision list” has a higher accuracy than ordinary decision lists but it has lower recall, because it has learned the learning model with fewer and more accurate collocations.
To overcome the problem of declining recall in special decision list, an ordinary decision list learned on the training corpus. Then the special decision list is used for tagging in tagging step for test samples and if the suggested special decision list did not have any tag, the class suggested by the ordinary decision list is used, or using another method the determined degree of validity of the class is multiplied by a weight and is compared with validity of class determined by the ordinary list and the more valid class is determined.
Evaluating Different Tested Processes to Improve Collocation Extraction Using XtractFor more compliance of the Smadja Method with the goal of word sense disambiguation and improving the performance of decision list, second and third filters of Smadja method have been focused on and the proposed method created changes in them. Since the results of these changes will be evaluated in the next section, these changes will be classified and introduced in this section. The first six processes will be done in the first step of
This method extracted and weighted pairs of word with different processes. Comparison of the results of these various processes has been mentioned in experiments section of the article. Each of these processes has been explained in continuation of this section:
The first process: this process does not make any changes to the first step of Smadja. Initially we formed a matrix for each homograph which is similar to
First, the bi-gram words that strength measure is small for them will be ineffective on their own due to not being in the decision list at all. Thus out strength measure threshold which is k0 can be lower. In other words, reducing strength measure does not significantly affect the accuracy but it increases the recall. Also as mentioned in the Smadja article, collocations in the ambiguity of words can be softer. Smadja required hard collocations based on the definition it had for collocation in order to automatically create vocabulary.
Hence two words that were repeatedly mentioned together in the text, such as “اشراف” (nobility) and “پادشاهان” (kings) just due to being in a same field has a high strength with a low spreadmeasure because they can be seen anywhere relative to each other in the sentence, so Xtract were considered high spread threshold. But in WSD such collocations are good and lower spread threshold can be considered.
The second process: the thing which was done in the first process for reducing spread threshold is not useful for all candidate collocations that have different morphological tags. Basically, the words which are considered as collocation in a WSD system are not verbs and letters (article, prepositions, pronouns, etc.). In other words, a verb which is mentioned usually in all steps of homograph forms bi-gram words with lower possibility compared to a verb which has occurrence peak from one to three words after homograph. For example, in this sentence: ““ امروز ميبینیم اعتماد و اتكاي به نفس داشته و قدم بعدي را در خلق ابتكارهاي بزرگتر به پيش ببرند. (Today we see that they have Self Confidence and can creating larger initiatives) the verb “دیدن” is not a pair with the word “نفس” but the verb “داشتن” is a bi-gram words. In general, “داشتن” has three peaks exactly in +1 and +2, and +3 windows. It is also about letters, in a way that collocation of a letter with homograph is only created in case of it appearing as completely hard or inflexible with it. For example, this sentence ” با ورزش دانشآموز علاوه براعتماد به نفس خويشتنداري مثبت را نيز فرا ميگيرد” (Students can learn Self Confidence and continence with exercise) only collocation of “به” which is located exactly in position of -1 compared to “نفس” creates a bi-gram words. In the same sentence, letters “بر”, “را” and “نیز” which are frequently used in each corpus and with every ratio with both meanings of “نفس” are not appropriate words. The word “به” itself should be searched for only in −1 window that this proviso is applicable by the third filter.
Thus this process has been considered for two types of morphological verbs and higher and more stringent threshold words and other lower threshold forms in spread measures but the tests have shown the filter is necessarily required for other morphological types because the peak of spread measure of the word is a sign of its desirability as a bi-gram word.
The third and fourth processes: the method of comparing spread measure in this way that if wi word has a low repeat in a corpus, even if all the times it is mentioned in the window around w is a fixed position compared to it, it will not receive a large spread. For example, consider two given wm and wn words which have frequency of distribution similar to
Then we have for each: Spreadm = 20.25, Spreadn = 30.29.
We see that although wm can clearly be a bi-gram good words but due to lack of its repetition in the corpus (which leads to a lack of collocation frequency with the homograph), it has lower spread compared to wn. Although we determine the value of spread threshold to the extent that we do not lose these type of words but we will cause the lack of filter for probably useless words like wn that are present just because the high frequency of occurrence in the body and not because of having a good spread. Especially if these words are letters which have a high frequency of occurrence.
A solution is that we consider the threshold of spread measure related to collocation frequency of the word which means that a word with a high collocation frequency must face more astringency when being filtered for the histogram spread (second filter). Thus, we changed the second filter in the third process as follows:
and in the forth process as follows:
changed and compared their results. It is clear that there is more astringency on the high-frequency words in the forth process compared to the third process.
The fifth process: inflectional form of words is not considered in third and fourth processes and filters are same for all words. Thus, the second filter is determined based on POS tag of the word in the fifth process. By testing different types and different thresholds, the most appropriate obtained filter is as follows: Equation (14) for verbs and prepositions and equation 13 for other words.
This means that this will be applied harder on verbs and prepositions of spread measure filter compared to higher collocation frequency.
The sixth process: this process evaluated the third filter; this means the determination of method for window of each word in bi-gram words. As previously mentioned:
freqi | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 15 | 0 | 0 | 0 | 0 | 0 | 15 | wm |
33 | 31 | 25 | 29 | 35 | 33 | 19 | 32 | 34 | 20 | 291 | wn |
This filter makes windows smaller for bi-gram words which have been mentioned in previous step. This filter removes positions in (−5, +5) window in which word does not have peak based on spread measure of each word. For example consider this sentence: “رسانهها تایید کردند که او در نشست دیروز اشراف و تسلط کافی را داشته است”(Some news had confirmed that he had enough mastery in yesterday's meeting) this filter has already been determined that collocation of “داشتن” must be searched in (0, +5) window around the homograph and it being used out of (0, +5) positions especially before the homograph is not able to allocate class. This feature is helpful in decision list, because as evaluated at the beginning of this section, if maximum measure for each word compared to homograph is used for determination of its window, this method can provide several appropriate positions instead of one position for limit range of the window and expend the window by replacing this method with the maximum occurrence. For example, it can determine a wider (0, +5) window according to peaks of spread histogram instead of limited window of (0, +3).
In the sixth process in addition comparing to testing different thresholds for constant k1 value, Maximum mode (MAX) which is used in the ordinary decision list and has a position in the list which has the highest occurrence frequency has also be compared. It is clear that higher values of k1 will remove more positions and more astringency will be applied.
The seventh process: this process is done on fourth step of
1) No weighting be done and instead the class proposed by special decision list always be used and the ordinary decision list be used for tagging only when this list did not have any suggestions in which the ordinary decision list have higher covering measure due to not selecting collocations and thus, if the special decision list does not find a collocation in the field of test sample, the ordinary list may be able to find a collocation.
2) Weighting with constant numbers. Arguably the best weight for all homographs is not unique. Some of them have the best efficiency with lower weightings and some with higher weightings. It can be said that the words which have better spread measure will have the maximum efficiency in higher weighting of the special decision list.
3) Making the assigned weight to correspond the extracted bi-gram words with calculated spread for it:
and
In which w is a constant number and spreadi is spread of i bi-gram words which have been extracted in one process and
variable is calculated for each extracted bi-gram words and acts as a weight during classification for its bi-gram words and adds a weight to constant weight of w. here a word with higher peak according to the mean and expansion variance of other pairs of words with receive higher weight.
Corpus used in the experiments is Hamshahri corpus which is in Persian and is a set of Hamshahri Newspaper’s text between 1996 and 2007 and there is no tagging or rooting on it [
We considered six اعمال”” (impose or acts), “دور” (round or far), “حسن” (Hasan or goodness), “گرم” (hot or gram), “اشراف” (nobility or aware) and “نفس” (breath or self) homograph words for the experiment. Initially, we extracted all of the occurrences of these words with five words before and five words after them. Semantic tag was given to each homograph for 1200 occurrence and one semantic tagging corpus was created. Then 5000 occurrences in each homograph were inflectionally tagged and rooted along with surrounding words in order to form a corpus without semantic tagging. Rooting was just limited to changing verbs into infinitive and plural nouns to singular and grammar tagging for dividing to three classes of verbs, letters and others (nouns, adjectives, etc.).
We used the standard method of 5-fold-cross-validation for evaluation. Extraction of features was carried out in the same 5000 samples without semantic tag. Evaluation measure is also F-measure which is a combined measure of accuracy and recall. The reason for selecting this measure is that basically, accuracy and recall alone cannot demonstrate the efficiency and if we aim to improve one of these measures in determination of thresholds, then we will directed to a way where the other is decreased. Thus the best measure to determine the threshold is the F-measure.
It is necessary to mention that tests done to achieve the thresholds in each process showed that better results can be achieved if we find an optimal threshold for each homograph word. But since this act reduces the efficiency of the method in reality, we used it for general threshold.
By comparing the results for different k1s in Equation (12), one general threshold for each six words can be determined. First we examined k1 = 2.5. This threshold for a window with size 5 is too high so that no occurrence passed through it in practice. Lack of changes compared to the ordinary decision list indicates this fact. Although k1 = 2.5 is the best threshold for “اعمال”(impose or acts), but it cannot be considered as a general and common threshold for all six words.
But to select an appropriate value among four modes k1 = 1, k1 = 1.5, k1 = 2 and Max, the numbers of times in which they have achieved the maximum values in each process can be considered.
The results obtained from
Experiments have also been done for seventh process and improvement percentage in each form of weighting has been shown in
Evaluating
Process 5 (verbs u0 = 0.01 letters u0 = 0.02 others u0 = 0.2) | Process 4 u0 = 0.02 | Process 3 u0 = 0.2 | Process 2 (verbs u0 = 500 letters u0 = 700 others u0 = 3) | Process 1 u0 = 3 | Ordinary decision list | Method and thersholding for windowing | |
---|---|---|---|---|---|---|---|
Homograph | |||||||
0.8643 | اعمال (impose or acts) | ||||||
0.8640 | 0.8677 | 0.8429 | 0.8312 | 0.8411 | k1 = 1 | ||
0.8601 | 0.8694 | 0.8526 | 0.8432 | 0.8526 | k1 = 1.5 | ||
0.8660 | 0.8710 | 0.8508 | 0.8474 | 0.8525 | k1 = 2 | ||
0.8769 | 0.8710 | 0.8694 | 0.8627 | 0.8643 | k1 = 2.5 | ||
0.8643 | 0.8643 | 0.8643 | 0.8643 | 0.8643 | k1 = 3 | ||
0.8568 | 0.8682 | 0.8333 | 0.8116 | 0.8283 | Max | ||
0.9212 | دور (round or far) | ||||||
0.9309 | 0.9359 | 0.9041 | 0.9107 | 0.9016 | k1 = 1 | ||
0.9359 | 0.9338 | 0.9158 | 0.9195 | 0.9150 | k1 = 1.5 | ||
0.9380 | 0.9355 | 0.9259 | 0.9288 | 0.9250 | k1 = 2 | ||
0.9212 | 0.9212 | 0.9212 | 0.9212 | 0.9212 | k1 = 2.5 | ||
0.9326 | 0.9338 | 0.9100 | 0.9124 | 0.9042 | Max | ||
0.9311 | حسن (Hasan or goodness) | ||||||
0.9336 | 0.9397 | 0.9265 | 0.9303 | 0.9219 | k1 = 1 | ||
0.9339 | 0.9379 | 0.9284 | 0.9277 | 0.9285 | k1 = 1.5 | ||
0.9363 | 0.9379 | 0.9329 | 0.9363 | 0.9304 | k1 = 2 | ||
0.9311 | 0.9311 | 0.9311 | 0.9311 | 0.9311 | k1 = 2.5 | ||
0.9336 | 0.9385 | 0.9261 | 0.9279 | 0.9266 | Max | ||
0.9254 | گرم (hot or gram) | ||||||
0.9384 | 0.9199 | 0.9181 | 0.9235 | 0.9146 | k1 = 1 | ||
0.9367 | 0.9259 | 0.9166 | 0.9239 | 0.9158 | k1 = 1.5 | ||
0.9447 | 0.9242 | 0.9305 | 0.9378 | 0.9270 | k1 = 2 | ||
0.9254 | 0.9254 | 0.9254 | 0.9254 | 0.9254 | k1 = 2.5 | ||
0.9418 | 0.9256 | 0.9137 | 0.9320 | 0.9155 | Max | ||
0.8565 | اشراف (nobility or aware) | ||||||
0.8780 | 0.8556 | 0.8759 | 0.8474 | 0.8602 | k1 = 1 | ||
0.8788 | 0.8548 | 0.8682 | 0.8540 | 0.8613 | k1 = 1.5 | ||
0.8723 | 0.8576 | 0.8623 | 0.8457 | 0.8614 | k1 = 2 | ||
0.8565 | 0.8565 | 0.8565 | 0.8565 | 0.8565 | k1 = 2.5 | ||
0.8709 | 0.8485 | 0.8623 | 0.8436 | 0.8571 | Max | ||
0.8936 | نفس (breath or self) | ||||||
0.9187 | 0.9084 | 0.9045 | 0.9118 | 0.9037 | k1 = 1 | ||
0.9181 | 0.9085 | 0.9021 | 0.9100 | 0.9038 | k1 = 1.5 | ||
0.9093 | 0.9051 | 0.9026 | 0.9071 | 0.9034 | k1 = 2 | ||
0.8936 | 0.8936 | 0.8936 | 0.8936 | 0.8936 | k1 = 2.5 | ||
0.9152 | 0.9076 | 0.8960 | 0.9026 | 0.8985 | Max | ||
total | Process 5 | Process 4 | Process 3 | Process 2 | Process 1 | |
---|---|---|---|---|---|---|
6 | 1 | 2 | 2 | 1 | 0 | k1 = 1 |
7 | 1 | 2 | 1 | 1 | 2 | k1 = 1.5 |
17 | 4 | 2 | 3 | 4 | 4 | k1 = 2 |
0 | 0 | 0 | 0 | 0 | 0 | Max |
Training corpus with size 500 | Training corpus with size 1200 | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
Process 5 (verbs u0 = 0.01 letters u0 = 0.02 others u0 = 0.2) | Process 4 u0 = 0.02 | Process 3 u0 = 0.2 | Process 2 (verbs u0 = 500 letters u0 = 7000 others u0 = 3) | Process 1 u0 = 3 | Process 5 (verbs u0 = 0.01 letters u0 = 0.02 others u0 = 0.2) | Process 4 u0 = 0.02 | Process 3 u0 = 0.2 | Process 2 (verbs u0 = 500 letters u0 = 700 others u0 = 3) | Process 1 u0 = 3 | Method and thersholding for windowing |
Homograph | ||||||||||
اعمال | ||||||||||
2.24% | 2.45% | 1.63% | 1.83% | 1.63% | 0.17% | 0.67% | −1.35% | −1.69% | −1.18% | Form 1 |
2.24% | 2.24% | 1.43% | 1.63% | 1.43% | 0.25% | 0.25% | −1.01% | −1.69% | −0.93% | Form 2 |
2.04% | 2.65% | 1.63% | 1.83% | 1.63% | 0.09% | 0.84% | −1.27% | −1.6% | −1.1% | Form 3 |
دور | ||||||||||
1.14% | 2.15% | −1.31% | −1.51% | −1.92% | 1.68% | 1.43% | 0.47% | 0.76% | 0.38% | Form 1 |
1.13% | 1.94% | −0.7% | −1.31% | −1.51% | 1.59% | 1.26% | 0.97% | 1.01% | 0.72% | Form 2 |
1.14% | 2.35% | −1.31% | −1.51% | −1.92% | 1.85% | 1.68% | 0.63% | 0.84% | 0.55% | Form 3 |
حسن | ||||||||||
0.11% | 0.61% | −0.1% | 0.11% | 0.1% | 0.52% | 0.68% | 0.18% | 0.52% | −0.07% | Form 1 |
−0.09% | 0.61% | 0.1% | 0.11% | 0.31% | 0.44% | 0.6% | 0.01% | 0.44% | −0.16% | Form 2 |
0.52% | 0.61% | 0.1% | 0.11% | 0.1% | 0.43% | 0.68% | 0.09% | 0.52% | −0.07% | Form 3 |
گرم | ||||||||||
10.97% | 8.55% | 8.94% | 10.15% | 8.84% | 1.93% | −0.12% | 0.51% | 1.24% | 0.16% | Form 1 |
9.35% | 8.15% | 8.94% | 9.34% | 9.05% | 1.76% | 0.31% | 0.6% | 1.33% | 0.43% | Form 2 |
10.97% | 8.37% | 8.94% | 10.15% | 8.84% | 2.18% | 0.49% | 0.51% | 1.24% | 0.16% | Form 3 |
اشراف | ||||||||||
2.65% | 1.84% | 1.01% | −0.46% | 0.73% | 1.58% | 0.11% | 0.58% | −1.08% | 0.49% | Form 1 |
2.44% | 1.62% | 1.01% | 1.01% | 0.73% | 1.58% | 0.74% | 0.58% | 0.79% | 0.58% | Form 2 |
2.86% | 2.25% | 1.01% | −0.46% | 0.73% | 1.79% | 1.07% | 0.76% | −1.05 | 0.49% | Form 3 |
نفس | ||||||||||
3% | 2.05% | 1.15% | 1.14% | 1.15% | 1.57% | 1.15% | 0.9% | 1.35% | 0.98% | Form 1 |
2.37% | 1.84% | 0.73% | 1.15% | 0.73% | 1.74% | 1.32% | 1.23% | 1.35% | 1.07% | Form 2 |
3% | 2.05% | 1.15% | 1.14% | 1.15% | 1.74% | 1.15% | 1.06% | 1.18% | 0.98% | Form 3 |
In general it can be said that when a big POS tagged corpus is available, the fifth process and otherwise the fourth process is more appropriate solution.
Experiments Related to Change Corpus SizeThe next experiment tests the effect of improving the weighting of extracted collocations from a different size of a semantic tagged corpus. Results of improvement related to the corpus with 500 training sample have been shown in
In another experiment the size of corpus is considered to be even smaller in order to evaluate the effect of purposed method on this small size.
Process 5 | Process 4 | Process 3 | Process 2 | Process 1 | |
---|---|---|---|---|---|
1 | 1 | 1 | 2 | 1 | Form 1 |
2 | 1 | 4 | 5 | 4 | Form 2 with weight 2.5 |
1 | 1 | 2 | 2 | 2 | Form 2 with weight 6 |
4 | 5 | 1 | 1 | 1 | Form 3with weight 0.5 |
Training corpus with size 50 | |||||
---|---|---|---|---|---|
Process 5 (verbs u0 = 0.01 letters u0 = 0.02 others u0 = 0.2) | Process 4 u0 = 0.02 | Process 3 u0 = 0.2 | Process 2 (verbs u0 = 500 letters u0 = 700 others u0 = 3) | Process 1 u0 = 3 | Method and thersholding for windowing |
homograph | |||||
اعمال | |||||
1% | 0.53% | 1% | 1.01% | 1% | Form 1 |
1% | 0.53% | 1% | 1.01% | 1% | Form 2 |
1% | 0.53% | 1% | 1.01% | 1% | Form 3 |
دور | |||||
3.42% | 2.17% | 2.08% | 2.55% | 2.3% | Form 1 |
3.21% | 1.95% | 2.08% | 2.55% | 2.3% | Form 2 |
3.42% | 2.17% | 2.08% | 2.55% | 2.3% | Form 3 |
حسن | |||||
3.71% | 2.23% | 2.4% | 3.56% | 2.47% | Form 1 |
3.27% | 1.78% | 1.95% | 3.11% | 2.02% | Form 2 |
3.71% | 2.23% | 2.4% | 3.56% | 2.47% | Form 3 |
گرم | |||||
0.61% | 0% | 1.08% | 0.99% | 1.08% | Form 1 |
1.06% | 0.69% | 1.08% | 0.99% | 1.08% | Form 2 |
0.61% | 0% | 1.08% | 0.99% | 1.08% | Form 3 |
اشراف | |||||
1.71% | 0% | −0.23% | −0.21% | −0.01% | Form 1 |
1.71% | 0% | 0.41% | 0.42% | 0.62% | Form 2 |
1.71% | 0% | −0.23% | 0.42% | −0.01% | Form 3 |
نفس | |||||
2.21% | 1.55% | 1.9% | 1.9% | 1.9% | Form 1 |
1.32% | 0.65% | 1.68% | 1.68% | 1.68% | Form 2 |
2.21% | 1.55% | 1.9% | 1.9% | 1.9% | Form 3 |
50 for training corpus leads to smaller training decision list and it is clear that collocations learned with this corpus have a little diversity. Thus it is expected that a lot of collocation detected by collocation extraction not be mentioned in the decision list and not be able to have effectiveness. Therefore the improvement is lower than results in
Finally, it is worth recalling that untagged corpus used in these experiments included 5000 occurrences from each of the ambiguous words and better improvements are expected if they become bigger. However, by changing the size of corpus, the general optimal threshold (u0, k0) also can change.
This article has focused on the subject of adverse impact of small size of semantic tagged corpus to remove the ambiguity of the meaning of homograph words in supervised methods. The amount of tagged data required in supervised methods in word sense disambiguation is much more than other tasks related to the field of machine learning. This is due to the frequency of homograph words in natural languages and needs of ambiguity removal methods for training separate classifications for each homograph. Thus the proposed method tries to improve the supervised algorithm using an untagged corpus.
Since collocations to homograph words in a text are considered as the most important features used in classifications, the small size of the corpus can reduce the performance of a disambiguation method. Smadja has a statistical approach for extraction of collections from an untagged corpus. The approach in this article revised and assessed the Smadja method in different processes and weighted the collocations in a supervised algorithm decision list. This weighting has been based on the fact that collocations resulted from a big untagged corpus is more valid than collocations which have been extracted by a decision list which depend on a small tagged corpus. The results of the evaluation for six different homographs have shown an improvement in the purposed method in a way that the improvements in the different homographs and processes have been from 1 to 3 percent.
Noushin Riahi,Fatemeh Sedghi, (2016) Improving the Collocation Extraction Method Using an Untagged Corpus for Persian Word Sense Disambiguation. Journal of Computer and Communications,04,109-124. doi: 10.4236/jcc.2016.44010