Paper Menu >>
Journal Menu >>
Journal of Software Engineering and Applications, 20 12, 5, 208-212 doi:10.4236 /js ea.2012.512b040 Published Online December 2012 (http://www.SciRP.org/journal/jsea) Copyright © 2012 SciRes. JSEA Automatic Event Trigger Word Extraction in Chinese Event Long Tian, W. Ma, Zhou Wen School of Computer Engineering and Science, Shanghai University, Shanghai,China. Email: hubu1931@163.com Received 2012 ABSTRACT As a basic unit of knowledge representation and an important means for information organization, event has drawn growing number of people’s att entio n, the re sear ch of e vent id enti ficat ion a nd e xtra ctio n in na tur al la ngua ge p roc essi ng field is an important research topic in information extraction area, the recognition and extraction of event trigger word plays a decisive role in event identification and extraction. In this paper, the authors make exp eri ment in Chi nese Eve nt Corp us CEC, and present a method of extracting event trigger word auto matically that combine s extended trigger word table and machine learning. The experiment result shows that the F-score of extracting event trigger word. can reach 71.2% by using this method. Keywords: Information Extraction; Event, Trigger Word; Trig g er Word Table; Machine Learning 1. Introduction The concept of “event” is widely discussed in philosophy, cognitive psychology and linguistics and other fields, people hope to know and understand the world by way of studying event and relationship between events. But in the field of natural language processing, the study of event are still in its i nfanc y at ho me and a bro ad . With the rapid development of the Internet, more and more people tend to get information that they are interested in from Inter net, the st udy o f event in nat ural l angua ge pro cessi ng is quietly rising in this demand-driven. The goal of TDT(Topic Detection and Tracking)[1],which is held by DARPA(Defense Advanced Research Projects Agency), tends to develop a series of information organization technology based on event. And ACE (Automatic Content Extraction) [2-6], held by NIST (National Institute of Standards and Technology), regards recognition and extraction of event as one of its eva luation task. In the field of information processing, growing number of researchers pay attention to event annotation. In the process of annotating event in Chinese text, we must study the taggability of Chinese event at first. The taggabilit y of event consist s of two asp ects: 1) extractio n of event trigger word; 2) division of the event boundary, and the former plays as a decisive role. On one hand, extraction of event trigger word can be directl y applied to automatic identification of event category [7], it is also the basis of studying language performance of event class[8]. On the other hand, because of the difference of annotation systems, the standard of extraction of event trigger word is not uniform, this phenomenon causes some chaos. For example,”离婚” is treated as trigger word of “离婚女连骗糊涂男” in ACE corpus, but most people think that the main thrust of this s entence is “骗”, so “离婚” can not be treated as trigger word. This chaos not only cause event annotation to be inconsistent. but also make the evaluation task associated with event very difficult. At pr esent, in the field of i nfor mation e xtrac tion, ther e are two methods that are used to extract event trigger word: method based on statistics and method based on rule. The method based on statistics starts with the concept of stat istics and computer science, and works on statistical pr ocessi ng o f large-scale corpus. Such as in the literature [9], Fu Jianfeng draws a statistical conclusion that event trigger word mainly include nouns, verb, gerund. The method based on statistics is a typical empiric method, and it is generally believed that the method can obtain reliable enough statistical result as long as the c o rp us i s su ffi ci e nt e noug h a nd typic al. But in the fact, due to the non-ergodicity of statistics led by limitation of corpus, this method can not guarantee that all results are necessarily correct. The method based on rule is a theoretical method, rule can cover all linguistic phenomena under ideal conditions, and then this method can be very effective. But due to limitation of rule and diver sit y a nd op enne ss o f l ing uist ic phe no meno n, onl y in Automatic Event Trigger Word Extraction in Chinese Event Copyright © 2012 SciRes. JSEA 209 the very serious language environment, can this method work. In this paper, we combine these two methods and then present a method of extracting event trigger word automatically that combines extended trigger word table and machine learning. Experiment result shows that this method can effectively improve F-score of extracting work. 2. Definitions Definition2-1 (Event) we define event as a thing happens in a certain time and environment, which some actors take part in and show some action features. Event e can be defined as a 6-tuple formally: e = (A, O, T, V, P, L) We call elements in 6-tuple event factors. A means an action set happen in an event. O means objects take part in the event, including all actors and entities involved in the event. T means the period of time that event lasting. V means environment of event, including nature environment and social environment. P means assertions on the procedure of actions execution in an event. L means language expressions. Definition2 -2 (Event Recognition) we define event recognition as finding event from sentence or text that contains event d irective[10] . Definition2-3 (Event Trigger Word) Event trigger word is defined as the word that expresses what happens in text[11]. Under normal circumstances, event trigger word is the main verb in the sentence (and probably is a noun or a gerund).Event trigger word describes event directly. For example: Exampl e 2 -1 2008年5月12 日, 四川汶川发生了地 震. Exampl e 2 -2 截止目前, 该起事故已造成 5人死亡, 9人受伤. Exampl e 2 -3 英国首相戈登·布朗于周五早晨抵达 北京, 开始为期三天的正式访问 Example 2-1 contains an event, event trigger word is “地震”(noun); Example 2-2 contains two event s, and event trigger word are “死亡”(verb) and “受伤”(verb). Example 2-3 c onta in s a n eve nt, e ve nt tri gge r wor d i s “访 问”(ge rund ). Definition2 -4 (Event trigger word extraction) we define event trigger word extraction as extracting event trigge r word from se ntenc e or text that contains e vent. 3. Event Trigger Word Extraction in Chinese Event 3.1. Extract Event Trigger Word Based on Extended Trigger Word Table The method of extracting trigger word based on trigger word table mainly has the following steps: Firstly, construct an initial trigger word table using CEC[12] corpus; Secondly, expand the initial trigger word table; Thirdly, construct a candidate trigger word set; Last, calculate the weight value of every element in the candidate set. 3.1.1. Construct initial tr igger wor d table Use CEC corpus to construct initial trigger word table, the structure of t he table is as follows: (id, denoter, characteristic, denoterType, times, weight, synInde x) id means the serial number of trigger word; denoter means event trigger word, characteristic means POS( part of speech); denoterType means event type of trigger word, such as地震’s denoterType is “emergency”, and 死亡’s denoterType is “state C han ge”; times means the number of trigger word in the training corpus; weight means the weight of trigger word, its value equal times divided by the number of trigger word in the training corpus; synIndex means the id set of the trigger word’s synonyms. In the experiment, we choose 203 corpuses of CEC as training data, the training data is divided into five categories: earthquakes, terrorist attacks, food poisoning, fires and traffic accidents. And these data contains 3269 events and trigger words. The statistical results are as follows: CEC categories Number of article Number of event Number of trigger word earthq uakes 45 704 704 terrorist attacks 30 490 490 food poisoning 43 183 183 fir es 31 531 531 traffic acciden t s 54 837 837 Total 203 3269 3269 3.1.2. Extend trigger word table Because of the limit of corpus scale, many important trigger words can not be included in the trigger word table. So we need to extend trigger word table. In this paper, we use《Tongyici Cilin(Extended E dition)》 to solve this problem. The specific steps are as follo ws: for every word in trigger word table, find out it's all synonymous with《Tongyici Cilin(Extended Edition)》;I nsert t hem to tr i gger word tab le; Upd ate it 's synInde x's value. For example: (95, ‘死亡’, ‘v’, ‘stateCha nge ’,44, 0.0127204,’3459,3460’) …… (3459, ‘丧生’, ‘v’, ‘stateChange ’,44, Automatic Event Trigger Word Extraction in Chinese Event Copyright © 2012 SciRes. JSEA 210 0.0127204,’95,3459’) (3460,‘丧命’, ‘v’, ‘state C han ge’,44, 0.0127204, ‘95,3460’) The expanded trigger word table contains 9807 records. 3.1.3. Extract Event Trigger Word Based on Extended Trigger Word Table The method of extracting event trigger word based on trigger word table includes the following two processes: construct trigger word set; calculate weight value. Firstly, do sentence segmentation and mark POS by using segme ntatio n tool, then filter out part of words and phra ses in the word collection formed by the segmentation, just leave nouns, verbs and gerunds. This action can narrow the scope of candidate trigger word set, the n desc ribe t he set in the format of W = {(w1,score1),(w2,score2),…, ( wk, scorek)}, w stands for candidate trigger word, score stands for the word’s weight value. We adopt a method like TF * IDF to calculate score value. The calculation formula is as follows: scorei = TF(wi)*IDF(wi) TF(term frequency) refers to the number of occurrences of the given word in the file. For word wi , its importance can be expressed as: TF(wi)= ni /N, ni is the number of wi that appear s in document, n is the total number of all candidate trigger word in the document. IDF(inverse document frequency) is a measure of a word’s general importance, it can be expressed as IDF( wi)=log2(trigger word’s weight value). Then we can set a threshold, and filter some candidate word whose weight value is less than threshold. Expe riment s sho w that the metho d can o btai n high recall rate, but its pre c isio n ra te is relativel y low. 3.2. Extract Event Trigger Word Based on Machine Learning The method of extracting trigger word based on machine learning mainly has the following several steps: at first, do sentence segmentation and mark POS by using segmentation tool, then filter out part of words and phrases in the word collection formed by the segmen- tation, j ust leave nouns, verbs and ger unds; Next, extract document feature and determine feature word that represents the feature, and then create training set by building feature vector of space. Then, obtain the machine learning model that can recognize event trigger word by using L-BFGS algorithm; Last, classify the testing data set by using SVM model and ME model. In this paper, according to the law of event trigger word's occurrence as well as the effect of experiment, we build feature vector of space by using two types of linguistic feature which are made up of the trigger word and c ontextua l i nfor mation. The c haracteri stics th at adopte d by this paper include word feature, lexical feature, syntactic feature, semantic feature and contextua l fe a tur e. These features are expressed in the following table: Featur e name Description word feature Regard the word its elf as feature. lexical feature Regar d POS as feature. syntacti c f eature Regard dependency relation and dependency relation direction as feature, direction "1" means that trigger word plays as core word in dependency relation, and direction "2" means that dependency word plays as cor e wor d in dependency relation. semantic f eature Regard word's paraphrase in dicti onary as featur e. contextual feature Regard x words on the left and y words' on the right word feature、lexical feature、 syntacti c f eature as feature. Feature vector can be formally expressed as: ν={(wi-a,f1(wi-a),…,fk(wi-a)),…,(wi+b,f1(wi+b),…,fk(wi+b) )} wi stands for event indicator word(i.e., lexical features) ,fj(wi) stands for wi’s jth feature (i.e ., word feature、lexical feature、syntactic feature), x stands for the number of words which are before trigger word and have dependency relation with trigger word, y stands for the number of words which are after trigger word and have dependency relation with trigger word. Statistical result sho ws that when x=3, y=2, we will obtain best exper iment resul t. If you expa nd the scope, t he amount of information will not increase sig nificantly, and it will cost more unnecessary compu tatio n. Exampl e 3 -1: 官兵很快赶到了 20 多公里外的重灾 Its feature vector is {(’NULL’,’NULL’,’NULL’,’ NULL’,’NULL’),(’官兵’,’n’,’SBV’,’1’,’Ae10’), (‘很 快’,’d’,’ADV’,’1’,’Eb23’),(‘赶到’,’v’,’HED’,’Hf08 ’), (‘了’,’u’,’MT’,’1’,’Kd05’),(’重灾区’,’n’,’VOB’,’1’,’ Cb08’)}.Because there are only two words which are before trigger word and have dependency relation with trigger word, so the characteristics of the third word which is before trigger word are all empty, we mar k the m as “NULL”. In order to validate the effect of extracting event trigger word in the field of emergenci es by using the method, we make experiment in CEC corpus by using Java programming language. ME algorithm in the experiment is brought from the open source tool package ME[16], and SVM classifier is brought from the open source tool package LibSVM[17].All parameters are set to default value s. Automatic Event Trigger Word Extraction in Chinese Event Copyright © 2012 SciRes. JSEA 211 3.3. Extract Event Trigger Word that Combines Extended Trigger Word Table and Machine Learning The method of extracting event trigger word based on extended trigger word table is a kind of method based on statistics, the method can obtain high recall rate, but the precision rate is relatively low. The me thod of e xtracti ng event trigger word bas ed o n machine learning is a kind of method base on rule, it can obtain high precision rate, but the recall rate is lower than the former method. Now we combine these two method, and the combination steps are as follows: 1) a threshold for score, in order to reduce ambiguity, the thr eshold is generally relatively high. 2) struct candidate trigger word set by using the method of extracting event trigger word based on extended trigger word table. 3) If the word in candidate trigger word set whose score is greater than or equals the threshold exists, the word whose score is largest can be regarded as trigger word. 4) If the word in candidate trigger word set whose score is greater than or equals the threshold does not exist, we can determine trigger word by using ME/SVM respectively. 4. Analysis of Experiment Result The experiment uses a common method which includes precision rate P and recall rate R to evaluate the qualit y of extracting result. But precision rate and recall rate reflect two different aspects, so both of them must be considered, either of them cannot be neglected. Therefore, we use another integrat ed evaluation indicator: F-score. F-score's mathematical formula is F-scor e = 2PR/ (P + R). The following ta ble sho ws the experime nt result. Experiment method Recall rate Precision rate F-score Extended trigge r word table 0.82535 0.34215 0.48626 SVM 0.42222 0.93442 0.58115 ME 0.62962 0.80952 0.70833 As t he e xp er i ment re s ult s ho ws : t he met hod o f e xtr ac ti ng event trigger word based on extended trigger word table obtains low precision rate, but it can solve the problem that recall rate is low which occurs in the process of extracting eve nt tr igger word b ased on mac hine le arnin g. The method of extracting event trigger word based on machine learning obtains low recall rate, but it can solve the problem which occurs in the process of extracting event trigger word based on extended trigger word table. When combining these two aspects, F value can reach to 71.2%. 5. Acknowledge This work was supported by the National Natural Science Foundation (60975033), the National Natural Science Foundation (61273328), the International Network for Bamboo and Rattan (INBAR) basic scientific research project (1632009006), sponsored by the Shanghai University Graduate Innovation Fund (SHUCX120103) REFERENCES [1] S.A. Lowe, “The Beta-Binomial Mixture Model and Its Application to TDT Tracking and Detection,” Proceed- ings of the DARPA Broadcast News Workshop, February 1999. [2] ACE Pilot Study Task Definition[EB/OL].[2007-09-28]. ftp://jaguar.ncsl.nist.gov/ace/phase1/edt_phase1_v2.2.pdf. [3] AC E -2 Evaluation Plan RDC Guidelines v2.3[EB/OL]. [2007-09-28] ftp://jaguar.ncsl.nist.gov/ace/phase2/docs/RDC-Guideline s-v2.3.doc. [4] ACE2003 Evaluation Plan v1 [EB/OL].[2007-09-28]. ftp://j aguar.ncsl . nist.go v/ ace/ doc/ace_evalp l an-2003.v1.p df. [5] ACE2004 Evaluation Plan v7[EB/OL].[2007-09-28]. http://www.nist.gov/speech/tests/ace/ace04/doc/ace04-eva lplan-v7.pdf. [6] ACE2005 Evaluation Plan v3[EB/OL].[2007-09-28]. http://www.nist.gov/speech/test s/ ace/ ace05 / do c/ ace05-eva lplan.v3.pdf. [7] Zhao yanyan, et al., Chinese event extraction technology research, Journal of Chinese Information, vol. 22, pp. 3-8, 2008. [8] Liu zongtian, Huang meili, Zhou wen, Zhong zhaoman, Fujianfeng, Shan jianfang, Zhihui lai. Research on Event2oriented Ontology Model. Computer science,2009,36(11):189~192 [9] Fu jianfeng, "Research on Event-Oriented Knowledge Processing," Shanghai university, 2010. [10] Fu jianfeng, et al., Dependency Parsing Based Event Recognition, Computer science , vol.36, pp. 217-219, 2009. [11] Consortium LD. ACE(Automatic Content Extraction) English Annotation Guidelines for Events. 2005. [12] Fu jianfeng, Event-based Chinese corpus annotation me- thod, Invention patents, State Intellectual Property Office of the People's Republic of China, vol. App- No. 2010101263 60.8, 2010. [13] Mei jiaju,Zhu yiming. Tong yi ci cilin. Shanghai Dictio- nary Publishing House,1983. [14] http://www.ir-lab.org/. Automatic Event Trigger Word Extraction in Chinese Event Copyright © 2012 SciRes. JSEA 212 [15] H. John, Automatically acquiring a classification of words Paris:University of Leeds, 1994. [16] Zhang le, Maximum Entropy Modeling Toolkit for Py- thon and C++, http://homepages.inf.ed.ac.uk/lzhang10/maxent_toolkit.ht ml. [17] C.-C. C. a. C.-J. Lin, A Library for Support Vector Ma- chines, http://www.csie.ntu.edu.tw/~cjlin/libsvm/ |