Journal of Software Engineering and Applications, 20 12, 5, 208-212
doi:10.4236 /js ea.2012.512b040 Published Online December 2012 (http://www.SciRP.org/journal/jsea)
Copyright © 2012 SciRes. JSEA
Automatic Event Trigger Word Extraction in Chinese
Event
Long Tian, W. Ma, Zhou Wen
School of Computer Engineering and Science, Shanghai University, Shanghai,China.
Email: hubu1931@163.com
Received 2012
ABSTRACT
As a basic unit of knowledge representation and an important means for information organization, event has drawn
growing number of peoples att entio n, the re sear ch of e vent id enti ficat ion a nd e xtra ctio n in na tur al la ngua ge p roc essi ng
field is an important research topic in information extraction area, the recognition and extraction of event trigger word
plays a decisive role in event identification and extraction. In this paper, the authors make exp eri ment in Chi nese Eve nt
Corp us CEC, and present a method of extracting event trigger word auto matically that combine s extended trigger word
table and machine learning. The experiment result shows that the F-score of extracting event trigger word. can reach
71.2% by using this method.
Keywords: Information Extraction; Event, Trigger Word; Trig g er Word Table; Machine Learning
1. Introduction
The concept of eventis widely discussed in philosophy,
cognitive psychology and linguistics and other fields,
people hope to know and understand the world by way of
studying event and relationship between events. But in
the field of natural language processing, the study of
event are still in its i nfanc y at ho me and a bro ad . With the
rapid development of the Internet, more and more people
tend to get information that they are interested in from
Inter net, the st udy o f event in nat ural l angua ge pro cessi ng
is quietly rising in this demand-driven. The goal of
TDT(Topic Detection and Tracking)[1],which is held by
DARPA(Defense Advanced Research Projects Agency),
tends to develop a series of information organization
technology based on event. And ACE (Automatic Content
Extraction) [2-6], held by NIST (National Institute of
Standards and Technology), regards recognition and
extraction of event as one of its eva luation task.
In the field of information processing, growing
number of researchers pay attention to event annotation.
In the process of annotating event in Chinese text, we
must study the taggability of Chinese event at first. The
taggabilit y of event consist s of two asp ects: 1) extractio n
of event trigger word; 2) division of the event boundary,
and the former plays as a decisive role. On one hand,
extraction of event trigger word can be directl y applied to
automatic identification of event category [7], it is also
the basis of studying language performance of event
class[8]. On the other hand, because of the difference of
annotation systems, the standard of extraction of event
trigger word is not uniform, this phenomenon causes
some chaos. For example,离婚” is treated as trigger
word of 离婚女连骗糊涂男” in ACE corpus, but most
people think that the main thrust of this s entence is 骗”,
so 离婚” can not be treated as trigger word. This chaos
not only cause event annotation to be inconsistent. but
also make the evaluation task associated with event very
difficult.
At pr esent, in the field of i nfor mation e xtrac tion, ther e
are two methods that are used to extract event trigger
word: method based on statistics and method based on
rule. The method based on statistics starts with the
concept of stat istics and computer science, and works on
statistical pr ocessi ng o f large-scale corpus. Such as in the
literature [9], Fu Jianfeng draws a statistical conclusion
that event trigger word mainly include nouns, verb,
gerund. The method based on statistics is a typical
empiric method, and it is generally believed that the
method can obtain reliable enough statistical result as
long as the c o rp us i s su ffi ci e nt e noug h a nd typic al. But in
the fact, due to the non-ergodicity of statistics led by
limitation of corpus, this method can not guarantee that
all results are necessarily correct. The method based on
rule is a theoretical method, rule can cover all linguistic
phenomena under ideal conditions, and then this method
can be very effective. But due to limitation of rule and
diver sit y a nd op enne ss o f l ing uist ic phe no meno n, onl y in
Automatic Event Trigger Word Extraction in Chinese Event
Copyright © 2012 SciRes. JSEA
209
the very serious language environment, can this method
work.
In this paper, we combine these two methods and then
present a method of extracting event trigger word
automatically that combines extended trigger word table
and machine learning. Experiment result shows that this
method can effectively improve F-score of extracting
work.
2. Definitions
Definition2-1 (Event) we define event as a thing
happens in a certain time and environment, which some
actors take part in and show some action features. Event
e can be defined as a 6-tuple formally:
e = (A, O, T, V, P, L)
We call elements in 6-tuple event factors. A means an
action set happen in an event. O means objects take part
in the event, including all actors and entities involved in
the event. T means the period of time that event lasting.
V means environment of event, including nature
environment and social environment. P means assertions
on the procedure of actions execution in an event. L
means language expressions.
Definition2 -2 (Event Recognition) we define event
recognition as finding event from sentence or text that
contains event d irective[10] .
Definition2-3 (Event Trigger Word) Event trigger
word is defined as the word that expresses what happens
in text[11]. Under normal circumstances, event trigger
word is the main verb in the sentence (and probably is a
noun or a gerund).Event trigger word describes event
directly. For example:
Exampl e 2 -1 2008512 , 四川汶川发生了地
.
Exampl e 2 -2 截止目前, 该起事故已造成 5人死亡,
9人受伤.
Exampl e 2 -3 英国首相戈登·布朗于周五早晨抵达
北京, 开始为期三天的正式访问
Example 2-1 contains an event, event trigger word is
地震”(noun); Example 2-2 contains two event s, and
event trigger word are 死亡”(verb) and 受伤(verb).
Example 2-3 c onta in s a n eve nt, e ve nt tri gge r wor d i s 访
”(ge rund ).
Definition2 -4 (Event trigger word extraction) we
define event trigger word extraction as extracting event
trigge r word from se ntenc e or text that contains e vent.
3. Event Trigger Word Extraction in
Chinese Event
3.1. Extract Event Trigger Word Based on
Extended Trigger Word Table
The method of extracting trigger word based on trigger
word table mainly has the following steps: Firstly,
construct an initial trigger word table using CEC[12]
corpus; Secondly, expand the initial trigger word table;
Thirdly, construct a candidate trigger word set; Last,
calculate the weight value of every element in the
candidate set.
3.1.1. Construct initial tr igger wor d table
Use CEC corpus to construct initial trigger word table,
the structure of t he table is as follows:
(id, denoter, characteristic, denoterType, times, weight,
synInde x)
id means the serial number of trigger word; denoter
means event trigger word, characteristic means POS( part
of speech); denoterType means event type of trigger
word, such as地震s denoterType is emergency, and
死亡s denoterType is state C han ge; times means the
number of trigger word in the training corpus; weight
means the weight of trigger word, its value equal times
divided by the number of trigger word in the training
corpus; synIndex means the id set of the trigger words
synonyms.
In the experiment, we choose 203 corpuses of CEC as
training data, the training data is divided into five
categories: earthquakes, terrorist attacks, food poisoning,
fires and traffic accidents. And these data contains 3269
events and trigger words. The statistical results are as
follows:
CEC
categories Number of
article Number of
event Number of
trigger word
earthq uakes 45 704 704
terrorist
attacks 30 490 490
food poisoning 43 183 183
fir es 31 531 531
traffic
acciden t s 54 837 837
Total
203
3269
3269
3.1.2. Extend trigger word table
Because of the limit of corpus scale, many important
trigger words can not be included in the trigger word
table. So we need to extend trigger word table. In this
paper, we useTongyici CilinExtended E dition)》 to
solve this problem. The specific steps are as follo ws: for
every word in trigger word table, find out it's all
synonymous withTongyici CilinExtended
Edition)》;I nsert t hem to tr i gger word tab le; Upd ate it 's
synInde x's value. For example:
(95, 死亡’, ‘v’, ‘stateCha nge ’,44,
0.0127204,3459,3460’)
……
(3459, 丧生’, ‘v’, stateChange ,44,
Automatic Event Trigger Word Extraction in Chinese Event
Copyright © 2012 SciRes. JSEA
210
0.0127204,95,3459’)
(3460,‘丧命’, v, state C han ge,44, 0.0127204,
95,3460)
The expanded trigger word table contains 9807
records.
3.1.3. Extract Event Trigger Word Based on
Extended Trigger Word Table
The method of extracting event trigger word based on
trigger word table includes the following two processes:
construct trigger word set; calculate weight value. Firstly,
do sentence segmentation and mark POS by using
segme ntatio n tool, then filter out part of words and phra ses
in the word collection formed by the segmentation, just
leave nouns, verbs and gerunds. This action can narrow
the scope of candidate trigger word set, the n desc ribe t he
set in the format of W = {(w1,score1),(w2,score2),…, ( wk,
scorek)}, w stands for candidate trigger word, score
stands for the words weight value. We adopt a method
like TF * IDF to calculate score value. The calculation
formula is as follows:
scorei = TF(wi)*IDF(wi)
TF(term frequency) refers to the number of
occurrences of the given word in the file. For word wi ,
its importance can be expressed as: TF(wi)= ni /N, ni is
the number of wi that appear s in document, n is the total
number of all candidate trigger word in the document.
IDF(inverse document frequency) is a measure of a
word’s general importance, it can be expressed as
IDF( wi)=log2(trigger words weight value).
Then we can set a threshold, and filter some candidate
word whose weight value is less than threshold.
Expe riment s sho w that the metho d can o btai n high recall
rate, but its pre c isio n ra te is relativel y low.
3.2. Extract Event Trigger Word Based on
Machine Learning
The method of extracting trigger word based on machine
learning mainly has the following several steps: at first,
do sentence segmentation and mark POS by using
segmentation tool, then filter out part of words and
phrases in the word collection formed by the segmen-
tation, j ust leave nouns, verbs and ger unds; Next, extract
document feature and determine feature word that
represents the feature, and then create training set by
building feature vector of space. Then, obtain the
machine learning model that can recognize event trigger
word by using L-BFGS algorithm; Last, classify the
testing data set by using SVM model and ME model.
In this paper, according to the law of event trigger
word's occurrence as well as the effect of experiment, we
build feature vector of space by using two types of
linguistic feature which are made up of the trigger word
and c ontextua l i nfor mation. The c haracteri stics th at adopte d
by this paper include word feature, lexical feature,
syntactic feature, semantic feature and contextua l fe a tur e.
These features are expressed in the following table:
Featur e name Description
word feature
Regard the word its elf as
feature.
lexical feature Regar d POS as feature.
syntacti c f eature
Regard dependency relation
and dependency relation
direction as feature, direction
"1" means that trigger word
plays as core word in
dependency relation, and
direction "2" means that
dependency word plays as cor e
wor d in dependency relation.
semantic f eature
Regard word's paraphrase in
dicti onary as featur e.
contextual feature
Regard x words on the left and
y words' on the right word
featurelexical feature
syntacti c f eature as feature.
Feature vector can be formally expressed as:
ν={(wi-a,f1(wi-a),…,fk(wi-a)),,(wi+b,f1(wi+b),…,fk(wi+b)
)} wi stands for event indicator word(i.e., lexical
features) ,fj(wi) stands for wis jth feature (i.e ., word
featurelexical featuresyntactic feature), x stands for the
number of words which are before trigger word and have
dependency relation with trigger word, y stands for the
number of words which are after trigger word and have
dependency relation with trigger word. Statistical result
sho ws that when x=3, y=2, we will obtain best
exper iment resul t. If you expa nd the scope, t he amount of
information will not increase sig nificantly, and it will
cost more unnecessary compu tatio n.
Exampl e 3 -1: 官兵很快赶到了 20 多公里外的重灾
Its feature vector is {’NULL’,’NULL’,’NULL’,’
NULL’,’NULL,官兵’,’n’,’SBV’,’1’,’Ae10, (‘
’,’d’,’ADV’,’1’,’Eb23’),(赶到’,’v’,’HED’,’Hf08 ’),
(‘’,’u’,’MT’,’1’,’Kd05’),重灾区’,’n’,’VOB’,’1’,’
Cb08}.Because there are only two words which are
before trigger word and have dependency relation with
trigger word, so the characteristics of the third word
which is before trigger word are all empty, we mar k the m
as NULL.
In order to validate the effect of extracting event
trigger word in the field of emergenci es by using the
method, we make experiment in CEC corpus by using
Java programming language. ME algorithm in the
experiment is brought from the open source tool package
ME[16], and SVM classifier is brought from the open
source tool package LibSVM[17].All parameters are set
to default value s.
Automatic Event Trigger Word Extraction in Chinese Event
Copyright © 2012 SciRes. JSEA
211
3.3. Extract Event Trigger Word that Combines
Extended Trigger Word Table and Machine
Learning
The method of extracting event trigger word based on
extended trigger word table is a kind of method based on
statistics, the method can obtain high recall rate, but the
precision rate is relatively low. The me thod of e xtracti ng
event trigger word bas ed o n machine learning is a kind of
method base on rule, it can obtain high precision rate, but
the recall rate is lower than the former method. Now we
combine these two method, and the combination steps
are as follows
1) a threshold for score, in order to reduce ambiguity,
the thr eshold is generally relatively high.
2) struct candidate trigger word set by using the
method of extracting event trigger word based on
extended trigger word table.
3) If the word in candidate trigger word set whose
score is greater than or equals the threshold exists, the
word whose score is largest can be regarded as trigger
word.
4) If the word in candidate trigger word set whose
score is greater than or equals the threshold does not
exist, we can determine trigger word by using ME/SVM
respectively.
4. Analysis of Experiment Result
The experiment uses a common method which includes
precision rate P and recall rate R to evaluate the qualit y
of extracting result. But precision rate and recall rate
reflect two different aspects, so both of them must be
considered, either of them cannot be neglected. Therefore,
we use another integrat ed evaluation indicator: F-score.
F-score's mathematical formula is F-scor e = 2PR/ (P + R).
The following ta ble sho ws the experime nt result.
Experiment
method
Recall rate
Precision
rate
F-score
Extended trigge r
word table
0.82535
0.34215
0.48626
SVM
0.42222
0.93442
0.58115
ME
0.62962
0.80952
0.70833
As t he e xp er i ment re s ult s ho ws : t he met hod o f e xtr ac ti ng
event trigger word based on extended trigger word table
obtains low precision rate, but it can solve the problem
that recall rate is low which occurs in the process of
extracting eve nt tr igger word b ased on mac hine le arnin g.
The method of extracting event trigger word based on
machine learning obtains low recall rate, but it can solve
the problem which occurs in the process of extracting
event trigger word based on extended trigger word table.
When combining these two aspects, F value can reach to
71.2%.
5. Acknowledge
This work was supported by the National Natural Science
Foundation (60975033), the National Natural Science
Foundation (61273328), the International Network for
Bamboo and Rattan (INBAR) basic scientific research
project (1632009006), sponsored by the Shanghai
University Graduate Innovation Fund (SHUCX120103)
REFERENCES
[1] S.A. Lowe, “The Beta-Binomial Mixture Model and Its
Application to TDT Tracking and Detection,” Proceed-
ings of the DARPA Broadcast News Workshop, February
1999.
[2] ACE Pilot Study Task DefinitionEB/OL.2007-09-28.
ftp://jaguar.ncsl.nist.gov/ace/phase1/edt_phase1_v2.2.pdf.
[3] AC E -2 Evaluation Plan RDC Guidelines v2.3EB/OL.
2007-09-28
ftp://jaguar.ncsl.nist.gov/ace/phase2/docs/RDC-Guideline
s-v2.3.doc.
[4] ACE2003 Evaluation Plan v1 EB/OL.2007-09-28.
ftp://j aguar.ncsl . nist.go v/ ace/ doc/ace_evalp l an-2003.v1.p
df.
[5] ACE2004 Evaluation Plan v7EB/OL.2007-09-28.
http://www.nist.gov/speech/tests/ace/ace04/doc/ace04-eva
lplan-v7.pdf.
[6] ACE2005 Evaluation Plan v3EB/OL.2007-09-28.
http://www.nist.gov/speech/test s/ ace/ ace05 / do c/ ace05-eva
lplan.v3.pdf.
[7] Zhao yanyan, et al., Chinese event extraction technology
research, Journal of Chinese Information, vol. 22, pp. 3-8,
2008.
[8] Liu zongtian, Huang meili, Zhou wen, Zhong zhaoman,
Fujianfeng, Shan jianfang, Zhihui lai. Research on
Event2oriented Ontology Model. Computer
science,2009,36(11):189~192
[9] Fu jianfeng, "Research on Event-Oriented Knowledge
Processing," Shanghai university, 2010.
[10] Fu jianfeng, et al., Dependency Parsing Based Event
Recognition, Computer science , vol.36, pp. 217-219,
2009.
[11] Consortium LD. ACE(Automatic Content Extraction)
English Annotation Guidelines for Events. 2005.
[12] Fu jianfeng, Event-based Chinese corpus annotation me-
thod, Invention patents, State Intellectual Property Office
of the People's Republic of China, vol. App-
No. 2010101263 60.8, 2010.
[13] Mei jiaju,Zhu yiming. Tong yi ci cilin. Shanghai Dictio-
nary Publishing House,1983.
[14] http://www.ir-lab.org/.
Automatic Event Trigger Word Extraction in Chinese Event
Copyright © 2012 SciRes. JSEA
212
[15] H. John, Automatically acquiring a classification of
words ParisUniversity of Leeds, 1994.
[16] Zhang le, Maximum Entropy Modeling Toolkit for Py-
thon and C++,
http://homepages.inf.ed.ac.uk/lzhang10/maxent_toolkit.ht
ml.
[17] C.-C. C. a. C.-J. Lin, A Library for Support Vector Ma-
chines, http://www.csie.ntu.edu.tw/~cjlin/libsvm/