Automatic Event Trigger Word Extraction in Chinese Event

doi:10.4236/jsea.2012.512B040

Paper Menu >>

Journal Menu >>

Journal of Software Engineering and Applications, 20 12, 5, 208-212

doi:10.4236 /js ea.2012.512b040 Published Online December 2012 (http://www.SciRP.org/journal/jsea)

Automatic Event Trigger Word Extraction in Chinese

Event

Long Tian, W. Ma, Zhou Wen

School of Computer Engineering and Science, Shanghai University, Shanghai,China.

Email: hubu1931@163.com

Received 2012

ABSTRACT

As a basic unit of knowledge representation and an important means for information organization, event has drawn

growing number of people’s att entio n, the re sear ch of e vent id enti ficat ion a nd e xtra ctio n in na tur al la ngua ge p roc essi ng

field is an important research topic in information extraction area, the recognition and extraction of event trigger word

plays a decisive role in event identification and extraction. In this paper, the authors make exp eri ment in Chi nese Eve nt

Corp us CEC, and present a method of extracting event trigger word auto matically that combine s extended trigger word

table and machine learning. The experiment result shows that the F-score of extracting event trigger word. can reach

71.2% by using this method.

Keywords: Information Extraction; Event, Trigger Word; Trig g er Word Table; Machine Learning

1. Introduction

The concept of “event” is widely discussed in philosophy,

cognitive psychology and linguistics and other fields,

people hope to know and understand the world by way of

studying event and relationship between events. But in

the field of natural language processing, the study of

event are still in its i nfanc y at ho me and a bro ad . With the

rapid development of the Internet, more and more people

tend to get information that they are interested in from

Inter net, the st udy o f event in nat ural l angua ge pro cessi ng

is quietly rising in this demand-driven. The goal of

TDT(Topic Detection and Tracking)[1],which is held by

DARPA(Defense Advanced Research Projects Agency),

tends to develop a series of information organization

technology based on event. And ACE (Automatic Content

Extraction) [2-6], held by NIST (National Institute of

Standards and Technology), regards recognition and

extraction of event as one of its eva luation task.

In the field of information processing, growing

number of researchers pay attention to event annotation.

In the process of annotating event in Chinese text, we

must study the taggability of Chinese event at first. The

taggabilit y of event consist s of two asp ects: 1) extractio n

of event trigger word; 2) division of the event boundary,

and the former plays as a decisive role. On one hand,

extraction of event trigger word can be directl y applied to

automatic identification of event category [7], it is also

the basis of studying language performance of event

class[8]. On the other hand, because of the difference of

annotation systems, the standard of extraction of event

trigger word is not uniform, this phenomenon causes

some chaos. For example,”离婚” is treated as trigger

word of “离婚女连骗糊涂男” in ACE corpus, but most

people think that the main thrust of this s entence is “骗”,

so “离婚” can not be treated as trigger word. This chaos

not only cause event annotation to be inconsistent. but

also make the evaluation task associated with event very

difficult.

At pr esent, in the field of i nfor mation e xtrac tion, ther e

are two methods that are used to extract event trigger

word: method based on statistics and method based on

rule. The method based on statistics starts with the

concept of stat istics and computer science, and works on

statistical pr ocessi ng o f large-scale corpus. Such as in the

literature [9], Fu Jianfeng draws a statistical conclusion

that event trigger word mainly include nouns, verb,

gerund. The method based on statistics is a typical

empiric method, and it is generally believed that the

method can obtain reliable enough statistical result as

long as the c o rp us i s su ffi ci e nt e noug h a nd typic al. But in

the fact, due to the non-ergodicity of statistics led by

limitation of corpus, this method can not guarantee that

all results are necessarily correct. The method based on

rule is a theoretical method, rule can cover all linguistic

phenomena under ideal conditions, and then this method

can be very effective. But due to limitation of rule and

diver sit y a nd op enne ss o f l ing uist ic phe no meno n, onl y in

Automatic Event Trigger Word Extraction in Chinese Event

209

the very serious language environment, can this method

work.

In this paper, we combine these two methods and then

present a method of extracting event trigger word

automatically that combines extended trigger word table

and machine learning. Experiment result shows that this

method can effectively improve F-score of extracting

work.

2. Definitions

Definition2-1 (Event) we define event as a thing

happens in a certain time and environment, which some

actors take part in and show some action features. Event

e can be defined as a 6-tuple formally:

e = (A, O, T, V, P, L)

We call elements in 6-tuple event factors. A means an

action set happen in an event. O means objects take part

in the event, including all actors and entities involved in

the event. T means the period of time that event lasting.

V means environment of event, including nature

environment and social environment. P means assertions

on the procedure of actions execution in an event. L

means language expressions.

Definition2 -2 (Event Recognition) we define event

recognition as finding event from sentence or text that

contains event d irective[10] .

Definition2-3 (Event Trigger Word) Event trigger

word is defined as the word that expresses what happens

in text[11]. Under normal circumstances, event trigger

word is the main verb in the sentence (and probably is a

noun or a gerund).Event trigger word describes event

directly. For example:

Exampl e 2 -1 2008年5月12 日, 四川汶川发生了地

震.

Exampl e 2 -2 截止目前, 该起事故已造成 5人死亡,

9人受伤.

Exampl e 2 -3 英国首相戈登·布朗于周五早晨抵达

北京, 开始为期三天的正式访问

Example 2-1 contains an event, event trigger word is

“地震”(noun); Example 2-2 contains two event s, and

event trigger word are “死亡”(verb) and “受伤”(verb).

Example 2-3 c onta in s a n eve nt, e ve nt tri gge r wor d i s “访

问”(ge rund ).

Definition2 -4 (Event trigger word extraction) we

define event trigger word extraction as extracting event

trigge r word from se ntenc e or text that contains e vent.

3. Event Trigger Word Extraction in

Chinese Event

3.1. Extract Event Trigger Word Based on

Extended Trigger Word Table

The method of extracting trigger word based on trigger

word table mainly has the following steps: Firstly,

construct an initial trigger word table using CEC[12]

corpus; Secondly, expand the initial trigger word table;

Thirdly, construct a candidate trigger word set; Last,

calculate the weight value of every element in the

candidate set.

3.1.1. Construct initial tr igger wor d table

Use CEC corpus to construct initial trigger word table,

the structure of t he table is as follows:

(id, denoter, characteristic, denoterType, times, weight,

synInde x)

id means the serial number of trigger word; denoter

means event trigger word, characteristic means POS( part

of speech); denoterType means event type of trigger

word, such as地震’s denoterType is “emergency”, and

死亡’s denoterType is “state C han ge”; times means the

number of trigger word in the training corpus; weight

means the weight of trigger word, its value equal times

divided by the number of trigger word in the training

corpus; synIndex means the id set of the trigger word’s

synonyms.

In the experiment, we choose 203 corpuses of CEC as

training data, the training data is divided into five

categories: earthquakes, terrorist attacks, food poisoning,

fires and traffic accidents. And these data contains 3269

events and trigger words. The statistical results are as

follows:

CEC

categories Number of

article Number of

event Number of

trigger word

earthq uakes 45 704 704

terrorist

attacks 30 490 490

food poisoning 43 183 183

fir es 31 531 531

traffic

acciden t s 54 837 837

Total

203

3269

3.1.2. Extend trigger word table

Because of the limit of corpus scale, many important

trigger words can not be included in the trigger word

table. So we need to extend trigger word table. In this

paper, we use《Tongyici Cilin（Extended E dition）》 to

solve this problem. The specific steps are as follo ws: for

every word in trigger word table, find out it's all

synonymous with《Tongyici Cilin（Extended

Edition）》;I nsert t hem to tr i gger word tab le; Upd ate it 's

synInde x's value. For example:

(95, ‘死亡’, ‘v’, ‘stateCha nge ’,44,

0.0127204,’3459,3460’)

……

(3459, ‘丧生’, ‘v’, ‘stateChange ’,44,

Automatic Event Trigger Word Extraction in Chinese Event

210

0.0127204,’95,3459’)

(3460,‘丧命’, ‘v’, ‘state C han ge’,44, 0.0127204,

‘95,3460’)

The expanded trigger word table contains 9807

records.

3.1.3. Extract Event Trigger Word Based on

Extended Trigger Word Table

The method of extracting event trigger word based on

trigger word table includes the following two processes:

construct trigger word set; calculate weight value. Firstly,

do sentence segmentation and mark POS by using

segme ntatio n tool, then filter out part of words and phra ses

in the word collection formed by the segmentation, just

leave nouns, verbs and gerunds. This action can narrow

the scope of candidate trigger word set, the n desc ribe t he

set in the format of W = {(w1,score1),(w2,score2),…, ( wk,

scorek)}, w stands for candidate trigger word, score

stands for the word’s weight value. We adopt a method

like TF * IDF to calculate score value. The calculation

formula is as follows:

scorei = TF(wi)*IDF(wi)

TF(term frequency) refers to the number of

occurrences of the given word in the file. For word wi ,

its importance can be expressed as: TF(wi)= ni /N, ni is

the number of wi that appear s in document, n is the total

number of all candidate trigger word in the document.

IDF(inverse document frequency) is a measure of a

word’s general importance, it can be expressed as

IDF( wi)=log2(trigger word’s weight value).

Then we can set a threshold, and filter some candidate

word whose weight value is less than threshold.

Expe riment s sho w that the metho d can o btai n high recall

rate, but its pre c isio n ra te is relativel y low.

3.2. Extract Event Trigger Word Based on

Machine Learning

The method of extracting trigger word based on machine

learning mainly has the following several steps: at first,

do sentence segmentation and mark POS by using

segmentation tool, then filter out part of words and

phrases in the word collection formed by the segmen-

tation, j ust leave nouns, verbs and ger unds; Next, extract

document feature and determine feature word that

represents the feature, and then create training set by

building feature vector of space. Then, obtain the

machine learning model that can recognize event trigger

word by using L-BFGS algorithm; Last, classify the

testing data set by using SVM model and ME model.

In this paper, according to the law of event trigger

word's occurrence as well as the effect of experiment, we

build feature vector of space by using two types of

linguistic feature which are made up of the trigger word

and c ontextua l i nfor mation. The c haracteri stics th at adopte d

by this paper include word feature, lexical feature,

syntactic feature, semantic feature and contextua l fe a tur e.

These features are expressed in the following table:

Featur e name Description

word feature

Regard the word its elf as

feature.

lexical feature Regar d POS as feature.

syntacti c f eature

Regard dependency relation

and dependency relation

direction as feature, direction

"1" means that trigger word

plays as core word in

dependency relation, and

direction "2" means that

dependency word plays as cor e

wor d in dependency relation.

semantic f eature

Regard word's paraphrase in

dicti onary as featur e.

contextual feature

Regard x words on the left and

y words' on the right word

feature、lexical feature、

syntacti c f eature as feature.

Feature vector can be formally expressed as:

ν={(wi-a,f1(wi-a),…,fk(wi-a)),…,(wi+b,f1(wi+b),…,fk(wi+b)

)} wi stands for event indicator word(i.e., lexical

features) ,fj(wi) stands for wi’s jth feature (i.e ., word

feature、lexical feature、syntactic feature), x stands for the

number of words which are before trigger word and have

dependency relation with trigger word, y stands for the

number of words which are after trigger word and have

dependency relation with trigger word. Statistical result

sho ws that when x=3, y=2, we will obtain best

exper iment resul t. If you expa nd the scope, t he amount of

information will not increase sig nificantly, and it will

cost more unnecessary compu tatio n.

Exampl e 3 -1: 官兵很快赶到了 20 多公里外的重灾

Its feature vector is {（’NULL’,’NULL’,’NULL’,’

NULL’,’NULL’）,（’官兵’,’n’,’SBV’,’1’,’Ae10’）, (‘很

快’,’d’,’ADV’,’1’,’Eb23’),(‘赶到’,’v’,’HED’,’Hf08 ’),

(‘了’,’u’,’MT’,’1’,’Kd05’),（’重灾区’,’n’,’VOB’,’1’,’

Cb08’）}.Because there are only two words which are

before trigger word and have dependency relation with

trigger word, so the characteristics of the third word

which is before trigger word are all empty, we mar k the m

as “NULL”.

In order to validate the effect of extracting event

trigger word in the field of emergenci es by using the

method, we make experiment in CEC corpus by using

Java programming language. ME algorithm in the

experiment is brought from the open source tool package

ME[16], and SVM classifier is brought from the open

source tool package LibSVM[17].All parameters are set

to default value s.

Automatic Event Trigger Word Extraction in Chinese Event

211

3.3. Extract Event Trigger Word that Combines

Extended Trigger Word Table and Machine

Learning

The method of extracting event trigger word based on

extended trigger word table is a kind of method based on

statistics, the method can obtain high recall rate, but the

precision rate is relatively low. The me thod of e xtracti ng

event trigger word bas ed o n machine learning is a kind of

method base on rule, it can obtain high precision rate, but

the recall rate is lower than the former method. Now we

combine these two method, and the combination steps

are as follows：

1) a threshold for score, in order to reduce ambiguity,

the thr eshold is generally relatively high.

2) struct candidate trigger word set by using the

method of extracting event trigger word based on

extended trigger word table.

3) If the word in candidate trigger word set whose

score is greater than or equals the threshold exists, the

word whose score is largest can be regarded as trigger

word.

4) If the word in candidate trigger word set whose

score is greater than or equals the threshold does not

exist, we can determine trigger word by using ME/SVM

respectively.

4. Analysis of Experiment Result

The experiment uses a common method which includes

precision rate P and recall rate R to evaluate the qualit y

of extracting result. But precision rate and recall rate

reflect two different aspects, so both of them must be

considered, either of them cannot be neglected. Therefore,

we use another integrat ed evaluation indicator: F-score.

F-score's mathematical formula is F-scor e = 2PR/ (P + R).

The following ta ble sho ws the experime nt result.

Experiment

method

Recall rate

Precision

rate

F-score

Extended trigge r

word table

0.82535

0.34215

0.48626

SVM

0.42222

0.93442

0.58115

0.62962

0.80952

0.70833

As t he e xp er i ment re s ult s ho ws : t he met hod o f e xtr ac ti ng

event trigger word based on extended trigger word table

obtains low precision rate, but it can solve the problem

that recall rate is low which occurs in the process of

extracting eve nt tr igger word b ased on mac hine le arnin g.

The method of extracting event trigger word based on

machine learning obtains low recall rate, but it can solve

the problem which occurs in the process of extracting

event trigger word based on extended trigger word table.

When combining these two aspects, F value can reach to

71.2%.

5. Acknowledge

This work was supported by the National Natural Science

Foundation (60975033), the National Natural Science

Foundation (61273328), the International Network for

Bamboo and Rattan (INBAR) basic scientific research

project (1632009006), sponsored by the Shanghai

University Graduate Innovation Fund (SHUCX120103)

REFERENCES

[1] S.A. Lowe, “The Beta-Binomial Mixture Model and Its

Application to TDT Tracking and Detection,” Proceed-

ings of the DARPA Broadcast News Workshop, February

1999.

[2] ACE Pilot Study Task Definition［EB/OL］.［2007-09-28］.

ftp://jaguar.ncsl.nist.gov/ace/phase1/edt_phase1_v2.2.pdf.

[3] AC E -2 Evaluation Plan RDC Guidelines v2.3［EB/OL］.

［2007-09-28］

ftp://jaguar.ncsl.nist.gov/ace/phase2/docs/RDC-Guideline

s-v2.3.doc.

[4] ACE2003 Evaluation Plan v1 ［EB/OL］.［2007-09-28］.

ftp://j aguar.ncsl . nist.go v/ ace/ doc/ace_evalp l an-2003.v1.p

df.

[5] ACE2004 Evaluation Plan v7［EB/OL］.［2007-09-28］.

http://www.nist.gov/speech/tests/ace/ace04/doc/ace04-eva

lplan-v7.pdf.

[6] ACE2005 Evaluation Plan v3［EB/OL］.［2007-09-28］.

http://www.nist.gov/speech/test s/ ace/ ace05 / do c/ ace05-eva

lplan.v3.pdf.

[7] Zhao yanyan, et al., Chinese event extraction technology

research, Journal of Chinese Information, vol. 22, pp. 3-8,

2008.

[8] Liu zongtian, Huang meili, Zhou wen, Zhong zhaoman,

Fujianfeng, Shan jianfang, Zhihui lai. Research on

Event2oriented Ontology Model. Computer

science,2009,36(11):189~192

[9] Fu jianfeng, "Research on Event-Oriented Knowledge

Processing," Shanghai university, 2010.

[10] Fu jianfeng, et al., Dependency Parsing Based Event

Recognition, Computer science , vol.36, pp. 217-219,

2009.

[11] Consortium LD. ACE(Automatic Content Extraction)

English Annotation Guidelines for Events. 2005.

[12] Fu jianfeng, Event-based Chinese corpus annotation me-

thod, Invention patents, State Intellectual Property Office

of the People's Republic of China, vol. App-

No. 2010101263 60.8, 2010.

[13] Mei jiaju,Zhu yiming. Tong yi ci cilin. Shanghai Dictio-

nary Publishing House,1983.

[14] http://www.ir-lab.org/.

Automatic Event Trigger Word Extraction in Chinese Event

212

[15] H. John, Automatically acquiring a classification of

words Paris：University of Leeds, 1994.

[16] Zhang le, Maximum Entropy Modeling Toolkit for Py-

thon and C++,

http://homepages.inf.ed.ac.uk/lzhang10/maxent_toolkit.ht

ml.

[17] C.-C. C. a. C.-J. Lin, A Library for Support Vector Ma-

chines, http://www.csie.ntu.edu.tw/~cjlin/libsvm/