An Empirical Study of Good-Turing Smoothing for Language Models on Different Size Corpora of Chinese
Copyright © 2013 SciRes. JCC
3.2. Advanced Issues of Good-Turing Method
Good-Turing smoothing has been employed in many
natural language applications. Previous works [3,11,12]
discussed the related parameters, such as cut-off k in
Good-Turing method. However, these works employ
English corpus only. In this section, w e will focus on the
Good-Turing method in Mandarin corpus and further
analyze the problems of Good-Turing for Chinese corpus
with various siz e, a n d differe nt cut-off value k.
As shown in Equation (8), Good-Turing reestimate
count c* of all events in term of original count c and
event number nc and nc+1. In practice, the discounted
count c* is not used for all count c. Assumed that larger
counts c are always much reliable. Recount c* are set by
Katz [5], in which c denotes the count of an event, c*
denotes the recount of an event, suggeste d by Katz ,1987
for English data. ni denotes the number of bigrms with i
counts, k denotes the cut-off value.
Good-Turing was first applied as a smoothing method
for n-gram models by Katz [5]. Until now, few papers
discuss the related problems between cut-off k and en-
tropy for Mandarin corpus, even for English. Katz sug-
gested a cut-off k at 5 as threshold for English corpus.
Another important parameter of Good-Turing is the best
kb (not ever discussed in previous works) in term of
training size N.
For Chinese character unigram model, we first calcu-
late the recount c* (c >= 0). Referring to the empirical
results, some recounts c* are negative (<0). In such case,
furthermore it leads to negative probability P and vi-
olates the statistical principle. For instance, c = 8, n8 =
106, n9 = 67, k = 10, the recount c* can be calculated and
is negative −20.56.
3.3. Models Evaluation-Cross Entropy and
Perplexity
Two commonly used schemes for evaluaitng the quality
of language model LM are referred to the entropy and
perplexity [13,14]. Supposed that a sample T is consisted
of sev eral ev ents e1, e2, …, em of m string s. The probabil-
ity P for a given testing sample T is calculated as fol-
lowing:
(9)
where
is the probability for the event ei, and
can be regarded as the coded length for all events
in testing datasets:
22
1
()( )log( )()log()
m
ii
xi
ETPx PxPePe
=
=−=−
∑∑
(10)
(11)
where E(T) and PP(T) denote the entropy (log model
probability) and perplexity for testing dataset T respec-
tively. Emin stands for the minimum entropy for a mode l.
The perplexity PP is usually regarded as the average
number for selected number which will be the possible
candidates referred to a known sequence. When a lan-
guage model is employed to predict the next appearing
word in the curr ent given context, the perplexity is adop-
ted to compare and evaluate n-gram statistical language
models.
In general, lower entropy E leads to lower PP for the
language models. It means that the lower PP, the better
performance of language models. Therefore, perplexity is
a quality measurement for LM. While two language mo-
dels, LM1 and LM2, are compared, the one with lower
perplexity is the better language representation and com-
monly provides higher performance.
In fact, the p robab ility d istributio n fo r t esting language
models is usually unknown. The cross entropy (CE) is
another measure for evaluating a language model. The
model which can predict better the next occurring event
always achieves lower cross entropy. In general situation,
, E denotes the entropy using the same language
model M for training and testing models. The Cross En-
tropy can be expressed as:
123 123
(, )
1
lim(... )log(...)
nn
nWL
CEp M
pwww wMwww w
n
→∞ ∈
=∑
(12)
Based on the Shannon-McMillan-Breiman theorem [7],
formula 12 can be simplified as following:
123
1
( ,)limlog(...)
n
n
CEpMMwwww
n
→∞
=
(13)
4. Experiments and Evaluation
Chinese Giga Word (CGW) is the Chinese corpus col-
lected from several world news databases and issued by
Linguistic Data Consortium (LDC). In the paper, we
adopted the CGW 3.0 of newest version published on
September 2009. The CGW news sources are Agence
France-Presse, Central News Agency of Taiwan, Xinhua
News Agency of Beijing, aned Zaobao Newspaper of
Singapore.
In the paper, we will cr eate 10 Unigram language mo-
dels with Chinese words for experiments. At first, we
read in random the paper of Chinese words from CGW
corpus, a language model LM1 will be created for the
first 3 × 107 (30 M) Chinese words. In the following, the
other new model LM2 can be created consequently for
the next 3 × 107 Chinese words. In other words, LM2 is
consisted of first 6 × 107 (60 M) Chinese words of CGW,
first half of which is also used to create LM1.
In the paper, the 10 language models created by dif-
ferent size of corpus are evaluated sequentially for inside