In this study we apply Zipf-Alecseev’s function to word length distributions of Chinese prose and dialogue texts. Since there are two potential measurement units of Chinese word length, we applied Zipf-Alecseev’s function to both of them. The results show that all the word length distributions fit Zipf-Alecseev’s function, no matter the word length is measured in characters or components. The parameters a an d b in Zipf-Alecseev’s function y = cxa bln(x) show no difference in different text styles (which are prose and dialogue in our case). However, the parameters are different when word length is measured in different units (character and component respectively). This indicates that the Zipf-Alecseev’s function is sensitive to word length measurement units, but not text styles.
Word length plays a crucial role in the development of quantitative linguistics, especially in Köhler’s lexical control circuit. There has been a wealth research into word length studies in different languages including Chinese [
Recently a unified model of length distribution of any unit in language was suggested ( [
In this book ( [
In this study, we will explore whether the text styles or measurement units of word length influence the value of a in Zipf-Alecseev’s function or not. What is more, since the parameters are part of a dynamic system displaying self-regulation, the dependence of the parameter b on parameter a is also tested.
Specifically, the following questions will be explored in this study.
Question 1: Can the word length distributions of Chinese prose and dialogue texts be modeled by Zipf-Alecseev’s function y = cxa+bln(x)?
Question 2: Do the parameters in fitting Zipf-Alecseev’s function to Chinese word length distributions display any self-regulation (the dependence of the parameter b on parameter a)?
Question 3: Are the parameters in Zipf-Alecseev’s function sensitive to different measurement units of word length (the potential measurement units of Chinese word length are the character and the component)?
Question 4: Are the parameters in Zipf-Alecseev’s function sensitive to different text styles (which are prose and dialogue texts in our case)?
This paper contains four sections. Section 2 describes the materials and methods used; Section 3 presents the results of fitting Zipf-Alecseev’s function to Chinese word length distributions, as well as the comparisons of the values of parameter a between different text styles and different measurement units of word length; Section 4 concludes this study.
In order to measure the word length in spoken Chinese and written Chinese, we built a dialogue text collection (spoken language) and a prose text collection (written language), with 20 texts respectively. The number of words in each text ranges from 726 to 3792. The spoken language texts come from a TV talk show named “QiangQiang San Ren Xing” (in English Three People) on Phoenix TV from 2013.06 to 2013.09, 5 texts each month and 20 texts in total, in the form of daily conversation. This TV program mainly discusses the current social hot issues. The written language texts come from a well-known Chinese prose journal Selective Prose1, from 2013.06 to 2013.09, 5 texts each month and 20 texts in total.
We need to explain in detail here that, the word “汉语” (means Chinese) consists of two characters “汉” “语”, and five components: “氵” “又” “讠” “五” “口”. Since there are no natural boundaries between words, word segmentation is needed before measuring word length. Word segmentation involves the definition of the word, which is a difficult problem especially in Chinese. But it is not the issue we will discuss here, in the present investigation we segment words with unified standard. Firstly, we use the ICTCLAS, one of the best Chinese word segmentation software, to segment words automatically. Then we did the manual checking and corrected the errors.
After word segmentation, we developed a java program to measure word length. To measure the number of components of a word, we used a list consisting of 20902 characters (CJK Unified Ideographs) with numbers of strokes and components of each character.1
We used Matlab 2012b to do the fitting work, and the goodness of fitting can be seen from the determination coefficients R2. As for the statistical comparisons, we used t-test through SPSS 19, and we set the significance level to 0.05 in this study.
Results of fitting Zipf-Alecseev’s function to Chinese word length distributions. In this part we show the results of fitting Zipf-Alecseev’s function to word length distributions of Chinese prose and dialogue texts, including the parameters and the determination coefficients R2. What is more, the dependence of the parameter b on parameter a is tested to see if Chinese word length distributions display any self-regulation.
Using the data from
1Selected Prose Website: http://swsk.qikan.com.
The relationship between a and b in
The relation between the a and b in
Text | Character tokens | Word tokens | Text | Character tokens | Word tokens |
---|---|---|---|---|---|
1 | 2168 | 1589 | 11 | 5441 | 3792 |
2 | 1561 | 1068 | 12 | 5419 | 3783 |
3 | 2520 | 1763 | 13 | 5216 | 3592 |
4 | 2245 | 1526 | 14 | 5021 | 3444 |
5 | 1373 | 941 | 15 | 4959 | 3498 |
6 | 1002 | 726 | 16 | 5251 | 3609 |
7 | 2287 | 1567 | 17 | 5093 | 3571 |
8 | 1306 | 883 | 18 | 5127 | 3437 |
9 | 2047 | 1445 | 19 | 4848 | 3329 |
10 | 1822 | 1278 | 20 | 4668 | 3197 |
Text | Characters tokens | Word tokens | Text | Characters tokens | Word tokens |
---|---|---|---|---|---|
1 | 1920 | 1366 | 11 | 1928 | 1368 |
2 | 1309 | 952 | 12 | 2655 | 1861 |
3 | 2055 | 1490 | 13 | 1423 | 948 |
4 | 2394 | 1657 | 14 | 2318 | 1779 |
5 | 2014 | 1502 | 15 | 1471 | 962 |
6 | 1550 | 1119 | 16 | 4128 | 2876 |
7 | 1786 | 1269 | 17 | 5143 | 3654 |
8 | 1466 | 993 | 18 | 5012 | 3512 |
9 | 1830 | 1366 | 19 | 4423 | 3057 |
10 | 2693 | 1928 | 20 | 4403 | 2953 |
The relation between the a and b in
It can be concluded from the above results that Chinese word length distributions can be modeled by the Zipf-Alecseev’s function, and the dependence of the parameter b on parameter a is testified.
1) Character as the measurement unit
Prose texts | a | b | c | R2 |
---|---|---|---|---|
1 | 4.829 | −6.46 | 239 | 0.9988 |
2 | 3.674 | −5.507 | 243 | 0.999 |
3 | 4.377 | −5.984 | 272 | 0.9979 |
4 | 5.924 | −7.737 | 320 | 0.9978 |
5 | 5.769 | −7.967 | 273 | 0.9993 |
6 | 4.841 | −6.905 | 257 | 0.9985 |
7 | 5.317 | −6.823 | 211 | 0.9998 |
8 | 5.601 | −7.539 | 205 | 0.9952 |
9 | 4.77 | −6.735 | 261 | 0.9992 |
10 | 5.543 | −7.226 | 272 | 0.9992 |
11 | 4.519 | −5.919 | 224 | 0.9978 |
12 | 5.241 | −6.558 | 260 | 0.9988 |
13 | 5.31 | −6.827 | 199 | 0.9974 |
14 | 3.626 | −5.61 | 409 | 0.9991 |
15 | 6.602 | −8.21 | 177 | 0.9984 |
16 | 5.239 | −6.592 | 411 | 0.994 |
17 | 5.332 | −6.777 | 465 | 0.9967 |
18 | 5.985 | −7.578 | 470 | 0.9973 |
19 | 6.034 | −7.439 | 412 | 0.9913 |
20 | 5.611 | −6.799 | 420 | 0.998 |
Prose texts | a | b | c | R2 |
---|---|---|---|---|
1 | 2.918 | −1.479 | 30.78 | 0.9785 |
2 | 2.362 | −1.277 | 36.1 | 0.9456 |
3 | 2.709 | −1.394 | 37.42 | 0.9607 |
4 | 2.983 | −1.41 | 35.3 | 0.9605 |
5 | 3.31 | −1.657 | 27.97 | 0.9796 |
6 | 2.777 | −1.48 | 35.14 | 0.9552 |
7 | 3.025 | −1.468 | 25.26 | 0.9442 |
8 | 3.2 | −1.525 | 19.96 | 0.9548 |
9 | 3.02 | −1.531 | 29.34 | 0.9608 |
10 | 3.533 | −1.685 | 25.07 | 0.9564 |
11 | 3.45 | −1.621 | 19.67 | 0.9701 |
12 | 3.787 | −1.727 | 20.1 | 0.967 |
13 | 3.042 | −1.448 | 22.32 | 0.9504 |
14 | 3.084 | −1.608 | 43.07 | 0.9939 |
15 | 3.177 | −1.436 | 18.4 | 0.943 |
16 | 3.407 | −1.572 | 38.57 | 0.9684 |
17 | 3.495 | −1.597 | 39.33 | 0.9747 |
18 | 3.798 | −1.703 | 34.34 | 0.9753 |
19 | 4.169 | −1.782 | 23.02 | 0.9496 |
20 | 3.617 | −1.61 | 34.96 | 0.9686 |
Dialogue texts | a | b | c | R2 |
---|---|---|---|---|
1 | 4.706 | −6.446 | 211 | 0.9992 |
2 | 4.724 | −5.981 | 148 | 0.9995 |
3 | 5.618 | −7.159 | 219 | 0.9991 |
4 | 4.345 | −5.546 | 195 | 0.9997 |
5 | 5.425 | −6.959 | 116 | 0.9999 |
6 | 5.922 | −8.256 | 128 | 1 |
7 | 5.461 | −6.748 | 176 | 0.9991 |
8 | 4.241 | −5.569 | 139 | 0.9989 |
9 | 5.138 | −6.485 | 180 | 0.9998 |
10 | 5.083 | −6.666 | 177 | 1 |
11 | 4.597 | −5.633 | 323 | 0.9996 |
12 | 5.964 | −7.485 | 305 | 0.9996 |
13 | 5.292 | −6.288 | 268 | 0.999 |
14 | 4.932 | −5.903 | 292 | 0.9996 |
15 | 5.243 | −6.452 | 248 | 0.9996 |
16 | 5.781 | −6.997 | 289 | 0.9997 |
17 | 4.708 | −5.771 | 303 | 0.9979 |
18 | 5.685 | −6.672 | 258 | 0.9989 |
19 | 5.627 | −6.812 | 293 | 0.999 |
20 | 5.07 | −6.3 | 283 | 0.9994 |
It can be seen from
2) Component as the measurement unit
When using component as Chinese word length measurement unit, the comparison results are given in
1) Prose texts
As for prose texts, i.e. Written Chinese, when word length is measure in different units, the comparison of values of parameter a is displayed in
Dialogue texts | a | b | c | R2 |
---|---|---|---|---|
1 | 2.476 | −1.329 | 34.03 | 0.976 |
2 | 3.092 | −1.494 | 17.25 | 0.9603 |
3 | 2.664 | −1.34 | 33.72 | 0.9404 |
4 | 2.86 | −1.435 | 26.95 | 0.9523 |
5 | 2.475 | −1.251 | 18.79 | 0.9053 |
6 | 2.818 | −1.534 | 19.07 | 0.9809 |
7 | 3.203 | −1.514 | 20.16 | 0.9405 |
8 | 2.797 | −1.373 | 17.46 | 0.9273 |
9 | 2.722 | −1.367 | 26.99 | 0.9467 |
10 | 2.574 | −1.316 | 26.62 | 0.9621 |
11 | 3.757 | −1.707 | 25.6 | 0.9656 |
12 | 4.168 | −1.841 | 18.25 | 0.9584 |
13 | 4.476 | −1.886 | 12.63 | 0.9432 |
14 | 4.154 | −1.796 | 17.18 | 0.9377 |
15 | 3.96 | −1.754 | 16.69 | 0.9387 |
16 | 4.507 | −1.932 | 14.12 | 0.9581 |
17 | 3.52 | −1.597 | 26.34 | 0.9703 |
18 | 4.251 | −1.819 | 15.29 | 0.9326 |
19 | 3.901 | −1.698 | 20.1 | 0.9396 |
20 | 4.35 | −1.907 | 14.9 | 0.9384 |
Style | N | Mean value | StDev | SE Mean | |
---|---|---|---|---|---|
a | Prose | 20 | 5.2072 | 0.76146 | 0.17027 |
Dialogue | 20 | 5.1781 | 0.51235 | 0.11456 |
Style | N | Mean value | StDev | SE Mean | |
---|---|---|---|---|---|
a | Prose | 20 | 3.2432 | 0.42575 | 0.09520 |
Dialogue | 20 | 3.4363 | 0.73874 | 0.16519 |
Measurement units | N | Mean value | StDev | SE Mean | |
---|---|---|---|---|---|
a | character | 20 | 5.2072 | 0.76146 | 0.17027 |
component | 20 | 3.2432 | 0.42575 | 0.09520 |
Measurement unit | N | Mean value | StDev | SE Mean | |
---|---|---|---|---|---|
a | character | 20 | 5.1781 | 0.51235 | 0.11456 |
component | 20 | 3.4363 | 0.73874 | 0.16519 |
It can be seen from
2) Dialogue texts
Then is the dialogue texts, i.e. Spoken Chinese, the comparison results are illustrated in
Base on the analyses above, we conclude that:
1) The word length distributions of Chinese prose and dialogue texts can be modeled by Zipf-Alecseev’s function y = cxa + bln(x).
2) The dependence of the parameter b on parameter a is testified, which means that the parameters in fitting Zipf-Alecseev’s function to Chinese word length distributions display some self-regulation.
3) Different measurement units of Chinese word length lead to different values of parameter a in Zipf-Alecseev’s function.
The parameters in Zipf-Alecseev’s function are not sensitive to different text styles (which are prose and dialogue texts in our case), which means that it may be only sensitive to different language types.
This work is supported by the Education Department of Guangdong Province “Innovative Strong School Project” Youth Innovation Talents Project (Humanities and Social Sciences) (Project Number: 2017WQNCX046).
Chen, H. (2018) Comparison of Word Length Distributions in Spoken and Written Chinese. Open Access Library Journal, 5: e4660. https://doi.org/10.4236/oalib.1104660