^{1}

^{1}

^{1}

In this study, we regard written texts as time series data and try to investigate dynamic correlations of word occurrences by utilizing an autocorrelation function (ACF). After defining appropriate formula for the ACF that is suitable for expressing the dynamic correlations of words, we use the formula to calculate ACFs for frequent words in 12 books. The ACFs obtained can be classified into two groups: One group of ACFs shows dynamic correlations, with these ACFs well described by a modified Kohlrausch-Williams-Watts (KWW) function; the other group of ACFs shows no correlations, with these ACFs fitted by a simple stepdown function. A word having the former ACF is called a Type-I word and a word with the latter ACF is called a Type-II word. It is also shown that the ACFs of Type-II words can be derived theoretically by assuming that the stochastic process governing word occurrence is a homogeneous Poisson point process. Based on the fitting of the ACFs by KWW and stepdown functions, we propose a measure of word importance which expresses the extent to which a word is important in a particular text. The validity of the measure is confirmed by using the Kleinburg’s burst detection algorithm.

We use language to convey our ideas. Since our physical function is limited to speaking or writing only one word at a time, we must transform our complex ideas into linear strings of words. In this transformation, it is essential to use memory, because our thought processes are far more complex than a linear object, and this one-dimensional is the origin of various types of correlations observed in written texts or speeches. In this regard, the questions that arise are how to characterize various types of correlations in linguistic data and how to relate them to our thought processes. These questions motivated us to initiate the study of dynamic correlations in written texts.

One major way to capture the correlations is to analyze word co-occurrence statistics, which is a traditional quantitative method in linguistics. This approach has been successfully applied to the extraction of semantic representations [

The goal of this study is to find a modification of the word-level mapping that is suitable for defining and calculating appropriate ACFs in the mapping scheme. With that modification, we then calculate ACFs for words in written texts and investigate word-level dynamic correlations in terms of the functional forms of the ACFs. In particular, we focus on dynamic correlations ranging from a few sentences to several tens of sentences because complex ideas require such a length to be conveyed. Through the analysis of ACFs, we will find that the functional form of ACFs for words with dynamic correlations are completely different from those without dynamic correlations. Using this result as a base, a measure that quantifies the strength of dynamic correlations will be presented, and the validity of the measure will be discussed. The measure expresses, in a sense, how important the corresponding word is in a text and thus has a wide range of real applications in which the importance of each word is required.

The rest of the paper is organized as follows. In the next section, we outline related studies with special emphasis on how the models used in the related studies can be interpreted in terms of stochastic processes. Then, we devote a section to explaining the modification of the word-level mapping, the definition of an appropriate ACF for word occurrences, and how to calculate the ACF from real written texts. Section 4 describes 12 texts, frequent words from which are investigated using ACFs. These 12 texts represent a wide variety of written linguistic data. Section 5 shows our systematic analysis of ACFs calculated for words in the 12 texts. A measure representing word importance in terms of dynamic correlations is also presented. In the final section, we give our conclusions and suggest directions for future research.

A homogeneous Poisson point process [

Sarkar et al. [

A further extension has been achieved by use of an inhomogeneous Poisson process which is defined as a Poisson point process having a time-varying occurrence rate [

Obviously, the two models mentioned above have more expressive power than that of a homogeneous Poisson process. However, these models do not serve to clarify dynamic correlations of word occurrences because the key property of “complete independence” is also common to these two models. In other words, since the “complete independence” property is inherited to these two models, an occurrence of a considered word in a text does not affect the probability of occurrences of the word at different times. This memoryless property makes the applications of these models hard to clarify dynamic correlations of word occurrences.

Another unsatisfactory point which is common to the two related studies is that the gap distribution function has been used to characterize stochastic properties of a considered word. Note that when the word-numbering time is employed, the “gap” is merely the number of other words encountered between occurrences of a considered word in the text. Therefore, that distribution function does not express the dynamical correlation explicitly, although it is suitable to present characteristics of stochastic processes such as homogeneous Poisson, mixture of two homogeneous Poisson and inhomogeneous Poisson processes in which the complete independence property is held.

To avoid the inappropriate use of the gap distribution function for representing dynamic correlations, we will discard the gap distribution function and in the next section, we will introduce an ACF that is more suitable for analyzing dynamic correlations of words.

There are other works in which linguistic data are treated as time series, as they are in this work and in which some methods of time series analysis are used to achieve the researchers’ purposes. Examples of classical works that use ACFs can be seen in [

We propose to use ACFs instead of the gap distribution functions to describe and analyze dynamic correlations in written texts. In standard signal processing theory, the definition of an ACF for a stationary system, C ( t ) [

C ( t ) = lim T → ∞ 1 T ∫ 0 T A ( τ ) A ( τ + t ) d τ (1)

Φ ( t ) = C ( t ) C ( 0 ) = lim T → ∞ 1 T ∫ 0 T A ( τ ) A ( τ + t ) d τ lim T → ∞ 1 T ∫ 0 T A ( τ ) A ( τ ) d τ (2)

where A ( τ ) is a time-varying signal of interest. As seen in the equations, the ACF measures the correlation of a signal A ( τ ) with a copy of itself shifted by some time delay t. A slightly different definition of an ACF for a random process is used in the area of time-series analysis [

R ( t ) = E [ ( A ( τ ) − μ ) ( A ( τ + t ) − μ ) ] σ 2 , (3)

where μ = E [ A ( t ) ] and σ 2 = E [ ( A ( t ) − μ ) 2 ] are the mean (the expectation

value) and variance, respectively, of the stochastic signal A ( t ) . Assuming an ergodic system, in which the expectation can be replaced by the limit of a time average [

In order to calculate an ACF for a word based on Equation (2), we must define both the meaning of A ( t ) for a word and the meaning of time t for a written text. Since we intend to clarify the dynamic properties of words through ACFs, it is natural to have A ( t ) indicating whether or not the considered word occurs at time t. Therefore, we define A ( t ) as a stochastic binary variable that takes value one if the word occurs at time t and otherwise takes value zero. Next, we consider an appropriate definition of the time unit such that the ACF calculated by Equation (2) will have properties that are preferable for the analysis of the dynamic characteristics of word occurrences. As mentioned before, if we use the word-numbering time, then the ACF shows a curious behavior that greatly impairs the use of ACFs. The problem with using word-numbering time is that Φ ( t ) with word-numbering time invariably takes the value zero at t = 1 because the probability of contiguous occurrences of the same word in a written text is extremely low.

Since the curious behavior seen in

With the sentence-numbering time, we can define the signal of word occurrence, A ( t ) , as a stochastic binary variable:

A ( t ) = { 1 ( the word occurs in the t -th sentence ) 0 ( the word does not occur in the t -th sentence ) (4)

where t is a non-negative integer. From Equation (1), we can define the discrete time analog of the continuous time ACF as

C ( t ) = 1 N − t ∑ i = 1 N − t A ( i ) A ( i + t ) , (5)

where N is the number of sentences in a considered text. A further simplification can be achieved by noting that A ( i ) is a binary variable. Let p j be the ordinal sentence number at which the considered word occurs: that is, p 1 is the sentence number of the first occurrence of a considered word, p 2 is that of the second occurrence, and so on. If A ( i ) is zero in Equation (5), then the contribution of A ( i ) A ( i + t ) in the equation is vanished. Thus, it is suﬃcient to think only about A ( p j ) , which is assumed to be 1, in Equation (5). Equation (5) then simplifies to

C ( t ) = 1 N − t ∑ i ∈ { p j } N − t A ( i ) A ( i + t ) = 1 N − t ∑ j = 1 m A ( p j ) A ( p j + t ) = 1 N − t ∑ j = 1 m A ( p j + t ) , (6)

where we have assumed that the total number of occurrences of the word in a text is m. The third equality holds because A ( p j ) = 1 by the definition of p j . Substituting t = 0 into the above equation yields C ( 0 ) = m / N , and this leads us to the normalized expression of the ACF:

Φ ( t ) = C ( t ) C ( 0 ) = N m ( N − t ) ∑ j = 1 m A ( p j + t ) . (7)

Throughout this work, we use Equation (7) to calculate the normalized ACF of a word.

We used the English version of 12 books as written texts for this work. They are listed in

Short name | Title | Author | Download URL |
---|---|---|---|

Carroll | Alice’s Adventures in Wonderland | Lewis Carroll | https://www.gutenberg.org/ebooks/11 |

Twain | The Adventures of Tom Sawyer | Mark Twain | https://www.gutenberg.org/ebooks/74 |

Austen | Pride and Prejudice | Jane Austen | https://www.gutenberg.org/ebooks/1342 |

Tolstoy | War and Peace | Leo Tolstoy | https://www.gutenberg.org/ebooks/2600 |

Melville | Moby Dick; or, The Whale | Herman Melville | https://www.gutenberg.org/ebooks/2701 |

Darwin | On the Origin of Species | Charles Darwin | https://www.gutenberg.org/ebooks/1228 |

Einstein | Relativity: The Special and General Theory | Albert Einstein | https://www.gutenberg.org/ebooks/5001 |

Lavoisier | Elements of Chemistry | Antoine Lavoisier | https://www.gutenberg.org/ebooks/30775 |

Freud | Dream Psychology | Sigmund Freud | https://www.gutenberg.org/ebooks/15489 |

Smith | An Inquiry into the Nature and Causes of the Wealth of Nations | Adam Smith | https://www.gutenberg.org/ebooks/3300 |

Kant | The Critique of Pure Reason | Immanuel Kant | https://www.gutenberg.org/ebooks/4280 |

Plato | The Republic | Plato | https://www.gutenberg.org/ebooks/1497 |

obtained through Project Gutenberg (http://www.gutenberg.org). Five of them are popular novels (Carroll, Twain, Austen, Tolstoy, and Melville) and the rest are chosen from the categories of natural science (Darwin, Einstein, and Lavoisier), psychology (Freud), political economy (Smith), and philosophy (Kant and Plato), so as to represent a wide range of written texts. The preface, contents and index pages were deleted before starting text pre-processing because they may act as noise and may affect the final results.

Before calculating the normalized ACF with Equation (7), we applied the following pre-processing procedures to each of the texts.

1) Blank lines were removed and multiple adjacent blank characters were replaced with a single blank character.

2) Each of the texts was split into sentences using a sentence segmentation tool. The software is available from https://cogcomp.org/page/tools/.

3) Each uppercase letter was converted to lowercase.

4) Comparative and superlative forms of adjectives and adverbs were converted into positive forms. Plural forms of nouns were converted into singular ones and also all the verb forms except the base form were converted into their base form. For these conversions, we used Tree Tagger which is a language independent part-of-speech tagger available from http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/.

5) Strings containing numbers were deleted. All punctuation characters were replaced with a single blank character.

6) Stop-word removal was performed by use of the stop-word list built for the experimental SMART [

Some basic statistics of the used texts, evaluated after the pre-processing procedures, are listed in

Text | Vocabulary size | Length in words | Length in sentences | Number of frequent words |
---|---|---|---|---|

Carroll | 1848 | 8191 | 1098 | 10 |

Twain | 5981 | 25,682 | 4288 | 46 |

Austen | 4643 | 39,590 | 5523 | 136 |

Tolstoy | 14,555 | 212,483 | 28,432 | 797 |

Melville | 14,413 | 85,557 | 8556 | 237 |

Darwin | 5316 | 58,611 | 3991 | 212 |

Einstein | 1893 | 11,642 | 963 | 24 |

Lavoisier | 3841 | 42,417 | 3029 | 155 |

Freud | 4006 | 19,533 | 1828 | 30 |

Smith | 6817 | 140,905 | 11,318 | 537 |

Kant | 5792 | 75,285 | 5715 | 289 |

Plato | 5400 | 35,491 | 4468 | 103 |

in the relevant text. Note that the set of these frequent words for each text contains not only content words, some of which play central roles in the explanation of important and specific ideas in the text, but also words that occur frequently merely due to their functionality. The former are context-specific but the latter are not. In other words, the former are important to describe an idea and thus they are expected to be highly correlated with duration of, typically, several tens of sentences where the idea is described. On the other hand, the latter are not expected to show any correlations because their occurrences are not context-specific but are governed by chance. As will be described in the next section, we will calculate the normalized ACF with Equation (7) for the frequent words and will find how these two kinds of frequent words behave differently in terms of ACF. For the calculation, we mainly employed the R software environment for statistical computing (version 3.1.2) [

In contrast with these, each of the ACFs in

To analyze the characteristic behaviors of ACFs described in the previous subsection, we introduced two model functions to express ACFs and attempted to fit these two parametrized functions to the calculated ACFs. One of the model functions is Φ KWW ( t ) , which is used for ACFs showing dynamic correlations,

as in

Φ KWW ( t ) = α exp { − ( t τ ) β } + ( 1 − α ) , (8)

where α , β and τ are fitting parameters satisfying the inequality conditions

0 < α ≤ 1 , (9)

0 < β ≤ 1 , (10)

0 < τ . (11)

Setting α = 1 in the above equation yields

Φ KWW ( t ; α = 1 ) = exp { − ( t τ ) β } , (12)

which is well known as the “Kohlrausch-Williams-Watts (KWW) function” or “stretched exponential function” and is widely used in material, social and economic sciences as a phenomenological description of relaxation for complex systems [

Another model function is Φ Poisson ( t ) , which is suitable for ACFs exhibiting

no dynamic correlations, as in

Φ Poisson ( t ) = { 1 ( t = 0 ) γ ( t > 0 ) (13)

where γ is a fitting parameter satisfying

0 < γ < 1. (14)

For ACFs exhibiting no correlations, as in

In the fitting procedures using the two model functions, we found that the set of Φ KWW ( t ) and Φ Poisson ( t ) , Equations (8) and (13), oﬀers full descriptive ability for all the calculated ACFs: for example, when fitting using Φ Poisson ( t ) gives a poor result, Φ KWW ( t ) provides a satisfactory fitting. We used the package “minpack.lm” in this study that provides an R interface to the non-linear least-squares fitting.

Another important point to note is that these two expressions for ACFs, Φ KWW ( t ) and Φ Poisson ( t ) , are not mutually exclusive. Rather, they are seamlessly connected in the following sense. Substituting a very small value of τ such that τ ≪ 1 into Equation (8) yields Φ KWW ( t ) ≅ 1 − α = constant for t ≥ 1 . Combining this fact with Φ KWW ( 0 ) = 1 leads us to an understanding of the nested relationship between Φ KWW ( t ) and Φ Poisson ( t ) : Φ Poisson ( t ) is formally included in the expression of Φ KWW ( t ) as the special case τ → 0 . This means that if Φ Poisson ( t ) gives a satisfactory fitting, then Φ KWW ( t ) with a small value of τ is also suitable to describe the ACF. An example of such a situation is shown in

candidate models of similar explanatory power, the simplest model is most likely to be the best choice [

Based on the two principles of model selection described above, we set three criteria for model selection through which the best model is determined from the two candidates, Φ KWW ( t ) and Φ Poisson ( t ) . If the ACF of a considered word is best described by Φ KWW ( t ) in terms of the criteria, then the word is called a “Type-I” word. If the best description is given by Φ Poisson ( t ) , then the word is called a “Type-II” word. Type-I words are those words that have dynamic correlations, as in

The following criteria classify a word as Type-I or Type-II without any ambiguity and are applied throughout the rest of this work.

(C1) After fitting procedures using both functions, Φ KWW ( t ) and Φ Poisson ( t ) , we evaluate the Bayesian information criterion (BIC) [

(C2) If BIC(KWW) is smaller than BIC(Poisson) and the best fitted value of in Φ KWW ( t ) is smaller than 0.01, then we judge that Φ Poisson ( t ) is better and we classify the considered word as a Type-II word. This judgment is a realization of the second principle, that is, we treat values of τ smaller than 0.01 as making no sense.

(C3) If BIC(KWW) is smaller than BIC(Poisson) and τ is greater than or equal to 0.01, then we judge that Φ KWW ( t ) is better and we classify the word as a Type-I word.

The reason for selecting the threshold value of τ as 0.01 in criteria (C2) and (C3) is as follows. It is natural to consider the minimum unit of the sentence-numbering time to be one sentence because the time is restricted to positive integers. Thus the “effective relaxation time” or the “effective duration” of dynamic correlations should also take values greater than or equal to one. The “effective relaxation time” of the ACFs described by Φ KWW ( t ) is approximately given by [

τ e = Γ ( 1 / β ) τ β , (15)

where β and τ are the parameters in Φ KWW ( t ) and Γ denotes the gamma function. Substituting β = 0.2 into the above equation, where 0.2 is a typical value of β for Type-I words as can be seen in

We classified all frequent words into one of the two types according to the criteria (C1)-(C3).

As stated above, we used both of the two model functions, Φ KWW ( t ) and Φ Poisson ( t ) , to describe each of the calculated ACFs and then determined which model function to use by checking the criteria (C1)-(C3) for a considered ACF. In the determination, we used the Bayesian information criterion (BIC), which has been widely used as a criterion for model selection from among a finite set of models [

BIC ( M ) = n ln L ^ ( M ) + k ln ( n ) . (16)

where L ^ is the maximized value of the likelihood function of the model M, k is the number of fitting parameters to be estimated, and n is the number of data points. In a comparison of models, the model with the lowest BIC is chosen as

Text | Type-I | (%) | Type-II | (%) | Total |
---|---|---|---|---|---|

Carroll | 5 | (50.0) | 5 | (50.0) | 10 |

Twain | 11 | (23.9) | 35 | (76.1) | 46 |

Austen | 13 | (9.6) | 123 | (90.4) | 136 |

Tolstoy | 273 | (34.3) | 524 | (65.7) | 797 |

Melville | 56 | (23.6) | 181 | (76.4) | 237 |

Darwin | 109 | (51.4) | 103 | (48.6) | 212 |

Einstein | 17 | (70.8) | 7 | (29.2) | 24 |

Lavoisier | 99 | (63.9) | 56 | (36.1) | 155 |

Freud | 14 | (46.7) | 16 | (53.3) | 30 |

Smith | 384 | (71.5) | 153 | (28.5) | 537 |

Kant | 143 | (49.5) | 146 | (50.5) | 289 |

Plato | 40 | (38.8) | 63 | (61.2) | 103 |

the best one. Under the assumption that model errors are independent and identically distributed according to a normal distribution, the BIC can be rewritten as

BIC ( M ) = n ln { 1 n ∑ i = 1 n ( x i − x ^ i θ ^ ( M ) ) 2 } + k ln ( n ) . (17)

where x i is the i-th data point, x ^ i the predicted value of x i by model M, and θ ^ ( M ) is the vector of parameter values of model M optimized by the curve-fitting procedures. In the above equation, we have omitted an additive constant that depends on only n and not on the model M.

For our application, M is KWW or Poisson, x i the ACF of a considered word calculated with Equation (7) at the i-th lag step, x ^ i is the predicted value of the ACF given by Φ KWW ( t ) or Φ Poisson ( t ) at t = i , the parameter vector is θ ^ ( KWW ) = ( α , β , τ ) or θ ^ ( Poisson ) = γ , the numbers of parameters are k ( KWW ) = 3 or k ( Poisson ) = 1 , and n = 100, which represents the maximum lag step used in the ACF calculation. We evaluated BIC(KWW) and BIC(Poisson) by use of Equation (17) and classified a considered word as Type-I or Type-II according to the criteria (C1)-(C3) described above. That is, if BIC(KWW) < BIC(Poisson) and τ ≥ 0.01 , then we judge that Φ KWW ( t ) is the better model and we classify the word as a Type-I word, otherwise Φ Poisson ( t ) is the better model and we classify the word as Type-II.

We consider here a stochastic model for Type-II words and attempt to derive Φ Poisson ( t ) , which is the model equation used for ACFs of Type-II words. We first assume that the observation count X t of a considered Type-II word in the first t sentences of a text obeys a homogeneous Poisson point process. This is because the process is the simplest one having the property that disjoint time intervals are completely independent of each other, and this property makes the process suitable for the Type-II case which does not show any dynamical correlations. Then, the probability of k observations of the word in t sentences is given by

P ( X t = k ) = ( λ t ) k k ! exp ( − λ t ) . (18)

where λ is the rate of word occurrences (occurrence probability per sentence) and the mean of X t is given by E [ X t ] = λ t [

A ( t ) = X t − X t − 1 , (19)

the mean of A ( t ) turns out to be

E [ A ( t ) ] = E [ X t ] − E [ X t − 1 ] = λ t − λ ( t − 1 ) = λ . (20)

We then consider the ACF of A ( t ) which is defined by

Φ ( s ) = E [ A ( t ) A ( t + s ) ] E [ ( A ( t ) ) 2 ] . (21)

The above definition is essentially the same as Equation (2) for ergodic systems in which expectation values can be replaced by time averages [

E [ A ( t ) A ( t + s ) ] = E [ A ( t ) ] E [ A ( t + s ) ] = λ 2 . (22)

where we have used Equation (20) and the stationary property, E [ A ( t ) ] = E [ A ( t + s ) ] . For the denominator of Equation (21), we obtain

E [ ( A ( t ) ) 2 ] = P ( A ( t ) = 1 ) × 1 2 + P ( A ( t ) = 0 ) × 0 2 = P ( A ( t ) = 1 ) = λ . (23)

The first equality holds because A ( t ) is either 0 or 1, and the last equality holds because we assume that the occurrence rate (occurrence probability per unit time) is λ. Substituting Equations (22) and (23) into Equation (21) yields an expression for Φ ( s ) ,

Φ ( s ) = { 1 ( s = 0 ) λ ( s > 0 ) (24)

which is equivalent to Φ Poisson ( t ) given by Equation (13). Since λ is the rate constant of the homogeneous Poisson point process, it can be simply evaluated from real written text by

λ ^ = number of sentences containing a considered word number of all sentendes in text , (25)

and the evaluated λ ^ can be directly compared with the fitting parameter γ in Equation (13) to confirm the validity of the discussion above.

The influence of the short window size of lag steps mentioned above is evident in

Through the discussion on Type-II words described above, we can recognize that the value of the fitting parameter γ in Equation (13) carries important information: γ is the estimator for the rate constant of the homogeneous Poisson point process. This is the reason for employing Equation (2) as the starting point of the normalized ACF. If we employ Equation (3) instead of Equation (2), then all the ACFs of Type-II words become Φ ( 0 ) = 1 and Φ ( t > 0 ) = 0 , without

Text | Text length in sentences | 1/(text length) | Average of γ | Average of λ ^ | Average of γ / λ ^ |
---|---|---|---|---|---|

Carroll | 1098 | 9.11 × 10^{−4} | 0.1260 | 0.1106 | 1.1397 |

wain | 4288 | 2.33 × 10^{−4} | 0.0352 | 0.0247 | 1.4227 |

Austen | 5523 | 1.81 × 10^{−4} | 0.0274 | 0.0200 | 1.3680 |

Tolstoy | 28,432 | 3.52 × 10^{−5} | 0.0129 | 0.0055 | 2.3592 |

Melville | 8556 | 1.17 × 10^{−4} | 0.0197 | 0.0128 | 1.5315 |

Darwin | 3991 | 2.51 × 10^{−4} | 0.0416 | 0.0274 | 1.5223 |

Einstein | 963 | 1.04 × 10^{−3} | 0.0774 | 0.0622 | 1.2455 |

Lavoisier | 3029 | 3.30 × 10^{−4} | 0.0403 | 0.0275 | 1.4646 |

Freud | 1828 | 5.47 × 10^{−4} | 0.0527 | 0.0363 | 1.4510 |

Smith | 11,318 | 8.84 × 10^{−5} | 0.0227 | 0.0127 | 1.7858 |

Kant | 5715 | 1.75 × 10^{−4} | 0.0326 | 0.0212 | 1.5394 |

Plato | 4468 | 2.24 × 10^{−4} | 0.0346 | 0.0233 | 1.4858 |

exception, and thus they become useless for getting information about the underlying homogeneous Poisson point process.

We have seen that frequent words can be classified as Type-I or Type-II words. Obviously, Type-I words, having dynamic correlations, are more important for a text because each of them appears multiple times in a bursty manner to describe a certain idea or a topic, which can be important for the text. In contrast, each of the Type-II words without dynamic correlations appears at an approximately constant rate in accordance with the homogeneous Poisson point process and therefore they cannot be related to any context in the text. The natural question arising from the discussion above is how we measure the importance of each word in terms of dynamic correlations.

As described earlier, we judged whether a word is Type-I or Type-II by using criteria (C1), (C2), and (C3) in which comparing BIC (KWW) and BIC (Poisson) plays a central role for the judgment. We introduce here a new quantity, ΔBIC, for Type-I words with the hope of quantifying the importance of each word. ΔBIC is defined as the difference between BIC KWW) and BIC (Poisson) for each Type-I word;

Δ BIC = BIC ( Poisson ) − BIC ( KWW ) . (26)

This value expresses the extent to which the best fitted Φ KWW ( t ) is different from the best fitted Φ Poisson ( t ) in terms of their overall functional behaviors. Since we have already seen that Φ Poisson ( t ) is the ACF of the homogeneous Poisson point process, which does not have any dynamic correlations, the difference between Φ KWW ( t ) and Φ Poisson ( t ) given by ΔBIC is considered to be an intuitive measure expressing the degree of dynamic correlation for Type-I words. In other words, ΔBIC describes the extent to which the stochastic process that governs the occurrences of the considered word deviates from a homogeneous Poisson point process. Note that ΔBIC always takes positive values because we define it only for Type-I words. Thus, a larger ΔBIC indicates that a word has a stronger dynamical correlation. The authors have already developed a measure of deviation from a Poisson distribution for static word-frequency distributions in written texts and have used that measure for text-classification tasks [

Carroll | Twain | Austen | Tolstoy | |||||
---|---|---|---|---|---|---|---|---|

hatter | (103.63) | sid | (95.57) | sir | (132.02) | army | (298.40) | |

turtle | (96.41) | aunt | (86.33) | Letter | (111.74) | prince | (251.69) | |

queen | (92.95) | polly | (53.35) | kitty | (90.41) | moscow | (240.09) | |

mock | (86.75) | heart | (35.84) | dance | (71.31) | french | (236.77) | |

gryphon | (56.34) | great | (24.22) | write | (62.84) | horse | (235.26) | |

good | (18.09) | charlotte | (49.88) | pierre | (234.66) | |||

time | (11.25) | stay | (22.46) | emperor | (223.26) | |||

hand | (8.61) | carriage | (16.57) | princess | (208.62) | |||

reckon | (7.39) | morning | (8.55) | battle | (205.19) | |||

make | (5.56) | speak | (8.49) | pray | (204.97) | |||

give | (5.52) | uncle | (4.19) | remember | (203.83) | |||

Great | (2.58) | russian | (201.50) | |||||

hour | (0.02) | doctor | (201.05) | |||||

letter | (198.53) | |||||||

napoleon | (198.00) | |||||||

oﬃcer | (191.16) | |||||||

soldier | (189.32) | |||||||

event | (186.64) | |||||||

dolokhov | (185.77) | |||||||

king | (184.92) |

Melville | Darwin | Einstein | Lavoisier | ||||
---|---|---|---|---|---|---|---|

whale | (238.84) | intermediate | (237.47) | theory | (107.63) | acid | (265.54) |

boat | (176.84) | variety | (197.75) | gravitational | (89.66) | ord | (248.30) |

captain | (167.21) | specie | (189.99) | velocity | (85.01) | caloric | (245.81) |

thou | (149.44) | plant | (185.22) | field | (82.42) | metal | (217.58) |

ahab | (148.56) | seed | (184.80) | motion | (79.29) | mercury | (205.74) |

pip | (142.50) | area | (179.06) | point | (73.41) | gas | (193.47) |

spout | (139.54) | organ | (174.80) | coordinate | (72.91) | water | (187.70) |

line | (124.37) | bird | (174.07) | principle | (71.27) | combustion | (183.23) |

jonah | (113.52) | flower | (167.41) | body | (67.05) | body | (176.02) |

masthead | (112.86) | form | (162.20) | law | (61.30) | sulphur | (168.42) |

sperm | (111.84) | instinct | (160.27) | time | (59.83) | tube | (165.41) |

bildad | (110.36) | character | (158.87) | relativity | (56.62) | air | (164.38) |

flask | (110.09) | nest | (158.22) | system | (54.48) | temperature | (152.40) |

oil | (108.19) | rudimentary | (154.39) | light | (45.90) | muriatic | (147.60) |

tail | (101.13) | bee | (147.35) | general | (33.84) | ice | (144.32) |

queequeg | (98.87) | variability | (146.35) | relative | (27.94) | pound | (141.17) |

harpooner | (96.33) | tree | (142.90) | space | (14.51) | oxygen | (138.28) |

fish | (93.78) | island | (142.13) | distillation | (126.75) | ||

carpenter | (90.69) | rank | (136.06) | charcoal | (125.45) | ||

dick | (90.48) | selection | (132.83) | nitric | (124.40) |

Freud | Smith | Kant | Plato | ||||
---|---|---|---|---|---|---|---|

dream | (243.18) | price | (350.17) | judgement | (299.96) | opinion | (138.90) |

thought | (130.91) | labour | (296.21) | conception | (259.94) | knowledge | (127.24) |

sleep | (128.58) | profit | (286.68) | reason | (253.09) | evil | (119.59) |

sexual | (112.75) | trade | (270.55) | object | (241.06) | state | (112.76) |

unconscious | (83.93) | country | (269.82) | experience | (229.76) | justice | (107.34) |

system | (76.47) | revenue | (265.82) | time | (225.28) | class | (95.16) |

child | (69.36) | expense | (264.33) | intuition | (217.05) | god | (79.46) |

idea | (58.90) | produce | (262.84) | internal | (208.99) | ruler | (78.21) |

psychic | (51.56) | silver | (258.60) | proposition | (203.65) | love | (77.44) |

process | (44.90) | town | (257.47) | quantity | (202.91) | soul | (76.82) |

life | (13.51) | society | (257.46) | sensation | (201.85) | son | (74.23) |

work | (3.76) | manufacture | (247.11) | cognition | (199.72) | end | (73.81) |

find | (2.59) | industry | (246.25) | space | (198.54) | understand | (72.28) |

place | (0.93) | capital | (240.01) | question | (196.14) | injustice | (71.82) |

money | (238.05) | rule | (194.18) | enemy | (67.21) | ||

stock | (235.11) | unity | (193.55) | pleasure | (66.70) | ||

slave | (234.57) | deduction | (192.32) | answer | (66.43) | ||

coin | (229.77) | condition | (185.50) | unjust | (63.41) | ||

pound | (229.64) | principle | (183.47) | art | (62.49) | ||

corn | (228.44) | change | (180.06) | great | (49.51) |

Freud, Smith, Kant, and Plato) than in novels (Carroll, Twain, Austen, Tolstoy, and Melville). This is probably because the word to characterize a certain topic is more context-specific in academic books than in novels.

To confirm the validity of using ΔBIC to measure the deviation from a homogeneous Poisson point process, we have attempted to apply another measure of the deviation to our text set, and have examined whether the relation between ΔBIC and this other measure can be interpreted in a uniform and consistent manner. We chose Kleinberg’s burst detection algorithm [

The Kleinburg’s algorithm analyzes the rate of increase of word frequencies and identifies rapidly growing words by using a probabilistic automaton. That is, it assumes an infinite number of hidden states (various degrees of burstiness), each of which corresponds to a homogeneous Poisson point process having its own rate parameter, and the change of occurrence rate in a unit time interval is modeled as a transition between these hidden states. The trajectory of state transition is determined by minimizing a cost function, where it is expensive (costly) to go up a level and cheap (zero-cost) to go down a level.

Typical results of Kleinburg’s algorithm are shown in

were detected by Kleinburg’s algorithm. For

Text | Correlation coeﬃcient | p-value |
---|---|---|

Carroll | 0.619 | 2.65 × 10^{−1} |

Twain | 0.966 | 1.46 × 10^{−6} |

Austen | 0.804 | 9.14 × 10^{−4} |

Tolstoy | 0.746 | 2.20 × 10^{−16} |

Melville | 0.810 | 4.09 × 10^{−13} |

Darwin | 0.864 | 2.20 × 10^{−16} |

Einstein | 0.727 | 9.49 × 10^{−4} |

Lavoisier | 0.874 | 2.20 × 10^{−16} |

Freud | 0.625 | 1.69 × 10^{−2} |

Smith | 0.781 | 2.20 × 10^{−16} |

Kant | 0.831 | 2.20 × 10^{−16} |

Plato | 0.735 | 6.85 × 10^{−8} |

these 9 words by use of ΔBIC, as seen in the scatter plot for the Einstein text in

Furthermore, we consider that ΔBIC can be used to measure the importance of a considered word in a given text because it expresses the extent to which the word occurrences are correlated with each other among successive sentences, and a large ΔBIC means that the word occurs multiple times in a bursty and context-specific manner. Of course, there can be various viewpoints to judge word importance; but at least ΔBIC offers well-defined procedures for calculation, with a clear meaning in terms of the stochastic properties of word occurrence. In this sense, ΔBIC has a wide range of real applications in which the degree of importance of each word is required.

In this study, we have regarded real written texts as time-series data and have tried to clarify the dynamic correlations of words by using ACFs. The set of serial sentence numbers assigned from the first to the last sentence along a considered text is used as a discretized time in order to define appropriate ACFs. Starting from the standard definition of an ACF in the signal processing area, we derived a normalized expression for an ACF that is suitable to express the dynamic correlation of word occurrences. We have calculated the ACFs for all the frequent words (words occurring in at least 50 sentences in a considered text) for 12 books chosen from various areas. It was found that the ACFs obtained can be classified into two groups: One is for words showing dynamic correlations and the other is for words with no type of correlation. Words showing dynamic correlations are called Type-I words, and their ACFs turn out to be well described by a modified KWW function. Words showing no correlations are called Type-II words, and their ACFs are modeled by a simple stepdown function. For the model function of Type-II words, we have shown that the functional form of the simple stepdown function can be theoretically derived from the assumption that the stochastic process governing word occurrence is a homogeneous Poisson point process. To select the appropriate type for a word, we have used the Bayesian information criterion (BIC).

We further proposed a measure of word importance, ΔBIC, which was defined as the difference between the BIC using the KWW function and that using the stepdown function. If ΔBIC takes a large value, then the stochastic process governing word occurrence is considered to deviate greatly from the homogeneous Poisson point process (which does not produce any correlations between two arbitrary separated time intervals). This indicates that a word with large ΔBIC has strong dynamic correlations with some range of duration along the text and is, therefore, important for a considered text. We have picked the top 20 Type-I words in terms of ΔBIC for each of the 12 texts, and found that the resultant word list seems to be plausible, especially for academic books. The validity of using ΔBIC to measure word importance was confirmed by comparing the value of ΔBIC with another measure of word importance. We chose the CCT as the other measure. This was obtained by applying the Kleinburg’s burst detection algorithm. We found that CCT and ΔBIC show a strong positive correlation. Since the backgrounds of CCT and that of ΔBIC are completely different from each other, the strong positive correlation between them means that both the CCT and ΔBIC are useful ways to measure the importance of a word.

At present, the stochastic process that governs dynamic correlations of Type-I words with long-range duration time is not clear. A detailed study along this line, through which we will try to identify the process suitable to describe word occurrences in real texts, is reserved for future work.

We thank Dr. Yusuke Higuchi for useful discussion and illuminating suggestions. This work was supported in part by JSPS Grant-in-Aid (Grant No. 25589003 and 16K00160).

The authors declare no conflicts of interest regarding the publication of this paper.

Ogura, H., Amano, H. and Kondo, M. (2019) Measuring Dynamic Correlations of Words in Written Texts with an Autocorrelation Function. Journal of Data Analysis and Information Processing, 7, 46-73. https://doi.org/10.4236/jdaip.2019.72004