_{1}

^{*}

Run count statistics serve a central role in tests of non-randomness of stochastic processes of interest to a wide range of disciplines within the physical sciences, social sciences, business and finance, and other endeavors involving intrinsic uncertainty. To carry out such tests, it is often necessary to calculate two kinds of run count probabilities: 1) the probability that a certain number of trials results in a specified multiple occurrence of an event, or 2) the probability that a specified number of occurrences of an event take place within a fixed number of trials. The use of appropriate generating functions provides a systematic procedure for obtaining the distribution functions of these probabilities. This paper examines relationships among the generating functions applicable to recurrent runs and discusses methods, employing symbolic mathematical software, for implementing numerical extraction of probabilities. In addition, the asymptotic form of the cumulative distribution function is derived, which allows accurate runs statistics to be obtained for sequences of trials so large that computation times for extraction of this information from the generating functions could be impractically long.

A stochastic process generates random outcomes in time or space. Such processes occur widely in the physical and social sciences, as well as in purely practical human activities such as finance, manufacturing, and commerce. Despite their random occurrence―indeed, precisely because of it―the outcomes of a stochastic process will display ordered patterns which a statistically naïve observer may mistakenly interpret as predictively useful information. In recent years, controversial issues over the information content of time series have arisen in a variety of disciplines such as nuclear physics [

An exclusive run is an unbroken sequence of similar events, ordinarily of a binary nature. For example, a sequence of symbols aabbbaa comprises 2 runs of a’s of length 2 and 1 run of b’s of length 3. If the events a and b occur with fixed probabilities throughout the sequence, the stochastic process is of the Bernoulli kind, and the distribution theory of binary runs [

It is not necessary for a stochastic process to generate binary events in order to be analyzed for runs. For example, a sequence of n different observations

Although developed initially for testing quality control in manufacturing, exclusive runs and up-down runs have been employed in analysis of a variety of experiments to test the fundamental prediction of quantum mechanics that transitions between quantum states occur randomly [6,7]. A problematic issue in the counting of exclusive or up-down runs is that the length of a run can be changed by future events. Thus, in the succession aabbbaa, the second run of 2 a could change to a run of 3 a or 4 a if the next two trials resulted in ab or aa respectively.

A third kind of runs analysis, based on Feller’s theory of recurrent events [^{th} trial). In a sequence of Bernoulli trials, a recurrent run of length t occurs at the

The advantage of Feller’s definition is that runs of a fixed length become recurrent events, and the statistical theory of recurrent events can then be applied to testing empirical data sequences for permutational invariance over a much wider variety of patterns than just those of unbroken sequences of identical binary elements. For example, one may be interested in testing the recurrence of a pattern abab, which, in a quantum optics experiment, might correspond to a sequence of alternate detections of left and right circularly polarized photons, or, in a series of stock price variations, might correspond to a sequence of alternative observations of rising and falling closing prices. Besides the application to runs, the same theoretical foundation may be applied to recurrent events in other forms such as return-to-origin problems, ladder-point problems (instances where a sum of random variables exceeds all preceding sums), and waiting-time problems.

The theory of recurrent runs, the relevant parts of which are examined in the following section, leads to generating functions from which the probability of a run of defined events of specified length can in principle be calculated exactly. As a practical matter, the extraction of these probabilities requires geometrically longer computation times with increasing sequence length. The availability of fast lap-top computers with large random access memory and of symbolic mathematical software of hitherto unparalleled ability to execute series expansions and perform differentiation and integration provides the analyst with computational power unimaginable to the creators of the statistical theory of runs. I report here mathematical strategies for reducing significantly the computation time for the probability of the widely applicable case of k occurrences of runs of length t in a Bernoulli sequence of length n.

Following Feller, I define the recurrent event E to be a run of successes of length t in a sequence of Bermoulli trials with p the probability of a single successful outcome and

The distribution of the variable T is defined by

where

The number of trials to the

where

is the probability that the

I leave to the cited literature the proof that the generating function (5) for runs of length t with individual probability of success p is given by

from which the mean and standard deviation of the recurrence times follow by differentiation

For economy of expression, the parameters p, t will be suppressed in the arguments of

For many applications the analyst’s interest is not necessarily in the recurrence time (i.e. number of trials) to the

The probability

and serves in the construction of two pgf’s

and

Note that the summation in (13) is over the number of occurrences, whereas the summation in (14) is over the number of trials. The second equality in (14) follows directly from Equation (11). Multiplying both sides of (13) by

from which the probabilities

A sense of the structure of the formalism can be obtained by considering the case of recurrent runs of length

the following rational expression for the right side of Equation (15) and its corresponding Taylor-series expansion

to order

It is not necessary to know the individual

which, for many applications in the physical sciences and elsewhere, is the experimentally observed quantity of interest. Multiplying both sides of Equation (17) by

Starting from the relation

and following the same procedure that led to (18) yields the generating function for the distribution

From the generating functions (18) and (20) one can deduce the asymptotic relations for mean and variance

and

The statistics (probabilities and expectation values) for any physically meaningful choice of probability of success p, run length t, and number of trials n are deducible exactly from expressions (15) and (18) in the manner previously illustrated. For many applications, however, particularly where it is possible to accumulate long sequences of data as is often the case in atomic, nuclear and elementary particle physics experiments or investigations of stock market time series, the tests for evidence of non-random behavior are best made by examining long runs. Suppose, for example, one wanted the probability of obtaining the number of occurrences of runs of length 50 in a sequence of 100 trials. One approach, leading directly to all non-vanishing probabilities, would be to extract the 100^{th} term

from the series expansion of the bivariate generating function (15). Powerful symbolic mathematical software such as Maple or Mathematica permits one to do this up to a certain order limited by the speed and memory of one’s computer, but these calculational tools may become insufficient when one is seeking exact probabilities of runs in data sequences of thousands to millions of bits.

Using complex variable theory, one can extract expressions for

and similarly

where C is the unit circle and the generating function

Because the generator

1) For given p and t, express

2) Convert

3) Extract the single desired term

Alternatively, one can convert the series generated to order

As an illustration, consider the calculation (by means of Maple) of the exact mean number of runs of length

1)

2)

3)

where

The procedure described above for converting the rational expression of s into a formal power series in s did not work with the bivariate generator

1) For given p and t, express

2) Generate a series expansion of

3) Convert the series expansions into polynomials

4) Subtract one polynomial from the other to obtain an expression of the form

where

5) Evaluate the set

As an example, the procedure led in under 10 seconds to the full set

One final procedure, particularly suitable when only selected probabilities of the full set

In Maple, method (1) proceeds as follows:

1)

2)

where the ellipsis is to be replaced by the

Note that the series must be expanded to

For many applications in the physical sciences and elsewhere, the full set of probabilities

introduced in Equation (11). Experimental situations calling for preferential usage of a cumulative probability distribution over a probability function abound in the physical sciences, as, for example, in the analysis of fragmentation [

The generating function for the ccp is derivable from

(It is understood that

One can show by application of the Central Limit Theorem (CLT) to relation (11) that for sufficiently large number of trials n and number of occurrences k, the number

The Gaussian approximation, however, does not work well in estimating the cumulative probability

in which the constant

Recall that

which is a sum of k independent exponential random variables

Run Length t | Mean Number of Runs (Exact) | Mean Number of Runs (Gaussian Approximation) |
---|---|---|

1 | 50 | 50 |

2 | 16.556 | 16.667 |

3 | 7.041 | 7.143 |

4 | 3.258 | 3.333 |

5 | 1.562 | 1.613 |

6 | 7.611 (−1) | 7.937 (−1) |

7 | 3.738 (−1) | 3.937 (−1) |

8 | 1.842 (−1) | 1.961 (−1) |

9 | 9.098 (−2) | 9.785 (−2) |

10 | 4.496 (−2) | 4.888 (−2) |

11 | 2.222 (−2) | 2.443 (−2) |

12 | 1.099 (−2) | 1.221 (−2) |

13 | 5.433 (−3) | 6.104 (−3) |

14 | 2.686 (−3) | 3.052 (−3) |

15 | 1.328 (−3) | 1.526 (−3) |

16 | 6.561 (−4) | 7.630 (−4) |

17 | 3.243 (−4) | 3.815 (−4) |

18 | 1.602 (−4) | 1.907 (−4) |

19 | 7.916 (−5) | 9.537 (−5) |

20 | 3.910 (−5) | 4.768 (−5) |

40 | 2.819 (−11) | 4.547 (−11) |

60 | 1.820 (−17) | 4.337 (−17) |

Run Length t | ||
---|---|---|

1 | 2.00 | 1.41 |

5 | 62.00 | 58.22 |

10 | 2046.00 | 2037.47 |

20 | ||

30 |

where

By virtue of equivalence (11), Equation (28) also yields a closed-form asymptotic relation for the sought-for cumulative probability distribution

Consider, for example, the probability of 1 or more runs of length 20 in 2000 trials with individual probability of success 0.5. A comparison of the results of (a) Equation (28), (b) the cumulative Gaussian distribution with mean and variance given by relations (21) and (22), and (c) the exact calculation obtained from the generating function (26)

supports the distribution (28). If the number of trials is increased to 10^{6}, the Gamma and Gaussian asymptotic distributions lead to comparable results

but the former is still superior to the latter when compared to the probability calculated from the exact generating function.

The relation (28), by which one can calculate the cumulative probability

The theory of recurrent runs provides a statistical basis for rejecting the hypothesis that a series of observations (in time or space) are random. This is a matter that often arises in experimental investigations in atomic, optical, nuclear, and elementary particle physics, as well as in other sciences, finance, and commerce, which may entail a very large number―perhaps in the thousands to millions―of trials or observations.

In this paper theoretical and numerical methods based on different generating functions were derived and investigated to determine (a) the probability

The methods reported here can be implemented on modern laptop computers running commercially available symbolic mathematical software, such as Maple (which was the application used by the author). Computation times for application of these methods to data sequences up to millions of trials could range from seconds to minutes.

To compute runs statistics for sequences of intermediate to very long trial numbers, the asymptotic distribution for the number of trials up to and including the