 Counting Runs of Ones and Ones in Runs of Ones in Binary Strings Counting runs in binary strings Frosso S. Makri, Zaharias M. Psillakis, Nikolaos Kollas Departments of Mathematics and Physics University of Patras Patras, Greece makri@math.upatras.gr; psillaki@physics.upatras.gr Abstract—Consider a binary string (a symmetric Bernoulli sequence) of length . For a positive integer , we exactly enumerate, in all  possible binary strings of length , the number of all runs of 1s of length (equal, at least)  and the number of 1s in all runs of 1s of length at least . To solve these counting problems, we use probability theory and we obtain simple and easy to compute explicit formulae as well as recursive schemes, for these potential useful in engineering numbers. Keywords-runs; symmetric Bernoulli trials; probability theory; combinatorial problems 1. Introduction and Preliminaries Nowadays, the increasing use of the computer science in diverse applications including encoding, compression and transmission of digital information calls for understanding the distribution of runs of 1s or 0s. For instance, such knowledge would help in analyzing, and comparing also, several techniques used in communication networks (wired or wireless). In such networks binary data, ranging from a few bytes (e.g. e-mails) to many gigabytes of greedy multimedia applications (e.g. video on demand), are highly processed. For details, see [1-2] and the references therein. Another area where the study of the distribution of runs of 1s and 0s has become increasingly useful is the field of bioinformatics or computational biology. In particular, molecular biologists examine tandem repeats among DNA (Deoxyribonucleic acid) segments trying to specify how probable are runs of matches, denoted as 1s, in adjacent segments of a DNA sequence. See, e.g. [3-5]. In such applications, as the indicative ones mentioned above, a key point is the understanding how 1s and 0s are distributed and combined among the elements of a binary sequence (finite or infinite, memoryless or not) and eventually forming runs of 1s and 0s according to certain enumeration rules (counting schemes). Each enumeration rule defines how runs of same symbols (i.e. 1s or 0s) are formed and consequently counted. A rule may depend on, among other considerations, whether overlapping counting is allowed or not as well as if the counting starts or not from scratch when a run of a certain size has been so far enumerated. For extensive reviews of the runs literature we refer to [6-8]. The topic is still active and attractive too, because of the wide range of its application in many areas of applied probability and engineering including hypothesis testing, quality control, system reliability and financial engineering. Some recent contributions on the subject, among others, are the works of  – . Let be a sequence of binary (two-state) random variables (RVs) taking on the values zero (0) or one (1) ordered on a line. According to Mood’s  enumeration scheme a run of 1s (1-run) is defined to be a sequence of consecutive 1s preceded and succeeded by 0s or by nothing. The number of 1s in a 1-run is referred to as its length (or size). For a positive integer  let  denote the number of 1-runs of length exactly  in the first  , binary trials. Following Makri and Psillakis , we use the indicator functions  ending at   with the convention . Consequently, the statistic  can be expressed as   . The RV , which is a fundamental one in the run literature, besides its independent merit, may be used for the representation of other interesting statistics, too. Among them, the following two have been frequently discussed in the literature and in particular in financial engineering and bioinformatics [4, 17-18]. They refer to 1-runs of length exceeding a positive integer , , in  binary trials, and they are the number  of 1-runs of length at least ;  , and the number  of 1s in all 1-runs of length at least ;    An alternative interpretation of  is that it denotes the sum of the lengths of the 1-runs of length greater than or equal to  The statistics  have been studied on binary sequences of several internal structures by many researches who used various methods. See, e.g. [1, 4, 9, 11-14, 17-19, 23-30]. In this brief note we show how someone can easily enumerate explicitly the (total) number of occurrences of all 1-runs associated with the first two mentioned statistics, as well as, the (total) number of 1s according to the third one, in all possible  binary strings of length . Our approach is relied on simple and efficient probabilistic arguments. It provides an Open Journal of Applied Sciences Supplement：2012 world Congress on Engineering and Technology44Copyright © 2012 SciRes. alternative way to recapture explicit formulae for numbers associated with 1-runs and it also establishes a new explicit expression for the number of 1s in certain 1-runs. A unified recursive scheme for these numbers is provided, too. 2. Main Results Let  stand for the RVs ,,  for  , respectively. The support (range set) of  is     . (1) Next, we consider a sequence  of length of independent (i.e. derived by a memoryless source) and identically distributed 0-1 RVs with a common probability of 1s  i.e.  ,   Such a sequence, called a finite Bernoulli sequence, is of particular importance in studies of applied probability because of its simplicity, and also since it may be considered as a special case of a sequence with dependent elements; e.g. a Markovian or an exchangeable one. For a Bernoulli sequence of length , let    denote the probability mass function (PMF) of the RV ; i.e.   ,   ,  . (2) Then, the expected value of ,       (3) is given by (see Makri et al. )             In the sequel, we consider a symmetric () finite Bernoulli sequence (i.e. a finite binary string) of length  for which we obtain our main results. Since the cardinality of a proper sample space is  (i.e. there are  binary strings that are equally likely to occur) the classical definition of probability implies that     . (5) The numbers  ,  , for  admit the following interpretation: (i)  ,   is the number of all binary strings of length  with exactly   , 1-runs of length [exactly (), at least ()] , among all the  possible binary strings of length , . (ii)  is the number of all binary strings of length  with exactly  n, 1s contained in all 1-runs of length at least , among all the  possible binary strings of length , . Simple explicit expressions of  (not repeated here) are given in Makri and Psillakis . The authors provided an explicit formula of  , for  in terms of binomial coefficients. Their method is based on the solution of a combinatorial problem; specifically, the allocation of balls into cells under certain constrains (see Lemma 2.2 of ). An explicit expression of  , in terms of binomial coefficients too, is given by Sinha and Sinha  who used a generating function approach. The latter expression contains an additional sum; therefore it may be evaluated slower computationally than that provided in . The numbers  allow us to establish (we do not actually need their specific expressions according to the new proposed approach) respective numbers referring to all possible  binary strings of length . They are defined as , (6) i.e.  is the total number of occurrences of all 1-runs of length [exactly (), at least ()] , and the total number of 1s in all 1-runs of length at least  [], in all possible  binary strings of length , . Since   , hence by (5) and (6),  . Therefore (4) implies            (7) Readily, by symmetry,  provides the respective numbers associated with 0-runs and 0s in 0-runs. By (7) it is noted that for a fixed , ,   decreases exponentially as  increases, and for a fixed , as , it holds       (8) Furthermore,   , since   , . Table I presents the three numbers  in binary strings of length  bits,  , and for . The entries of the table confirm the previously noted behavior of the depicted numbers. Sinha  was the first who addressed the usefulness of the number  and also provided its formula. Then, Sinha and Sinha  obtained an explicit expression of  whereas Copyright © 2012 SciRes.45 Makri and Psillakis  derived the same formula for it and they also established an explicit expression of . In both papers ( and ) the approach was relied on the definition of  via  , . Recently, Sinha and Sinha  reestablished  solving explicitly a recursive generation scheme for it. The proposed, in the present note, approach is a new one and treats under the same frame all the numbers  in a simple, unified and systematic way. Accordingly, by (7) we effortless get a recursive scheme for ,  . It gives a way to generate  from  and it offers further insight in understanding the interdependencies among the studied numbers. Specifically, it holds           (9) with    . We note that for the particular case  we capture by the relevant entries of (9) and (7), Theorems 2 and 3 of , respectively. TABLE I. NUMBERS OF OCCURRENCES OF 1-RUNS,  AND NUMBER OF 1S, , IN BINARY STRINGS OF LENGTH       4 1 12 20 32 2 5 8 20 3 2 3 10 4 1 1 4 16 1 147456 278528 524288 2 69632 131072 376832 3 32768 61440 237568 4 15360 28672 139264 5 7168 13312 77824 6 3328 6144 41984 7 1536 2816 22016 8 704 1280 11264 9 320 576 5632 10 144 256 2752 11 64 112 1312 12 28 48 608 13 12 20 272 14 5 8 116 15 2 3 46 16 1 1 16 3. Conlusions In this note we stated three run statistics which are important in many areas of applied probability. We defined them on a binary (0-1) sequence, and we then provided explicitly their mean values for a Bernoulli sequence. After that, we considered binary strings (symmetric Bernoulli sequences) and we showed how the analytic expressions of the means of these RVs provide eventually the respective explicit expressions of three numbers studied recently by different methods. Finally, as a byproduct of our approach, we proposed a unified recursive scheme which clarifies further the interdependencies among these numbers. The examined numbers are potential useful in many engineering applications like the ones mentioned briefly in the Introduction. Early results are encouraging in this direction. REFERENCES  K. Sinha and B. P. Sinha, “On the distribution of runs of ones in binary strings,” Comput. Math. Appl., vol. 58, pp. 1816-1829, 2009.  K. Sinha and B. P. Sinha, “Energy-efficient communication: understanding the distribution of runs in binary strings,” 1st International Conference on Recent Advances in Information Technology (RAIT-2012), pp. 177-181, 2012.  G. Benson, “Tandem repeats finder: a program to analyze DNA sequences,” Nucleic Acids Res., vol. 27, pp. 573-580, 1999.  W.Y. W. Lou, “The exact distribution of the “-tuple statistic for sequence homoloy,” Statist. Probab. Lett., vol. 61, pp. 51-59, 2003.  G. Nuel, L. Regad, J. Martin and A.C. Camproux, “Exact distribution of a pattern in a set of random sequences generated by a Markov source: applications to biological data,” Alg. Mol. Biol., vol. 5, pp. 1-18, 2010.  N. Balakrishnan and M. V. Koutras, Runs and Scans with Applications, New York: Wiley, 2002.  J. C. Fu and W. Y. W. Lou, Distribution Theory of Runs and Patterns and its Applications: A Finite Markov Imbedding Approach, New Jersey: World Scientific, 2003.  M. V. Koutras, “Applications of Markov chains to the distribution theory of runs and patterns,” in Handbook of Statistics, vol. 21, D. N. Shanbhag, C. R. Rao, Eds. North Holland: Elsevier, 2003, pp. 431-472.  S. Eryilmaz, “Success runs in a sequence of exchangeable binary trials,” J. Statist. Plann. Inference, vol. 137, pp. 2954-2963, 2007.  F. S. Makri, A.N. Philippou and Z. M. Psillakis, “Polya, inverse Polya, and circular Polya distributions of order  for -overlapping success runs,” Commun. Statist. Theory Methods, vol. 36, pp. 657-668, 2007.  F. S. Makri, A. N. Philippou and Z. M. Psillakis, “Success run statistics defined on an urn model,” Adv. Appl. Probab., vol. 39, pp. 991-1019, 2007.  S. Eryilmaz, “Run statistics defined on the multicolor urn model,” J. Appl. Probab., vol. 45, pp. 1007-1023, 2008.  S. Demir and S. Eryilmaz, “Run statistics in a sequence of arbitrarily dependent binary trials,” Stat. Papers, vol. 51, pp. 959-973, 2010.  K. Inoue and S. Aki, “On the conditional and unconditional distributions of the number of success runs on a circle with applications,” Statist. Probab. Lett., vol. 80, pp. 874-885, 2010. 46Copyright © 2012 SciRes. F. S. Makri, “On occurences of  strings in linearly and circularly ordered binary sequences,” J. Appl. Probab., vol. 47, pp. 157-178, 2010.  F. S. Makri, “Minimum and maximum distances between failures in binary sequences,” Statist. Probab. Lett., vol. 81, pp. 402-410, 2011.  F. S. Makri and Z. M. Psillakis, “On runs of length exceeding a threshold: normal approximation,” Stat. Papers, vol. 52, pp. 531-551, 2011.  F. S. Makri and Z. M. Psillakis, “On success runs of length exceeded a threshold,” Methodol. Comput. Appl. Probab., vol. 13, pp. 269-305, 2011.  F. S. Makri and Z. M. Psillakis, “On success runs of a fixed length in Bernoull sequences: exact and aymptotic results,” Comput. Math. Appl., vol. 61, pp. 761-772, 2011.  S. D. Dafnis, A. N. Philippou and D. L. Anztoulakos, “Distributions of patterns of two successes separeted by a string of  failures,” Stat. Papers, vol. 53, pp. 323-344, 2012.  M. V. Koutras and F. S. Milienos, “Exact and asymptotic results for pattern waiting times,” J. Statist. Plann. Inference, vol. 142, pp. 1464-1479, 2012.  F. S. Makri and Z. M. Psillakis, “Counting certain binary strings,” J. Statist. Plann. Inference, vol. 142, pp. 908-924, 2012.  A. M. Mood, “The distribution theory of runs,” Ann. Math. Stat., vol. 11, pp. 367-392, 1940.  J. C. Fu and M. V. Koutras, “Distribution theory of runs: a Markov chain approach,” J. Amer. Statist. Assoc., vol. 89, pp. 1050-1058, 1994.  D. L. Antzoulakos, “On waiting time problems associated with runs in Markov dependent trials,” Ann. Inst. Statist. Math., vol. 51, pp. 323-330, 1999.  Q. Han and S. Aki, “Joint distributions of runs in a sequence of multi-trials,” Ann. Inst. Statist. Math., vol. 51, pp. 419-447, 1999.  J. C. Fu, W. Y. W. Lou, Z. Bai and G. Li, “The exact and limiting distributions for the number of successes in success runs within a sequence of Markov-dependent two-state trials,” Ann. Inst. Statist. Math., vol. 54, pp. 719-730, 2002.  K. Sen, M. L. Agarwal and S. Chakraborty, “Lenths of runs and waiting time distributions by using Polya-Eggenberger sampling scheme,” Studia Sci. Math. Hungar., vol. 2, pp. 309-332, 2002.  D. L. Antzoulakos, S. Bersimis and M. V. Koutras, “On the distribution of the total number of run lengths,” Ann. Inst. Statist. Math., vol. 55, pp. 865-884, 2003.  D. E. Martin, “Distribution of the number of successes in success runs of length at least  in higher-order Markovian sequences,” Methodol. Comput. Appl. Probab., vol. 7, pp. 543-554, 2005.  K. Sinha, Location and communication issues in mobile networks. Ph. D. Dissertaion, Department of Computer Science and Engineering, Jadavpur University, Calcutta, India , 2007. Copyright © 2012 SciRes.47