Open Journal of Statistics, 2012, 2, 408-414
http://dx.doi.org/10.4236/ojs.2012.24049 Published Online October 2012 (http://www.SciRP.org/journal/ojs)
Regularities in Sequences of Observations
Mahkame Megan Khoshyaran
Economics Traffic Clinic (ETC), Paris, France
Email: megan.khoshyaran@wanadoo.fr
Received July 18, 2012; revised August 20, 2012; accepted September 2, 2012
ABSTRACT
The objective of this paper is to propose an adjustment to the three methods of calculating the probability that regulari-
ties in a sample data represent a systemic influence in the population data. The method proposed is called data profiling.
It consists of calculating vertical and horizontal correlation coefficients in a sample data. The two correlation coeffi-
cients indicate the internal dynamic or inter dependency among observation points, and thus add new information. This
information is incorporated in the already established methods and the consequence of this integration is that one can
conclude with certainty that the probability calculated is indeed a valid indication of systemic influence in the popula-
tion data.
Keywords: Systematic Influence; Theory of Large Samples; Analysis of the Variance Principle; Multiple Regression;
Data Profiling; Vertical Correlation Coefficient; Horizontal Correlation Coefficient
1. Introduction
Suppose that in a sequence of observations one observes
a striking regularity; for example suppose that the values
arrange themselves in an increasing or decreasing order
of magnitude, or a maximum or a minimum is indicated.
Many questions arise. Is the observed regularity a general
phenomenon, or is it true only of the sequence of the data
set sampled. Is the observed regularity due to the par-
ticular sequence sampled or is it due to sampling from a
random sequence. In other words, in recurrent sampling,
is it reasonable to believe that approximately the same
general results will occur. Is it the manner of sampling
that creates artificial regularities. The occurrence of regu-
larity in a data set that results from random sampling is
highly improbable; thus regularity in a sample data is a
justification for regarding regularity as a true representa-
tive of the population data. The assumption is that unless
the probability of random occurrence is small, there is no
objective proof that there exists an actual regularity in the
population data.
To explore regularities in random sample data sets
many researchers have made significant contributions,
[1-6]. For example, assuming that the sequence of indi-
vidual numerical values is available, they have applied
various tests based on characteristics of a random se-
quence. For example, they concluded that the number of
maxima in a sequence of unrelated numbers is one-third
of the number of data points. The deviation of any se-
quence of data in any characteristic from what is as-
sumed for a random sample of sequences implies that
there is a systematic influence, the extent of which de-
pends on the magnitude of deviation and the number of
data points in a sample. In general, random sampling of
data is not a sufficient criteria for proving systematic
influence. It is shown that unless there are a large number
of data points, the proof of the existence of systematic
influence remains unresolved.
Up to now, the attempts to determine the probability of
getting a short sequence of terms having a strict appear-
ance of regularity have proven to be rather misleading.
Given the uncertainties researchers have modified the
analysis of regularities in random samples. In the new
approach a sequence of averages of groups of individual
observations is obtained in a systematic way. For exam-
ple, random samples are drawn any number of times. The
averages of each sequence of data sample are calculated.
These averages form a composite sequence that can be
used in testing the systematic influence in samples. The
statistical significance of such a sequence of averages
can be determined by comparing the variance of the in-
dividual observations in a random sample computed di-
rectly with that calculated from the variances of the av-
erages, [7-14]. This analysis of variance principle can be
applied to a general case where the values of the inde-
pendent variable are related to each of a number of cor-
related independent variables. This is a problem of mul-
tiple curvilinear correlation, where a sequence of aver-
ages of the dependent variable is computed with respect
to each independent variable and correlated to the con-
stant values of the other independent variables, [15-17].
C
opyright © 2012 SciRes. OJS
M. M. KHOSHYARAN 409
o
and
n
A method of testing the statistical significances of each
sequence of averages as well as the composite signifi-
cance of all of the sequences is derived.
should be less than the sampling error.
If this principle holds then, Cox employed the criterion of
significance. If the error of standard deviation of
o
is
Although the use of a sequence of averages is a logical
approach this method is highly uncertain and in some
cases inapplicable. Analysis of variance principle, and
the multiple curvilinear correlation have a solid logic,
they provide approximate indications of any systematic
influence. The main shortcoming of these models is that
they do not detect the source of variability in a data set.
The focus of the three models is on the variability within
and the correlation among averages in a sample. To ad-
dress this shortcoming of the three approaches, a modifi-
cation to these models is proposed. The modification
consists of detecting the correlation among individual
observations both within and across groups in a data
sample, or in another word, data profiling. This aim is
achieved by calculating vertical v
n
o
n

and horizontal
h

correlation coefficients, and incorporating them in
the calculations. The precise definitions of these vari-
ables are given and the manner in which they are inte-
grated into the three models are demonstrated in the fol-
lowing sections.
2. The Approach Based on the Theory of
Large Samples
Commonly, the values of sequences in a data sample are
averages of measurements or numbers grouped in some
systematic fashion. There are many readily available
methods that calculate the probability of such systematic
grouping of data. These methods are based on calculating
the variability between the averages and within the
groups. These methods are extended to special cases
where the regularities of a sequence are periodic, [7-14].
The method based on the theory of large numbers devel-
oped by [9] consists of computing the standard deviation
of the groups means multiplied by the square root of the
number of observations

nn
, and the standard de-
viation of the entire series in a data sample o

. Let m
= number of columns or groups, n = number of entries
per column, s
ya group mean, and
y
= the grand
mean, then the group standard deviation is given by the
following:

2
s
yy
m




n
nn
(1)
The sample standard deviation is calculated using the
following equation:

2
s
yy
mn
o
(2)
It is assumed that the difference in value between
small, then the standard error of the ratio () should
n
). Cox assumes that if there are be proportional to (
systematic influences, then the expression
1
n
o
n
should on the average equal zero given the theory of
large samples, and the standard error 2
n
o
nm




.
Practice has shown that this method that is based on the
theory of large samples is often inapplicable. One way to
circumvent this problem is to introduce vertical and
horizontal correlation coefficients. Correlation coeffi-
cients show the variations between observations, and
across groups. A minor change of notation is introduced.
The
y
s
that represented the group mean is modified
to
,1,,yj m
j to reflect the mean by group of the
observations or vertical means. Horizontal means
,1,,
i
y
in reflect observation means. Each obser-
vation is represented by
,1,,; 1,,yi nj m
ij . The
vertical
v
and horizontal h

correlation coeffi-
cients are calculated using the following formulas:




.1
0
2
.
0
for1, ,
n
ij j
i
i
vn
ij
i
yyy y
jm
yy


(3)
and



1
0
2
.
0
for1,,
m
ji i
j
j
hm
ji
j
yyy y
in
yy




(4)
If the ratio max
max
v
h


is equal to one, then the
indication is that each observation is related to the other,
both within each column and across columns; in other
words, there is evidence of systematic influence or sys-
tematic regularity. On the other hand, if the ratio

max
max
v
h




is either less than one or greater than one,
then the evidence points to the contrary, which translates
into the lack of any systematic influence. Thus, in
general if the ratio n
o
n






is equal to ma x
max
v
h




,
equal to one, then there is absolute certainty that the po-
Copyright © 2012 SciRes. OJS
M. M. KHOSHYARAN
410
lation data exhibits systematic influence. The reverse
case where the ratio n
o
n


is equal to

max
max
v
h





,,,LAPR
,;nj

0
yL
,
is either less than or greater than one, then there is no
systematic influence in the population data.
The approach based on the theory of large samples
looks at the sample data from the macroscopic level,
meaning sample averages and sample standard deviations.
Data profiling explores the data set from the microscopic
level, meaning the vertical and the horizontal correlation
coefficients. Data profiling method adds new information
which allows for an efficient and accurate detection of
systemic influence. To state this formally, let
be the space of almost surely random
sets, where () is the set of all random sets, and (A) is a
subset with σ-algebra. It can be stated that the sample
data ij , and ij exhibits
systemic influence if and only if the probability that
0
,1,yi1,,m




1ma
max
max
and ma
p
n
n
n
n
n


00
x;
x
V
oh
V
oh
LL















exists and is equal to 1, or


max
max
Vn
ho
n
Pt






00
t





0
L
,
where (t) is some constant. Data profiling assigns to the
space () a metric (d) which is associated with the
probability of convergence. Let






max
max
max
,max max
1max
V
V
n
oh
n
dE





n
ho
Vn
ho
n
n











0
L
;
then it is easy to notice that (d) represents a distance in
, and is invariant under any transformation (no
matter which subset of random sets is used). If n
o
n






and max
max
v
h




are true representations of data at the
two levels (macroscopic and microscopic respectively),
then one would expect


max
max
v
n
oh
n
d





,0
as
n



max 00
max
vn
ho
n
Ptt




if and only if








.
In fact the convergence of (d) to zero causes the con-
vergence of the probability. This is due to the Bienaymé-
Tchebychev or Markov inequality and the fact that as
1
t
nt t



.
Inversely if
max 0
max
vn
ho
n
Pt











holds then let for any (ε > 0)
;0
12
tt
t






and
00
max for, 1
max 2
vn
ho
n
Ptnnn




 





0
nn
,
then for
















max
max
max
max
max
,max
max
max d
max
1max
max
max d
max
1max
max
max12 2
vn
ho
vn
ho
v
n
oh
vn
ho
v
nn
t
ho
vn
ho
v
nn
t
ho
vn
ho
n
d
n
P
n
n
P
n
nt
Pt
t
































The conclusion is that max
max
v
h


assures almost
surely the detection of systemic influence in a data set.
3. The Approach Based on the Method of
Analysis of Variance
This method finds the probability that any variation in
between averages is purely random, [18]. An outline of
the procedure follows:
s
n = number of entries in column (s)
Copyright © 2012 SciRes. OJS
M. M. KHOSHYARAN 411
s
Nn is the total number of entries
a = a reasonable estimate of
y

hya
The mean variance between columns is calculated:


22
1
1
a Nh
mn


0
m
ss
j
s
ny
V
(5)
The residual variance is calculated using the formula:



2
0
2
ss
ny a
n

m
j
r
ya
VNm


(6)
Let log s
e
r
V
V




Z, then the probability of no syste-
matic influence is found from tables, [18-20] given (Z),
and the degrees of freedom 1, and
n
2. The method
of analysis of variance looks into the variability between
column means and the variability of individual observa-
tions from the corresponding mean within each column.
This method has a shortcoming in that it does not look at
the corresponding correlations between individual ob-
servations in each column and across groups. Data pro-
filing allows for a better analysis and detection of inter-
nal or systematic variability. To account for data profil-
ing, the formula for (Z) should be modified in the fol-
lowing way:
n
max
max
v
h







log log
s
ee
r
V
ZV








.
The addition of a log of the fraction of vertical and
horizontal correlation coefficients has one major effect; it
either augments the value of (Z), in which case lowers
the probability of systematic influence or lowers the
value of (Z), in which case raises the probability of sys-
tematic influence.
4. The Approach Based on Multiple
Regression
Up to this point, we have been dealing with one inde-
pendent variable only. [17] generalizes the method of
analysis of variance to many independent variables which
may be mutually correlated. In other words, the group
averages are given as a multiple regression of (K) inde-
pendent variables. He thus modifies the mean variance
between columns
s
V, and the residual variance
r,
using the multiple regression method. The outline of the
procedure is as follows:
V
M = Total number of columns (groups) to be averaged
with respect to all the independent variables
K = Number of independent variables
Ky' = Value of an observation corrected with respect to
all except the kth independent variable
1
corrected group averages of the dependent variable
.w.r.t. to independent variable
s
K
s
y
K
y





1
1weightedaverage of
..
weightedaverage of
s
K
K
s
yy
yy





The overall variance between columns is calculated:
 


2
22
12
12
00 0
1
s
mm m
K
ss ssss
jj j
V
ny ynyynyy
K
MK n
 
 

 
(7)
The residual variance is calculated using the formula:

2
1
k
s
r
ky y
VNK Mn
(8)
 
The probability that data being random is obtained as
log s
e
r
V
ZV




before from , and the degrees of free-
1
n, and
2
n; and the probability of systematic dom
1log
s
e
r
V
ZV









influence is thus . The shortcom-
ing of the generalized method of the analysis of variance
is that although it tries to look more closely at individual
data sets, it does not look at the strength of the relation-
ship between each individual data points. Data profiling
in this case allows for adjusting for this shortcoming. The
vertical and horizontal correlation coefficients,
v
,
h
are modified to adjust to the (K) independent
variables. Let
1,,
K
vv

be the vertical correlation coe-
fficients calculated for the K independent variables, and
1,,
K
hh


be the horizontal correlation coefficients
calculated for the K independent variables. New correla-
tion coefficients are introduced:

1,,
K
rr

 

1,,
K
rr

vv
re-
present residual vertical correlation coefficients of (Ky')
adjusted observations, and hh
repre-
sent residual vertical correlation coefficients of (Ky')
adjusted observations. For each independent variable (k),
the vertical and horizontal correlation coefficients,
v
,
h
are calculated as before:


(1)
0
2
0
for1,, ;
and1, ,
nkkk k
iji j
ki
vnkk
ij
i
yyy y
jm
yy
kK


(9)
Copyright © 2012 SciRes. OJS
M. M. KHOSHYARAN
right © 2012 SciRes. OJS
412
and




1
0
2
0
for1,, an
mkkkk
ji i
j
j
k
hmkk
ji
j
yyy y
in
yy

d1, ,
kK
(10)
The overall variance between columns is then modified as follows:

 






12
12
max
max
vv
hh


22
12
12
00
1
2
0
1
max
max
max
max
mm
ss ss
jj
s
K
mv
K
ss K
jh
ny ynyy
VMK n
ny y
K
MK n



 






  




(11)
In order to modify the residual variance, residual ver-
tical and horizontal correlation coefficients are calculated
using (ky'), the value of an observation which is cor-
rected to constant values of all the rest except the kth
independent variable given in McEwen’s generalized
method of the analysis of variance, [17]. The residual
vertical correlation coefficients are calculated:
Copy




1
0
2
0
for1, ,;and1, ,
nkk
ij j
i
ki
vnk
ij
i
ky ykyy
rjmkK
ky y




(12)
and the residual horizontal correlation coefficients are given by:




1
0
2
for 1,,
mkk
ji i
j
j
k
hmk
ky y kyy
r


0
and 1,,
ji
j
i
nkK
ky y

(13)
The residual variance is modified as:






2
max
max
1
k
v
k
sk
h
r
ky yr
NK Mn
r
V



(14)
The probability that data exhibits systematic influence
log s
e
r
ZV
V

is obtained using

1
n2
and the degrees of free-
dom (), and (n) as is already explained.
5. An Example: Sunspot Numbers
In this section the validity of the improvement in the
form of data profiling is tested. For this purpose the data
set used in [17] is revisited and the probability of the
existence of systematic influences in the data is calcu-
lated once given the proposed analysis of variance method,
which is already demonstrated in [17], and once with a
modified version. Consider the data corresponding to
sunspot numbers arranged with respect to a trial cycle of
length 11 years, i.e. from 1749 to 1826. The sunspot
numbers exceeding 99 are excluded. The data is shown
in a matrix form as (Table 1):
The averages
s
j
yy
are given:
52.5, 43.2, 26.0, 21.5, 13.5,
6.5, 7.5, 12.7, 24.5, 30.5, 43.2
s
y
n
The number of columns is (m = 11). The number of
observations in each column is (s = 4). The number of
observations of the dependent variable is (N = 44). The
overall average is y = 25.59. The degrees of freedom
M. M. KHOSHYARAN 413
Table 1. Sunspot numbers arranged with respect to a trail
cycle of 11 years, 1749-1826.

1749-1759 1794-1804 1805-1815 1816-1826
1 81 41 42 46
2 83 21 28 41
3 48 16 10 30
4 48 6 8 24
5 31 4 3 16
6 12 7 0 7
7 10 15 1 4
8 10 34 5 2
9 32 45 12 9
10 48 43 14 17
11 54 48 35 36
1
n, and
2
n

11 110
44 1133
are respectively , and
2. The averages
1
n
n

s
y
decrease up to
the 6th column, and then increase from then on. To cal-
culate the probability that the sample data is indicative of
the population data, and thus there are cyclic effects, the
(Z) statistic is calculated. The statistic (Z) is calculated
using the mean variance between the columns
s
V
V
, and
the residual variance ().
r
V956.32
s
, and 265.0
r
V
.
The statistic 956.32
log 0.6408
265.0
s
e
r
V
ZV





.
The value of (Z) corresponding to the 20, 5, 1, and 0.1
percent points are respectively 0.19, 0.38, 0.54, and 0.71.
Since (Z = 0.64) is greater than 0.54, then the probability
of random effects is 0.01, which makes the probability of
systematic influence to be 0.99. Though the results seem
to point in favor of systematic influence or the existence
of cycles, the evidence is not conclusive. To find out if
the sample obtained implies cyclic appearance of sun
spots, the data profiling method is tested. The vertical
and horizontal correlation coefficients are calculated
given Equations (3) and (4). The vertical averages

, 1,2,3,4yj
j are calculated as:
(41.5, 25
j
y.4, 14.3, 21.0v
). The two statistics (
),
and (h
) are calculated.
13.82, 5.40, 4.06, 1.41, 1.12, 1.85, 1
6.00, 12.22, 15.99
v
.13, 3.85,
, 1322.41
(3280.54, 54.19, 1214.77
h
 
v
)
The max of (
), and (h
) are calculated as well.

max v
15.996053
54.188689

max h
The ratio max
max
v
h






is calculated as:
max 0.2951 917
max
v
h






.
The value max
log max
v
e
h




is equal to 0.2586587.
The modified value of the statistic (Z) is then obtained by
adding the two values of


max
loglog max
v
s
ee
rh
V
ZV

 
 



 


 


which then would give (0.6408 + 0.2587 ) = 0.8995.
Since the value (0.8995) is higher than (0.71), it indicates
that the probability that the population data is random is
less than 0.001 which is less than 0.1 indicating with
certainty that the number of sunspots is cyclic. The exis-
tence of systemic influence is indisputable. Applying the
approach based on the method of large samples, the
8
n
o
n
statistic is obtained. There is a large dis-
crepancy between this statistic and the adjustment pro-
posed in Section 2,


max
10.71
max
v
h









. The statis-
tic n
o
n
is thus inapplicable. The statistic
e
log 0.6669
s
r
V
ZV








calculated using the ap-
proach based on multiple regression is a slight im-
provement over the statistic obtained using the method of
analysis of variance (Z = 0.6408).
Using data profiling method, the statistic Z is corrected
to (Z = 1.0). As in the case of the analysis of variance
method, it can be stated with absolute certainty that there
is indeed a systemic influence in the sample data.
6. Conclusion
The objective is to derive conclusions about the random-
ness of observations in a population given that the sam-
ple data set exhibits strict regularities. Three methods are
analyzed and their shortcomings are indicated. An im-
provement to the three methods is suggested and formu-
lated. The improvement comes in the form of data pro-
filing which in essence is the integration of vertical and
horizontal correlation coefficients in the equations. Through
a simple example, it is shown that data profiling is indeed
Copyright © 2012 SciRes. OJS
M. M. KHOSHYARAN
Copyright © 2012 SciRes. OJS
414
a compliment of the original formulation.
REFERENCES
[1] L. Besson, “On the Comparison of Methodological Data
with Results of Chance,” Journal of Monthly Weather
Review, Vol. 48, 1920, pp. 89-94.
[2] H. W. Clough, “A Statistical Comparison of Meteoro-
logical Data with Data of Random Occurrence,” Journal
of Monthly Weather Review, Vol. 49, No. 3, 1921, pp.
124-132.
doi:10.1175/1520-0493(1921)49<124:ASCOMD>2.0.CO
;2
[3] W. L. Crum, “A Measure of Dispersion for Ordered Se-
ries,” Journal of American Statistical Association Quar-
terly Publication, Vol. 17, 1921, pp. 969-975.
[4] E. W. Wooland, “On the Mean Variability in Random
Series,” Journal of Monthly Weather Review, Vol. 53, No.
3, 1925, pp. 107-111.
doi:10.1175/1520-0493(1925)53<107:OTMVIR>2.0.CO;
2
[5] H. Working, “A Random Difference Series for Use in the
Analysis of Time Series,” Journal of American Statistical
Association Quarterly Publication, Vol. 24, 1934, pp. 11-
24. doi:10.1080/01621459.1934.10502683
[6] W. O. Kermack and A. G. McKendrick, “A Measure of
Dispersion for Ordered Series,” Journal of the Proceed-
ings of the Royal Society Edinburgh, Vol. 57, 1937, pp.
228-240.
[7] D. Alter, “A Group or Correlation Periodogram with Ap-
plication to the Rainfall of the British Iles,” Journal of
Monthly Weather Review, Vol. 55, No. 210, 1927, pp.
263-266.
doi:10.1175/1520-0493(1927)55<263:AGOCPW>2.0.CO
;2
[8] C. Chree, “Periodicities Solar and Meteorological,” Jour-
nal of the Royal Meteorological Society, Vol. 85, 1924,
pp. 87-97.
[9] J. B. Cox, “Periodic Fluctuations of Rainfall in Hawaii”
Proceedings of the American Society of Civil Engineers,
Vol. 87, 1924, pp. 461-491.
[10] E. L. Dodd, “The Probability Law for the Intensity of a
Trail Period with Data Subject to the Gaussian Law,”
Bulletin of the American Mathematical Association Soci-
ety, Vol. 33, 1927, pp. 681-684.
doi:10.1090/S0002-9904-1927-04451-2
[11] S. Kuznets, “Random Events and Cyclical Oscillations,”
Journal of the American Statistical Association, Vol. 24,
1929, pp. 258-275.
doi:10.1080/01621459.1929.10503048
[12] R. W. Powell, “Successive Integration as a Method of
Finding Long Period Cycles,” Annals of the Mathemati-
cal Statistics, Vol. 1, No. 2, 1930, pp. 123-136.
doi:10.1214/aoms/1177733127
[13] K. Stumpff, “Grunlagen und Methoden der Periodenfor-
schung,” Springer, Berlin, 1925.
[14] G. T. Walker, “On Periodicity—Criteria for Reality,”
Memorandum of the Royal meteorological Society, Vol. 3,
No. 25, 1930, pp. 97-101.
[15] C. F. McEwen and E. L. Michel, “The Functional Rela-
tion of One Variable to Each of a Number of Correlated
Variables Determined by a Method of Successive Ap-
proximations to Group Averages,” Proceedings of the
American Academy of Arts and Sciences, Vol. 55, No. 8,
1919, pp. 89-133.
[16] C. F. McEwen, “The Minimum Temperature, a Function
of the Dew Point and Humidity, at 5 p.m. of the Preced-
ing Day; Method of Determining This Function by Suc-
cessive Approximations to Group Averages,” Monthly
Weather Review Supplement, No. 16, 1920, pp. 64-69.
[17] C. F. McEwen, “The Reality of Regularities Indicated in
Sequences of Observations,” Proceedings of the Berkeley
Symposium on Mathematical Statistics and Probability,
San Francisco, 13-18 August 1945, pp. 229-238.
[18] R. A. Fisher, “Statistical Methods for Research Workers,”
4th Edition, Biological Monographs and Manuals, Lon-
don, 1932.
[19] G. U. Yule and M. G. Kendall, “An Introduction to the
Theory of Statistics” 11th Edition, Charles Griffin and
Company Ltd., London, 1937.
[20] R. A. Fisher and F. Yates, “Statistical Tables for Biologi-
cal, Agricultural, and Medical Research,” Oliver and Boyd,
London, 1938.