Journal of Software Engineering and Applications, 20 13, 6,34-41
doi:10.4236 / j sea.2013.67B007Published Online July 2013 (http://www.scirp.org/journal/jsea)
Copyright © 2013SciR es. JSEA
Using Optimized Distributional Parameters as Inputs in a
Sequential Unsupervised and Supervised Modeling of
Sunspots Data
K.Mwitondi1, J. Bugrien2, K. Wang3
1Sheffield HallamUniversity, Department of Computing; 2Statistics Department, BenghaziUniversity; 3Radio and Space Services,
Australian Bureau of Meteorology.
Email: k.mwitondi@shu.ac.uk, jamal.bugrien@ar.benghazi.edu.ly, K.Wang@bom.gov.au
Received June,2013
ABSTRACT
Detecting natura lly ar isin g structure s in data iscentral to kno wledge extraction from data. In most applications, the main
challenge is in the choice of the appropriate model for exploring the data features. The choice is generally poorly under-
stood and any tentative choice may be too restrictive. Growing volumes of data, disparate data sources and modelling
techniques entail the need for model optimization via adaptability rather than comparability. We propose a novel
two-stage algorithm to modelling continuous data consisting of an unsupervised stage whereby the algorithm searches
thro ugh t he da ta for optimal parameter values and a supervised stage that adapts the parameters for predictive mode lling.
The method is implemented on the sunspots data with inherently Gaussian distributional properties and assumed
bi-modality. Optimal values separating high from lows cycles are obtained via multiple simulations. Early patterns for
each recorded cycle reveal that the first 3 years provide a sufficient basis for predicting the peak. Multiple Support
Vector Machine runs using repeatedly improved data parameters show that the approach yields greater accuracy and
reliability than conventional approaches and provides a good basis for model selectio n. Mo del reliability is establis hed
via multiple simulations of this type.
Keywords:Clustering; Data Mining; Density Estimatio n; EM Algorithm; S unspots; Supervised M odelling; Support
Vector Machines; Unsupervised Modelling
1. Introduction
Many real-life problems are tackled via knowledge ex-
traction from data a process typically associated with
detecting naturally arising structures in the data. A typi-
cal example is the sunspots dataset [11]an average
oscillating sequence of the beginning and ending periods
of solar cycles with an approximate periodicity of 11
years [7]. Recorded sunspots span across the first cycle
(Ma rch 17 55 to June 1766 ) to the first fe w mont hs of t he
current (24th) cycle. Clustered in non-random positions
above and below the equator, the spots are generated by
interactions between the sun's surface plasma and its
magnetic field [19 and 22]. Solar magnetic activity
cycles have attracted the attention of scientists for many
years. Solar flares, for instance, affect our planet in dif-
ferent ways - including ejecting plasma and energetic
particles and potentially causing geomagnetic storms and
damaging satellites [16]. The paper is motivated by the
documented effects of sunspots on terrestrial conditio n s.
Correlations between space and terrestrial weather have
been indicated in solar studies dating back many years
[13, 18 and 20]. Climatic variations in Lapland via co m-
plex variations in the atmosphere, lunar gravitation and
solar activity have also been explained [11].
This paper will be subjecting sunspots data to a se-
quential analysis involving unsupervised and supervised
modeling. The two concepts represent the typical data
mining problems data c lustering and cla ssification. T he
primary goal of the former is to partition a given dataset
with a known or unknown distribution into subgroups in
such a way that data points in each group are as homo-
geneous as possible while those in different groups are as
heterogeneous as possible. The method is typically ap-
plied in pro ble ms in which t her e is no clear mathematica l
formulation for describing the underlying structures.
Various approaches to data clustering have been studied
and are well-documented in the literature [21, 17 and 6].
However, determining the number of naturally arising
structures in data remains a daunting challenge among
the data science community. Many clustering tools in the
Using Optimized Distributional Parameters as Inputs in a Sequential Unsupervised
and Supervised Modeling of Sunspots Data
Copyright © 2013SciR es. JSEA
35
literature are based on the conventional mechanics of
minimization of the distances between data points a
feature which inherently constitutes the same challenge
the methods are designed to address that is, determin-
ing the optimal number of clusters. The primary goal of
the latter is to allocate new cases in known classes and
one of its main challenges is balancing model accuracy
and reliability.
Let a dataset of independently identically distributed
random vectors
{ }
12 1
,,, ,
d
nn
xxx x
…∈
represent fea-
tures an underlying density function. The main features
of interest may include modes (local maxima), an-
ti-modes (local minima) and bumps - regions where the
second derivative is negative. In an exploratory setting,
the number and locations of these features are not known
a priori.Many real-life data take this form and with large
volumes of data generated from different sources and
inputted into different models, we are constantly faced
with the challenge to d eter mine op timal stationer y points.
The challenge is to address model complexity via adap-
tability rather than comparability. In other words, we
seek to minimise inherent randomness in tra ini ng and test
data via novel ad aptive metho ds of data ana lysis [10 and
1].
This paper proposes a novel approach to detecting na-
turally arising structures in data that searches for genera-
lising parameter levels and adapts them to supervised
modeling. Its main research problem is to develop an
algorithm for predicting future cycles given historical
solar activity d ata. We tr y to addre ss this proble m via the
following objectives.
1) To determine naturally arising s t ruc t ur es i n t he d a ta .
For simplicity, we shall be seeking to identify and sepa-
rate high from low solar activity cycles. This objective
constitutes the unsupervised stage of the algorithm.
2) To predict future cycles based on information in
previous c yc l es. T hi s i s the super vised stage.
3) To search for an optimal solution based on repeated
simulations at the unsupervis ed and sup ervise d stages.
The paper is organised as follows. Section 1 provides
the introduction followed by methods in Section 2. Data
analyses and discussions are in Section 3 and concluding
remarks and p otential new dire c tions in Section 4 .
2. Method s
Choosing a parametric form of the density to explore
features is generally poorly understood and any tentative
choice may be too restr ictive. Often under s uch circ ums-
tances non-parametric density estimation, e.g. Kernel
Density Estimation (KDE) technique [21]allows for
practical solutions to the classical problem of choosing
the level of smoothing (bandwidth), can be efficiently
used. For example, given the data points
{ }
12 1
,,, ,
d
nn
xxx x
…∈
, the K DE appro ach to clust ering
defines clusters as regions of high density separated by
regions of no or low density. Its main idea is to first
compute a kernel density estimate,
( )
ˆt
fx
, say, from the
data, with a Gaussian kernel and isotropic bandwidth
0t>
controlling the amount of smoothing. In its sim-
plest form, KDE can be thought of as an alternative to
the histogram as it typically provides a smoother repre-
sentation of the data, and unlike the histogram, its ap-
pearance does not depend on a choice of starting point.
The scenario represents a problem amenable to the mul-
tivariate kernel function in Equation 1 where T is a
symmetric positive d by d bandwidth matrix defined as
the diagonal
[ ]
12 1
,, ,,
nn
Tttt t
= …diag
with a direct
effect on model complexity.
( )
( )
1
1
ˆn
T ti
i
f xnKxx
=
= −
(1)
Without loss of generality, consider a phenomenon
with a binary structure of, say, “highs” and “lows”. De-
pending on the context, a number of models can be ap-
plied. For instance, if we assume a Gaussian kernel, we
can define a parametric pattern of “lows” and “highs” in
the form of a normal mixture model and use the parame-
ter estimates
{ }
Θμ,Σ=
to track the dynamics of the
cycles. Further, if we assume that the probability of a
“high” followed by another “high” structure is Phh and
that of a “low” followed by a “low” structure is P11, we
can define a Hidden Markov Model as in Tabl e 1.In this
case, an HMM provides a formal foundation for linear
sequence labelingof data. Balancing accuracy and relia-
bility amounts to defining an appropriate way of labeling
data using the probabilities and interpretingthe results
probabilistically. We could also define associations, the
corresponding scores and the underlying confidence.
2.1. Data Description, Research Problem and
Objectives
We adopt the sunspots data [11]an average oscillating
sequence of the beginning and ending periods of solar
cycles forming the densities in F igure 1.
Table 1.A stat e transition matrix for a binary s tructure.
HIGH LOW
HIGH Phh 1 - Phh
LOW 1 - P11 P11
Using Optimized Distributional Parameters as Inputs in a Sequential Unsupervised
and Supervised Modeling of Sunspots Data
Copyright © 2013SciR es. JSEA
Figure 1 .Density representation of the cycles in Fig ure 1.
The densities in Figure 1 exhibit different umber of
modes a feature typically determined by the adopted
level of s moo t hi ng. By contr olli ng t he l e vel o f smoo t hi ng
via a kernel function of the form in Equation 1 or other-
wise we are able to identify different structures in data.
Figure 2 presents a 2-D plot of the sunspots means and
standard deviations. The numbers in the plot represent
the indices for each of the last 23 cycles and the current
cycle (24th). Using a rule of thumb, we can identify the
high, moderate and low solar activity cycles, say. Fol-
lowing [10 and 1]we can treat each cycle as a separate
density and then use their distributional behavior to ex-
plore the underlying structures of the cycle s .
2.2. ModellingStrategy
Conventional approaches to modeling sunspots include
data assimilation [8] and rotational solar dynamo-based
predictive models for short-term predictions[2 and
14].The densities in Figure 1 exhibit typically bivariate
patterns and so we shall assume that the cycles form a
parametric pattern of “lows” and “highs” and define
( )
( )
11 1
|,| ,
∈ ∈∈
= ==
∏∑ ∏
i ii
NK N
kk ik kSk iSk Sk
ik i
πfSμΣfSμΣ
(2)
whereSi denotes the sunspots numbers, K is the number
of components,
( )
**
f.
is a normal distribution,
k
π
is
the prior probability of class membership and i
Sk
are class allocations.Statistically, the high-peaked (more
than normal) and low-peaked (less than normal) cycles
imply high and low solar activities respectively while
those skewed to the right imply few increases and fre-
quent decreases in solar activity and vice versa. Our
strategy involves two main levels unsupervised and
supervised. At the former level, we examine the initial
and subsequent patterns of the cycles in order to separate
the “lows” from the “highs”. The maximum likelihood
estimates (MLEs) of the random finite mixture densities
are estimated and passed on to a predictive model at the
supervised level as outlined below.
Using Optimized Distributional Parameters as Inputs in a Sequential Unsupervised
and Supervised Modeling of Sunspots Data
Copyright © 2013SciR es. JSEA
37
Figure 2 . Sunspot means and standard deviations.
The above algorithm adapts the EM converging fea-
tures described in [5 and 9 ] . Its for m suits an y supervi sed
modelling technique. In this paper it is implemented in
Support Vector Machines (SVM).
2.3. Supervised Modeling of Labeled Data
We adopt Support Vector Machines (SVM) - a ker-
nel-based discriminant function the mechanics of which
rely on supervised learning of the underlying discrimi-
nating rules from the training data [5]. To put it in con-
text, let the “high” and “low” cycles in our modified set
{ }
ii
S,y :i1,,N= …
,
{ }
i
y 1,1∈−
and
2
i
S
be se-
parable by the hyper-plane H . T hen the poi nts lyi ng on H
satisfy the equation
wS a0+=
wherew is normal to the hyper-plane,
a
w
is the per-
pend icular d istance from H t o the or igin and
w
is the
Euclidean norm of w. The points on the hyper-planes
above and below
{ }
H1, H2H
will satisfy the equations
wS a1+=±
(both with normal w and distance to the
original
1a
w
±−
) which means that the gap
{ }
H1, H2
2
w
=
. We need to find hyper-planes maximising the
gap ( minimizing
2
w
) subject to
( )
ii
y Swa 10+−≥
. T he
numbers in
{ }
H1, H2
are the support vectors (suppor-
ters) of the optimal location of the decision surface and
the hardest to classify. Intuitively, the allocation rule is
( )
ii
ii i
ii
S wa1fory1y Swa10
S wa1fory1
+ ≥+=+
↔+ −≥∀
+ −≤=−
(3)
The SVM kernel [4] is generally defined as
( )()
V
ii
i1
FSαΦS,S a
=
= +
(4)
in whic h
i
α
represents the Lagrange multiplier summed
over the values for which
i
α0.>
The upper index V
denotes the number of support vectors as described
above. SVM solution relies on the Lagrangian formula-
tion of the problem an optimisation method requiring
VN
positive multipliers (
i 1,2,,V
α
= …
) for each of the
inequalities on the RHS of Equation 3. The general for-
mulation of the Lagr angia n is
( )
2VV
ii ii
i1 i1
w
LαySwa 10α
2
= =
=−+ −≥+
∑∑
(5)
SVM solution is obtained by minimising Equation 5
with respect to w and a and si mu lt ane o usly re qui ring t ha t
i
i
dL 0
dα= ∀
or equivalently maximising L and require that
bothw and a disappear. The latter implies that
iii
i
wαyS=
and
ii
i
αy0=
transforming Equation 5
into its dual equivalent
dii ji jij
ii
1
Lαααyy S .S .
2
= −
∑∑
The SVM model weights are calculated as the product of
the support vector coefficients and their values and used
in forming the allocation rule. Other than the support
vectors (i
α0>) the remaining data points have i
α0=
these are those lying on the two hyper-planes
{} ()
ii
H1, H2ySwa10→ +−=
or beyond them if
( )
ii
ySwa10 .+−>
3. Analyses and Discussions
We now present the two-level analyses described above
in order to establish whether sunspots follow identifiable
patterns which can be used as inputs in a predictive mod-
el.
3.1. Unsupervised: Initia l Patterns and
Maximisation
Figure 3 exhibits the low and high cycles separation
based on the cut-off points above alongside their corres-
ponding overall bi-modal densities. It is based on the
maximum number of sun spots reached by the full cycles
and the number reached in the first 30 and 40 months.
The cut-off point in the LHS panel is set to the mean of
the averaged maximum early sun spots which, in this
case, is 109 - separating the low cycles 1, 5, 6, 7, 9, 10,
12, 13, 14 and 16 from the highs 2, 3, 4, 8, 11, 15, 17, 18,
19, 20, 21, 22 and 23. The densities in the RHS panel
exhibit the eme rging bi-moda lity as a function of ti me.
Figure 3 suggests that the pattern of each solar activity
Using Optimized Distributional Parameters as Inputs in a Sequential Unsupervised
and Supervised Modeling of Sunspots Data
Copyright © 2013SciR es. JSEA
Figure 3 .Omega cut-off (LHS) and the corresponding bi-mo dal density (RHS).
cycle is defined by its early patterns. In particular, the
maximum values reached by each cycle appear to pro-
vide an insight into the overall activity of the cycle be-
fore it starts to subside. The foregoing structural detec-
tion of patterns in the sunspots data amounts to unsuper-
vised modelling. Adopting these patterns as a guide to
data labe lling rule yields the two class p riors as
N
il
i1
z
ˆˆ 0.46 0.54
N
lh
ππ
=
==↔=
with .lh
θ
computed as
above. As the average early patterns for cycle 24 fall
below the cut-off point, it is reasonable to suggest that it
will be a low activity cycle. Implementation of SVM
modelling follows belo w.
3.2. Supervised Level: SVM Supervised
Modelling
Results from SVM modelling based on the initial class
patterns with prior probabilities
lh
ˆˆand
ππ
gave an aver-
aged accuracy of 58% on a cost range of 0.005 to 5 and a
training sample of 500. Posterior class probabilities con-
ditioned on maximised averages of the early low and
high group means reached an average accuracy of 98%
on the same cost range and training sample size. The
support vectors are shown in Figure 4 with the horizon-
tal and vertical axes corresponding to the support vectors
and indices respectively.
Figure 5 shows the best discriminating SVM decision
values at two different bandwidths. The bandwidths and
hence decision values are chosen from multiple simula-
tions as determined by the binary cut-off point demar-
cating low from high cycles.Notice how each of the
modes also exhibits sub-modes To avoid spurious modes
(over-fitting) or masking effects (under-fitting) it is rec-
ommended to use significance test for changes or, for
clear patterns, graphical visualization.
Typically, SVM model weights for each of the support
vectors are obtained as a cross product of the model
coefficients and support vectors [15]. Weight s fro m mul-
tiple SVM runs can be recorded and their graphical pat-
terns be used to guide model selection. Other SVM out-
puts include the individual probabilities and decision
values as in Figure 6. The difference between the lower
accuracy case in the top panel - highlighting the random
nature of class allocation and the higher accuracy mod-
el in the bottom panel) showing clear concentrations of
lh
ˆˆand
ππ
on either side of the class boundary.The obser-
vations corresponding to the vectors in Figure 4, the
decision values in Figure 5 and to the corresponding
prob abilities in Figure 6 can be identified by indexing.
4. Concluding Remarksand Potential Fut ure
Directions
Predicting solar activity cycles remains one of the major
challenges the scientific community faces with intricacy
being compared to predicting, say, the severity of next
year’s winter. In this weather analogue, if all that is
available is a long vector of temperature readings over
many years, the only sensible approach is to search for
Using Optimized Distributional Parameters as Inputs in a Sequential Unsupervised
and Supervised Modeling of Sunspots Data
Copyright © 2013SciR es. JSEA
39
Figure 4 . Support vectors for the initial patterns (LHS) and maximised parameters (RHS).
Figure 5 . S VM decision values at two different bandwidths sho wing wel l-separated structures.
Figure6 . Sun cycles class probabilities.
Using Optimized Distributional Parameters as Inputs in a Sequential Unsupervised
and Supervised Modeling of Sunspots Data
Copyright © 2013SciR es. JSEA
naturally arising structures in the data with the hope that
if uncovered they may provide potentially useful infor-
mation. This paper adopted the foregoing philosophy and
sought to develop a predictive framework for modelling
sunspots data using inherent distributional properties in
the data. The paper relied on a continuous flow of data
for prediction, but rather than assessing model accuracy
on the NOAA benchmark, an SVM model was trained
and tested on a notionally infinite dataset of cycl es.
By examining multiple sets of observations from the
onset of each cycle via graphical visualisation early pat-
terns of sun cycles and their binary nature were deter-
mined. Comparing multiple early patterns for each rec-
orded cycle extracted at different time periods to the cor-
responding full c ycles revealed that the first 3 years pro-
vide a sufficient basis for predicting the cycle’s peak.
The patterns were then adapted as inputs into an inte-
grate d unsup ervi sed and supervi sed mod elli ng algo rithm.
The novel method’s mechanics are geared towards si-
multaneously tracing anomalies via an adaptive approach.
Repeated SVM runs using repeatedly improved parame-
ters showed that the approach yields greater accuracy and
reliability than conventional approaches. Multiple simu-
lations of this type can be generated based on the algo-
rithm above to assist in selecting the most consistent
model. The paper’s main substance can be described as
an enhancement of algorithmic methods for learning un-
derlying rule s from data.
Finally, it is worth noting that while the study was
confined to the conventional periodicity of 11.11 years
[22] with a binary pattern of cycles, the definition im-
plies that the periodicities can differ according to defini-
tions. Further, while we assumed a binary scenario of the
cycles in Figure 1, different bandwidths are likely to
yield different patterns. To address this limitation, the
paper’s findings highlight potential investigations paths
into such variations. Further, the current study, based on
a single application and a single method, could not con-
firm the algorithm’s robustness. Although we adopted
SVM for implementation, the approach is amenable to
any domain-partitioning method. Thus, for model en-
hancement purposes, it will be useful to provide a com-
parative study using other learning algorithms such as
neural networks and decision trees.
REFERENCES
[1] Bugrie n, J. , Mwit ondi, K. and Shuweihdi, F. (2013).
A Kernel Density Smoothing Method for Deter-
mining an Optimal Number of Clusters in Conti-
nuous Data; The 16th International Conference on
Computational Methods and Experimental Mea-
surement s; 2 – 4 July, 2013, A Coruña, Spain.
[2] Choudhuri , A. R., Chatterjee, P. and Jiang, J. (2007).
Predicting Solar Cycle 24 with a Solar Dynamo
Model; Physical Review Letters, Vol. 98, Issue 13,
American Phys. Society.
[3] Cuevas, A., Febrero, M. and Fraiman, R. (2000).
Estimating the number of clusters; The Canadian
Journal of Statistics, Vol. 28, No. 2, pp 367-382.
[4] Cort es a n d V apn i k , (1995). Su ppor t-vector networks;
Machine Learning, Vol. 20, No. 3, pp. 273-297,
Kluwer Academic Publishers.
[5] Dempster, A. P., Laird, N. M., and Rubin, D. B.
(1977). Ma x imum Lik el ih ood from In complete Data
via the EM Algorithm; Journal of the Royal Statis-
tical Society, Vol. 39, pages 1-38.
[6] Hand, D., Mannila, H. and Smyth, P. (2001). Prin-
ciples of Data Mining (Adaptive Computation and
Machine Learning); A Bradford Book; ISBN-13:
978-0262082907.
[7] Kane, R. P. (2007). Solar Cy cle Predictions Based on
Extrapolation of Spectral Components: An Update;
A Journal for Solar and Solar-Stellar Research and
the Study of Solar Terrestrial Physics; Vol. 246, Is-
sue 2, pp 487-493, ISSN: 0038-0938.
[8] Kitiashvili, I. and Kosovichev, A. (2009). Prediction
of solar magnetic cycles by a data assimilation me-
thod; Cosmic Magnetic Fields: From Planets, to
Stars and Galaxies; Proceedings IAU Symposium,
No. 259, Edited by Strassmeier, K, Kosovichev, A.
and Beckman, J. ( 2009) - International Astronomical
Union.
[9] McLachlan, G. Krishnan, T. (1996). The EM Algo-
rithm and Extensions; John W iley.
[10] Mwitondi, K., Said, R. and Yousif, A.: A sequential
data mining method for modelling solar magnetic
cycles; Neural Information Processing, LNCS, Vol.
7663, pp 296-304, Springer (2012).
[11] NOOA (2012). http://www.ngdc.noaa.gov
[12] Pohtila, E. (1980). Climatic fluctuations and forestry
in Lapland; Ecography, Vol. 3, Issue 2, pp 65136,
ISSN: 1600-0587.
[13] Pielke, R., Avissar, R., Raupach, M., Dolman, A.,
Zeng, X. and Denning, A. (1998). Interactions be-
tween the atmosphere and terrestrial ecosystems:
Influence on weather and climate; Global Change
Biology, Vol 4, Issue 5, pp 461–475.
[14] Qahwaji, R. and Colak, T. (2007). Automatic
Shor t-Term Solar Flare Prediction Using Machine
Learning and Sunspot Associations; SOLAR
PHYSICS, Vol. 241, No. 1, pp 195-211, ISBN
11207-006-0272-5.
[15] R (2011). R Version 2.13.0 for Windows; R Foun-
dation for Statistical Computing.
[16] Reames, D. (2002). Magnetic topology of impulsive
and gradual solar energetic particle events; The As-
trophysical Journal, Vol. 571, pp 6366.
Using Optimized Distributional Parameters as Inputs in a Sequential Unsupervised
and Supervised Modeling of Sunspots Data
Copyright © 2013SciR es. JSEA
41
[17] Roberts, S. J.: Parametric and Non-parametric Un-
supervised Cluster An alysis, Pattern R ecognition, 30,
5, pp 261-272 (1997).
[18] Rycroft, M. J, Israelsson, S. and Price, C. (2000).
The global atmospheric electric circui t, solar activity
and climate change; Journal of Atmospheric and
Solar-Terrestrial Physics, Vol. 62, Issues 1718, pp
1563–1576.
[19] Schwabe, S.H. (1843). AstronomischeNachrichten,
20, No. 495, 234-235
[20] Siscoe, G. L. (1978). Solarterrestrial influences on
weather and climate; Climatology Supplement, Na-
ture, Vol. 276, pp 348-352.
[21] Silverman, B. W. (1981). Using Kernel Density
Estimates to Investigate Multimodality, Journal of
the Royal Stati stica l Society, B , 43, pp 97-99.
[22] Wolf, J. R. (1852). New studies of the period of
Sunspots and their meanings; Communications of
Natural History; Society in Bern, 255, pp 249-270.