Int. J. Communications, Network and System Sciences, 2012, 5, 603-608
http://dx.doi.org/10.4236/ijcns.2012.529070 Published Online September 2012 (http://www.SciRP.org/journal/ijcns)
Stochastic Binary Neural Networks for Qualitatively
Robust Predictive Model Mapping
A. T. Burrell1, P. Papantoni-Kazakos2
1Computer Science Department, Oklahoma State University, Stillwater, USA
2Electrical Engineering Department, University of Colorado Denver, Denver, USA
Email: tburrell@okstate.edu, Titsa.Papantoni@ucdenver.edu
Received May 15, 2012; revised June 25, 2012; accepted July 10, 2012
ABSTRACT
We consider qualitatively robust predictive mappings of stochastic environmental models, where protection against
outlier data is incorporated. We utilize digital representations of the models and deploy stochastic binary neural net-
works that are pre-trained to produce such mappings. The pre-training is implemented by a back propagating supervised
learning algorithm which converges almost surely to the probabilities induced by the environment, under general er-
godicity conditions.
Keywords: Qualitative Robustness; Predictive Model Mapping; Stochastic Approximation; Stochastic Binary Neural
Networks; Real-Time Supervised Learn ing; Ergodicity
1. Introduction
We consider the case where the statistical behavior of
environmental models must be learned in real time. In
particular, we focus on learning such behavior predic-
tively, as may be applicable in data compression, hy-
pothesis testing or model identification, while statistical
qualitative robustness for protection against outlier data
is sought as well. In this paper, we promote the deploy-
ment of stochastic binary neural networks which imple-
ment predictive model mappings in real time, in interac-
tion with the environment; i.e. supervised learning, while
they also offer sound protection against data outliers. Our
approach uses results from stochastic approximation and
statistical qualitative robustness [1-9]. While powerful
such results have been in existence for a long time, they
have not been g iven attention synergistically, in the light
of neural network implementations. In this paper, our
objective is to stimulate interest in the application of the
existing theories in such implementations, especially
those addressing environmental models.
When neural networks operate in stochastically de-
scribed environments, supervised learning corresponds to
a statistical sequential estimation problem dealt with by
stochastic approx imation method s. Th ere is rich literature
in such methods represented by the works of Abdelhamid
[1], Beran [10], Blum [11], Fabian [2], Fisher [12], Ger-
encser [13], Kashyab et al. [14,15], Kiefer et al. [4],
Kushner [5], Kushner et al. [6], Ljung [7], Ljung et al.
[8], Robbins et al. [16] and Young et al. [9].
In the neural networks literature, supervised learning
has been basically limited to techniques arising from the
Robbins/Monro [16] method and its extensions, with
performance criterion the least squares error. The repre-
sentative works on the subject are those by Barron et al.
[17], Elman et al. [18], Gorman et al. [19], Minsky et al.
[20], Rosenblatt [21], Werbos [22], White [23], Widrow
[24], and Widrow et al. [25]. Literature in the area, when
the performance criterion is, instead, the Kullback-Lei-
bler distance (see Blahut [26] and Kazakos et al. [3]) and
the techniques used do not necessarily arise from the
Robbins/Monro method, is represented by the works of
Ackley et al. [27], Amari et al. [28], Pados et al. [29-33]
and Kogiantis et al. [34].
In the domain of stochastic neural networks, some
more recent results address time-delay issues (Liu et al.
[35] and Wang et al. [36]), while the book by Ling [37]
discusses some general aspects in this area.
The organization of this paper is as follows. In Section
2, we introduce digital finite memory qualitativ ely robust
predictive mappings, as well as the neural network layers
needed for their implementation. In Section 3, we de-
scribe the operations performed at the predictive neural
network layer. In Section 4, we present the supervised
learning algorithm used at the predictive layer. In Section
5, we draw some conclusions.
2. Digital Finite Memory Qualitatively
Robust Mapping
We consider digital environmental representations. We
start by letting 1,,
n
x
x denote a sequence of dis-
C
opyright © 2012 SciRes. IJCNS
A. T. BURRELL, P. PAPANTONI-KAZAKOS
604
crete-time observations that represent the environment.
Then, given 1n
,,
x
x
1n
, the objective of the digital map-
ping is to predict which one of M distinct regions, the
observation
x
is going to lie in. Denoting these re-
gions
j
A
; , let us define the probabilities 1, ,jM

11
1
1
,, ,
jnnjn
jM
xPx Axx

 
,,
n
px
1,,
which are
used to map stochastically an observed sequence
x
x onto each of the regions
j
A
, with corre-
sponding probabilities . Two problems
arise immediately:


,,
jn
x

,,
n
1
px
1) Exploding computational load, due to the increasing
memory represented by the sequences 1
x
x.
2) Statistical information on the sequences
1,,
n
x
x


x

,,
needed for the computation of the probabilities
.
1jn
The first problem is resolved if the increasing memory
is approximated by finite, say size-m memory. That is,
the increasing computational load is, instead, bounded if
the process that generates the observations is approxi-
mated by an m-order Markov process. Then, the informa-
tion loss is minimized when the process is Gaussian (see
Blahut [26]).
,,px
Thus, to reduce the exploding computational load due
to increasing data memory, we may initially model the
process that generates the environmental data or observa-
tions by an m-order Gaussian Markov process, whose
auto-covariance m × m matrix Q has components identi-
cal to those of the original process. We name this initial
(Gaussian and Markov) process, nominal process.
Starting with our nominal process, but incorporating
then statistical uncertainties in the form of unknown data
outliers, we are led to a powerful qualitatively robust
formalization, which results in a stochastic mapping (see
Papantoni-Kazako s et al. [38]), as follows:
Given observations 1n
x
x, use the m most re-
cent observations for the prediction of the next datum
1n
x
, and defining
1,,
mx

T
yx
mn n
, decide that
1nj
x
A
with probability
j
m
qy, defined as follows,
 
1
jm m
qy ryr
M

 
1,
m jm
y py


(1)
where
j
m
py 1nj
is the conditional probability of
x
A
,
given
1,,
mn
m n
x

T
yx, as induced by the Gaussian
and Markov nominal process, and where, for some posi-
tive finite constant
,

1
,T
mm
yQ y




1min1
m
ry (2)
The value of the constant
in (2) represents the level
of confidence on the “purity” of the data vector ym, in
terms of it being generated by the nominal Gaussian
process: the higher the value of
, the higher the level of
confidence, where as
decreases, increased weight on
purely random mappings (represented by the probability
1 per region) is in duced.
M
Robust estimation of the auto-covariance matrix Q
may be also required. The components of the auto-co-
variance matrix Q should emerge from the statistics of
the nominal Gaussian process. A scheme for the robust
estimation of the matrix Q may arise from robust pa-
rameter estimation techniques, (see Kazakos et al. [3]).
The robust prediction expression in (1) is based on a
Gaussian assumption regarding the nominal process
which generates the data in the environment, where the
latter assumption is the result of an information-theoretic
approach to the reduction of the computational load
caused by increased past memory. The important robust
effects induced by the mapping in (1) remain unaltered,
however, when instead, the probability

j
m in (1)
arises from an arbitrary non-Gaussian process, and when
its conditioning on ym is substituted by conditioning on
quantized values of the scalar quantity mm
py
1T
y
Qy
1
Q
1
Q
1T
. When
quantized values are involved, the implementation of the
mapping in (1) requires the following stages:
1) Preprocessing. This stage corresponds to long-term
memory and involves the robust pre-estimation (see Ka-
zakos et al. [3]), and storage of the matrix .
2) Processing. This stage corresponds to short-term
memory. It uses the matrix from the preprocessing
step and the observation vector ym to: a) first compute the
quadratic expression mm
y
Qy
1T, b) subsequently repre-
sent mm
y
Qy
in a quantized form comprised of N dis-
tinct values and c) finally, use the quantized values in d)
to compute the corresponding value of the function r
y
m in (2).
3) Predictive Mapping. This stage involves the estima-

jm
py and the computation tion of the probabilities
of the probabilities
qy
jm
in (1) using inputs from
the processing stage, and the subsequent implementation
of the prediction mappings.
The three different stages above are performed sequen-
tially by separate but connected neural structures, named
preprocessing layer, processing layer, and predictive
mapping layer, respectively. Our focus in this paper is on
the latter layer: its structure and its operations. Towards
that direction, we first note that, due to the quantization
operations at the processing layer, the expression in (1)
takes the following form:
1
11;
for ; 1,,
jjp
T
mm
qrrp
M
yQ yRN
 




(3)
p rq
where
j
,
j
, and
denote, respectively, the prob-
abilities
m
ry ,
j
m
qy and
j
m
py and the number
Copyright © 2012 SciRes. IJCNS
A. T. BURRELL, P. PAPANTONI-KAZAKOS 605
when the quantized value of 1
m
Tm
y
Qy
equals R
.
3. The Neural Predictive Layer
Con sider th e integ er M in (3), and let s be a unique posi-
tive integer, such that 1
22
s
s
M
1,jM
,. Then, in modulo-2
arithmetic, each state j, can be represented
by an s-length 0 - 1 binary sequence 1
s
x
x
R. The state
is provided as an input to the prediction layer by the
processing layer, and the former produces a binary se-
quence 1
s
x
x as a prediction mapping. Given the state
R
, the operations of the prediction layer must be such
that, a given prediction sequence 1
s
x
x is produced
stochastically with probability.
 
1
1
ss
qxxpxxR
M
1
1
Rr r




(4)
where expression (4) is the same as expression (3) when
the binary representation of the positive integer j in the
latter is 1
s
x
x, and where

x R
1s
px
is the pre-
diction mapping generated by the nominal process that
represents the actual data generating environment. Due
to the stochastic nature of the rule in (4), such is also the
nature of the predictive mapping layer, whose neural
representation corresponds then to a stochastic neural
network, first developed by Kogiantis et al. [17], when
the response of each neuron is limited to binary. We
proceed with the descriptio n of the latter representation.
Let us temporarily assume that the probabilities

pxxR
1s

,,
have been “learned” and are known.
Without lack in generality, let us also assume that M = 2s.
The original constraint of binary firing per neuron in the
prediction layer leads us to the digital representation of
the future states 1
s
x
x
R
. The design can be accom-
modated easily in a binary tree structure. In detail, given
the observed state
and the resulting R
value, the
mapping 1
s
x
x
r
can be obtained via a stochastic binary
tree search, on the 2s-leaves tree, as follows: 1) With
probability
a fair tree-search is activated, where the
tree-node x1, x1 = 0, 1 is visited with probability 0.5, and
each of the two tree-nodes branching off a visited
tree-node 1k, 11
x
xk r
s
is also visited with
probability 0.5; 2) With probability 1
a generally
biased tree-search is activated, where the tree-node x1 is
visited with probability

R
1
px
, while from a visited
tree-node 1k, 11
x
xk 11kk
s the tree-node
x
xx
is visited with probability:

11 ,
kk kkk
pxxxR pxxxR
11 1
,xx

 
where


11
,
k
px xRpxRxxR
px xR
 


R
11
11
,
sk
ss
px
x
(5)
Thus, the predictive mapping layer may be viewed as
been comprised of a fair-search binary tree and a number
of biased-search binary trees, each of the latter corre-
sponding to a specific observation state
Given R
the common fair-search binary tree is activated with
probability r
, while, with probability 1r
R
, the bi-
ased-search binary tree that corresponds to the state
is activated, instead; we name the latter tree, the R
tree. The nodes of each of the above binary trees are
neurons that “fire,” if the corresponding tree-nodes are
“visited.” Given R
, a specific mapping 1
s
x
x
r
is
generated either equiprobably from the fair-search binary
tree with probability
, or from the R
-tree via the
sequential stochastic representation in (5) with probabil-
ity 1r
. It is thus in the R
-tree that the probabilities
which generate the data of the environment must be
“learned” and then used to generate prediction mappings.
R
, consider the Given the observation state R
-
tree in conjunction with the sequential stochastic repre-
sentation in (5) of th e corre spond ing pred iction mappi ngs,
as generated by the process representing the actual envi-
ronmental data. Let 1k
xx
u
represent the bi-
nary random output of the neuron that corresponds to the
node
,1 ks
k1
x
xR
of the
-tree.
Then, 1k
xx 1u
if and only if 1i
xx 1,uik

1k
.
Thus, the output
x
x
u may be viewed as generat ed by a
product, 1211 1kk
xxxxxx
WW W
, of mutually independent
binary random variables
11
1
ii
xx xik
W
R
, whose distri-
butions at the operational stage of the
-tree must be as
follows (in view of (5)):



 

112111
12111
11
1111
1
11
,
11 1;2
11 ,
kkk
kk
xxx xxxxx
kkk
xxx xxx
xx
PuPWW W
px xRpxRpxx xR
PWPWPWks
PuPWpxR
 
 

 
 


(6)
where
11 11
1,;2
kk
xx xkk
PWpx xxRks

R
(7)
The above logical arguments and expressions lead to
the following neural structure of the
-tree: 1) The
neuron corresponding to the tree-node x1; x1 = 0, 1 has a
binary random variable 1
x
W built in, where 01
WW1
.
At the operational stage, the neuron must be activated

with probability
1
px R
; thus,
11
1
x
PWpx R


11 1PW PW
10
then, where
 2k; 2) For ,
the neuron corresponding to the tree-node 1k
x
x has a
binary random variable 11kk
xxx
built in and fires, if
and only if the latter variable takes the value 1 and si-
multaneously the neuron corresponding to the tree-node
W
11k
x
x
fires as well. Thus, the binary neural output
Copyright © 2012 SciRes. IJCNS
A. T. BURRELL, P. PAPANTONI-KAZAKOS
606
1k
x
x
u is formed as a product 11 11kk k
xx xxx
uW


, where



11
11
1
11
kk
kkk
xxx
W


R
111
11
1
kk
xxxxxxx
xx
PuPu W
Pu P



 (8)
and where, at the operational stage of the
-tree, the
probability

11
1
kk
xxx
PW must be as in (7). We note
that
11 11
10
1,
kk 11
2,
x
xxx
WW

 
 k
kxx
(9)
and thus

11
10
1,
kk
xx


R
11
11
11
2,
xx
k
PW PW
kxx

 
As it is clear from the derivations and arguments in
this section, the parameters of interest in the
-tree
neural network consist of the independent binary random
variables W1 and

2
1
ks
ik


11
1;0,1;1
k
xx i
Wx

,
whose distributions

1pR
and


11 2
1,; 0,1;11,
ki ks
pxxRxik


R
must be “learned” in advance, via interaction with the
environment.
4. Learning at the Predictive Layer
Given the
-tree, we observe that, due to (6), any ad-
aptations of the probability back-propa-

11
s
xx
Pu


11
s
xx
Pu
gate to adaptations of each of the other involved prob-
abilities. It thus suffices to focus on the learning of the
probabilities for the various binary se-
quences 1
s
x
x
R
, which correspond to the responses of
the output or “visible” neurons in the
-tree network.
For easiness in presentation, let us now consider a fixed
sequence 1
s
x
x
R (in conjunction with the fixed ob-
served state
that represents the R
-tree). Let then
p denote the value of the probability

pxxR
1s
Pu
, as
induced by the environment, and let q denote the value of
the probability 1s
xx. Let the natural number n
denote discrete observation time from the beginning of
the learning stage, and let and denote estimates
at time n of the probability values p and q, respectively.
Finally, let the random variable Vn be defined as equal to
1, if the environmental event

ˆn
q
1
ˆn
p
1s
x
xR
occurs at
time n, and as equal to 0, otherwise, and let

0;
1, 1;
ZV WW

1wp r
wpr
 



1
ˆ0q
1
ˆ
pv
.
In Kogiantis et al. [17], a Kullback-Leibler matching
criterion between p and q was used, in conjunction with
Newton’s iterative numerical method, to develop the
supervised learning algorithm stated below.
Algorithm
Initial Values: Select an initial value , while
1
. Computational Steps:
1) Given computed value and , compute
ˆn
p1n
z
1
ˆn
p
as follows:
1
1
1
ˆ
1
ˆˆ1
nn
nn
zrp
pp n



 (10)
1
ˆn
p
For some small positive value
, the value
is
corrected to 1
ˆn
p
if
, and is corrected to 1
if 1n
ˆ1p
2) Given computed values
, given

ˆˆ
,
nn
qp 1n
z
.
,
compute 1
ˆn
q
as follows:



12
2
1
1
2
ˆˆˆˆ
1
ˆˆˆˆ ˆ ˆ
1
ˆ
1ˆˆ
1
1ˆˆˆˆ
1
nnnn
nn
nn n n
ni nnn
nn n n
qpqq
qq
qp p p
zrp qq
nqp p p

 








(11)
where
1
1
1
ˆ
1
ˆˆ1
nn
nn
zrp
pp n



 from (10).
1
ˆn
q
For some small positive value
, the value
is
corrected to 1
ˆn
q
if
, and is corrected to
1
if
1
ˆ1
n
q
Remarks: 1) Expression (10) is the computationally
efficient recursive estimation format of the probabilities
that represent the environment. 2) The expression in (11)
includes correction terms induced by the Newton’s itera-
tive numerical method, when the latter is applied on the
Kullback-Leibler matching criterion between environ-
mental probabilities and probabilistic adaptations used in
the supervised learning process. The last term in (11)
converges to zero, as the estimate in (10) converges to its
true value. The second term in (11) converges to zero as
the estimate of q converges to the estimate of p. 3) The
small
and
correction biases are used to prevent the
corresponding probability estimates from diverging to
the 0 or 1 d egenera t e values.
.

We now proceed with the statement of a theorem first
proved in Kogiantis et al. [34].
Theorem
Let the process which generates the observed data in
the environment be ergodic. Let then s denote the prob-
ability of the event
1s
x
xR
, as induced by the lat-
ter process. Then, the supervised learning algorithm con-
verges to the probability s, almost surely, with rate in-
versely proportional to the sample/iteration size n.
Copyright © 2012 SciRes. IJCNS
A. T. BURRELL, P. PAPANTONI-KAZAKOS 607
Proof Outline
Here, we present an outline of the Theorem’s proof. 1)
If the observed data are generated by an ergodic process,
then the recursive sequential estimate in (10) will con-
verge to the probability s. 2) The pair sequence
ˆˆ
,
qp
ˆˆ
qqˆˆ
qps
nn
is a two dimensional Markov process. The expected
value of the drift 1nn, conditioned on nn
ˆ
qs
equals zero, as deduced from expression (11). 3) In view
of the result in 2), it is then shown that the supremum of
the conditional expected drift in 2) multiplied by n
converges to negative values, for all values of the abso-
lute difference ˆ
qs
nthat are larger than some given
positive small value. 4) Using finally Blum’s condition
[11], the results in 2) and 3) above guarantee almost sure
convergence.
We note that in the Theorem, if the process that gener-
ates the observed data in the environment is ergodic, and
if


1s
sxxR
denote the prediction mappings in-
duced by the latter process, then, via the learning algo-
rithm and with almost sure convergence, the prediction
mappings produced by the predictive mapping layer are
governed by the probabilities


1
R Mr
11
1
ss
qx xRrsxx



In Kogiantis et al. [34], it was found that the learning
algorithm converges rapidly to predictive probability
mappings that are close to those induced by the environ-
ment, even under mismatch network conditions. Specifi-
cally, when past dependence decays fast with distance,
then, even when the network order is less than the order
of the Markovian environmental model, convergence to
almost the true process is attained in less than fifty itera-
tions in most cases.
5. Conclusion
We presented a neural network implementation for a
digital qualitatively robust predictive mapping of envi-
ronmental models. The mapping uses synergistically re-
sults from statistical qualitative robustness and stochastic
binary neural representation to realize digital real-time
predictive operations which identify the environmental
model, while they simultaneously protect the operations
against data outliers. The supervised learning algorithm
recommended for the training of the neural network is
based on stochastic approximation principles applied to
the Kullback-Leibler matching criterion, in conjunction
with Newton’s iterative numerical method, and con-
verges almost surely for models generated by ergodic
processes. The considered predictive mappings have nu-
merous applications, ranging from data compression, to
model identification to sequential model hypothesis test-
ing.
REFERENCES
[1] F. Abdelhamid, “Transformation of Observations in Sto-
chastic Approximation,” Annals of Statistics, Vol. 1, No.
6, 1973, pp. 1158-1174. doi:10.1214/aos/1176342564
[2] V. Fabian, “On Asymptotic Normality in Stochastic Ap-
proximation,” Annals of Mathematical Statistics, Vol. 39,
No. 4, 1968, pp. 1327-1332.
doi:10.1214/aoms/1177698258
[3] D. Kazakos and P. Papantoni-Kazakos, “Detection and
Estimation,” Computer Science Press, New York, 1989.
[4] J. Kiefer and J. Wolfowitz, “Stochastic Estimation of the
Maximum of a Regression Function,” Annals of Mathe-
matical Statistics, Vol. 23, No. 3, 1952, pp. 462-466.
doi:10.1214/aoms/1177729392
[5] H. Kushner, “Asymptotic Global Behavior for Stochastic
Approximations and Diffusions with Slowly Decreasing
Noise Effects: Global Minimization via Monte Carlo,”
SIAM Journal of Applied Mathematics, Vol. 47, No. 1,
1987, pp. 169-185. doi:10.1137/0147010
[6] H. Kushner and D. Clark, “Stochastic Approximation
Methods for Constrained and Unconstrained Systems,”
Springer-Verlag, Berlin, 1978.
doi:10.1007/978-1-4684-9352-8
[7] L. Ljung, “Analysis of Recursive Stochastic Algorithms,”
IEEE Transactions on Automatic Control, Vol. 22, No. 4,
1977, pp. 551-575. doi:10.1109/TAC.1977.1101561
[8] L. Ljung and T. Söderström, “Theory and Practice of
Recursive Identification,” MIT Press, Cambridge, 1983.
[9] T. Y. Young and R. A. Westerberg, “Stochastic Appro-
ximation with a Nonstationary Regression Function,”
IEEE Transactions on Information Theory, Vol. IT-18,
No. 4, 1972, pp. 518-519. doi:10.1109/TIT.1972.1054851
[10] R. Beran, “Adaptive Autoregressive Process,” Annals of
the Institute of Statistical Mathematics, Vol. 28, No. 1,
1976, pp. 77-89. doi:10.1007/BF02504731
[11] J. R. Blum, “Multidimensional Stochastic Approximation
Procedure,” Annals of Mathematical Statistics, Vol. 22,
No. 4, 1954, pp. 737-744.
doi:10.1214/aoms/1177728659
[12] R. A. Fisher, “The Goodness of Fit of Regression Formu-
lae and the Distribution of Regression Coefficients,”
Journal of the Royal Statistical Society, Vol. 85, No. 4,
1922, pp. 597-612. doi:10.2307/2341124
[13] L. Gerencser, “Parameter Tracing of Time-Varying Con-
tinuous-Time Linear Stochastic Systems,” In: C. I.
Byrnes and A. Lindquist, Eds., Modeling, Identification
and Robust Controls, North-Holland, Amsterdam, 1986,
pp. 581-594.
[14] R. L. Kashyap and C. C. Blaydon, “Recovery of Func-
tions from Noisy Measurements Taken at Randomly Se-
lected Points and Its Application to Pattern Classifica-
tion,” Proceedings of the IEEE, Vol. 54, No. 8, 1966, pp.
1127-1129. doi:10.1109/PROC.1966.5051
[15] R. L. Kashyap, C. Blaydon and K. S. Fu, “Stochastic
Approximation,” In: J. M. Mendel and K. S. Fu, Eds.,
Adaptive Learning and Pattern Recognition Systems, Aca-
demic Press, New York, 1970, pp. 329-355.
Copyright © 2012 SciRes. IJCNS
A. T. BURRELL, P. PAPANTONI-KAZAKOS
Copyright © 2012 SciRes. IJCNS
608
doi:10.1016/S0076-5392(08)60499-3
[16] H. Robbins and S. Monro, “A Stochastic Approximation
Method,” Annals of Mathematical Statistics, Vol. 22, No.
3, 1951, pp. 400-407. doi:10.1214/aoms/1177729586
[17] A. R. Barron, F. W. van Straten and R. L. Barron, “Adap-
tive Learning Network Approach to Weather Forecasting:
A Summary,” Proceedings of the International Confer-
ence on Cybernetics and Society, 1977, pp. 724-727.
[18] J. Elman and D. Zipser, “Learning the Hidden Structure
of Speech,” Journal of the Acoustical Society of America,
Vol. 83, No. 4, 1988, pp. 1615-1626.
doi:10.1121/1.395916
[19] P. Gorman and T. Sejnowski, “Analysis of Hidden Units
in a Layered Network Trained to Classify Sonar Target s,”
Neural Networks, Vol. 1, No. 1, 1988, pp. 75-90.
doi:10.1016/0893-6080(88)90023-8
[20] M. Minsky and S. Papert, “Perceptrons,” MIT Press, Ca m-
bridge, 1969.
[21] F. Rosenblatt, “The Perceptron: A Perceiving and Recog-
nizing Automaton,” Report 85-60-1, Cornell Aeronautical
Laboratory, Buffalo, New York, 1957.
[22] P. Werbos, “Beyond Regression: New Tools for Predic-
tion and Analysis in the Behavioral Sciences,” Ph.D.
Dissertation, Harvard University, Cambridge, 1974.
[23] H. White, “Some Asymptotic Results for Learning in
Single Hidden-Layer Feedforward Network Models,”
American Statistical Association, Vol. 84, No. 408, 1989,
pp. 1003-1013. doi:10.1080/01621459.1989.10478865
[24] B. Widrow, “Generalization and Information Storage in
Networks of Adaline Neurons,” In: M. D. Yovits, G. T.
Jacobi and G. D. Goldstein, Eds., Self-Organizing Sys-
tems, Spartan Books, Washington DC, 1962, pp. 435-461.
[25] B. Widrow and M. E. Hoff, “Adaptive Switching Cir-
cuits,” 1960 IRE WESCON Convention Record, 1960, pp.
96-104.
[26] R. E. Blahut, “Hypothesis Testing and Information The-
ory,” IEEE Transactions on Information Theory, Vol.
IT-20, 1987, pp. 405-417.
[27] D. H. Ackley, G. E. Hinton and T. J. Sejnowski, “A Lea-
rning Algorithm for Boltzman Machines,” Cognitive Sci-
ence, Vol. 9, No. 1, 1985, pp. 147-169.
doi:10.1207/s15516709cog0901_7
[28] S. Amari, K. Kurata and H. Nagoaka, “Information Ge-
ometry of Boltzman Machines,” IEEE Transactions on
Neural Networks, Vol. 3, No. 2, 1992, pp. 260-271.
doi:10.1109/72.125867
[29] D. Pados and P. Papantoni-Kazakos, “New Non-Least-
Squares Neural Network Learning Algorithms for Hy-
pothesis Testing,” IEEE Transactions on Neural Net-
works, Vol. 6, No. 3, 1995, pp. 596-609.
doi:10.1109/72.377966
[30] D. Pados, K. W. Halford, D. Kazakos and P. Papantoni-
Kazakos, “Distributed Binary Hypothesis Testing with
Feedback,” IEEE Transactions on Systems, Man, and
Cybernetics, Vol. 25, No. 1, 1995, pp. 21-42.
[31] D. Pados, P. Papantoni-Kazakos, D. Kazakos and A. Ko-
giantis, “On-Line Threshold Learning for Neyman-Pear-
son Distributed Detection,” IEEE Transactions on Sys-
tems, Man, and Cybernetics, Vol. 24, No. 10, 1994, pp.
1519- 1531. doi:10.1109/21.310534
[32] D. Pados and P. Papantoni-Kazakos, “A Note on the Es-
timation of the Generalization Error and the Prevention of
Overfitting,” IEEE International Conference on Neural
Networks, Orlando, 1994.
[33] D. Pados and P. Papantoni-Kazakos, “A Class of Ney-
man-Pearson and Bayes Learning Algorithms for Neural
Classification,” IEEE International Symposium on Infor-
mation Theory, Trondheim, 1-27 July 1994.
[34] A. G. Kogiantis and P. Papantoni-Kazakos, “Operations
and Learning in Neural Networks for Robust Prediction,”
IEEE Transactions on Systems, Man, and Cybernetics,
Vol. 27, No. 3, 1997, pp. 402-411.
[35] Y. Liu, Z. Wang and X. Liu, “Robust Stability of Dis-
crete-Time Stochastic Neural Networks with Time-
Varying Delays,” 4 th International Symposium on Neural
Networks, Vol. 71, No. 4-6, 2008, pp. 823-833.
[36] Z. Wang, Y. Liu, M. Li and X. Liu, “Stability for Sto-
chastic Cohen-Grossberg Neural Networks with Mixed
Time Delays,” IEEE Transactions on Neural Networks,
Vol. 17, No. 3, 2006, pp. 814-820.
doi:10.1109/TNN.2006.872355
[37] H. Ling, “Stochastic Neural Networks,” LAP Lambert
Academic Publishing, 2010.
[38] P. Papantoni-Kazakos, D. Kazakos and K. Birmiwal,
“Predictive Analog-to-Digital Conversion for Resistance
to Data Outliers,” Information and Computation, Vol. 98,
No. 1, 1992, pp. 56-98.
doi:10.1016/0890-5401(92)90042-E