Int. J. Communications, Network and System Sciences, 2012, 5, 603608 http://dx.doi.org/10.4236/ijcns.2012.529070 Published Online September 2012 (http://www.SciRP.org/journal/ijcns) Stochastic Binary Neural Networks for Qualitatively Robust Predictive Model Mapping A. T. Burrell1, P. PapantoniKazakos2 1Computer Science Department, Oklahoma State University, Stillwater, USA 2Electrical Engineering Department, University of Colorado Denver, Denver, USA Email: tburrell@okstate.edu, Titsa.Papantoni@ucdenver.edu Received May 15, 2012; revised June 25, 2012; accepted July 10, 2012 ABSTRACT We consider qualitatively robust predictive mappings of stochastic environmental models, where protection against outlier data is incorporated. We utilize digital representations of the models and deploy stochastic binary neural net works that are pretrained to produce such mappings. The pretraining is implemented by a back propagating supervised learning algorithm which converges almost surely to the probabilities induced by the environment, under general er godicity conditions. Keywords: Qualitative Robustness; Predictive Model Mapping; Stochastic Approximation; Stochastic Binary Neural Networks; RealTime Supervised Learn ing; Ergodicity 1. Introduction We consider the case where the statistical behavior of environmental models must be learned in real time. In particular, we focus on learning such behavior predic tively, as may be applicable in data compression, hy pothesis testing or model identification, while statistical qualitative robustness for protection against outlier data is sought as well. In this paper, we promote the deploy ment of stochastic binary neural networks which imple ment predictive model mappings in real time, in interac tion with the environment; i.e. supervised learning, while they also offer sound protection against data outliers. Our approach uses results from stochastic approximation and statistical qualitative robustness [19]. While powerful such results have been in existence for a long time, they have not been g iven attention synergistically, in the light of neural network implementations. In this paper, our objective is to stimulate interest in the application of the existing theories in such implementations, especially those addressing environmental models. When neural networks operate in stochastically de scribed environments, supervised learning corresponds to a statistical sequential estimation problem dealt with by stochastic approx imation method s. Th ere is rich literature in such methods represented by the works of Abdelhamid [1], Beran [10], Blum [11], Fabian [2], Fisher [12], Ger encser [13], Kashyab et al. [14,15], Kiefer et al. [4], Kushner [5], Kushner et al. [6], Ljung [7], Ljung et al. [8], Robbins et al. [16] and Young et al. [9]. In the neural networks literature, supervised learning has been basically limited to techniques arising from the Robbins/Monro [16] method and its extensions, with performance criterion the least squares error. The repre sentative works on the subject are those by Barron et al. [17], Elman et al. [18], Gorman et al. [19], Minsky et al. [20], Rosenblatt [21], Werbos [22], White [23], Widrow [24], and Widrow et al. [25]. Literature in the area, when the performance criterion is, instead, the KullbackLei bler distance (see Blahut [26] and Kazakos et al. [3]) and the techniques used do not necessarily arise from the Robbins/Monro method, is represented by the works of Ackley et al. [27], Amari et al. [28], Pados et al. [2933] and Kogiantis et al. [34]. In the domain of stochastic neural networks, some more recent results address timedelay issues (Liu et al. [35] and Wang et al. [36]), while the book by Ling [37] discusses some general aspects in this area. The organization of this paper is as follows. In Section 2, we introduce digital finite memory qualitativ ely robust predictive mappings, as well as the neural network layers needed for their implementation. In Section 3, we de scribe the operations performed at the predictive neural network layer. In Section 4, we present the supervised learning algorithm used at the predictive layer. In Section 5, we draw some conclusions. 2. Digital Finite Memory Qualitatively Robust Mapping We consider digital environmental representations. We start by letting 1,, n x denote a sequence of dis C opyright © 2012 SciRes. IJCNS
A. T. BURRELL, P. PAPANTONIKAZAKOS 604 cretetime observations that represent the environment. Then, given 1n ,, x 1n , the objective of the digital map ping is to predict which one of M distinct regions, the observation is going to lie in. Denoting these re gions ; , let us define the probabilities 1, ,jM 11 1 1 ,, , jnnjn jM xPx Axx ,, n px 1,, which are used to map stochastically an observed sequence x onto each of the regions , with corre sponding probabilities . Two problems arise immediately: ,, jn x ,, n 1 px 1) Exploding computational load, due to the increasing memory represented by the sequences 1 x. 2) Statistical information on the sequences 1,, n x x ,, needed for the computation of the probabilities . 1jn The first problem is resolved if the increasing memory is approximated by finite, say sizem memory. That is, the increasing computational load is, instead, bounded if the process that generates the observations is approxi mated by an morder Markov process. Then, the informa tion loss is minimized when the process is Gaussian (see Blahut [26]). ,,px Thus, to reduce the exploding computational load due to increasing data memory, we may initially model the process that generates the environmental data or observa tions by an morder Gaussian Markov process, whose autocovariance m × m matrix Q has components identi cal to those of the original process. We name this initial (Gaussian and Markov) process, nominal process. Starting with our nominal process, but incorporating then statistical uncertainties in the form of unknown data outliers, we are led to a powerful qualitatively robust formalization, which results in a stochastic mapping (see PapantoniKazako s et al. [38]), as follows: Given observations 1n x, use the m most re cent observations for the prediction of the next datum 1n , and defining 1,, mx T yx mn n , decide that 1nj A with probability m qy, defined as follows, 1 jm m qy ryr M 1, m jm y py (1) where m py 1nj is the conditional probability of A , given 1,, mn m n x T yx, as induced by the Gaussian and Markov nominal process, and where, for some posi tive finite constant , 1 ,T mm yQ y 1min1 m ry (2) The value of the constant in (2) represents the level of confidence on the “purity” of the data vector ym, in terms of it being generated by the nominal Gaussian process: the higher the value of , the higher the level of confidence, where as decreases, increased weight on purely random mappings (represented by the probability 1 per region) is in duced. Robust estimation of the autocovariance matrix Q may be also required. The components of the autoco variance matrix Q should emerge from the statistics of the nominal Gaussian process. A scheme for the robust estimation of the matrix Q may arise from robust pa rameter estimation techniques, (see Kazakos et al. [3]). The robust prediction expression in (1) is based on a Gaussian assumption regarding the nominal process which generates the data in the environment, where the latter assumption is the result of an informationtheoretic approach to the reduction of the computational load caused by increased past memory. The important robust effects induced by the mapping in (1) remain unaltered, however, when instead, the probability m in (1) arises from an arbitrary nonGaussian process, and when its conditioning on ym is substituted by conditioning on quantized values of the scalar quantity mm py 1T Qy 1 Q 1 Q 1T . When quantized values are involved, the implementation of the mapping in (1) requires the following stages: 1) Preprocessing. This stage corresponds to longterm memory and involves the robust preestimation (see Ka zakos et al. [3]), and storage of the matrix . 2) Processing. This stage corresponds to shortterm memory. It uses the matrix from the preprocessing step and the observation vector ym to: a) first compute the quadratic expression mm Qy 1T, b) subsequently repre sent mm Qy in a quantized form comprised of N dis tinct values and c) finally, use the quantized values in d) to compute the corresponding value of the function r m in (2). 3) Predictive Mapping. This stage involves the estima jm py and the computation tion of the probabilities of the probabilities qy jm in (1) using inputs from the processing stage, and the subsequent implementation of the prediction mappings. The three different stages above are performed sequen tially by separate but connected neural structures, named preprocessing layer, processing layer, and predictive mapping layer, respectively. Our focus in this paper is on the latter layer: its structure and its operations. Towards that direction, we first note that, due to the quantization operations at the processing layer, the expression in (1) takes the following form: 1 11; for ; 1,, jjp T mm qrrp M yQ yRN (3) p rq where , , and denote, respectively, the prob abilities m ry , m qy and m py and the number Copyright © 2012 SciRes. IJCNS
A. T. BURRELL, P. PAPANTONIKAZAKOS 605 when the quantized value of 1 m Tm Qy equals R . 3. The Neural Predictive Layer Con sider th e integ er M in (3), and let s be a unique posi tive integer, such that 1 22 s 1,jM ,. Then, in modulo2 arithmetic, each state j, can be represented by an slength 0  1 binary sequence 1 x R. The state is provided as an input to the prediction layer by the processing layer, and the former produces a binary se quence 1 x as a prediction mapping. Given the state R , the operations of the prediction layer must be such that, a given prediction sequence 1 x is produced stochastically with probability. 1 1 ss qxxpxxR M 1 1 Rr r (4) where expression (4) is the same as expression (3) when the binary representation of the positive integer j in the latter is 1 x, and where x R 1s px is the pre diction mapping generated by the nominal process that represents the actual data generating environment. Due to the stochastic nature of the rule in (4), such is also the nature of the predictive mapping layer, whose neural representation corresponds then to a stochastic neural network, first developed by Kogiantis et al. [17], when the response of each neuron is limited to binary. We proceed with the descriptio n of the latter representation. Let us temporarily assume that the probabilities pxxR 1s ,, have been “learned” and are known. Without lack in generality, let us also assume that M = 2s. The original constraint of binary firing per neuron in the prediction layer leads us to the digital representation of the future states 1 x R . The design can be accom modated easily in a binary tree structure. In detail, given the observed state and the resulting R value, the mapping 1 x r can be obtained via a stochastic binary tree search, on the 2sleaves tree, as follows: 1) With probability a fair treesearch is activated, where the treenode x1, x1 = 0, 1 is visited with probability 0.5, and each of the two treenodes branching off a visited treenode 1k, 11 xk r s is also visited with probability 0.5; 2) With probability 1 a generally biased treesearch is activated, where the treenode x1 is visited with probability R 1 px , while from a visited treenode 1k, 11 xk 11kk s the treenode xx is visited with probability: 11 , kk kkk pxxxR pxxxR 11 1 ,xx where 11 , k px xRpxRxxR px xR R 11 11 , sk ss px x (5) Thus, the predictive mapping layer may be viewed as been comprised of a fairsearch binary tree and a number of biasedsearch binary trees, each of the latter corre sponding to a specific observation state Given R the common fairsearch binary tree is activated with probability r , while, with probability 1r R , the bi asedsearch binary tree that corresponds to the state is activated, instead; we name the latter tree, the R tree. The nodes of each of the above binary trees are neurons that “fire,” if the corresponding treenodes are “visited.” Given R , a specific mapping 1 x r is generated either equiprobably from the fairsearch binary tree with probability , or from the R tree via the sequential stochastic representation in (5) with probabil ity 1r . It is thus in the R tree that the probabilities which generate the data of the environment must be “learned” and then used to generate prediction mappings. R , consider the Given the observation state R  tree in conjunction with the sequential stochastic repre sentation in (5) of th e corre spond ing pred iction mappi ngs, as generated by the process representing the actual envi ronmental data. Let 1k xx u represent the bi nary random output of the neuron that corresponds to the node ,1 ks k1 xR of the tree. Then, 1k xx 1u if and only if 1i xx 1,uik 1k . Thus, the output x u may be viewed as generat ed by a product, 1211 1kk xxxxxx WW W , of mutually independent binary random variables 11 1 ii xx xik W R , whose distri butions at the operational stage of the tree must be as follows (in view of (5)): 112111 12111 11 1111 1 11 , 11 1;2 11 , kkk kk xxx xxxxx kkk xxx xxx xx PuPWW W px xRpxRpxx xR PWPWPWks PuPWpxR (6) where 11 11 1,;2 kk xx xkk PWpx xxRks R (7) The above logical arguments and expressions lead to the following neural structure of the tree: 1) The neuron corresponding to the treenode x1; x1 = 0, 1 has a binary random variable 1 W built in, where 01 WW1 . At the operational stage, the neuron must be activated with probability 1 px R ; thus, 11 1 x PWpx R 11 1PW PW 10 then, where 2k; 2) For , the neuron corresponding to the treenode 1k x has a binary random variable 11kk xxx built in and fires, if and only if the latter variable takes the value 1 and si multaneously the neuron corresponding to the treenode W 11k x fires as well. Thus, the binary neural output Copyright © 2012 SciRes. IJCNS
A. T. BURRELL, P. PAPANTONIKAZAKOS 606 1k x u is formed as a product 11 11kk k xx xxx uW , where 11 11 1 11 kk kkk xxx W R 111 11 1 kk xxxxxxx xx PuPu W Pu P (8) and where, at the operational stage of the tree, the probability 11 1 kk xxx PW must be as in (7). We note that 11 11 10 1, kk 11 2, xxx WW k kxx (9) and thus 11 10 1, kk xx R 11 11 11 2, xx k PW PW kxx As it is clear from the derivations and arguments in this section, the parameters of interest in the tree neural network consist of the independent binary random variables W1 and 2 1 ks ik 11 1;0,1;1 k xx i Wx , whose distributions 1pR and 11 2 1,; 0,1;11, ki ks pxxRxik R must be “learned” in advance, via interaction with the environment. 4. Learning at the Predictive Layer Given the tree, we observe that, due to (6), any ad aptations of the probability backpropa 11 s xx Pu 11 s xx Pu gate to adaptations of each of the other involved prob abilities. It thus suffices to focus on the learning of the probabilities for the various binary se quences 1 x R , which correspond to the responses of the output or “visible” neurons in the tree network. For easiness in presentation, let us now consider a fixed sequence 1 x R (in conjunction with the fixed ob served state that represents the R tree). Let then p denote the value of the probability pxxR 1s Pu , as induced by the environment, and let q denote the value of the probability 1s xx. Let the natural number n denote discrete observation time from the beginning of the learning stage, and let and denote estimates at time n of the probability values p and q, respectively. Finally, let the random variable Vn be defined as equal to 1, if the environmental event ˆn q 1 ˆn p 1s xR occurs at time n, and as equal to 0, otherwise, and let 0; 1, 1; ZV WW 1wp r wpr 1 ˆ0q 1 ˆ pv . In Kogiantis et al. [17], a KullbackLeibler matching criterion between p and q was used, in conjunction with Newton’s iterative numerical method, to develop the supervised learning algorithm stated below. Algorithm Initial Values: Select an initial value , while 1 . Computational Steps: 1) Given computed value and , compute ˆn p1n z 1 ˆn p as follows: 1 1 1 ˆ 1 ˆˆ1 nn nn zrp pp n (10) 1 ˆn p For some small positive value , the value is corrected to 1 ˆn p if , and is corrected to 1 if 1n ˆ1p 2) Given computed values , given ˆˆ , nn qp 1n z . , compute 1 ˆn q as follows: 12 2 1 1 2 ˆˆˆˆ 1 ˆˆˆˆ ˆ ˆ 1 ˆ 1ˆˆ 1 1ˆˆˆˆ 1 nnnn nn nn n n ni nnn nn n n qpqq qq qp p p zrp qq nqp p p (11) where 1 1 1 ˆ 1 ˆˆ1 nn nn zrp pp n from (10). 1 ˆn q For some small positive value , the value is corrected to 1 ˆn q if , and is corrected to 1 if 1 ˆ1 n q Remarks: 1) Expression (10) is the computationally efficient recursive estimation format of the probabilities that represent the environment. 2) The expression in (11) includes correction terms induced by the Newton’s itera tive numerical method, when the latter is applied on the KullbackLeibler matching criterion between environ mental probabilities and probabilistic adaptations used in the supervised learning process. The last term in (11) converges to zero, as the estimate in (10) converges to its true value. The second term in (11) converges to zero as the estimate of q converges to the estimate of p. 3) The small and correction biases are used to prevent the corresponding probability estimates from diverging to the 0 or 1 d egenera t e values. . We now proceed with the statement of a theorem first proved in Kogiantis et al. [34]. Theorem Let the process which generates the observed data in the environment be ergodic. Let then s denote the prob ability of the event 1s xR , as induced by the lat ter process. Then, the supervised learning algorithm con verges to the probability s, almost surely, with rate in versely proportional to the sample/iteration size n. Copyright © 2012 SciRes. IJCNS
A. T. BURRELL, P. PAPANTONIKAZAKOS 607 Proof Outline Here, we present an outline of the Theorem’s proof. 1) If the observed data are generated by an ergodic process, then the recursive sequential estimate in (10) will con verge to the probability s. 2) The pair sequence ˆˆ , qp ˆˆ qqˆˆ qps nn is a two dimensional Markov process. The expected value of the drift 1nn, conditioned on nn ˆ qs equals zero, as deduced from expression (11). 3) In view of the result in 2), it is then shown that the supremum of the conditional expected drift in 2) multiplied by n converges to negative values, for all values of the abso lute difference ˆ qs nthat are larger than some given positive small value. 4) Using finally Blum’s condition [11], the results in 2) and 3) above guarantee almost sure convergence. We note that in the Theorem, if the process that gener ates the observed data in the environment is ergodic, and if 1s sxxR denote the prediction mappings in duced by the latter process, then, via the learning algo rithm and with almost sure convergence, the prediction mappings produced by the predictive mapping layer are governed by the probabilities 1 R Mr 11 1 ss qx xRrsxx In Kogiantis et al. [34], it was found that the learning algorithm converges rapidly to predictive probability mappings that are close to those induced by the environ ment, even under mismatch network conditions. Specifi cally, when past dependence decays fast with distance, then, even when the network order is less than the order of the Markovian environmental model, convergence to almost the true process is attained in less than fifty itera tions in most cases. 5. Conclusion We presented a neural network implementation for a digital qualitatively robust predictive mapping of envi ronmental models. The mapping uses synergistically re sults from statistical qualitative robustness and stochastic binary neural representation to realize digital realtime predictive operations which identify the environmental model, while they simultaneously protect the operations against data outliers. The supervised learning algorithm recommended for the training of the neural network is based on stochastic approximation principles applied to the KullbackLeibler matching criterion, in conjunction with Newton’s iterative numerical method, and con verges almost surely for models generated by ergodic processes. The considered predictive mappings have nu merous applications, ranging from data compression, to model identification to sequential model hypothesis test ing. REFERENCES [1] F. Abdelhamid, “Transformation of Observations in Sto chastic Approximation,” Annals of Statistics, Vol. 1, No. 6, 1973, pp. 11581174. doi:10.1214/aos/1176342564 [2] V. Fabian, “On Asymptotic Normality in Stochastic Ap proximation,” Annals of Mathematical Statistics, Vol. 39, No. 4, 1968, pp. 13271332. doi:10.1214/aoms/1177698258 [3] D. Kazakos and P. PapantoniKazakos, “Detection and Estimation,” Computer Science Press, New York, 1989. [4] J. Kiefer and J. Wolfowitz, “Stochastic Estimation of the Maximum of a Regression Function,” Annals of Mathe matical Statistics, Vol. 23, No. 3, 1952, pp. 462466. doi:10.1214/aoms/1177729392 [5] H. Kushner, “Asymptotic Global Behavior for Stochastic Approximations and Diffusions with Slowly Decreasing Noise Effects: Global Minimization via Monte Carlo,” SIAM Journal of Applied Mathematics, Vol. 47, No. 1, 1987, pp. 169185. doi:10.1137/0147010 [6] H. Kushner and D. Clark, “Stochastic Approximation Methods for Constrained and Unconstrained Systems,” SpringerVerlag, Berlin, 1978. doi:10.1007/9781468493528 [7] L. Ljung, “Analysis of Recursive Stochastic Algorithms,” IEEE Transactions on Automatic Control, Vol. 22, No. 4, 1977, pp. 551575. doi:10.1109/TAC.1977.1101561 [8] L. Ljung and T. Söderström, “Theory and Practice of Recursive Identification,” MIT Press, Cambridge, 1983. [9] T. Y. Young and R. A. Westerberg, “Stochastic Appro ximation with a Nonstationary Regression Function,” IEEE Transactions on Information Theory, Vol. IT18, No. 4, 1972, pp. 518519. doi:10.1109/TIT.1972.1054851 [10] R. Beran, “Adaptive Autoregressive Process,” Annals of the Institute of Statistical Mathematics, Vol. 28, No. 1, 1976, pp. 7789. doi:10.1007/BF02504731 [11] J. R. Blum, “Multidimensional Stochastic Approximation Procedure,” Annals of Mathematical Statistics, Vol. 22, No. 4, 1954, pp. 737744. doi:10.1214/aoms/1177728659 [12] R. A. Fisher, “The Goodness of Fit of Regression Formu lae and the Distribution of Regression Coefficients,” Journal of the Royal Statistical Society, Vol. 85, No. 4, 1922, pp. 597612. doi:10.2307/2341124 [13] L. Gerencser, “Parameter Tracing of TimeVarying Con tinuousTime Linear Stochastic Systems,” In: C. I. Byrnes and A. Lindquist, Eds., Modeling, Identification and Robust Controls, NorthHolland, Amsterdam, 1986, pp. 581594. [14] R. L. Kashyap and C. C. Blaydon, “Recovery of Func tions from Noisy Measurements Taken at Randomly Se lected Points and Its Application to Pattern Classifica tion,” Proceedings of the IEEE, Vol. 54, No. 8, 1966, pp. 11271129. doi:10.1109/PROC.1966.5051 [15] R. L. Kashyap, C. Blaydon and K. S. Fu, “Stochastic Approximation,” In: J. M. Mendel and K. S. Fu, Eds., Adaptive Learning and Pattern Recognition Systems, Aca demic Press, New York, 1970, pp. 329355. Copyright © 2012 SciRes. IJCNS
A. T. BURRELL, P. PAPANTONIKAZAKOS Copyright © 2012 SciRes. IJCNS 608 doi:10.1016/S00765392(08)604993 [16] H. Robbins and S. Monro, “A Stochastic Approximation Method,” Annals of Mathematical Statistics, Vol. 22, No. 3, 1951, pp. 400407. doi:10.1214/aoms/1177729586 [17] A. R. Barron, F. W. van Straten and R. L. Barron, “Adap tive Learning Network Approach to Weather Forecasting: A Summary,” Proceedings of the International Confer ence on Cybernetics and Society, 1977, pp. 724727. [18] J. Elman and D. Zipser, “Learning the Hidden Structure of Speech,” Journal of the Acoustical Society of America, Vol. 83, No. 4, 1988, pp. 16151626. doi:10.1121/1.395916 [19] P. Gorman and T. Sejnowski, “Analysis of Hidden Units in a Layered Network Trained to Classify Sonar Target s,” Neural Networks, Vol. 1, No. 1, 1988, pp. 7590. doi:10.1016/08936080(88)900238 [20] M. Minsky and S. Papert, “Perceptrons,” MIT Press, Ca m bridge, 1969. [21] F. Rosenblatt, “The Perceptron: A Perceiving and Recog nizing Automaton,” Report 85601, Cornell Aeronautical Laboratory, Buffalo, New York, 1957. [22] P. Werbos, “Beyond Regression: New Tools for Predic tion and Analysis in the Behavioral Sciences,” Ph.D. Dissertation, Harvard University, Cambridge, 1974. [23] H. White, “Some Asymptotic Results for Learning in Single HiddenLayer Feedforward Network Models,” American Statistical Association, Vol. 84, No. 408, 1989, pp. 10031013. doi:10.1080/01621459.1989.10478865 [24] B. Widrow, “Generalization and Information Storage in Networks of Adaline Neurons,” In: M. D. Yovits, G. T. Jacobi and G. D. Goldstein, Eds., SelfOrganizing Sys tems, Spartan Books, Washington DC, 1962, pp. 435461. [25] B. Widrow and M. E. Hoff, “Adaptive Switching Cir cuits,” 1960 IRE WESCON Convention Record, 1960, pp. 96104. [26] R. E. Blahut, “Hypothesis Testing and Information The ory,” IEEE Transactions on Information Theory, Vol. IT20, 1987, pp. 405417. [27] D. H. Ackley, G. E. Hinton and T. J. Sejnowski, “A Lea rning Algorithm for Boltzman Machines,” Cognitive Sci ence, Vol. 9, No. 1, 1985, pp. 147169. doi:10.1207/s15516709cog0901_7 [28] S. Amari, K. Kurata and H. Nagoaka, “Information Ge ometry of Boltzman Machines,” IEEE Transactions on Neural Networks, Vol. 3, No. 2, 1992, pp. 260271. doi:10.1109/72.125867 [29] D. Pados and P. PapantoniKazakos, “New NonLeast Squares Neural Network Learning Algorithms for Hy pothesis Testing,” IEEE Transactions on Neural Net works, Vol. 6, No. 3, 1995, pp. 596609. doi:10.1109/72.377966 [30] D. Pados, K. W. Halford, D. Kazakos and P. Papantoni Kazakos, “Distributed Binary Hypothesis Testing with Feedback,” IEEE Transactions on Systems, Man, and Cybernetics, Vol. 25, No. 1, 1995, pp. 2142. [31] D. Pados, P. PapantoniKazakos, D. Kazakos and A. Ko giantis, “OnLine Threshold Learning for NeymanPear son Distributed Detection,” IEEE Transactions on Sys tems, Man, and Cybernetics, Vol. 24, No. 10, 1994, pp. 1519 1531. doi:10.1109/21.310534 [32] D. Pados and P. PapantoniKazakos, “A Note on the Es timation of the Generalization Error and the Prevention of Overfitting,” IEEE International Conference on Neural Networks, Orlando, 1994. [33] D. Pados and P. PapantoniKazakos, “A Class of Ney manPearson and Bayes Learning Algorithms for Neural Classification,” IEEE International Symposium on Infor mation Theory, Trondheim, 127 July 1994. [34] A. G. Kogiantis and P. PapantoniKazakos, “Operations and Learning in Neural Networks for Robust Prediction,” IEEE Transactions on Systems, Man, and Cybernetics, Vol. 27, No. 3, 1997, pp. 402411. [35] Y. Liu, Z. Wang and X. Liu, “Robust Stability of Dis creteTime Stochastic Neural Networks with Time Varying Delays,” 4 th International Symposium on Neural Networks, Vol. 71, No. 46, 2008, pp. 823833. [36] Z. Wang, Y. Liu, M. Li and X. Liu, “Stability for Sto chastic CohenGrossberg Neural Networks with Mixed Time Delays,” IEEE Transactions on Neural Networks, Vol. 17, No. 3, 2006, pp. 814820. doi:10.1109/TNN.2006.872355 [37] H. Ling, “Stochastic Neural Networks,” LAP Lambert Academic Publishing, 2010. [38] P. PapantoniKazakos, D. Kazakos and K. Birmiwal, “Predictive AnalogtoDigital Conversion for Resistance to Data Outliers,” Information and Computation, Vol. 98, No. 1, 1992, pp. 5698. doi:10.1016/08905401(92)90042E
