This paper deals with deriving the properties of updated neural network model that is exploited to identify an unknown nonlinear system via the standard gradient learning algorithm. The convergence of this algorithm for online training the three-layer neural networks in stochastic environment is studied. A special case where an unknown nonlinearity can exactly be approximated by some neural network with a nonlinear activation function for its output layer is considered. To analyze the asymptotic behavior of the learning processes, the so-called Lyapunov-like approach is utilized. As the Lyapunov function, the expected value of the square of approximation error depending on network parameters is chosen. Within this approach, sufficient conditions guaranteeing the convergence of learning algorithm with probability 1 are derived. Simulation results are presented to support the theoretical analysis.
Design of mathematical models for technical, economic, social and other systems with uncertainties is the important problem from both theoretical and practical points of view. This problem attracts close attention of many researches. The significant progress in this scientific area has been achieved last time. Within this area, new methods and modern intelligent algorithms dealing with uncertain systems have recently been proposed in [
Over the past decades, interest has been increasing toward the use multilayer neural networks applied among other as models for the adaptive identification of nonlinearly parameterized dynamic systems [
Different learning methods for updating the weights of neural networks have been reported in literature. Most of these methods rely on the gradient concept [
The convergence of the online gradient training procedure dealing with input signals that have deterministic (non-stochastic) nature was studied by many authors [
The probabilistic asymptotic analysis on convergence of the online gradient training algorithms has been conducted in [
A popular approach to analyze the asymptotic behavior of online gradient algorithms in stochastic case is based on Martingale convergence theory [
The difficulties associated with convergence properties of online gradient learning algorithms are how to guarantee the boundedness of the network weights biases assuming the learning process to be theoretically infinite. To overcome these difficulties, the penalty term to an error function has been introduced in [
This work has been motivated by the fact that the standard gradient algorithm is widely exploited for online updating the neural network weights in accordance with the gradient-descent principle whereas the following important question related to its ultimate properties remained in part open as yet: when does the sequential procedure based on this algorithm converge if the learning rate is constant? As pointed out in [
Novelty of this paper which extends the basic ideas of [
Consider the typical three-layer feedforward neural network containing a hidden layer and p inputs, q hidden neurons, and one output neuron. Denote by
W = ( w i j ) q × p = [ w 1 , ⋯ , w q ] T
with
w i = [ w i 1 , ⋯ , w i p ] T ∈ R p , i = 1 , ⋯ , q
the weight matrix connecting the input and hidden layers, and define the so-called bias vector w 0 as
w 0 = [ w 01 , ⋯ , w 0 q ] T ∈ R q ,
which is the threshold in the hidden-layer output. Further, let
ω = [ ω 1 , ⋯ , ω q ] T ∈ R q ,
be the weight vector between the hidden and output layers, and ω 0 be the bias in the output layer. As in [
Now, denoting by
G ( z ) = [ g ( z 1 ) , ⋯ , g ( z q ) ] T
the vector-valued function which depends on the vector z = [ z 1 , ⋯ , z q ] T ∈ R q , introduce the extended matrix W ˜ = [ W ⋮ w 0 ] ∈ R q × ( p + 1 ) by adding the column w 0 to W and the extended vector ω ˜ = [ ω T , ω 0 ] T ∈ R q + 1 , and also the function G ˜ ( z ) = [ g ( z 1 ) , ⋯ , g ( z q ) , 1 ] T of z. Then the for an input vector
x = [ x 1 , ⋯ , x p ] T ∈ R p ,
the output vector of hidden layer can be written as G ˜ ( W ˜ x ˜ ) , where the notation x ˜ = [ x T , 1 ] T of the extended vector x ˜ ∈ R p + 1 is used, and the final output y NN ∈ R of the neural network can be expressed as follows:
y NN = f ( ω ˜ T G ˜ ( W ˜ x ˜ ) ) . (1)
Let
y = φ ( x ) (2)
with φ : R p → R be an unknown and bounded nonlinearity given over the bounded either finite or infinite sets X ⊂ R p which are depicted in
e ( ω ˜ , W ˜ , x , y ) = y − f ( ω ˜ T G ˜ ( W ˜ x ˜ ) ) (3)
depends on x for any fixed ( ω ˜ , W ˜ ) .
Now, suppose that some complex system to be identified is described at each nth time instant by the equation
y n = φ ( x n ) ( n = 0 , 1 , 2 , ⋯ ) (4)
in which x n ∈ X and y n ∈ R are its input and output signals, respectively available for measurement.
Based on the infinite sequence of the training examples { x n , y n } n = 0 ∞ that is
generated by (4), the outline learning algorithm for updating the weight and biases in (1) is defined as the standard gradient-descent iteration procedure
ω ˜ n + 1 = ω ˜ n − η n ∇ ω ˜ e 2 ( ω ˜ n , W ˜ n ; y n , x ˜ n ) , (5)
w i n + 1 = w i n − η n ∇ w ˜ i e 2 ( ω ˜ n , W ˜ n ; y n , x ˜ n ) , (6)
i = 1 , ⋯ , q , n = 0 , 1 , ⋯ .
In these equations, ∇ ω ˜ e 2 ( ⋅ , ⋅ ; ⋅ , ⋅ ) and ∇ w ˜ i e 2 ( ⋅ , ⋅ ; ⋅ , ⋅ ) denote the current gradients of the error function e 2 ( ω ˜ , W ˜ ; y , x ˜ ) with respect to ω ˜ and w i ,
respectively, obtained after substituting ω ˜ = ω ˜ n , W ˜ = W ˜ n , y = y n , and x ˜ = x ˜ n into (3), and η n > 0 represents the step size (the learning rate). Note that the expressions of ∇ ω ˜ e 2 ( ω ˜ n , W ˜ n ; y n , x ˜ n ) and ∇ w ˜ i e 2 ( ω ˜ n , W ˜ n ; y n , x ˜ n ) may be written in detail similar to that in [
Introducing the notation
θ = [ ω ˜ T , w 1 T , ⋯ , w q T , w 0 T ] T
of the extended weight and bias vector θ ∈ R q ( p + 1 ) , and considering the Equations (5) and (6) in conjunction, rewrite the online gradient learning algorithm for updating θ n in a general form (as in [
θ n + 1 = θ n − η n ∇ θ e 2 ( θ n ; y n , x ˜ n ) , (7)
where ∇ θ e 2 ( ⋅ ; ⋅ , ⋅ ) represents the gradient of e 2 ( θ n ; y n , x ˜ n ) with respect to θ calculated at the nth time instant.
Thus, the Equation (7) together with the expression
e ( θ n ; y n , x n ) = y n − y NN n
in which y n is given by (4), and
y NN n = ψ ( θ n , x n )
describe the learning neural network system necessary to identify the nonlinearity (2). For better understanding the performance of this system, its structure is depicted in
The problem formulated in this paper consists in analyzing asymptotic properties of the learning neural network system presented above. More certainly, it is required to derive conditions under which the learning procedure will be convergent meaning the existence of a limit
lim n → ∞ θ n = θ ∞ (8)
in some sense [
Suppose that there is a multilayer neural network described by
y NN ≡ ψ ( θ , x ) ,
where θ is some fixed parameter vector. According to [
max x ∈ X | φ ( x ) − ψ ( θ , x ) | ≤ ε
evaluating an accuracy of the approximation of φ ( x ) by ψ ( θ , x ) can be satisfied for any ε > 0 via suitable choice of θ and the number of the neurons in its layers. On the other hand, the performance index of the neural network model with a fixed number of these neurons defining its approximation capability might naturally be expressed as follows:
J 0 ( θ ) = max x ∈ X | φ ( x ) − ψ ( θ , x ) | . (9)
In fact, the desired (optimal) vector θ = θ 0 * will then be specified from (9) as the variable θ minimizing J 0 ( θ ) :
θ 0 * = arg min θ max x ∈ X | φ ( x ) − ψ ( θ , x ) | . (10)
Nevertheless, all researches which employ online learning procedures in stochastic environment “silently” replace J 0 ( θ ) by
J ( θ ) = E x { e 2 ( θ ; y , x ˜ ) } ,
where E x { e 2 ( θ ; y , x ˜ ) } denotes the expected value of e 2 ( θ ; y , x ˜ ) .
Indeed, the learning algorithm (7) does not minimize (9): namely, it minimizes J ( θ ) (instead of J 0 ( θ ) ) [
θ * : = arg min θ J ( θ ) .
but not θ 0 * given by (10) as n → ∞ .
Now, consider a special case when the unknown function (2) can exactly be approximated by the neural network ψ ( θ , x ˜ ) implying
φ ( x ) ≡ ψ ( θ * , x ˜ ) ∀ x ∈ X . (11)
In this case called in ( [
If the condition given in identity (11) is satisfied, then the learning rate η n in (7) may be constant:
η n ≡ η = const > 0 ;
see ( [
Note that the property (11) may take place, in particular, when
X = { x ( 1 ) , ⋯ , x ( K ) } contains certain number K = card X of training examples
provided that their number does not exceed the dimension of θ . To understand this fact, according to (11) write the set of K equations
ψ ( θ , x ˜ ( 1 ) ) = y ( 1 ) ⋮ ψ ( θ , x ˜ ( K ) ) = y ( K ) }
with respect to the unknown θ . They are compatible if K ≤ q ( p + 2 ) + 1 . Due to (2) together with the definition of θ * it can be concluded that their solution is just θ = θ * yielding J ( θ * ) = 0 because in this special case, ψ ( θ * , x ˜ ( k ) ) = y ( k ) for all k = 1 , ⋯ , K .
It turns out that if the activation functions g of the hidden layer are nonlinear, then for an arbitrary fixed vector θ ′ there is, at least, one vector θ ″ such that the network outputs for these different vectors are the same even though the output activation function f is linear, i.e. if f ( ζ ) = ζ :
ψ ( θ ′ , x ˜ ) ≡ ψ ( θ ″ , x ˜ ) ∀ x ∈ X . (12)
The feature (12) gives that in the presence of nonlinear g there exist, at least, two different θ * s . For example, let p = 1 , q = 1 and
g ( z 1 ) = 1 1 + exp ( − z 1 )
in which z 1 = w 11 x 1 + w 01 , and f ( ζ ) = ζ with ζ = ω 1 g ( z 1 ) + ω 0 . Fix a θ ′ = [ w ′ 11 , w ′ 01 , ω ′ 1 , ω ′ 0 ] T . Then θ ″ = [ − w ′ 11 , − w ′ 01 , − ω ′ 1 , ω ′ 1 + ω ′ 0 ] T will also satisfy (12); see [
To study some asymptotic properties of sequence { θ n } caused by the learning algorithm (7) in the non-stochastic case, simulation experiments with the scalar nonlinear system (2) having the nonlinearity
φ ( x ) = 3.75 + 0.05 exp ( − 7.15 x ) 1 + 0.19 exp ( − 7.15 x )
were conducted. This nonlinearity can explicitly be approximated by the two-layer neural network model described by ψ ( θ * , x ˜ ) as in Subsection 4.1 with θ ∗ ( 1 ) = [ 7.15 , 1.65 , 3.45 , 0.3 ] T and θ ∗ ( 2 ) = [ − 7.15 , − 1.65 , − 3.45 , 3.75 ] T .
The following basic assumption concerning { x n } n = 0 ∞ which is bounded stochastic sequence (since X is bounded) is made:
(A1) x n s arise randomly in accordance with a probability distribution P ( x ) if X is finite, and with probability density p ( x ) if X is infinite.
Within assumption (A1), the expected value (mean) of e 2 ( θ ; y , x ˜ ) = ( y − ψ ( θ , x ˜ ) ) 2 is given by
E x { e 2 ( θ ; y , x ˜ ) } = { ∑ x ∈ X e 2 ( θ ; y , x ˜ ) P ( x ) if X isfiniteset, ∫ X e 2 ( θ ; y , x ˜ ) p ( x ) d x if X isinfiniteset .
To derive the main theoretical result we need Assumption (A1) and the following additional assumptions:
(A2) the identity (11) holds;
(A3) the activation functions used in the hidden neurons and output neuron are the same ( f ( ⋅ ) = g ( ⋅ ) ) , twice continuously differentiable on R and also uniformly bounded on R .
Further, we introduce a scalar function V ( θ ) playing a role of the Lyapunov function [
(a) V ( θ ) is nonnegative, i.e.,
V ( θ ) ≥ 0 ; (13)
(b) V ( θ ) is the Lipschitz function in the sense that
‖ ∇ V ( θ ′ ) − ∇ V ( θ ″ ) ‖ ≤ L ‖ θ ′ − θ ″ ‖ (14)
for any θ ′ , θ ″ from R q ( p + 1 ) , where ∇ V ( θ ) denotes its gradient, and L > 0 represents the Lipschitz constant.
Now, the global stochastic convergence analysis of the gradient learning algorithm (7) is based on employing the fundamental convergence conditions established in the following Key Technical Lemma which is the slightly reformulated Theorem 3 of [
Key Technical Lemma. Let V ( θ ) be a function satisfying (13) and (14). Define the scalar variable
H ( θ ) = ∇ θ V ( θ ) T ∇ θ E { Q ( x , θ ) } (16)
with some Q ( x , θ ) ≥ 0 , and denote
H n ( θ ) : = ∇ θ V ( θ n ) T ∇ θ E { Q ( x , θ n ) } .
Suppose:
1) H n ( θ ) ≥ Θ n V ( θ n − 1 ) , Θ n > 0 ,
2) E { ‖ ∇ θ Q ( x , θ n ) ‖ 2 } ≤ τ n V ( θ n ) , τ n ≥ 0.
Introduce the additional variable
ν n = η n ( Θ n − L η n τ n / 2 ) . (17)
Then the algorithm (7) yields
lim n → ∞ V n = 0 a.s.,
where V n : = V ( θ n ) provided that E { θ 0 } < ∞ and
0 ≤ ν n ≤ 1 , (18)
∑ n = 0 ∞ v n = ∞ . (19)
Related results followed from the Theorem 3’ of [
Corollary. Under the conditions of the Key Technical Lemma, if
Θ n ≡ Θ = const and τ n ≡ τ = const , and η n ≡ η = const , then V n → n → ∞ 0
with probability 1 provided that
0 < η ≤ 2 ( Θ − ε ) / L τ ( 0 < ε < Θ ) (20)
is satisfied. n
Next, we are able to present the convergence result summarized in the theorem below.
Theorem. Suppose Assumption (A2) holds. Then the gradient algorithm (7) with a constant learning rate, η n ≡ η , will converge with probability 1 (in the
sense that V n → n → ∞ 0 a.s.) and
lim n → ∞ e ( θ n ; y n , x ˜ n ) = 0 a.s. (21)
for any initial θ 0 chosen randomly so that E { Q ( x , θ 0 ) } < ∞ if η satisfies
the conditions (20) with Θ and τ determined by
Θ : = inf θ ‖ ∇ θ E { Q ( x , θ ) } ‖ 2 E { Q ( x , θ ) } > 0 , (22)
τ : = sup θ E { ‖ ∇ θ Q ( x , θ ) ‖ 2 } E { Q ( x , θ ) } < ∞ . (23)
Proof. Set V ( θ ) = E { Q ( x , θ ) } . Then condition (13) and (14) can be shown to be valid. This indicates that this function may be taken as the Lyapunov function. By virtue of (16) such a choice of V ( θ ) gives H ( θ ) = ‖ ∇ θ E { Q ( x , θ ) } ‖ 2 . Putting Θ n ≡ Θ and τ n ≡ τ with Θ and τ determined by (22) and (23), respectively, we can conclude that the conditions 1), 2) of the Key Technical Lemma are satisfied. Applying its Corollary it proves that lim n → ∞ V n = 0 with probability 1.
Due to the fact that V ( θ ) = E x { e 2 ( θ ; y , x ˜ ) } together with Assumption (A2), result (21) follows. n
To demonstrate theoretical result given in Subsection 4.3, several simulations were conducted. First, we dealt with the same neural network and the same training samples as in ( [
x ( 1 ) = [ 0 , 0 ] T , y ( 1 ) = 1 ;
x ( 2 ) = [ 0 , 1 ] T , y ( 2 ) = 0 ;
x ( 3 ) = [ 1 , 0 ] T , y ( 3 ) = 0 ;
x ( 4 ) = [ 1 , 1 ] T , y ( 4 ) = 1.
The two numerical examples with different initial θ 0 were considered. In Example 1 we set w 11 0 = 0.95 , w 12 0 = − 0.084 , w 21 0 = 0 .079 , w 22 0 = − 0.079 , w 01 0 = − 0.089 , w 02 0 = 0 .075 , ω 1 0 = 0 .357 , ω 2 0 = − 0 .357 , ω 0 0 = 0 .354 . In Example 2 we set w 11 0 = − 0 .090 , w 12 0 = 0 .225 , w 21 0 = − 0 .138 , w 22 0 = 0 .139 , w 01 0 = 0 .222 , w 02 0 = − 0 .084 , ω 1 0 = − 0 .356 , ω 2 0 = 0 .357 , ω 0 0 = 0 .353 .
Contrary to [
Results of two simulation experiments whose durations were 10000 iteration steps are presented in
Further, another simulation experiments were also conducted. In contrast with previous experiments, they dealt with an infinite training sets X Namely, the two simulations with the same nonlinear function as in Subsection 4.2 were first conducted, provided that X is the infinite bounded set given by X ∈ [ − 2 , 2 ] . However, { x n } was now chosen as the stochastic sequence. Namely, it was generated as a pseudorandom i.i.d. sequence.
Two numerical examples were considered. In Example 3, the initial values of neural network weights and biases were taken as follows: w 1 0 = 0.529 ,
w 2 0 = − 0.5012 , ω 1 0 = − 0.9168 , ω 2 0 = 1.0409 . In Example 4 we set w 1 0 = − 0.3756 , w 2 0 = − 0.572 , ω 1 0 = − 0.9798 , ω 2 0 = 1.1436 .
Next, another nonlinearity
φ ( x ) = 1 1 + exp [ − a 1 ( x ) − a 2 ( x ) − 1 ]
with a 1 ( x ) = [ 1 + exp ( − 10 x − 5 ) ] − 1 and a 2 ( x ) = [ 1 + exp ( − 10 x + 5 ) ] − 1 to be exactly approximated by a suitable neural network was chosen as in [11, p. 12-4]. The following initial estimates were taken: w 1 0 = 2.8 , w 2 0 = − 5.6 , w 3 0 = − 2.8 , w 4 0 = − 5.6 , ω 1 0 = 5.33 , ω 2 0 = 1.71 , ω 3 0 = − 3.52 (Example 5), and w 1 0 = 0.27 , w 2 0 = 0.19 , w 3 0 = − 3.09 , w 4 0 = 3.96 , ω 1 0 = 1.64 , ω 2 0 = 0.72 , ω 3 0 = − 2.21 (Example 6).
Results of the two simulation experiments conducted with the initial estimates θ 0 given above are depicted in
From Figures 4-9 we can see that the learning processes converge and the performance index J ( θ n ) tends to zero while the penalty term is absent. It can be observed that if the initial vectors θ 0 s are different then the sequences { θ n } may converge to different final θ ∞ s .
The simulation experiments show that the penalty term is not necessary, in principle, to achieve the convergence of the online gradient learning procedure in the three-layer neural networks if certain conditions given by Assumption (A1)-(A3) are satisfied. This fact supports our theoretical results.
In this paper, some important features of multilayer neural networks which are utilized as nonlinearly parameterized models of unknown nonlinear systems to be identified have been derived. A special case where the nonlinearity can exactly be approximated by a three-layer neural network has been studied. Contrary to previous author’s papers we dealt with the neural network having a nonlinear activation function for its output layer. It was shown that if the activation function of the hidden layer is nonlinear, then, for any input variables, there are, at least, two different network parameter vectors under which the network outputs will be the same even though the output activation function is linear. This feature gives that the standard gradient online training algorithm with a constant learning rate may not be convergent if the training sequence is non-stochastic. Nevertheless, provided that this sequence is stochastic, it has theoretically been established that, under certain conditions, such algorithm will converge with probability one. However, ultimate values of network parameters may be different. These facts were confirmed by simulation experiments.
The authors are grateful to anonymous reviewer for his valuable comments.
Zhiteckii, L.S., Azarskov, V.N., Nikolaienko, S.A. and Solovchuk, K.Yu. (2018) Some Features of Neural Networks as Nonlinearly Parameterized Models of Unknown Systems Using an Online Learning Algorithm. Journal of Applied Mathematics and Physics, 6, 247-263. https://doi.org/10.4236/jamp.2018.61024