Advances in Pure Mathematics
Vol.08 No.09(2018), Article ID:87483,10 pages
10.4236/apm.2018.89048
Least Squares Method from the View Point of Deep Learning II: Generalization
Kazuyuki Fujii1,2
1International College of Arts and Sciences, Yokohama City University, Yokohama, Japan
2Department of Mathematical Sciences, Shibaura Institute of Technology, Saitama, Japan
Copyright © 2018 by author and Scientific Research Publishing Inc.
This work is licensed under the Creative Commons Attribution International License (CC BY 4.0).
http://creativecommons.org/licenses/by/4.0/
Received: August 30, 2018; Accepted: September 22, 2018; Published: September 25, 2018
ABSTRACT
The least squares method is one of the most fundamental methods in Statistics to estimate correlations among various data. On the other hand, Deep Learning is the heart of Artificial Intelligence and it is a learning method based on the least squares method, in which a parameter called learning rate plays an important role. It is in general very hard to determine its value. In this paper we generalize the preceding paper [K. Fujii: Least squares method from the view point of Deep Learning: Advances in Pure Mathematics, 8, 485-493, 2018] and give an admissible value of the learning rate, which is easily obtained.
Keywords:
Least Squares Method, Statistics, Deep Learning, Learning Rate, Gerschgorin’s Theorem
1. Introduction
This paper is a sequel to the preceding paper [1] .
The least squares method in Statistics plays an important role in almost all disciplines, from Natural Science to Social Science. When we want to find properties, tendencies or correlations hidden in huge and complicated data we usually employ the method. See for example [2] .
On the other hand, Deep Learning is the heart of Artificial Intelligence and will become a most important field in Data Science in the near future. As to Deep Learning see for example [3] - [10] .
Deep Learning may be stated as a successive learning method based on the least squares method. Therefore, to reconsider it from the view point of Deep Learning is natural and instructive. We carry out the calculation thoroughly of the successive approximation called gradient descent sequence, in which a parameter called learning rate plays an important role.
One of main points is to determine the range of the learning rate, which is a very hard problem [8] . We showed in [1] that a difference in methods between Statistics and Deep Learning leads to different results when the learning rate changes.
We generalize the preceding results to the case of the least squares method by polynomial approximation. Our results may give a new insight to both Statistics and Data Science.
2. Least Squares Method
Let us explain the least squares method by polynomial approximation [9] . The model function is a polynomial in x of degree M given by
(1)
For N pieces of two dimensional real data
we assume that their scatter plot is given like Figure 1.
The coefficients of (1)
(2)
must be determined by the data set (T denotes the transposition of a vector or a matrix).
For this set of data the error function is given by
(3)
Figure 1. Scatter plot.
The aim of least squares method is to minimize the error function (3) with respect to in (2). Usually it is obtained by solving the simultaneous differentiable equations
However, in this paper another approach based on quadratic form is given, which is instructive.
Let us calculate the error function (3). By using the definition of inner product
it is not difficult to see
(4)
where
and
Here we make an important
Assumption and (full rank).
Let us deform (4). From
we set for simplicity
Namely, we have a general quadratic form
(5)
On the other hand, the deformation of (5) is well-known.
Formula For a symmetric and invertible matrix (: ) we have
(6)
The proof is easy. Since we obtain
and this gives (6).
Therefore, our case becomes
(7)
because is symmetric and invertible by the assumption.
If we choose
(8)
then the minimum is given by
(9)
where is the N-dimensional identity matrix.
Our method is simple and clear (“smart” in our terminology).
3. Least Squares Method from Deep Learning
In this section we reconsider the least squares method in Section 2 from the view point of Deep Learning.
First we arrange the data in Section 2 like
and consider a simple neuron model in [11] (see Figure 2).
Here we use the polynomial (1) instead of the sigmoid function .
In this case the square error function becomes
(10)
Figure 2. Simple neuron model.
We in general use instead of in (3).
Our aim is also to determine the parameters in order to minimize . However, the procedure is different from the least squares method in Section 2. This is an important and interesting point.
The parameters are determined successively by the gradient descent method (see for example [12] ): For
and
(11)
where
(12)
and is a small parameter called the learning rate.
The initial value is given appropriately. Pay attention that t is discrete time and T is the transposition.
Let us calculate (11) explicitly. Since
from (12) we have
(13)
This equation is easily solved to be
(14)
for .
The proof is left to readers.
Since this is not a final form let us continue the calculation. From (14) we have
(15)
if
(16)
where is the N-dimensional zero matrix. (15) is just the equation (8) and it is independent of .
Let us evaluate (14) further. The matrix is positive definite, so all eigenvalues are positive. This can be shown as follows. Let us consider the eigenvalue equation
Then we have
Therefore we can arrange all eigenvalues like
Since is symmetric, it is diagonalized as
(17)
where Q is an element in ( ) and D is a diagonal matrix
See for example [13] .
By substituting (17) into (14) and using the equation
we finally obtain
Theorem I A general solution to (14) is
(18)
This is our main result.
Next, let us show how to choose the learning rate , which is a very important problem in Deep Learning [7] [8] .
Let us remember
From (16) and (18) the equations
(19)
determine the range of . Noting
and
we obtain
Theorem II The learning rate must satisfy an inequality
(20)
The greater the value of , the sooner goes the gradient descent (11) so long as the convergence (19) is guaranteed. Let us note that the choice of the initial values is irrelevant when the convergence condition (20) is satisfied.
Comment For example, if we choose like
then we cannot recover (15), which shows a difference in methods between Statistics and Deep Learning.
4. How to Estimate the Learning Rate
How do we calculate ? Since are the eigenvalues of the matrix , they satisfy the equation
where is the characteristic polynomial of given by
(21)
This is abstract, so let us deform (21). For simplicity we write as
(22)
Then it is easy to see
where the notation is the (real) inner product of vectors.
For clarity let us write down (21) explicitly.
As far as we know there is no viable method to determine the greatest root of if M is very large1. Therefore, let us get satisfied by obtaining an approximate value which is both greater than and easy to calculate.
For the purpose the Gerschgorin’s theorem is very useful2. Let be an complex (real in our case) matrix, and we set
(23)
and
(24)
for each i. This is a closed disc centered at with radius called the Gerschgorin’s disc.
Theorem (Gerschgorin [14] ) For any eigenvalue of A we have
(25)
The proof is simple. See for example [7] .
Our case is real and and
Therefore, all eigenvalues satisfy
(26)
where is a closed interval and
If we define
(27)
then it is easy to see
from (26).
Thus we arrive at an admissible value of the learning rate which is easily obtained.
Theorem III An admissible value of is
(28)
Let us show an example in the case of ( [1] ), which is very instructive for non-experts.
Example In this case it is easy to see and we set
for simplicity. Moreover, we may assume . Then from (21) we have
and
On the other hand, from (27) we have
because .
Then it is easy to show
To check this inequality is left to readers. Therefore, from (28) the admissible value becomes
We emphasize once more that is easy to evaluate, while to calculate is very hard if M is large.
5. Concluding Remarks
In this paper we have discussed the least squares method by polynomial approximation from the view point of Deep Learning and carried out calculation of the gradient descent thoroughly. A difference in methods between Statistics and Deep Learning delivers different results when the learning rate is changed. Theorem III is the first result to provide an admissible value of as far as we know.
Deep Learning plays an essential role in Data Science and maybe in almost all fields of Science. Therefore it is desirable for undergraduates to master it in the early stages. To master it they must study Calculus, Linear Algebra and Statistics from Mathematics. My textbook [7] is recommended.
Acknowledgements
We wishes to thank Ryu Sasaki for useful suggestions and comments.
Conflicts of Interest
The author declares no conflicts of interest regarding the publication of this paper.
Cite this paper
Fujii, K. (2018) Least Squares Method from the View Point of Deep Learning II: Generalization. Advances in Pure Mathematics, 8, 782-791. https://doi.org/10.4236/apm.2018.89048
References
- 1. Fujii, K. (2018) Least Squares Method from the View Point of Deep Learning. Advances in Pure Mathematics, 8, 485-493.
- 2. Wikipedia: Least Squares. https://en.m.wikipedia.org/wiki/Least_Squares
- 3. Wikipedia: Deep Learning. https://en.m.wikipedia.org/wiki/Deep_Learning
- 4. Goodfellow, I., Bengio, Y. and Courville, A. (2016) Deep Learning. The MIT Press, Cambridge.
- 5. Patterson, J. and Gibson, A. (2017) Deep Learning: A Practitioner’s Approach, O’Reilly Media, Inc., Sebastopol.
- 6. Alpaydin, J. (2014) Introduction to Machine Learning. 3rd Edition, The MIT Press, Cambridge.
- 7. Fujii, K. (2018) Introduction to Mathematics for Understanding Deep Learning. Scientific Research Publishing Inc., Wuhan.
- 8. Okaya, T. (2015) Deep Learning (In Japanese). Kodansha Ltd., Tokyo.
- 9. Nakai, E. (2015) Introduction to Theory of Machine Learning (In Japanese). Gijutsu-Hyouronn Co., Ltd., Tokyo.
- 10. Amari, S. (2016) Brain Heart Artificial Intelligence (In Japanese). Kodansha Ltd., Tokyo.
- 11. Fujii, K. (2018) Mathematical Reinforcement to the Minibatch of Deep Learning. Advances in Pure Mathematics, 8, 307-320. https://doi.org/10.4236/apm.2018.83016
- 12. Wikipedia: Gradient Descent. https://en.m.wikipedia.org/wiki/Gradient_descent
- 13. Kasahara, K. (2000) Linear Algebra (In Japanese). Saiensu Ltd., Tokyo.
- 14. Gerschgorin, S. (1931) über die Abgrenzung der Eigenwerte einer Matrix. Izv. Akad. Nauk. USSR Otd. Fiz.-Mat. Nauk, 6, 749-754.
NOTES
1 is not a sparse matrix.
2In my opinion this theorem is not so popular. Why?