Advances in Pure Mathematics
Vol.08 No.05(2018), Article ID:84867,9 pages
10.4236/apm.2018.85027
Least Squares Method from the View Point of Deep Learning
Kazuyuki Fujii1,2
1International College of Arts and Sciences, Yokohama City University, Yokohama, Japan,
2Department of Mathematical Sciences, Shibaura Institute of Technology, Saitama, Japan
Copyright © 2018 by author and Scientific Research Publishing Inc.
This work is licensed under the Creative Commons Attribution International License (CC BY 4.0).
http://creativecommons.org/licenses/by/4.0/
Received: April 10, 2018; Accepted: May 26, 2018; Published: May 29, 2018
ABSTRACT
The least squares method is one of the most fundamental methods in Statistics to estimate correlations among various data. On the other hand, Deep Learning is the heart of Artificial Intelligence and it is a learning method based on the least squares. In this paper we reconsider the least squares method from the view point of Deep Learning and we carry out the computation thoroughly for the gradient descent sequence in a very simple setting. Depending on the values of the learning rate, an essential parameter of Deep Learning, the least squares methods of Statistics and Deep Learning reveal an interesting difference.
Keywords:
Least Squares Method, Statistics, Deep Learning, Learning Rate, Linear Algebra
1. Introduction
The least squares method in Statistics plays an important role in almost all disciplines, from Natural Science to Social Science. When we want to find properties, tendencies or correlations hidden in huge and complicated data we usually employ the method. See for example [1] .
On the other hand, Deep Learning is the heart of Artificial Intelligence and will become a most important field in Data Science in the near future. As to Deep Learning see for example [2] [3] [4] [5] [6] .
Deep Learning may be stated as a successive learning method based on the least squares method. Therefore, to reconsider it from the view point of Deep Learning is very natural and we carry out the calculation thoroughly of the successive approximation called gradient descent sequence.
When the learning rate changes a difference in method between Statistics and Deep Learning gives different results.
Theorem I and II in the text are our main results and a related problem (exercise) is presented for readers. Our results may give a new insight to both Statistics and Data Science.
First of all let us explain the least squares method for readers in a very simple setting. For n pieces of two dimensional real data
we assume that their scatter plot is like Figure 1
Then a model function is linear
(1)
For this function the error (or loss) function is defined by
(2)
The aim of least squares method is to minimize the error function (2) with respect to . A little calculation gives
Then the equations for the stationality
(3)
give a linear equation for a and b
(4)
Figure 1. Scatter plot 1.
and its solution is given by
(5)
Explicitly, we have
(6)
To check that a and b give the minimum of (2) is a good exercise.
Note We have an inequality
(7)
and the equal sign holds if and only if
Since are data we may assume that for some . Therefore
gives
(8)
2. Least Squares Method from Deep Learning
In this section we reconsider the least squares method in Section 1 from the view point of Deep Learning.
First we arrange the data in Section 1 like
and consider a simple neuron model in [7] (see Figure 2)
Figure 2. Simple neuron model 1.
Here we use the linear function (1) instead of the sigmoid function .
In this case the square error function becomes
We usally use instead of in (2).
Our aim is also to determine the parameters in order to minimize . However, the procedure is different from the least squares method in Section 1. This is an important and interesting point.
For later use let us perform a little calculation
(9)
We determine the parameters successively by the gradient descent method, see for example [8] . For
and
(10)
where
and is small enough. The initial value is given appropriately. As will be shown shortly in Theorem I, their explicit values are not important.
Comment The parameter ò is called the learning rate and it is very hard to choose ò properly as emphasized in [7] . In this paper we provide an estimation (see Theorem II).
Let us write down (10) explicitly:
and
These are cast in a vector-matrix form
(11)
For simplicity by setting
we have a simple equation
(12)
where E is a unit matrix. Due to (8) the matrix A is invertible ( ), that is, we exclude the trivial and uninteresting case .
The solution is easy and given by
(13)
Note Let us consider a simple difference equation
for . Then, the solution is given by
Check this.
Comment The solution (13) gives
(14)
if
(15)
where O is a zero matrix. (14) is just the equation (5).
Let us evaluate (13) further. For the purpose we make some preparations from Linear Algebra [9] . For simplicity we set
and want to diagonalize A.
The characteristic polynomial of A is
and the solutions are given by
(16)
It is easy to see
(17)
We set the two eigenvectors of matrix A, corresponding to and , in a matrix form
It is easy to see
from (16) and we also set
(18)
Then it is easy to see
Namely, Q is an orthogonal matrix. Then the diagonalization of A becomes
(19)
By substituting (19) into (13) and using
we finally obtain
Theorem I A general solution to (12) is
(20)
This is our main result.
Lastly, let us show how to choose the learning rate , which is a very important problem in Deep Learning. Let us remember
from (17). From (15) the equations
determine the range of ò. Noting
we obtain
Theorem II The learning rate ò must satisfy an inequality
(21)
From (21) ò becomes very small when is large enough. It is easy to see that the second condition is automatically satisfied.
Under Theorem II we can recover (14)
by (19).
Comment For example, if we choose ò like
then we cannot recover (14), which shows a difference between Statistics and Deep Learning. Let us emphasize that the choice of the initial values is irrelevant when the convergence condition (21) is satisfied.
As a result, how to choose ò properly in Deep Learning becomes a very important problem when the number of data is huge. As far as we know the result like Theorem II has not been obtained.
3. Problem
In this section we present the outline of a simple generalization of the results in Section 2. The actual calculation is left as a problem (exercise) to readers.
For n pieces of three dimensional real data
we assume that its scatter plot is like Figure 3
Then a model function is linear
(22)
and the error (or loss) function is defined by
Figure 3. Scatter plot 2.
Figure 4. Simple neuron model 2.
(23)
The aim of least squares method is to minimize the error function (22) with respect to .
As we want to treat the least squares method above from the view point of Deep Learning, we again arrange the data like
and consider another simple neuron model (see Figure 4)
Then we present
Problem Carry out the corresponding calculation as given in Section 2.
4. Concluding Remarks
In this paper we discussed the least squares method from the view point of Deep Learning and carried out calculation of the gradient descent thoroughly. A difference in methods between Statistics and Deep Learning delivers different results when the learning rate ò is changed. The result of Theorem II is the first one as far as we know.
Deep Learning plays an essential role in Data Science and maybe in almost all fields of Science. Therefore it is desirable for undergraduates to master it as soon as possible. To master it they must study Calculus, Linear Algebra and Statistics from Mathematics. However we don’t know a good and compact textbook leading to Deep Learning.
I am planning to write a comprehensive textbook in the near future [10] .
Acknowledgements
We wish to thank Ryu Sasaki for useful suggestions and comments.
Cite this paper
Fujii, K. (2018) Least Squares Method from the View Point of Deep Learning. Advances in Pure Mathematics, 8, 485-493. https://doi.org/10.4236/apm.2018.85027
References
- 1. Wikipedia: Least Squares. https://en.wikipedia.org/wiki/Least_squares
- 2. Wikipedi: Deep Learning. https://en.wikipedia.org/wiki/Deep_learning
- 3. Goodfellow, I., Bengio, Y. and Courville, A. (2016) Deep Learning. The MIT Press, Cambridge.
- 4. Patterson, J. and Gibson, A. (2017) Deep Learning: A Practitioner’s Approach. O’Reilly Media, Inc., Sebastopol.
- 5. Okaya, T. (2015) Deep Learning (In Japanese). Kodansha Ltd., Tokyo.
- 6. Amari, S. (2016) Brain⋅Heart⋅Artificial Intelligence (In Japanese). Kodansha Ltd., Blue Backs B-1968, Tokyo.
- 7. Fujii, K. (2018) Mathematical Reinforcement to the Minibatch of Deep Learning. Advances in Pure Mathematics, 8, 307-320. https://doi.org/10.4236/apm.2018.83016
- 8. Wikipedia: Gradient Descent. https://en.wikipedia.org/wiki/Gradient_descent
- 9. Kasahara, K. (2000) Linear Algebra (In Japanese). Saiensu Ltd., Tokyo.
- 10. Fujii, K. Introduction to Mathematics for Understanding Deep Learning. In Preparation.