A class of rapid algorithms for independent component analysis (ICA) is presented. This method utilizes multi-step past information with respect to an existing fixed-point style for increasing the non-Gaussianity. This can be viewed as the addition of a variable-size momentum term. The use of past information comes from the idea of surrogate optimization. There is little additional cost for either software design or runtime execution when past information is included. The speed of the algorithm is evaluated on both simulated and real-world data. The real-world data includes color images and electroencephalograms (EEGs), which are an important source of data on human-computer interactions. From these experiments, it is found that the method we present here, the RapidICA, performs quickly, especially for the demixing of super-Gaussian signals.
Optimization of the amount of information leads to a rich class of learning algorithms for various types of signal processing. The learning phase is usually iterative, which is often the cause of their effective-but-slow nature. The speed of a learning algorithm is important, because this facilitates the application to real-world problems. Independent component analysis (ICA) [
Reflecting this necessity for speedup, this paper presents a class of rapid ICA algorithms. The core idea is the introduction of past information to the fixed-point ICA. This method inherits the idea for the speedup of the gradient-style algorithm using a surrogate function [3-5].
When a new class of learning algorithms is presented, it is necessary to check the following:
1) Is its merit on the performance enough?
2) Is it applicable to substantial real-world problems?
In this paper, we present the RapidICA algorithm and the results of testing it on both artificial and real-world data. As practical data, we adopted color images and electroencephalograms (EEGs) for pattern recognition. After extensive experiments, the effectiveness of the RapidICA over the existing fixed-point ICA can be fully recognized.
The organization of this paper is as follows. In Section 2, preliminaries to the ICA problem are given. Section 3 shows three types of fast versions, starting with the simplest one. Next, a summarized version is given, along with a strategy for stabilization and a measure of the convergence. Section 4 presents the results of experiments comparing the RapidICA with the FastICA for both artificial and real-world data. The practical data were color images and electroencephalograms (EEGs). In Section 5, concluding remarks are given.
In this section, we present our notation and the minimum necessary preliminaries for ICA. Given a measured random vector x, the problem of ICA starts with the relationship of the superposition.
Here,
is zero mean. The matrix A is an n × n unknown nonsingular matrix. The column vector
is also unknown. Its components are assumed to be independent of each other, and to be non-Gaussian except for one component. The necessity of these assumptions indicates the following:
1) For the consistency of the ICA problem, it is desirable that the xi are far from a Gaussian distribution.
2) There are two classes that are non-Gaussian. One is super-Gaussian and the other is sub-Gaussian. But by the central limit theorem, the summation of only a few independent sub-Gaussian random variables easily becomes an almost Gaussian one. Therefore, real-world sub-Gaussian mixtures xi for ICA applications are rare. On the other hand, super-Gaussian mixtures xi are found in many signal processing sources. Thus, experiments on real-world data will be mainly on super-Gaussian mixtures.
Using the data production mechanism of (1)-(3), the problem of ICA is to estimate both A and s using only x. This is, however, a semi-parametric problem where uncertainty remains. Therefore, we will be satisfied if the following y and W are successfully obtained.
Here, the random vector y is an estimated column vector with independent components. Note that the ordering of components of y is allowed to be permuted from that of s. The amplitude of si is also uncertain in this generic ICA formulation. Therefore, W is an estimation of. Here, is a nonsingular diagonal matrix, and is a permutation matrix.
There are several cost functions that measure the independence of random variables. A popular target function is the minimization of the Kullback-Leibler divergence, or, equivalently, the minimum average mutual information. The methods we present in this paper are related to this information measure.
As is well known from information theory [
This average mutual information becomes zero if and only if the are independent of each other. Thus differentiation with respect to W leads to a gradient descent algorithm. This method can be generalized by using a convex divergence that includes the average mutual information of (5) as a special case.
The convex divergence is an information quantity expressed as follows [
Here, is convex on, and is a dual convex function. Note that the special case of
is reduced to the average mutual information (5). The differentiation of with respect to W leads to a gradient descent update for the demixing matrix [
Here, t is the iteration index for the update, and is a delay. The symbol represents a natural gradient of g in Equation (6) multiplied by, where c is a positive constant associated with the Fisher information matrix [
is unknown in practice since is unknown in the formulation of the ICA. Therefore, in practical ICA execution, the function is assumed to be a nonlinear one, such as or.
An important point in this preliminary section appears in the last line of Equation (9). If, then the second term disappears, and it becomes the method of minimum average mutual information using Equation (5). On the other hand, the method of minimum convex divergence generates the momentum term by. This was the fastest ICA by the gradient descent method [
Non-Gaussianity can be measured by the KullbackLeibler divergence (5) from a pdf to a Gaussian pdf, which is called negentropy.
A criterion for the ICA of Equation (4) can be set to the maximization of the negentropy (11). It is not possible to maximize the negentropy directly since is unknown. Therefore, in the FastICA [
Note that this assumption was not needed in the method of Section 2.2.1.
Then, using a predetermined contrast function, the maximization of (11) is approximated by the following optimization with constraint [
Here, is a zero-mean, unit-variance Gaussian random variable. By virtue of the above drastic approximation, the FastICA with a couple of variants [
Here, and are the first and second derivatives of a contract function. Examples of are y4, log cosh y, and. The update method (14) is a fixed-point algorithm and, because of its speed, it is the current de facto standard for ICA.
So far, two preliminary sections have been presented. The first one, Section 2.2.2, presents a direct minimization of the convex divergence between the current and independent pdfs. The last term of the last line of Equation (9) is the important one.
1) This term, which depends on the past information, is the core of the speedup [
2) The case of corresponds to the optimization of the Kullback-Leibler divergence or its subsidiary, the entropy.
The second preliminary, the FastICA derived from the entropy difference, updates Equation (14) in a fixedpoint style. Based upon the theoretical foundation of the derivation of Equation (9), one may conjecture that the acceleration methods for Equation (14) would be obtainable by using the strategy of Equation (9). The rest of this paper will show that this conjecture is answered in the affirmative. We comment here in advance that the process we use to obtain the accelerated version of the fixed-point method is not naïve. It requires more artifice than the method of Section 2.2.1 since orthonormalization steps need to be interleaved. But we will see that the result will be easy to code as software.
In this section, several steps towards the final rapid version are presented. From this point on, the description will be more software oriented.
Instead of the measured vector x, its whitened version z, obtained in the following way, is used as a target source to be demixed.
Here, D is the diagonal matrix constructed from the largest eigenvalues of the covariance matrix, and ei, , are the corresponding eigen-vectors. Then, the update and orthonormalization steps of the fixed-point method are expressed as follows:
Update step: W is updated by the following computation.
Orthonormalization step: W is orthonormalized by using the eigenvalue decomposition.
Since we will use the speedup method of Section 2.2.1, it is necessary to rewrite (17) as an additive form.
Equation (19) can be derived by dividing the righthand side of Equation (14) by. This is allowed since the orthonormalization step for W follows.
Then the basic method, which is the FastICA, is described as follows:
[Basic Method]
Step 1:
Step 2:
Step 3:
Beginning in the next subsection, a series of speedup versions is presented. Findings in each version are supported by extensive experiments. Except for the destination algorithm, intermediate experimental results are omitted in order to save space.
For the basic method, which is the FastICA, it has been pointed out that the orthonormalization of Step 3 sometimes hinders the learning of non-Gaussianity by causing a slow convergence. Therefore, we start by considering an algorithm that simply increases the update amount.
[Method 0: Naïve version]
Step 1:
Step 2:
Step 3:
Step 4:
This naïve method is too simple, in that convergence will frequently fail for unadjusted, even for small figures. Therefore, we make this increase usable by inserting an orthonormalization step.
[Method 1: Opening version]
Step 1:
Step 2:
Step 3:
Step 4:
Step 5:
Step 6:
One might think that this version simply computes the same update twice, but the following observations lead us to better versions:
1) Step 2 requires much more computational power than do the others.
2) Without Step 4, the increase by could cause oscillations.
By integrating Method 1 and the usage of the past information described in Section 2.2.1, the following method is derived.
[Method 2: Second-order version]
Step 1:
Step 2:
Step 3:
Step 4:
Step 5:
Step 6:
Step 7:
The key point of this method is that there are two types of increments. The first type is in Step 3 and the second one is in Step 6. This method has the following properties:
1) Each increment uses different time information. This is equivalent to the usage of a higher-order strategy.
2) The computed increments of Step 5 gradually converge to the null matrix O. Therefore, this method is more stable than Method 1.
3) The computational load of Step 5 is negligible compared to that of Step 3. Therefore, the reduction of iterations will be directly reflected in the runtime speedup.
In High-speed Version II, the adjusting parameter of the increment was a scalar, such as 0.1. The next step is to find reasonable values of that depend on the row indices i of W.
[Method 3: Variable step-size version]
Step 1:
Step 2:
Step 3:
Step 4:
Step 5:
Step 6:
Step 7:
In this method, the main issue is how to find an effecttive. Here, we adopt the idea of jointly using current and past information.
Equation (20) has the property that a large change in the direction of from causes the value of to be small. This property is useful in helping to avoid oscillations during the intermediate steps. However, the magnitude of is not taken into account. Also, near to the convergence maximum, the value of might become numerically unstable since division by zero could occur. We now get to the final form.
Here, is a constant, but it can be omitted (i.e.,). The constant serves the purpose of providing numerical stability by preventing division by zero (for 32-bit machines). Equation (21) has the following properties:
1) When and are close together, the value of is large since this direction needs to be emphasized. On the other hand, if the directions of and are considerably different (the extreme case is anti-parallel), the value of is close to zero.
2) Because of, a small generates a small.
Properties 1) and 2) give us hope that considerable speedup can be obtained with very little increase in computational complexity. There is one more property on which to comment.
3) It should be noted that every ICA algorithm has the possibility of non-convergence. The FastICA [
Here, is a slowdown constant. The case of is the FastICA, which may not converge on some data. In our preliminary experiments with Method 3, worked well in achieving convergence without too much slowdown.
Up to this point, we have presented steps that increase the speedup and stabilization. Integrating all these steps is expected to produce an even faster ICA than the existing ones. The following procedure summarizes the RapidICA.
[RapidICA]
Step 1:
Step 2:
Step 3:
where
Step 4:
Step 5:
Step 6:
Step 7:
with of Equation (21).
Step 8:
In the appendix, we will give the source code of the RapidICA in the programming language R.
It is important to emphasize the following:
1) The algorithm of the RapidICA utilizes past information over three steps in a single cycle. This is illustrated in
2) From
an exception, since acceleration methods for the natural gradient ICA have already been presented [3-5] based upon the idea of surrogate optimization of the likelihood ratio [
3) Computationally, the heaviest parts are Steps 1 and 3 of the High-speed Versions II and III. Let n be the number of independent components and T be the number of samples. Then the order of the computation is.
The computation of, however, is only. Since holds for most cases, the computational overhead for realizing the RapidICA from the FastICA remains small. Therefore a reduction in the number of iterations leads directly to a speedup in CPU time.
In the next section, the speedup effects of the RapidICA will be measured by the number of iterations and the CPU time.
We evaluated the performance of the RapidICA with respect to simulated and real-world data. Because of the semi-parametric nature of the ICA formulation, methods to measure convergence are different between simulated and real-world data.
The evaluation of simulated data is important since the mixing matrix A is specified in advance. Note that this matrix is assumed to be unknown in the ICA setting.
For simulated data, the first measure of error counts how close is to A. There is also the permutation uncertainty. Therefore, the error measure is based on the matrix
Here, V is a transformation matrix of Equation (16). The error measure is then defined as follows [
Here, is the element of the matrix P. If, i.e., the source signal is regarded as a preprocessed one, then the matrix P becomes a permutation matrix.
The second error measure reflects the independence by using.
This is applicable to both simulated and real-world data.
The third error criterion measures the convergence of the demixing matrix. Let and be the row vectors of W and, respectively. Here, W is the currently orthonormalized version. Thus the convergence measure is defined as follows.
This is the most important convergence measure, since it can be used as a stop criterion for the iteration.
A typical value for is to.
First, we generated a super-Gaussian source from Gaussian random numbers, as follows:
Step 1: A mixture matrix A was selected.
Step 2: We drew N(0, 1) Gaussian pseudo-random numbers.
Step 3: For each random number r, was applied to obtain an s.
Step 4: A total of 2000 such s were generated.
Step 5: The time series of s was renormalized to have zero mean and unit variance.
Step 6: A total of of such super-Gaussian sources were generated.
Step 7: The mixture signal x was generated by Equation (1).
(28)
In Figures 3 and 4, both types of RapidICAs outperformed the FastICA. Also, the RapidICA, with of Equation (21), outperformed the version with const. This means that the diagonal matrix appropriately changed its elements. Since there are n = 20 elements, we computed their average so that a general trend could be more easily seen.
From
1) During the first two iterations, was set to zero since there was no past information.
2) Once the adjustment of started, iterations used this information to adjust the step size and the direction of. The average became close to zero as the iteration proceeded. In
Digital color images are popular real-world data for ICA benchmarking. First, we checked to see the trend of the convergence by using a single typical natural image. We first present this, and then follow with the average performance on 200 images.
We applied the ICAs to obtain image bases. In this case, the true mixing matrix A was unknown.
Step 1: Image patches of x were collected directly from RGB source images. First, we drew a collection of 8 × 8 size patches. The total number was 15,000.
Step 2: Each patch x was regarded as a 192-dimensional vector (192 = 8 × 8 × 3).
Step 3: By the whitening of Equation (15), the dimension was reduced to 64.
In this experiment, the contrast function of Equations (28) and (29) was used.
It is important for ICA users to understand that there are difficulties with using log cosh as a performance measure. This is because its value cannot be known in advance because of the semi-parametric formulation of the ICA problem. We can understand this by comparing the vertical axis levels of Figures 4 and 6. For this reason, the convergence criterion (26) is the most practical one.
By examining
among the three ICA methods, but they are expected to be similar to each other except for the ordering permutetion. Figures 8 and 9 are sets of bases obtained by the RapidICA and the FastICA at the 300th iteration. These are raster-scan visualizations of the column vectors of the matrix
Here, is a Moor-Penrose generalized inverse of V. is an estimation of the unknown mixing matrix.
By considering the permutation, we compare two sets of bases
and
by the computation procedure described in
If, we can regard the two basis sets as having a similar role. By the computation of S in
We prepared a data set of 200 color images as follows.
As will be seen, the image data are mostly super-Gaussian.
Step 1: Each color image was resized to 150 × 112 pixels.
Step 2: From each image, a set of 16,800 patches of 8 × 8 pixels was drawn by overlapped sampling.
Step 3: Each set of patches was normalized to have zero mean.
Step 4: By the whitening method of Equations (15) and (16), the patch vector’s dimension of 192 = 8 × 8 × 3 was reduced to 64.
For the color image data set prepared as described above, we conducted an experiment to compare the RapidICA (Method 3) and the FastICA. The adopted contrast function was log cosh.
The vertical axis is the number of occurrences. Obviously, the RapidICA outperformed the FastICA in all 200 cases. For the total of 200 ICA trials, the RapidICA required 5349 iterations with a CPU time of 1837 seconds by a conventional PC. On the other hand, the FastICA required 9848 iterations with a CPU time of 3031 seconds. Therefore, the average iteration speedup ratio was 1.67, and the CPU speedup ratio was 1.65. The most frequent speedup ratio for the iterations was 1.80, and its CPU speedup ratio was 1.78. In all of the 200 cases, the RapidICA outperformed the FastICA.
Accompanied by the 200 image experiments, we used the measure of –log cosh to check the non-Gaussianity. This is illustrated in
As can be understood from
We prepared a data set of 200 EEG vector time series. As will be seen, EEG data are weakly super-Gaussian. That is, they are almost Gaussian, which can cause difficulties in ICA decomposition, as was stated in the ICA formulation of Section 2.1. The EEG data sets were prepared and preprocessed in the following way:
Step 1: 200 EEG time series were drawn at random from the data of [
Step 2: Each EEG time series was divided into chunks that were 8 seconds in length. This generated 8000 samples (1 KHz sampling).
Step 3: Each EEG time series was normalized to have zero mean.
Step 4: By the whitening method of Equations (15) and (16), the EEG channel dimension of 59 was reduced to 32.
In this paper, a class of speedy ICA algorithms was presented. This method, called the RapidICA, outperformed the FastICA. Since the increase in computation is very light, the reduction of iterations directly realizes the CPU speedup. The speedup is drastic if the sources are superGaussian, which is the case when the source data are natural images. Since ICA bases have the roles of expressing texture information for data compression and measuring the similarity between different images, they
can be used in this system for the similar-image retrieval [
For the case of signals that are nearly Gaussian, such as the EEGs, the RapidICA again outperformed the FastICA in terms of speed. But the margin of improvement is less than that in the case of images. This is because cases of multiple Gaussian sources need to be excluded from the ICA formulation, including both the RapidICA and the FastICA. For such cases, if the ICA is applied, it is necessary to transform raw EEG data into good superGaussian data. This is possible on the EEGs since the purpose of using these data in only for the detection of event changes in source signals. This approach will be presented in future papers.
This study was supported by the Ambient SoC Global COE Program of Waseda University from MEXT Japan. Waseda University Grant for Special Research Projects No. 2010B and the Grant-in-Aid for Scientific Research No. 22656088 are also acknowledged.
# settings
iter.max < − 100
eps < − 1e−5
p.alpha < − 1.0
p.beta < − 1.0
p.gamma < − 1e−6
# initialization
W < − diag(1, d, d)
dW1 < − matrix(0, d, d)
dW2 < − matrix(0, d, d)
eta < − matrix(0, d, d)
# iterations
for (p in 1:iter.max) {
# (learning non-gaussianity)
Wold < − W
y < − W %*% z
dW1 < −
−p.alpha *
diag(1 / rowSums(G2(y))) %*%
(G1(y) %*% t(z))
W < − W + dW1
W < − orth(W)
# (convergence test)
conv < −
1 − sum(abs(diag(W %*% t(Wold)))) / d
if (conv < eps) break
# (acceleration steps)
dW2old < − dW2
dW2 < − W − Wold
for (i in 1:d) {
eta.i.num < −
max(t(dW2[i,]) %*% dW2old[i,], 0)
eta.i.den < −
max(t(dW2[i,]) %*%
dW2[i,], t(dW2old[i,]) %*%
dW2old[i,])
eta[i,i] < − p.beta *
eta.i.num / (eta.i.den + p.gamma)
}
W < − W + eta * dW2
W < − orth(W)
}