 Intelligent Information Ma nagement, 2011, 3, 87-103 doi:10.4236/iim.2011.33011 Published Online May 2011 (http://www.SciRP.org/journal/iim) Copyright © 2011 SciRes. IIM Supervised Fuzzy Mixture of Local Feature Models Mingyang Xu, Michael Golay Massachusetts Institute of Technology, Cambridge, USA E-mail: xumy@alum.mit.edu, golay@mit.edu Received January 17, 2011; revised April 7, 2011; accepted April 10, 2011 Abstract This paper addresses an important issue in model combination, that is, model locality. Since usually a global linear model is unable to reflect nonlinearity and to characterize local features, especially in a complex sys- tem, we propose a mixture of local feature models to overcome these weaknesses. The basic idea is to split the entire input space into operating domains, and a recently developed feature-based model combination method is applied to build local models for each region. To realize this idea, three steps are required, which include clustering, local modeling and model combination, governed by a single objective function. An adaptive fuzzy parametric clustering algorithm is proposed to divide the whole input space into operating regimes, local feature models are created in each individual region by applying a recently developed fea- ture-based model combination method, and finally they are combined into a single mixture model. Corre- spondingly, a three-stage procedure is designed to optimize the complete objective function, which is actu- ally a hybrid Genetic Algorithm (GA). Our simulation results show that the adaptive fuzzy mixture of local feature models turns out to be superior to global models. Keywords: Adaptive Fuzzy Mixture, Supervised Clustering, Local Feature Model, PCA, ICA, Phase Transition, Fuzzy Parametric Clustering, Real-Coded Genetic Algorithm 1. Introduction Models are useful in both explaining systems and pre- dicting their future behavior. Usually, for a specific sys- tem there are many different competing models. In this situation, the question arises of how to improve model performance given all available information. One strat- egy is to select a single best model among the group of competing models. An alternative is to combine multiple competing models. In both theoretical and empirical work, it has been shown that model combination can lead to better models than does selecting a single model. However, how to aggregate information contained in candidate models and new observations in an efficient way is still an open research problem. To this end, Xu and Golay [1] proposed a feature-based model combina- tion method, which, compared to other methods men- tioned above, makes more efficient use of information embedded in candidate models and extra data. In this feature-based model combination approach, it is assumed the true model can be expressed as 1 N Ti i where fi(x) F, the factor set, and wi(x) is its corre- sponding weight, intensity, or factor loading as in factor analysis [2], N is the number of total factors, x is an input variable. In a linear case, wi(x) is constant, independent of x. Correspondingly, a candidate model, as an ap- proximate representation of the system, can be expressed in a similar way, for example, the kth candidate model 1 k N kki i ki xwxf x (2) where fki(x) Fk with Fk the set of factors for kth model and Nk is the number of factors in Fk. Note that Fk is a subset of feature space F. Assuming the features fi(x), , are either uncorrelated or independent conditional on input x, [1] proposes to apply either Principal Component Analysis (PCA) [3] or Independent Component Analysis (ICA) [4] to extract the features and then aggregates them through regression into a new optimal composite model based on data 1, ,iN cii iS xwfx (3) where Sc denotes the set of selected factors, and the fac- tor weights and constant can be determined based upon data. i xwxf x (1)
 88 M. Y. XU ET AL. However, there are some weaknesses inherent in this approach that limit its application. Among them two major issues are the model locality and nonlinearity. As shown later it can be unjustified to use a single global linear model to describe a complex system over its entire domain. In fact, this limitation is inherent in the applica- tion of the global linear PCA or ICA for feature extrac- tion, whose weaknesses have been investigated by Ka- runen and Malaroiu [5] and other authors. In order to mitigate these limitations, we propose a mixture model of local feature models. Basically, local models are built to approximate the complex system within operating regimes and then are combined by smooth interpolation into a complete global model. The split of the input space enable us to characterize the nonlinearity of the system although somewhat coarsely, and local component analysis produces different sets of local features, which lead to different local models. The outline of this paper is as follows: in section 2, the idea of mixture of local models is proposed and justified in terms of different arguments; an adaptive fuzzy para- metric clustering algorithm is presented in order to split the entire input space into sub-domains; and then local PCA or ICA is used to extract local features, which con- stitute local models; finally, a method is proposed to piece local models together into a mixture model. In sec- tion 3, a three-stage optimization algorithm is used to implement the procedure. Both an artificial example and a physical case are presented in section 4 to demonstrate the performance of this new approach. The last section summarizes the paper. 2. Mixture of Local Models In this section, we introduce a mixture of local models in order to characterize model nonlinearity and locality. Local models are built to approximate the complex sys- tem locally and then are combined by smooth interpola- tion into a complete global model. In order to describe model locality, we introduce the concept of phase transi- tion. 2.1. Why Local Models In general, it is usually preferred to build a single global model to describe a system’s behavior over its entire input space. However, in reality a model might not be able to cover the full range of the input space accurately due to high complexity. This can happen because of the need to describe the interactions between a large number of phenomena that appear within the domain. Rather a well-defined model may only be appropriate over a spe- cific prescribed subspace, namely its operating regime. For example, a model, which is fitted quite well to the data in a specific region, may be inaccurate when ex- trapolated to other regions, for example, because some assumption underlying a model can only be met in a cer- tain range of inputs. Typically for a complicated system the true model is nonlinear and highly complex and a global linear model is far from adequate. It is also conceivable that a physical system may un- dergo different phase changes that are governed by dif- ferent underlying laws. This phenomenon is regarded as phase transition. Meanwhile, the set of important features of the system can vary over different domains or the same set of features but with varying influences. Both cases result in varied local behavior patterns. Corre- spondingly, this situation can lead to variation of model structures, that is, different models over different regions. Thus, in the presence of phase transition, it can be diffi- cult to incorporate the properties of distinct phases into a single global model. Therefore, a single global linear model sometimes cannot describe a nonlinear system adequately or capture all the local features accurately. In order to deal with this nonlinearity and locality, it might be possible to identify missing hidden variables needed to characterize nonlin- earity or phase transition phenomena and create a com- prehensive complex global model. Such variables are called Arrhenius-type terms by Johansen and Foss [6]. However, it is usually difficult to find such extra hidden variables, not only because it requires greater knowledge concerning the system under investigation, but also be- cause even if we do so, we may obtain an overly com- plicated global model or even an intractable one. Fol- lowing the philosophy of divide-and-conquer, an alterna- tive simpler way to capture the locality is to divide the whole input space into several small regions and to per- form local analysis in each local region, for example, classification and regression tree (CART) and use of a hierarchical mixture of experts (HME). Local analysis leads to simple local models, which try to characterize a complicated physical system over a specific regime called an operating regime. Then local models are com- bined into a global composite model. In contrast to a global model that is valid over the full range of the input space, a local model is valid only in a predefined operat- ing region smaller than the input space. Typically, local models can be considerably simpler because a smaller number of phenomena are relevant and their interactions are simpler [6]. Therefore, this divide-and-conquer prin- ciple simplifies the modeling problem by transforming the task of modeling a complex system into one of sim- pler modeling local results can be combined relatively easily to yield a satisfactory overall model. Typically, in local learning models are constructed as linear functions of local features. Then, they are pieced Copyright © 2011 SciRes. IIM
 M. Y. XU ET AL. 89 together somehow to form a global linear mixture model [7]. Although simpler, this mixture model improves the model accuracy because it reduces model bias by speci- fying the model more properly. By examining the bias/ variance tradeoff for local and global learning, Murray- Smith and Johansen [8] show that local learning can be viewed as a simple form of regularization, which pro- duces models with higher accuracy and greater robust- ness than do global learning methods. However, as noted by Jordan and Jacobs [9], divide- and-conquer tends to increase the model variance. A remedy to it is to employ a soft split of the input space. A simple version of soft partition is to overlap the operating regimes of the local models. Doing this help smooth the switching between local models and thereby can reduce variance. This will be discussed in more details as we proceed. It is noteworthy that local models here are different from those used in local modeling [10] where a paramet- ric function is fitted to data in the neighborhood around a query point x. This can be done by locally weighted re- gression [11] where the weights in Weighted Least Squares (WLS) depend upon the distance from a data point to the query point x, or by mixtures of local experts [12] where local experts are fitted to all data but not equally well in some local regions. A common drawback of these local learning methods is the complete lack of interpretability of the resultant models. 2.1.1. Phase T ra nsi tion Usually, local models are combined into a global model by smooth superposition. The main motivation for this is that the system often has some smoothness properties between regions, i.e. with the operating point changing the phenomena or behavior changes smoothly. However, one may occasionally come across processes that change abruptly. For example, in fluid dynamics a phase transi- tion or flow pattern changes can occur [13]. Below a mixture-of-phases model is introduced to characterize a phase transition. This leads to use of a mixture of over- lapping local models. In general, phase transition means a system undergoes a discontinuous change in association with continuously changing parameters, transforming from one phase to another. For example, a liquid flow changes from lami- nar flow to turbulent flow as the Ronald number in- creases. In the current case, by phase transition we mean a system undergoes a qualitative change in its underlying local features, thereby leading to the need for different local models. According to the modern classification scheme, phase transitions fall into two broad categories, namely the first and second-order phase transitions. Under this scheme, phase transitions were labeled by the lowest derivatives of the free energy that is discontinuous at the transition. First-order transitions exhibit a discontinuity in the first derivative of the free energy, or the value of the response variable. In contrast, second-order transitions are con- tinuous in the value of the response variable but not in the second derivative of the free energy. In this scheme, first-order transitions are associated with “mixed-phase regimes”, where some parts of the system have completed the transition and others have not. A typical example of this class of transitions is water boiling where with the temperature increasing the water does not instantly turn into a gas but forms a turbulent mixture of water and water vapor. Mixed-phase systems are unstable and difficult to study, because their dynamics are violent and hard to control. However, many important phase transitions fall in this category, including the solid/liquid/gas transitions. On parallel with soft partition, in the present case we can adopt soft phase boundaries, where multiple phases coexist. This is consistent with overlapping operating regime concept mentioned earlier. In the coexistence region, the system can be considered to be a random mixture of multiple phases governed by different local models. This stochastic mixture model of phase transitions helps explain instability within the transitional regime, because inside this coexistence region the system behaves like one of the distinct phases with specific probabilities and where distinct phases are quite different in behavior. Therefore, the expected behavior of the system is simply a probability-weigthed average of those of distinct phases. At the sametime, outside the coexitence regime the system is dominated by a single phase. Thus a local model can be applied deterministically. This statistical mixture model of phase transition is not short of physical evidence. For example, Harrington et al. [14] reported in liquid-liquid phase transition, the coexistence of two different phases inside an unstable region. In addition, generally we have no idea in advance where and how wide the transitonal regions. Thus their domains have to be estiamted based upon data. Con- veniently, each transitional region can be represented by two parameters, the outset and the end-point of a phase transition region. In fact, this parametric model is able to characterize both categories of phase transitions, namely first-order or second-order transition. If the width of the coexistence regime turns out to be nil, it is a first-order phase transition; otherwise, it is a second-order phase transition. Based upon the above argument, the framework of mixture of local models is able to model the phase transi- tion phenomena via dynamic identification of operating regimes. Copyright © 2011 SciRes. IIM
 90 M. Y. XU ET AL. 2.2. Adaptive Fuzzy Parametric Clustering In our local model scheme, the input space is split into partitions, which overlap, reflecting the effect of a coex- istence region. Each partition corresponds to a distinct operating regime, in which the system can be described by a local model. In implementing this scheme, the first and a key problem is that of how to split the input space. This is actually a problem of clustering. For this purpose, we propose a supervised clustering algorithm. Clustering can be considered the most important un- supervised learning problem, which is intended to or- ganized objects into groups whose members are similar in some way. Clustering involves the task of dividing data points into homogeneous classes or clusters so that items in the same class are as similar as possible and items in different classes are as dissimilar as possible. Thus, the goal of clustering is to find common patterns or similarity. The measure of similarity plays an essential role in clustering algorithms. Usually, distance is employed as the similarity criterion. However, this seems inappropri- ate for the current situation. In our scheme, if two points in the input space belong to the same phase, they are thought of behaving similarly and hence as being in the same cluster. From the perspective of local features, points in the same cluster have the same underlying local features. Furthermore, since currently local features are extracted from candidate models through either PCA or ICA, similarity in local features is equivalent to similar- ity in the separating matrix [15]. Therefore, it is more reasonable to cluster data based on the similarity of the mixing matrix W, which is equal to the inverse of the separating matrix. Because of the similarity of points in the same cluster, it is reasonable to treat any cluster in- dependent in the course of feature analysis and building local models. In trying to meet different needs, many clustering al- gorithms have been proposed. They can be roughly grouped into two classes, namely hard clustering and soft clustering. Hard clustering assumes exclusive assignment of each datum to respective clusters, which means that a specific datum belonging to a definite cluster cannot be included in another cluster simultaneously. So, hard clustering results in crisp clusters, where each data point belongs to exactly one cluster. An example of this class is the K-means clustering algorithm. Its application is mainly in pure local piecewise models such as CART [16]. On the contrary the soft clustering, also called over- lapping clustering, allows each point to belong to two or more clusters simultaneously. Using the two existing uncertain reasoning techniques, fuzzy set theory and probability theory, this class of clustering algorithms can be further divided into fuzzy clustering and probabilistic clustering. In fuzzy clustering, fuzzy clusters are identi- fied and the data points can belong to more than one cluster associated with membership grades, which indi- cate the degree to which the data points belong to the different clusters. The fuzzy c-means algorithm is one of the most widely used fuzzy clustering algorithms, which is developed by Dunn [17] and improved by Bezdek [18]. Similar to the fuzzy clustering, in probabilistic clus- tering each data point has specific probability of belong- ing to a particular cluster. Use of probabilistic reasoning is implied by availability of only a restricted amount of evidence. A well used probabilistic clustering algorithm is the mixture of Gaussians, where the well-known Ex- pectation-Maximization algorithm is applied to estimate parameters. As pointed out in [9], the divide-and-conquer techni- que tends to increase the variance. A simple remedy to this problem is to apply soft partition. This can be done in the current case; and thus, we favor soft clustering. Another fact making soft clustering appealing is that many systems change behaviors smoothly as a function of inputs and soft transition between regimes introduced by the fuzzy set representation characterizes this feature in an elegant fashion. However, both fuzzy clustering and probabilistic clustering cannot be applied directly, because fundamentally they measure the similarity based on distance and are separated from the modeling process, which is inappropriate in our case. Meanwhile, in com- parison to probabilistic algorithms, fuzzy clustering is more natural and flexible in the current case. Conse- quently, we propose a new adaptive fuzzy clustering al- gorithm below to identify different phases over the entire input space. The characteristics of the new fuzzy clus- tering algorithm are described in detail in the following discussion. 1) Fuzzy clustering A natural way to interpret overlapping operating re- gimes is to apply fuzzy set theory, where an operating point can fall into two or more operating regime simul- taneously. According to our scheme, in the overlapping regions multiple local models might be relevant while outside the coexistence regions only one local model, called the dominant local model, is valid. The simple trapezoidal fuzzy membership function is specifically suitable for characterizing such operating regimes. Fur- thermore, the choice of a trapezoidal shape membership function produces more interpretable local models than other functions like Gaussians [15]. In order to be consistent with the concept of mixture modeling and superposition, a constraint on the fuzzy membership functions is imposed, which requires that at Copyright © 2011 SciRes. IIM
 M. Y. XU ET AL. 91 any point x, , where m(x) is member- 11 M m mx ship degree of the mt h cluster. This results in smooth transition between operating regimes. Clusters or operating regimes are represented by fuzzy sets. Typical fuzzy clustering with three overlapping operating regimes is depicted in Figure 1, where [a,b] represents the overlapping region between the cluster I and II and likewise [c, d] represents the overlapping re- gion between the cluster II and III. Any point in the input space can belong to multiple clusters with simulta- neous memberships. From another practical angle, the mem- bership can be interpreted as describing how possi- ble an observation can be generated by a specific local model. This type of member functions is consistent with the nature of phase transition and able to model both classes of phase transition. It also keeps the interpretability of each local model corresponding to one cluster. Thus, we can argue that this type of member function is a natural choice for physical models. 2) Supervised clustering In our fuzzy clustering scheme, the fuzzy membership functions are parameterized by the number of clusters used and the locations of the splitting points. As usual, the main task of fuzzy clustering is to iden- tify fuzzy sets characterized by parametric fuzzy mem- bership functions. For each cluster, the parameters in- clude the location of boundaries and their widths. The locations of the boundaries can be chosen such that the similarity within a cluster is maximized, while the pat- terns of different clusters should be as dissimilar as pos- sible. Only so, in each cluster can the system be better represented by a local model and the bias decreased. The overlap or the width of coexistence region plays a major role in smoothing the transition between local models. [8] further argues that overlap has a regularizing effect in the ill conditioning in a learning problem and the level of overlap determines the amount of regularize- tion. A high level of overlap leads to a high level of cor- relation between neighboring local models and decreased transparency of the local models, i.e. compatibility with the understanding of a system [6]. A low level of overlap results in an abrupt transition between local models. Hence, the optimal degree of overlap and softness de- pends upon the modeling problem through the objective function. This algorithm is called adaptive in the sense that in addition to the separators the number of regions is de- termined based upon the data. If the number of clusters is not large enough, the nonlinearity of the system cannot be described adequately. On the other hand, an increase- ing number of operating regimes increase the model complexity. The overall effect of an increasing number of local models depends upon whether the decrease in bias is more significant than the increase in variance. Thus, the number, location and overlap of the operat- ing regimes should be so tuned dynamically as to reach optimal values. These are determined by objective func- tions. In so doing, it can be ensured that there are ade- quate amount of data within each operating regime to get a good local model. 3) Single global objective function The goal of modeling processes is to minimize the predictive error. Likewise, local modeling also aims to minimize the generalization error across the entire input space. Both the splitting of the input space and the building of local models should be determined by this overall goal. Nevertheless, in most previous work such as local PCA [19] or local ICA [20], and use of local models [7], clustering is treated as a separate optimiza- tion problem from local learning and obtaining a global mixture. This causes sub-optimality. In this paper, we optimize both problems jointly by incorporating them within a single objective function, which reflects the overall goal of minimizing the global generalization error. This goal can be realized by two steps, namely estima- tion of clustering parameters given the number of clus- ters and estimation of the number of clusters. The first step can be done by minimizing the empirical error. Until now, we have not discussed how to create local models and the global mixture model. Suppose the global mixture model to be fg(x, , ), where denotes the clustering parameters, refers to other local model pa- rameters and x is the model input. Therefore, the partial objective can be expressed as 2 1 , arg min,, n igi iyfx (4) where (xi, yi) denotes a pair of observations. Note that the squared loss is applied, which is equivalent to use of MLE method under the assumption of a normal distribu- tion of the data. From the partial objective function in Equation (1), it is seen that the input space is so split as to minimize the empirical error. In this sense, this clustering algorithm is supervised. Region I Region II Region III Transition Area a b c d f(x) 1.0 Figure 1. Illustration of the fuzzy clustering concept. Copyright © 2011 SciRes. IIM
 92 M. Y. XU ET AL. Therefore, through a same objective function fuzzy clustering is closely connected to the modeling process. This is exactly in agreement with [6] that the creation of local models should not be separated from the choice of operating regimes. In this aspect, it is also in spirit simi- lar to Mixture of Experts (MoE), where probabilistic clustering is mixed with learning. Nevertheless, the partial objective function in Equa- tion (1) does not involve the number of clusters, which is in fact another important part of our fuzzy clustering. This is because a different number of operating regimes lead to varied model structures, which cannot be re- flected in the empirical error. Following the principle of parsimony in model selection, we use a complete object- tive function, which incorporates the effect of the num- ber of clusters. The divide-and-conquer principle reduces the model bias by specifying the model structure more properly, but the variance is increased at the same time, because with an increasing number of local models (increasing model structure) more parameters need to be estimated. This leads to larger variance on the parameter estimate. This phenomenon is well known in the literature of statistics as bias/variance tradeoff (see e.g. [21]). On one hand, if the input space is split into too many regions, overfitting will occur; on the other hand, use of too few regions may not capture the structure in enough detail and thus could lead to underfitting. The task here is to find out the opti- mal balance point within the bias/variance tradeoff given a finite number of samples. To this end, generally two different strategies can be employed, namely model se- lection or regularization. In the current case, our purpose is to choose the optimal number of operating regimes, and thus model selection appears more appropriate. With an increasing number of operating regimes local models can fit the data much better, but the model com- plexity is also increased, which usually leads to deterio- rating generalization. Generally, the higher the model complexity is, the smaller the bias but the larger the variance. Most model selection criteria realize Occam’s razor by penalizing the goodness-of-fit with increasing model complexity, thereby minimizing the generalization error. Among all model selection methods, the informa- tion theoretical criteria like Akaike’s Information Crite- rion (AIC) [22] or Bayesian Information Criterion (BIC) [23] have a close connection to the maximum likelihood method, which seems to many statisticians an advantage. As a result, these information criteria can be easily ap- plied in many circumstances without any additional computation. In addition, some of these information cri- teria including AIC and BIC can be justified under a Bayesian framework, which is also viewed by many statisticians as another big advantage (see [22] and [23]). Therefore, in this paper we will only utilize the BIC for the purposes of demonstration. BIC was first derived by Schwarz in a Bayesian con- text with a uniform prior probability on each competing model and priors with everywhere positive densities on the model parameters in each model. Choosing the model dimensionality with the highest posterior prob- ability leads to the BIC criterion of Schwarz [23], ˆ BIC2 loglogLxk n , (5) where ˆ Lx is likelihood function of data x at maximum likelihood estimate ˆ , k is the number of parameters or model dimension, and n is the sample size. Note that the first term on the RHS comes directly from the general maximum likelihood and the second term is a complexity penalty term. Assuming that the errors in data are Gaussian and ho- mogeneous over the operating regimes, we can obtain the BIC formula as BIClog 2πlog1log ,nnRSS k n (6) where RSS is the sum of empirical squared error, i.e. 2 1,, n igi i RSSyf x . In the current situation, where any data point can fall into more than one clusters simultaneously and Gaussian errors are heterogeneous over operating regimes, similar to the Mixture of Gaussian (cf. [9]) the likelihood func- tion can be written as 11 ,,, i Mn ijj ij LyxL y x (7) where M denotes the number of clusters or operating regimes, ji refers to the membership of jth data point to ith cluster, and Li(,| ) denotes the likelihood function for the local model of ith cluster. Thus, the log likelihood becomes 11 ,log, log , Mn jiij j ij lyx Lyx Lyx (8) where parameters , including and , are estimated by MLE. Assuming that the local models are linear, i.e. , ii fx T, i and the errors are Gaussian, i.e. ,, j ii x, ij i Nf the log likelihood for a local model can be expressed as 1 2 T 2 1 , log , log 2πlog , 2 i n jiij j j jiji j n jiji i ji lyx Lyx yx (9) Copyright © 2011 SciRes. IIM
 M. Y. XU ET AL. 93 ij and therefore 2 T 1 ˆarg min i n ijij j x (10) and the variance for the ith local model can be estimated as 2 T 1 2 11 ˆ n jiji j ji inn i jj yxWRSS ji (11) where the weighted RSS is obtained as 2 T 1 n ijiji j WRSSy x j 1 n i j . Substituting the likelihood function to the BIC formula produces 11 1 BIClog 2πlog 1 log , Mn n jiji i ij j kM n (12) Finally, we obtain the complete objective function as 1 ,, arg minlog2πlog 1 log , M ij i M kM n , 1 (13) where M refers to the number of operating regimes and the model dimensionality k(M) is a function of it. Usually, the penalty term is equal to the number of free parameters that need to be estimated based upon the data. In the current case model parameters include clus- tering parameters and local model parameters. Therefore, the model complexity can be evaluated by 1 21M m m kM Mp (14) where the first term refers to the number of clustering parameters specifying the boundary positions and the pi denotes the model dimensionality of each local model. However, with different number of operating regimes, the complexity of the local models does not change in terms of model structure and the number of parameters. Therefore, for the purpose of choosing the number of clusters the appropriate model complexity can be ex- pressed as 2kM M (15) Note from the above that the purpose of model selec- tion is only to determine the optimal number of operating regimes, because the penalty term only depends upon the number of free parameters rather than upon their values. By the complete objective function, not only does this - algorithm determine the regime location, size and overlap, but it also determines the number of regimes. The number of clusters depends upon the sample size. Its upper bound should be such that in each cluster the number of data points having non-zero membership should be greater than the number of features. Note that such an objective function favors parsimonious models rather than under or over-parameterized model structures obtained by optimizing the number of local models. 2.3. Local Analysis Similar to the global feature-based model combination in [1], local analysis include two steps, namely, extracting local features through local component analysis and cre- ate local models by aggregation. 2.3.1. Local Component Analysis ICA is a successful technique for reducing statistical dependence, and hence redundancy, between the candi- date models. Dimension reduction is also achieved by eliminating a subset of independent components without significant loss of information. PCA [3,25] is another effective dimension reduction technique, which only relies on second order statistics and helps remove linear dependencies. As a result, the principal components, although uncorrelated, can be highly statistically dependent. In contrast, ICA takes into account higher order statistics and thus is able to capture non-linear dependency. Therefore, ICA can produce a more compact representation of the data and outperforms PCA in terms of statistical redundancy as well as dimen- sion reduction. However, PCA is much easier to use than ICA and in some cases where no significant nonlinear dependencies are involved, PCA can produce satisfactory results. Thus, although in the following we will mainly focus on local ICA, the arguments are also directly ap- plicable to local PCA. Standard ICA still has some limitations, namely its linearity and globality. First, the standard ICA assume that the data x are a linear superposition of independent components s, i.e. txWs t (16) where T 1,,, n txt xt x and W is the mixing matrix. T 1,, , m tst st s However, it is not general to assume this linearity, even though in many cases linear ICA delivers useful results. For general nonlinear data structures, it can pro- vide only a crude approximation and cannot describe nonlinear characteristics adequately. In view of this drawback, during the recent years researchers have at- tempted to generalize linear ICA to nonlinear ICA, as ,tftxs (17) where :m fRR n can be an arbitrary nonlinear mixing function. Copyright © 2011 SciRes. IIM
 94 M. Y. XU ET AL. Second, the standard ICA tries to describe all of the data using a single group of independent component called global features. This means that the mixing matrix, W, in Equation (16) and the corresponding separating matrix are assumed to be the same over the entire region. However, usually physical systems have varying charac- teristics; and thus, varying mixing matrices in qualita- tively different domains of the entire domain. This calls for using different local features in each domain in order to obtain an efficient representation. In order to overcome the limitation of global linearity, recently researchers have proposed nonlinear ICA as in Equation (14). However, the difficulty of nonlinear ICA arises because the solution of nonlinear ICA problem is usually highly non-unique (see e.g. [25]) and is computa- tionally rather demanding [20]. In order to develop non- linear extensions of ICA, we propose to use a local linear ICA, in which the data space is first partitioned into dis- joint regions and then ICA is performed within each cluster. Local linear ICA can provide an approximation of nonlinear ICA because using the Taylor expansion the nonlinear mixing function f() in Equation (14) can be approximated locally at any point by linear functions. By choosing the number of regions adaptively, the nonlinear characteristic can be represented accurately given limited observations. At the same time, linear ICA is utilized to extract local features within each more homogeneous domain. Thus, multiple sets of local features rather than global features are produced. Therefore, local ICA can overcome some weaknesses of linear ICA while avoiding the problems associated with general nonlinear ICA. Local ICA usually works in conjunction with a suitable clustering algorithm, which is responsible for partitioning the data space into clusters. For example,[20] proposed to use k-means clustering algorithm, and [15] suggested using fuzzy c-varieties clustering [18], which partitions the data space based on the similarity of the mixing matrix. In our case a different adaptive fuzzy clustering is proposed (see section 2.2), which is embedded in local modeling process rather than standalone. Given cluster- ing, PCA or the Fast ICA algorithm [4] is applied in each cluster to extract local features. It seems more appropri- ate to employ weighted ICA based upon fuzzy member- ships, because even intuitively the points having smaller fuzzy memberships, in the coexistence region for exam- ple, should have smaller influence upon feature extrac- tion. However, since the coexistence regions are not so wide, for simplicity we apply the standard algorithms in each cluster. 2.3.2. Build Local Models Once local features are extracted, we can proceed to con- struct local models, each pertaining to one of the over- lapping operating regimes of the input space. Since each local model is only relevant to one cluster, it is reason- able to train local models using data points belonging to that cluster. Thus, before building local models, all ob- servations need to be first assigned to their respective fuzzy clusters. This constitutes a fuzzy multi-category classification problem. Because the membership func- tions j(x) of all fuzzy clusters are known from the step of clustering, the classification task is simply to evaluate the membership of each data point in all clusters, that is, to obtain the result , ijj i (18) which specifies the influence of each data point in build- ing local models pertaining to the different operating regimes. The creation of local composite models follows the same method as in [1]. They are built by multiple linear regression models, 1, p mmjm j fx hx j m (19) where hmj(x)’s refer to local features in the m-th fuzzy operating regime. Taking into account the varied influence of the data points, the parameters in local models can be estimated by weighted least squares as 2 1 arg min, m n miji iyfx (20) which further encourages the locality of local models. Local feature selection is an integral part of building local models. The purpose of feature selection is to eliminate non-informative features and noise and remove redundant information, thereby reducing model dimen- sionality. Since the multiple linear regression method is used in constructing local linear models, feature selection is actually a variable selection problem. Feature selection results in parsimonious models, which are known to yield improved generalization. From the above, it is easy to see that the act of build- ing of each of the local models is separated from that of the others, except for the effects of overlapping domains. Thus, local feature selection can be also performed sepa- rately in each operating regime. As a result, the local feature selection only depends upon the empirical errors of each of the observations falling into a certain cluster. 2.4. Combine Local Feature Models Once local models are built for all clusters, they can be combined into a global mixture model, or so-called oper- ating based model according to [26], based upon our phase transition model. Copyright © 2011 SciRes. IIM
 M. Y. XU ET AL. Copyright © 2011 SciRes. IIM 95 Based upon the fuzzy membership functions, the final global mixture model can easily be formulated as a fuzzy weighted average model as f(x) a b Region II Region I f 1 (x)f 2 (x) 1, M gm m fx xfx m j (21) where the fuzzy membership function m(x) characterizes the operating regime of the m-th local model fm(x). Each local ICA model can be expressed as 1, p mmjm j xh x (22) (a) Local models where hmj(x) denotes j-th local featur for m-th cluster and mj is its corresponding regression coefficient, and p de- notes the number of features. fg(x) f(x) Region II Region I a b In the current case a simple trapezoidal membership function is applied to represent the fuzzy operating re- gimes and for any point x a constraint is imposed such that , so the mixture of local models 11 M m mx turns out to be simple. If for given point x, which is out- side the unstable phase transition regions, some m(x) = 1, the m-th local model fm(x) is dominant and thus fg(x) = fm(x); otherwise, if x is inside a transition region two different phases characterized by two different local models coexist. Based upon our stochastic modeling of phase transition, the mixture model fg(x) can be con- structed by a weighted linear superposition of local mod- els having nonzero membership as, (b) Global mixture model Figure 2. An example with two operating regimes with cor- responding local and global models. optimize the number of operating regimes, the locations of boundaries as well as parameters in local models. This is analogous to use of adaptive regression splines with free knots (cf. [27,28]). Splines are piecewise polynomial functions that are constrained to join smoothly at points called knots. In particular, a free-knot spline is a spline where the knot locations are considered parameters to be estimated from the data. Freeing the knots greatly im- proves the spline’s approximating power [29]. However, it poses a very difficult problem, that is, estimating the optimal number of knots and their locations, which is similar to our problem. . giijj xxfxxf x (23) An example of two operating regimes is shown in Figure 2. From the above example, it can be seen that the global mixture model is continuous. In fact, this is true as long as the width of the coexistence region is not equal to zero, because, for example, 112 21 , gaa aaaa xxfxxfxfx (24) where 1(xa) = 1 and 2(xa) = 0. Nevertheless, in general the first derivates are not con- tinuous. This is because with trapezoid membership functions we have the result 11 11 22 22 12 1 gaa aa a aa a aa a fxxfx xfx a fx xfx cf xfxf x (25) But, there are important differences between them. First, our model is more flexible without smoothness constraints, which, on the other hand, leads to use of more free parameters. Second, in our model the regres- sors nonlinearly depend upon the boundary locations while for splines regressors are fixed as polynomials. Therefore, we expected that the current optimization problem is even more difficult than that of free-knot splines. where 1a c and 2a c . The difficulty lies in some undesirable characteristics of the complete objective function in Equation (13). In order to illustrate its properties, we can substitute fg (x, , ) in Equation (21) and rewrite it as 3. Three-Stage Optimization Algorithm For our problem under investigation, we need to jointly 2 11 1 argminminminlog,,() log. m nM p imimjmji im j Mny xhxkM n (26)
 M. Y. XU ET AL. Copyright © 2011 SciRes. IIM 96 First, it is a complex nonlinear function of the bound- ary locations, because the membership functions and the local independent components depend nonlinearly upon and they appear inside the square. Second, it is non-differentiable partly because of the non-differentiable membership function. Furthermore, the explicit dependence of the local features, or equiva- lently the separating matrices, on the fuzzy clustering parameters cannot be known. Because of this non- differentiability, all gradient-based optimization algo- rithms such as steepest descent, Newton-Raphson method and conjugate gradients methods will fail. However, some numerical searching algorithms are still applicable. Finally, the objective function is not strictly convex or concave but has many local optima. This can be seen by applying the “lethargy” theorem introduced by Jupp [27]. Similar to free-knot splines [27], the existence of multi- ple optima in the objective surface is related to the sym- metry introduced by the exchangeability of the boundary parameters. For example, in a simple case with two clus- ters the objective surface is symmetric along any normal to the line defined by two equal parameters. Conse- quently, the derivative along the normal at the intersect- tion to the equal-parameter line is equal to zero. This property, called “lethargy” by [27], results in many sta- tionary points and ridges along lines or planes in the pa- rameter space where two or more parameters coincide. This property results in the failure of all local optimi- zation algorithms including gradient-based methods, line search and hill climbing, because they easily fix upon local optima, the locations of which depend upon the different initialization. In order to overcome this problem, global optimization algorithms including simulated annealing and genetic algorithms should be applied instead. These are some interrelated reasons that make it so difficult to locate the global optimum. Since there are many local optima in the objective surface, use of good starting parameter values is essential for finding the global optimum. Unfortunately, this is usually difficult. One possible way is to construct starting values based upon data. First, we sort the inputs x and split the input range to segments s1 through sn-1. Clearly, every parame- ter i can be formulated within each one of the segments. Then, the entire parameter space of the vector will be divided into (n-1)2M-2 pieces of subspaces. If we pick an initial value for within each subspace, we will eventu- ally find a local optimum within that subspace. Compar- ing all of these local optima will give us an approximate global optimum. However, because we cannot make sure there is only one local optimum within each subspace, the globality of the best identified optimum is not guar- anteed. Furthermore, this procedure is computationally very expensive because there are O(nn/k) possible initial values in total. Originally developed by Holland [30], genetic algo- rithm (GA) is a global stochastic search algorithm, which is less susceptible to getting fixed upon local optima than are gradient search methods. On the other hand they tend to be computationally expensive. In practice, genetic algorithms work very well on mixed (continuous and discrete), combinatorial problems. For example, Pittman [31] suggests using GAs to optimize the knot locations in adaptive splines. Here we propose a similar hybrid genetic algorithm, which combines global optimization together with a local search. It is different from that in [31] in that a distinct genetic chromosomal representation and correspondingly different genetic operators are defined. Furthermore, in this scheme GAs are only used to identify good starting values for the local search rather than the global opti- mum. Doing this significantly reduces the required computational time arising from the slow convergence of GAs. On the whole, following a strategy of problem split- ting, the optimization problem can be solved through three stages. First, we optimize the local model parameters given fuzzy clusters. In order to encourage competition and locality of local models, we can approximate the original objective function with a slightly different one 2 1 2 11 ˆarg min,, arg min,,. n igi i Mn im im mi yfx yfx i mi (27) The only difference between the two objective func- tions arises in the coexistence regions. For convenience, let’s denote 1 m i m E , providing the result 2 1 2 2 2 2 2 1 ,, ,, ,, ,, ,, ,, ,, ,, M im im m imi imi mi mi igi M immigi m yfx Eyf x yEfx Ef xEf x yfx fx fx (28) where the second term is usually small within the coex- istence regions. In fact, another advantage of this change is that it makes the optimization problem much easier, because the membership function is moved out of the square op- erator. Moreover, with this change local models can be
 M. Y. XU ET AL. 97 built independently from each other (being consistent with section 2.3.2) At this stage the local model parameters can be op- timized as functions of , which can be easily done by WLS as described in section 2.3. Second, we need to identify the best fuzzy clustering given the number of clusters, by rewriting the sub-opti- mization problem as 2 11 ˆ ˆarg min,,, arg min, nM imimi im yxfx F (29) where with 12 ,,M 2 min 1 221 22maxMM x . Note that under the assumption of Gaussian errors the fuzzy clustering parameters are actually quantified by the Maximum Likelihood Estimator. Nevertheless, a well- known problem with MLE is the danger of overfitting. For a simple example, suppose we have two operating regimes and the number of data points falling in the first regime is equal to the number of candidate models and all others fall in the other. In this case, the data can be fitted perfectly in the first regime and a little better in the other regime than in the single regime case. Therefore, as a result the maximum likelihood is increased dramati- cally, but the resultant model most likely becomes worse. In order to overcome this pitfall, we apply the cross- validation approach of using the testing likelihood in place of maximum likelihood. Correspondingly, the sub- optimization problem can be expressed as 2 11 ˆ ˆarg min,,, arg min t nM tim tim ti im t yxfx F (30) where the subscribe t denote the out-of-sample test. The sub-objective function Ft( ) in Equation (28) has all of the unpleasing properties mentioned earlier. In or- der to address this challenge, we will propose using a hybrid optimization algorithm combining GAs with multi-dimensional hill climbing. After performing both global and local optimization procedure, a group of good candidate solutions are ob- tained, from which the solution with the smallest Ft( ) can be easily chosen as the optimal one. The last stage is to choose the optimal number of local models (a model selection problem). Since we require that in each operating regime the number of data points must be greater than the number of local features, there- fore there exists an upper bound for the number of local models, Mm, which is much smaller than the sample size n. Thus, this optimization problem can be expressed as 2 1 1 1 ˆ ˆˆ arg minlog, , log arg min t m m n igi i MM t MM Mnyfx kM n GM (31) In order to determine the optimal number of clusters, we repeat the first two stages of our procedure for M ranging from 1 to Mm, and finally we choose the optimal value that corresponds to the minimal model selection criterion value in Equation (29). In practice, a forward stepwise process can be utilized, which increases the value of M from 1 and stops when the model selection criterion value increases. 3.1. Real-Coded Genetic Algorithm (GA) In order to apply an actual-coded genetic algorithm to facilitate the search for a global optimum, we need to design problem-specific fitness functions and operators. The first crucial issue of using a GA is to define a proper fitness function, which indicates the quality of a candidate solution. Usually, GA works by maximizing the fitness, but here we intend to minimize the objective function or equivalently to maximize the likelihood. Therefore, the fitness function can be constructed from the objective function as fitness Max, kj tt FF k (32) where Ft( ) is the objective function in Equation (32), Max j t F stands for the maximum t F in a population and fitness k refers to the fitness of the k-th individual chromosome. The definition of fitness significantly influences the convergence behavior. For example, in the early stage few “super individuals” tend to dominate the selection process leading to premature convergence, whereas later when the population is less diverse, the simulation tends to lose focus [32]. Therefore, in practice we would like to apply a more general and flexible fitness function by scaling and shifting, i.e., as fitness Max kj tt ba FF k (33) where the scaling factors a and shifting factor b are so adjusted adaptively during simulation avoid premature convergence early and to encourage later convergence. In our simulation, we choose b to be the minimum fitness and a to be the reciprocal of the average fitness. As for selection, we utilize the fitness-weighted rou- lette wheel method, which is conceptually equivalent to giving each individual a slice of a roulette wheel equal in Copyright © 2011 SciRes. IIM
 98 M. Y. XU ET AL. area to the individual’s fitness. The wheel is spun and the ball comes to rest on the wedge shaped slice, by which means the corresponding individual is selected. There- fore, the probability for a chromosome to be chosen is proportional to its fitness. A pair of “parents” is selected by spinning the wheel two times to reproduce a pair of “children” via recombination and mutation. Notably, the GA success is also sensitive to the nature of the recombination and mutation operators. For exam- ple, it is found that general, fixed, problem-independent recombination operators often break partial solutions and cause slow convergence. In order to avoid such problems, we design problem-specific crossover and mutation op- erators. The recombination strategy that we applied is the one-point arithmetic crossover. Let the parents be 1111 ,, PP P and 221 2 ,, PP P, respectively. Then, the two offspring are obtained as 1 1max 1 122 max 2 , , i it tit t P CxP PPP xP it it (34) and 2 2max 2 211 max 1 , , i it tit t P CxP PPP xP it it (35) where t is a random integer number among 1, 2, ···, L. This crossover operator is so designed that it guaran- tees that the resulting children are still ordered sequences of real numbers within a valid range, and the number of clusters is maintained. In fact, this treatment is especially suitable for chromosomal representations of ordered se- quences of real numbers. The crossover rate (i.e. the probability that crossover occurs) is generally around 0.5, and in this paper we set it at 0.6. The mutation operator is defined as an addition of a normally distributed factor with mean value 0 (i.e. ii DD ) where Di is an original parameter and i D is the mutated one, is a normally distributed random number, i.e. N(0, 2), where the value of 2 is tunable. plays a similar role of the step size in a line search. In this study, we choose maxmin , 3 xx n (36) where n denotes the sample size. [33] suggested that the optimal mutation rate, i.e. the probability that mutation occurs for a single gene in a chromosome, is approximately (S L1/2) –1, where S is the population size and L is the length of the chromosome. Here we follow this rule. Since the current optimization problem is constrained by the requirement xmin < 1 2 < ··· < 2M-1 2M-2 xmax, in addition to the selection, crossover and muta- tion operators, we need another check operator in order to create a new valid child chromosome. A valid chromosome has to satisfy some constraint. First, a chromosome must be an ordered sequence of real numbers within the range [xmin, xmax ]. Also, the number of data points falling into each cluster must be greater than the number of local independent components. If there is no crossover and mutation, a chromosome is simply copied to the next generation. The stopping rule for the current iterative case is rela- tively simple, as our purpose is to search for promising initial inputs for a local optimization algorithm. Thus, when we observe that the convergence of GA becomes very slow we stop it. In summary, the main steps of the GA are the follow- ing: 1) Build an initial population of S chromosomes ran- domly selected within [xmin, xmax ]; 2) Calculate the fitness of each chromosome; 3) Select chromosomes from the parent generation to reproduce a child generation: i) Select two parent chromosomes, ii) Generate a random number between [0,1].(If it is smaller than the crossover rate, recombine them by one-point arithmetic crossover; otherwise, enter the next step); iii) Generate a random number between [0,1]. (If it is smaller than the mutation rate, perform mutation on a gene in a chromosome. Repeat this for each gene in both chromosomes). iv) Add the two resulting chromosomes to the next generation. Repeat steps i) through iv) until S new chromo- somes are reproduced. 4) Finally, if the stopping criterion is satisfied, then exit; otherwise, return to step (2). 3.2. Adaptive Multi-Dimensional Hill Climbing By means of GA optimization, we obtain a set of global good initial guesses of the best vector of fuzzy clustering parameters, namely the last generation produced by the GA. In practice, it is also useful to keep track of the “best” chromosome throughout the whole GA simulation history. The next task is to search for the optima corre- sponding to these good initial guesses. As noted, our objective functions are quite compli- cated and, thus, it is difficult to apply classical gradient- based optimization methods. However, this problem can Copyright © 2011 SciRes. IIM
 M. Y. XU ET AL. 99 be solved numerically by a derivative-free approach, for instance, hill climbing. We propose a derivative-free method in order to optimize the parameters one by one while keeping the others fixed. Furthermore, the step size is adaptively tuned. However, since each parameter is not independent of the others, the overall optimization has to be performed iteratively. Our algorithm, described below, is actually a multi-dimensional version of adap- tive hill climbing, It consists of multiple loops, in each of which the in- dividual parameters are optimized one at a time. Con- sider optimization of i, where its current value is i(0) and the current model evaluation value is Ft( )(0). Let k = 1 and i(k) = i(k–1) + kd, where d is small positive number, and keep the other p – 1 parameters unchanged. Then recalculate the model evaluation value as Ft( )(k). If Ft( )(1) > Ft( )(0), that is, the fuzzy model gets worse, then return to i(0) and let k = 1and replace d by –d; oth- erwise, continue to search in the same direction within the interval [xmin, xmax] until Ft( )(k+1) > Ft( )(k). The final i(k) is taken as the optimum in the current loop. Then we turn to the next parameter i+1. Each computational loop starts with 0 and ends up with 2M-2. Once a loop is computed, another one will be started depending upon the stopping criterion. At the beginning of each loop, we calculate the resul- tant model’s values of Ft( ), and we do the same for the end of each loop. If the difference between these two values is small enough, for example, 1 1δ, jj tt j t FF F (37) where is very small (e.g. 10–5), we say that the mini- mum has been reached and we stop the local searching process. In view of the facts that i) There exists an (unknown) lower bound for Ft( )(k), and ii) the sequence of Ft( )(k) is not increasing, the convergence is guaranteed according to the Cauchy convergence criterion. 3.3. Procedure of Mixture of Local Feature Models To this point the development of a new method for mix- ture of local feature models, from model structure to pa- rameter estimation, is almost complete. In summary, the input space of the system is first decomposed into fuzzy subspaces by use of the adaptive fuzzy clustering algo- rithm and then in each subspace the system is approxi- mated by a local linear feature model. This is somewhat analogous to that of the Takagi-Sugeno fuzzy model [34] within the context of predictive control. The entire modeling procedure can be summarized as follows. 1) Set M = 1 2) Optimize the fuzzy clustering using a hybrid GA given the number of fuzzy clusters a) Build an initial population of S chromosomes randomly; b) Calculate the fitness of each chromosome; i) Classify data points into each fuzzy cluster ii) Perform local PCA/ICA within each fuzzy cluster iii) Create local PCA/ICA models based upon data using the multiple regression method with fuzzy variable selection iv) Mix local PCA/ICA models v) Calculate the weighted residual sum of squared error c) Generate the next generation of chromosomes by selection, crossover and mutation d) If the “stop” criterion is met, then go to e); oth- erwise go to b) e) Treat the last generation of GA as the starting parameter values and determine the local min- ima around them 3) Choose the best local optimum as the global opti- mum and then assess the resulting optimal model by means of G(M) in Equation (29). If G(M) < G(M – 1) for M 2, then go to step (4); otherwise, go to step (5) 4) Set M = M + 1 and go to step (2) 5) Return the final optimal mixture model including the optimal fuzzy clustering. 4. Numerical Simulation Study In this section, we present results from our numerical simulation studies. In order to directly compare with a global composite model method, we apply the new method to the same examples used in [1]. It is first applied to an artificial example, where the true model is supposed to be known. This allows us to demonstrate how the method works and its advantage over global models. Then, the method is used concerning a real physical case. From our numerical simulation, we also justify our argument that a mixture of local models is suitable to situations where severe nonlinearity is in- volved, with the linear local models reflecting the nonlinear characteristics. 4.1. Artificial Example In this example, artificial models and data are used to demonstrate the effectiveness of the new method in treating complex nonlinearity. Assume that the true model is expressed as Copyright © 2011 SciRes. IIM
 100 M. Y. XU ET AL. 223 3 150150e0 14 30esin15sin1 5 20 ln1, x x/ yxx .xx .x x (38) Correspondingly, the data generative model can be written as ,yyx (39) where obeys a normal distribution, i.e. N(0, 2) where 2 is equal to 64. From this generative model, we gath- ered a set of data with n = 50, i.e. (xi, yi), where the xi are evenly distributed between [0,10] (see Figure 3) and the yi are the corresponding noisy observations. We also propose a class of candidate models as fol- lows: 2 1 223 2 223 3 3 23 4 223 5 2 6 150150e415sin1 5 20 log(1) 150150e0 1 150150e0 1 30e sin 150150e630esin( ) 20 log1 150150e0 115sin1 5 150 150e x x x x/ xx/ x x fxx.x x; fxx.x; fxx .x x; fxx x x; xx.x fx 2 15cos2150 004 .x; .x; (40) as shown in Figure 3. Thus, for the same input these candidate models give different results. Note that each candidate model is either incomplete or erroneous, or both. As shown in Figure 3, these models approximate the true model to varying degrees over the input space. Once the candidate models are formulized and data are collected, we are ready to employ our new approach to determine local domains by fuzzy clustering, to build local models within each fuzzy cluster and finally to mix Figure 3. Artificial candidate models and ge ne r a te d data. local models into a global model. The results are shown in Table 1, compared to a single best model and a global linear model. In Table 1, test error refers to the cross- validation mean squared error. The fuzzy membership functions and resultant mixture models are plotted in Figures 4 and 5, respectively. From the results in Table 1, we see that the new method works well as expected. This is understandable, because in this example highly non-linear dependencies among candidate models and severe nonlinearity in the data are involved. These are exactly the two primary problems that our new approach is supposed to deal with. Table 1. Simulation results of the artificial example. Number of domains Domains BIC Test error A single best model [0, 10] N/A 104.67 Global combination [0, 10] 583 18.02 2-Regime local models[0, 0.57, 4.865, 10] 55.15 11.68 Figure 4. Membership functions. Figure 5. Local models and the mixture model. Copyright © 2011 SciRes. IIM
 M. Y. XU ET AL. 101 4.2. Physical Case Study In the above subsection, we demonstrate the effective- ness of our method using artificial data. Here, we address a physical example. The physical example that we use here is that of the peak ground acceleration (PGA) attenuation models used in seismology. In this example, the purpose is to build a more accurate mixture of local models, which is applica- ble to southern California in the United States. A sample data set of size 102 is obtained from the literature [35], whose logarithms are assumed to include Gaussian noise. Correspondingly, the candidate attenuation models in- clude those of Boore et al. (1997)[36], Sadigh et al. (1997)[37], Abrahamson and Silva (1997)[38], Campbell and Bozorgnia (1997)[39], Spudich et al. (1997) and Idriss(1995) [40]. All of these attenuation relations may be found in Seismological Research Letters, Volume 68, Number 1, January/February, 1997. All of these attenua- tion relationships were developed for shallow crustal earthquakes in active tectonic regions, and thus they should be applicable to southern California. The candidate models are plotted together with the sample data in Figure 6. From Figure 6, it is easy to note that all the models are close to be a straight line, which means that unlike the artificial example the de- pendence among candidate models are mostly linear. As in the artificial example, we apply the new method to optimize fuzzy domains, create local models and fi- nally create a mixture model. The simulation results are shown in Table 2, where the test error refers to the cross- validation mean squared error, and the mixture model is plotted in Figure 7 together with the data. From the above results, we see that the test error of the mixture model using two operating regimes is smaller than that of the global model by about 12%. Also, the Figure 6. Candidate peak ground acceleration (PGA) at- tenuation models and data. model plotted in Figure 7 appears reasonable. First of all, the peak ground acceleration (PGA) decreases with in- creasing distance from the earthquake epicenter. It also shows that in two regions, namely near the epicenter and far from the epicenter, the attenuation model as a func- tion of the distance is somewhat different. One of the possible reasons is the effect of the depth of the seismic source. Likewise, a possible explanation of the turning point shown near 23 km is that for shallow crustal earth- quakes the average depth of ruptures is about 25 km [39]. 5. Summary In this paper, we propose a mixture model of local ICA models to overcome the weakness of global models in dealing with nonlinearity and locality. This method con- sists of three components: fuzzy clustering, local feature and model combination. The supervised adaptive fuzzy clustering algorithm is proposed in order to divide the entire input space into operating regimes, local PCA or ICA analysis is carried out and the feature-based model combination method is applied to create local feature models in each individual region. Finally the local fea- ture models are combined into a single mixture model. Correspondingly, a three-stage optimization procedure is designed to optimize the complete objective function, which is actually a hybrid GA algorithm. Table 2. Simulation results of physical case study. Number of domains Domains BIC Test error Best single model [0,120] N/A 0.1935 Global combination [0, 120] –66.72 0.1567 2-Regime local model [0, 23, 23.9, 120] –69.37 0.1303 Figure 7. Local models and mixture model. Copyright © 2011 SciRes. IIM
 102 M. Y. XU ET AL. In order to demonstrate its effectiveness this new method is applied to both an artificial example and a physical case in a seismic study. Our simulation results show that the adaptive fuzzy mixture of local ICA mod- els is superior to global models. 6. References [1] M. Xu and M. Golay, “Data-guided Model Combination by Decomposition and Aggregation,” Machine Learning, Vol. 63, No. 1, 2005, pp. 43-67. [2] D. J. Bartholomew and M. Knott, “Latent Variable Mod- els and Factor Analysis,” London: Arnold; New York: Oxford University Press, 1999. [3] I. T. Jolliffe, “Principal Component Analysis,” New York: Springer-Verlag, 1986. [4] A. Hyvärinen, “Fast and Robust Fixed-Point Algorithms for Independent Component Analysis,” IEEE Transac- tions on Neural Networks, Vol. 10, No. 3, 1999, pp. 626-634. [5] J. Karhunen and S. Malaroiu, “Locally Linear Independ- ent Component Analysis,” International Joint Conference on Neural Networks, 1999. [6] T. A. Johansen and B. A. Foss, “Operating Regime Based Process Modeling and Identification,” Computers and Chemical Engineering, Vol. 21, 1997, pp. 159-176. doi:10.1016/0098-1354(95)00260-X [7] G. J. McLachlan and K. E. Basford, “Mixture Models: Inference and Application to Clustering,” New York: Marcel Dekker, 1988. [8] R. Murray-Smith and T. A. Johansen, “Local Learning in Local Model Networks,” Proceedings of IEE Interna- tional Conference on Artificial Neural Networks, Cam- bridge, UK, 1995, pp. 40-46. doi:10.1049/cp:19950526 [9] M. I. Jordan and R.A. Jacobs, “Hierarchical Mixtures of Experts and the EM Algorithm,” Neural Computation, Vol. 6, 1994, pp. 181-214. doi:10.1162/neco.1994.6.2.181 [10] J. Fan, “Local Modelling,” Encyclopidea of Statistical Science, 1995. [11] W. S. Cleveland and S. J. Devlin, “Locally Weighted Regression: An Approach to Regression Analysis by Lo- cal Fitting,” Journal of the American Statistical Associa- tion, Vol. 83, 1988, pp. 596-610. [12] R. A. Jacobs, M. I. Jordan, S. J. Nowlan and G. E. Hinton, “Adaptive Mixtures of Local Experts,” Neural Computa- tion, Vol. 3, 1991, pp. 79-87. doi:10.1162/neco.1991.3.1.79 [13] U. Söderman, J. Top and J.-E. Strömberg, “The Concep- tual Side of Mode Switching,” Proceedings of IEE Inter- national Conference on Systems, Man, and Cybernetics, Le Touquet, France, 1993, pp. 245-250. [14] S. Harrington, R. Zhang, P. H. Poole, F. Sciortino, and H. E. Stanley, “Liquid-Liquid Phase Transition: Evidence from Simulations,” Physical Review Letters, Vol. 78, No. 12, 1997, pp. 2409-2412. doi:10.1103/PhysRevLett.78.2409 [15] K. Honda, H. Ichihashi, M. Ohue and K. Kitaguchi, “Ex- traction of Local Independent Components Using Fuzzy Clustering,” Proceedings of 6th International Conference on Soft Computing, 2000. [16] L. J. Breiman, H. Friedman, R. A. Olshen, and C. J. Stone, “Classification and Regression Trees,” Belmont CA: Wadsworth, 1984. [17] J. C. Dunn, “A Fuzzy Relative of the ISODATA Process and Its Use in detecting Compact Well-Separated Clus- ters,” Journal of Cybernetics, Vol. 3, 1973, pp. 32-57. doi:10.1080/01969727308546046 [18] J. C. Bezdek, “Pattern Recognition with Fuzzy Objective Function Algorithms,” Plenum Press, New York, 1981. [19] N. Kambhatla and T. Leen, “Dimension Reduction by Local Principal Component Analysis,” Neural Computa- tion, Vol. 9, 1997, pp. 1493-1516. doi:10.1162/neco.1997.9.7.1493 [20] J. Karhunen and S. Malaroiu, “Local Independent Com- ponent Analysis Using Clustering,” Proc. First Int. Workshop on Independent Component Analysis and Sig- nal Separation, 1999, pp. 43-48. [21] S. E. Geman, Bienenstock and R. Doursat, “Neural Net- works and the Bias/Variance Dilemma,” Neural Compu- tation, Vol. 4, 1992, pp. 1-58. [22] H. Akaike, “Information Theory and an Extension of the Maximum Likelihood Principle,” 2nd International Symposium on Information Theory (B. N. Petrov and F. Czáki, eds.), 1973, pp. 267-281. [23] G. Schwarz, “Estimating the Dimension of a Model,” Annals of Statistics, Vol. 6, 1978, pp. 461-464. doi:10.1214/aos/1176344136 [24] H. Hotelling, “Analysis of a Complex of Statistical Vari- ables into Principal Components,” Journal of Educational Psychology, Vol. 24, 1933, 417-441. doi:10.1037/h0071325 [25] A. Hyvärinen and P. Pajunen, “Nonlinear Independent Component Analysis: Existence and Uniqueness Re- sults,” Neural Network, Vol. 12, No. 2, 1999, pp. 209- 219. [26] R. Murray-Smith and T. A. Johansen (Eds.), “Multiple Model Approaches to Nonlinear Modeling and Control,” Taylor and Francis, London, UK, 1997. [27] D. L. B. Jupp, “Approximation to Data by Splines with Free Knots,” SIAM Journal on Numerical Analysis, Vol. 15, No. 2, 1978, pp. 328-343. doi:10.1137/0715022 [28] J. Friedman, “Multivariate Adaptive Regression Splines (with discussion),” Annals of Statistics, Vol. 19, 1991, pp. 1-141. doi:10.1214/aos/1176347963 [29] H. G. Burchard, “Splines (With Optimal Knots) are Bet- ter,” Applicable An al ys is, Vol. 3, 1974, pp. 309-319. doi:10.1080/00036817408839073 [30] J. M. Holland, “Adaptation in Nature and Artificial Sys- tems,” Ann Arbor, MI: The University of Michigan Press, Copyright © 2011 SciRes. IIM
 M. Y. XU ET AL. Copyright © 2011 SciRes. IIM 103 1975. [31] J. Pittman, “Adaptive Spline and Genetic Algorithms,” Journal of Computational and Graphical Statistics, Vol. 11, No. 3, pp. 1-24. [32] David E Goldberg, “Genetic Algorithms in Search, Opti- mization and Machine Learning,” Kluwer Academic Publishers, Boston, MA, 1989. [33] J. Hessner and R. Männer, “In Proceedings of the First Workshop on Parallel Problem Solving from Nature,” Lecture Notes in Computer Science, Vol. 496, Springer- Verlag: Berlin, 1991, pp. 23-31. [34] T. Takagi and M. Sugeno, “Fuzzy Identification of Sys- tems and Its Application to Modeling and Control,” IEEE Transactions on Systems, Man and Cybernetics, Vol. 15, 1985, pp. 116-132. [35] J. H. Steidl and Y. Lee, “The SCEC Phase III Strong- Motion DataBase,” Bulletin of the Seismological Society of America, Vol. 90, No. 6B, 2000, pp. S113-S135. doi:10.1785/0120000511 [36] D. M. Boore, W. B. Joyner and T. E. Fumal, “Equations for Estimating Horizontal Response Spectra and Peak Acceleration from Western North American Earthquakes: A Summary of Recent Work,” Seismological Research Letters, Vol. 68, No. 1, 1997, pp. 128-153. [37] K. Sadigh, C.-Y. Chang, J. A. Egan, F. Makdisi and R. R. Youngs, “Attenuation Relations for Shallow Crustal Earthquakes Based on California Strong Motion Data,” Seismological Research Letters, Vol. 68, No. 1, 1997, pp. 180-189. [38] N. A. Abrahamson, and W. J. Silva, “Empirical Response Spectral Attenuation Relations for Shallow Crustal Earthquakes,” Seismological Research Letters, Vol. 68, No. 1, 1997, pp. 94-12. [39] K. W. Campbell, “Empirical Near-source Attenuation Relations for Horizontal and Vertical Components of Peak Ground Acceleration, Peak Ground Velocity, and Pseudo-absolute Acceleration Response Spectra,” Seis- mological Research Letters, Vol. 68, No. 1, 1997, pp. 154-179. [40] I. M. Idriss, “An Overview of Earthquake Ground Motion Pertinent to Seismic Zonation,” 5th International Con- ference on Seismic Zonation, 1995, pp. 17-19.
|