Open Journal of Statistics
Vol.08 No.06(2018), Article ID:88954,17 pages
10.4236/ojs.2018.86059

Linear Regression Analysis for Symbolic Interval Data

Jin-Jian Hsieh, Chien-Cheng Pan

Department of Mathematics, National Chung Cheng University, Taiwan

Copyright © 2018 by authors and Scientific Research Publishing Inc.

This work is licensed under the Creative Commons Attribution International License (CC BY 4.0).

http://creativecommons.org/licenses/by/4.0/

Received: October 24, 2018; Accepted: December 1, 2018; Published: December 4, 2018

ABSTRACT

In the network technology era, the collected data are growing more and more complex, and become larger than before. In this article, we focus on estimates of the linear regression parameters for symbolic interval data. We propose two approaches to estimate regression parameters for symbolic interval data under two different data models and compare our proposed approaches with the existing methods via simulations. Finally, we analyze two real datasets with the proposed methods for illustrations.

Keywords:

Linear Regression, Symbolic Interval Data, Centre Method, Least Squares Estimate

1. Introduction

In classical statistical analysis, the collected data usually have exact value. But, in network technology era, the collected data are usually symbolic type. Diday [1] introduced symbolic data which are presented in the form of intervals, histograms, lists and so on. Unlike the classical data, symbolic data could be presented in more types in a p-dimensional space R p . We discuss symbolic interval data in this article, which are symbolic interval-values no longer an exact value. With this change, the classical methods may not be available. Therefore, it is necessary to develop new methods for the analysis of symbolic data. With covariates, we study parameter estimates of the linear regression for symbolic interval data.

Billard and Diday [2] used the center point of each interval-value to fit the linear regression model. Carvalho et al. [3] used the center point and range of each interval-value to fit two linear regression models. Xu [4] used the symbolic covariance method for the symbolic interval data, which was introduced by Billard [5] . In this article, we present two approaches to estimate regression parameters for symbolic interval data. The first method considers the endpoints least square estimate, and the second method considers the least squares estimate with interval weighted function.

This paper is organized as follows. Section 2 gives a introduction for symbolic interval data, Model 1 and Model 2. In Section 3, we propose two methods to estimate regression coefficient for symbolic interval data. In Section 4, the comparisons of the proposed methods and some existing methods are performed via simulations. In Section 5, we analyze two real datasets with the proposed approaches. Finally, we make some concluding remarks in Section 6.

2. Data and Models

2.1. Symbolic Interval Data

In this article, we study on symbolic regression of interval-valued data. First of all, we introduce the symbolic interval data. In classical data, the exact value of the interested variables usually can be observed. In the network technology era, the collected data are growing more and more complex, and no longer a single point. Diday [1] introduced the new data format which is called as symbolic data. Symbolic data have several types as follows: intervals, histograms, lists and so on. For the several types, it is necessary to develop some new methods. For example, because of privacy issues, we usually cannot collect the exact data from respondents. Thus, we usually design some questionnaires to collect the symbolic interval data. For notations, define X i j = [ a i j , b i j ] , Y i = [ c i , d i ] , i = 1 , , n , j = 1 , , p .

Y = ( Y 1 , , Y n ) , X i = ( X i 1 , , X i p ) and X = ( X 1 , , X n ) . Thus, the observed data are { ( Y i , X i 1 , , X i p ) : i = 1 , , n } .

2.2. Model 1

This model considered the linear regression model for symbolic interval data as

Y i = β 0 + β 1 X i 1 + + β p X i p + ε i , (1)

[ Y i U Y i L ] = [ β 0 + β 1 X i 1 U + + β p X i p U + ε i U β 0 + β 1 X i 1 L + + β p X i p L + ε i L ] , (2)

where i = 1 , , n , ( β 1 , , β p ) are the parameters of interest, and ε i is the error term. Here, we assume X i j L < X i j U , which is also considered by Billard and Diday [2] . Therefore, a i j = X i j L and b i j = X i j U . Due to the unknown of β's, we cannot identify the order of Y i L and Y i U . Hence, Y i U is either c i or d i , and the remaining one is Y i L . This model implies that the length of Y i depends on the lengths of X i . But, in practice, the length of Y i may not depends on lengths of X i . Thus, we also consider the different model, Model 2.

2.3. Model 2

In typical statistical analysis, the linear regression model is

Y i * = β 0 + β 1 X i 1 * + + β p X i p * + ε i , i = 1 , , n , (3)

where Y i * and X i j * are single points, ( β 1 , , β p ) are the parameters of interest, and ε i is the error term. In practice, ( Y i * , X i j * ) may not be observed due to privacy issues or some reasons. Usually, the proxies Y i = [ c i , d i ] , X i j = [ a i j , b i j ] of ( Y i * , X i j * ) can be collected. Note that Y i * Y i and X i j * X i j , i = 1 , , n , j = 1 , , p . Thus, the collected data is { ( Y i , X i 1 , , X i p ) : i = 1 , , n } . In this model, the length of Y i does not depend on the lengths of X i .

3. The Proposed Estimations

3.1. Method 1: Endpoints Least Squares Estimate

Based on Model 1, we propose the endpoints least squares estimation approach to estimate ( β 0 , , β p ) . We assume that X i j L < X i j U , i = 1 , , n , j = 1 , , p , which is also considered by Billard and Diday [2] . Due to the unknown of β i s , the order of ( Y i L , Y i U ) cannot be identified. We consider the following procedure to identify the order of ( Y i L , Y i U ) . From model 1, the model is presented as follows,

Y i U = β 0 + β 1 X i 1 U + β 2 X i 2 U + + β p X i p U + ε i U , Y i L = β 0 + β 1 X i 1 L + β 2 X i 2 L + + β p X i p L + ε i L , (4)

where i = 1 , , n . To identify the order of Y i L and Y i U , we apply the centre method [2] to obtain the estimates of β as β ^ c .

Then, compute Y ^ i U c and Y ^ i L c as

Y ^ i U c = β ^ 0 c + β ^ 1 c X i 1 U + β ^ 2 c X i 2 U + + β ^ p c X i p U , Y ^ i L c = β ^ 0 c + β ^ 1 c X i 1 L + β ^ 2 c X i 2 L + + β ^ p c X i p L . (5)

When Y ^ i L c < Y ^ i U c , Y i L = c i and Y i U = d i . When Y ^ i U c < Y ^ i L c , Y i U = c i and Y i L = d i . Then we would obtain the estimates of β by the endpoints least squares estimate as

β ^ U = ( X U X U ) 1 X U Y U , β ^ L = ( X L X L ) 1 X L Y L , (6)

where Y U = ( Y 1 U , , Y n U ) , Y L = ( Y 1 L , , Y n L ) , X U = ( X 1 U , , X n U ) , X L = ( X 1 L , , X n L ) , X i U = ( X i 1 U , , X i p U ) and X i L = ( X i 1 L , , X i p L ) . Then, set

β ^ = β ^ U + β ^ L 2 . (7)

β ^ is the estimator of β in model 1.

3.2. Method 2: Interval Weighted Least Squares Estimate

The method 2 is provided for the model 2, which allows the length of Y i does not depend on the lengths of X i . The centre method [2] estimates the regression parameters by least squares estimate approach with center points of the interval data. Based on the centre method [2], we think the lengths of the interval data can provide some different information in the estimation procedure. Therefore, we use the lengths of the interval data to construct some weighted functions, which provide different impact for each data observation in least squares estimation procedure. Denote the weighted function by W i k . Thus, we suggest the interval weighted least squares estimation method as

min β i = 1 n W i * ( Y i c β 0 β 1 X i 1 c β 2 X i 2 c β p X i p c ) 2 , (8)

where W i * = W i k / i = 1 n W i k , i = 1 , , n , k = 1 , 2 , 3 . As the results of (8), the minimizer β ^ is the estimator of β in model 2. Through some examinations in simulations, we suggest three weighted functions of the length of the interval data in the following. Denote the length of interval: Y i r = | d i c i | , X i j r = | b i j a i j | , M Y r = max i ( Y i r ) and M X j r = max i ( X i j r ) , i = 1 , , n , j = 1 , , p . The first weighted function is designed as

W i 1 = ( a 1 * + b 1 * × exp ( Y i r M Y r ) ) + j = 1 p ( c 1 * + d 1 * × exp ( X i j r M X j r ) ) , (9)

where i = 1 , , n and a 1 * , b 1 * , c 1 * , d 1 * are positive constants. The weighted function is exponential decline as the lengths of interval data increasing. The second weighted function is given as

W i 2 = ( a 2 * + b 2 * × ( Y i r M Y r ) ) + j = 1 p ( c 2 * + d 2 * × ( X i j r M X j r ) ) , (10)

where i = 1 , , n , a 2 * , b 2 * , c 2 * , d 2 * are positive constants, b 2 * < a 2 * and d 2 * < c 2 * . The weighted function is linear decline as the lengths of interval data increasing. Define the standardized lengths of interval data Y i r M Y r and X i j r M X j r as S Y i r and S X i j r . Let 1 n i = 1 n S Y i r and 1 n i = 1 n S X i j r be S ¯ Y r and S ¯ X j r . The third weighted function is designed as

W i 3 = ( 1 2 a 3 * exp ( | S Y i r S ¯ Y r a 3 * | ) ) 1 2 a 3 * + j = 1 p ( ( 1 2 a 3 * exp ( | S X i j r S ¯ X j r a 3 * | ) ) 1 2 a 3 * ) , (11)

where i = 1 , , n and a 3 * is a positive constant. The weighted function is decreasing when the standardized length less than the average of the standardized length and increasing when the standardized length is more than the average of the standardized length. We will compare all methods via simulations in Section 4.

4. Simulation

In this section, we compare our proposed methods, endspoints least squares estimator (M1) and interval weighted least squares estimator (M2), with the existing methods, CM [2], CRM [3] and SCM [4], by simulated datasets. We consider two data generations for model 1 and model 2. For each table, we present the bias, empirical standard deviation (SD), average of jackknife standard deviation (JackSD), mean squares error (MSE), and 95% coverage probability (CP). Data are simulated with sample size n = 50 and 100, and replications R = 500.

For model 1: we first generate 2 independent values from N ( 0, σ X j 2 ) , and let

X i j U be the larger one and X i j L be the smaller one, where i = 1 , , n , j = 1 , 2 . The error term ε i L ~ N ( 0, σ ε L 2 ) and ε i U ~ N ( 0, σ ε U 2 ) , i = 1 , , n . Then, we generate Y i L and Y i U as

Y i L = β 0 + β 1 X i 1 L + β 2 X i 2 L + ε i L , Y i U = β 0 + β 1 X i 1 U + β 2 X i 2 U + ε i U . (12)

The β = ( β 0 , β 1 , β 2 ) are set as ( 10,10,5 ) and ( 10, 10,5 ) . ( σ X 1 2 , σ X 2 2 , σ ε L 2 , σ ε U 2 ) are set as ( 1,1,0.1,0.1 ) . In Table 1 & Table 2, we consider the interval data have the same error terms of ( Y i L , Y i U ) . That is, ε i L = ε i U ~ N ( 0 , 0.1 ) . In Table 3 & Table 4, we consider the error terms of ( Y i L , Y i U ) are different. That is, ε i L ~ N ( 0,0.1 ) and ε i U ~ N ( 0,0.1 ) . From the results, the endpoints least squares estimation (M1) has smaller standard deviation than others for β 1 and β 2 under Table 1, Table 2 and Table 4. SCM has better performance than others under Table 3. Note that SCM has poor performance when ( β 1 , β 2 , β 3 ) = ( 10 , 10 , 5 ) in Table 2 and Table 4.

For model 2: we first generate the single points X i j * ~ N ( 0, σ X j 2 ) , j = 1 , 2 , ε i ~ N ( 0, σ ε 2 ) and set

Y i * = β 0 + β 1 X i 1 * + β 2 X i 2 * + ε i , (13)

where i = 1 , , n . To construct the interval data, the range is generated from a uniform distribution, and denote the upper range of ( Y i * , X i j * ) by ( Y i U h , X i j U h ) and the lower range of ( Y i * , X i j * ) by ( Y i L h , X i j L h ) , i = 1 , , n , j = 1 , 2 . Therefore, we could built the interval-valued data as [ c i , d i ] = [ Y i * Y i L h , Y i * + Y i U h ] and [ a i j , b i j ] = [ X i j * X i j L h , X i j * + X i j U h ] , i = 1 , , n , j = 1 , 2 . Thus, we obtain the interval data ( Y i , X i 1 , X i 2 ) , i = 1 , , n . For the settings, β = ( β 0 , β 1 , β 2 ) are set as ( 10,10,5 ) and ( 10, 10,5 ) , and ( σ X j 2 , σ ε 2 ) are set as ( 1,0.5 ) , j = 1 , 2 . Y i L h , Y i U h , X i j L h and X i j U h are generated from uniform distribution such as Y i L h and Y i U h from U ( 0, b 1 h ) , X i 1 L h and X i 1 U h from U ( 0, b 2 h ) , and X i 2 L h and X i 2 U h from U ( 0, b 3 h ) , i = 1 , , n . Note that M2 (1) is the interval weighted LSE with the first weighted function, W1, and ( a 1 * , b 1 * , c 1 * , d 1 * ) = ( 1,0.2,1,0.15 ) ; M2 (2) is the method with the second weighted function, W2, and ( a 2 * , b 2 * , c 2 * , d 2 * ) = ( 1,0.2,1,0.15 ) ; M2 (3) is the method with the third weighted function, W3, and a 3 * = 0.5 ; M2 (4) is the method with the third weighted function, W3, and a 3 * = 1 ; M2 (5) is the

Table 1. Estimations of β under model 1 with ( β 0 , β 1 , β 2 ) = ( 10 , 10 , 5 ) .

Table 2. Estimations of β under model 1 with ( β 0 , β 1 , β 2 ) = ( 10 , 10 , 5 ) .

Table 3. Estimations of β under model 1 with ( β 0 , β 1 , β 2 ) = ( 10 , 10 , 5 ) .

Table 4. Estimations of β under model 1 with ( β 0 , β 1 , β 2 ) = ( 10 , 10 , 5 ) .

method with the third weighted function, W3, and a 3 * = 2 . The simulation results are shown in Tables 5-8. From the results, the interval weighted least squares estimation with W3 has better performance than others.

Table 5. Estimations of β under model 2 with ( β 0 , β 1 , β 2 ) = ( 10 , 10 , 5 ) , and b 1 h = b 2 h = b 3 h = 0.5 .

Table 6. Estimations of β under model 2 with ( β 0 , β 1 , β 2 ) = ( 10 , 10 , 5 ) , and b 1 h = b 2 h = b 3 h = 0.5 .

Table 7. Estimations of β under model 2 with ( β 0 , β 1 , β 2 ) = ( 10 , 10 , 5 ) , and b 1 h = b 2 h = b 3 h = 1 .

Table 8. Estimations of β under model 2 with ( β 0 , β 1 , β 2 ) = ( 10 , 10 , 5 ) , and b 1 h = b 2 h = b 3 h = 1 .

5. Real Data Analysis

In this section, we apply our proposed methods to analyze two datasets, mushroom data and medical data, which are interval data corresponding to Model 1 or Model 2. The first data which we used to analyze is a mushroom data, which is from the Fungi of California Species Index. The complete data can be downloaded from the internet site, http://www.mykoweb.com/CAF/species_index.html. Three features are represented by three variables Y = the width of the pileus cap, X1 = the length of the stipe, and X2 = the thickness of the stipe. These measurements in the dataset are interval value (in cm). There were 311 observations from the Fungi of California Species Index. Because the lengths of the variables should depend on each other, the dataset belongs to Model 1. By the method 1 and method 2 with the same settings in simulations, we analyze the dataset and present the results in Table 9. In Table 9, we present the estimations of β ( Beta ^ ), the jackknife standard deviation (JackSD) and the 95% confidence interval (95% CI). From the results in Table 9, the M1 approach has smaller standard deviation in the estimations of β 1 and β 2 . The SCM approach has smaller standard deviation in the estimation of β 0 .

The next data which we used to analyze is a medical data, which is from Billard and Diday [6], and the dataset have 10,000 classical observations. Xu [4] classified the entire data to form 42 categories by Agegroup × diabetes × race ( 7 × 3 × 2 = 42 ). For the dataset, we consider three variables Y = cholesterol (chol), X1 = age, and X2 = income. The medical dataset should belong to Model 2, because the lengths of the variables do not depend on each other. Then, apply the method 1 and method 2 with the same settings in simulations to analyze the dataset and present it in Table 10. In Table 10, we present the estimations of β ( Beta ^ ), the jackknife standard deviation (JackSD) and the 95% confidence interval (95% CI). From the results, the interval weighted LSE (M2) with W3 has smaller standard deviation than others, which coincides with the results in simulations. The age variable is significant and the income variable is not significant. Furthermore, the average of cholesterol adds about 0.59 when age adds one year.

6. Conclusion

In the network technology era, the collected data are growing more and more complex, and become larger than before. It brings the difficulty to analyze by using the standard statistical tools. Diday [1] introduced the new data format which is called symbolic data, and symbolic data can be presented in many types. In this paper, we focus on parameter estimates of the linear regression for symbolic interval data. We propose two approaches to estimate regression parameters for symbolic interval data. For the data of model 1, which are considered by Billard and Diday [2], Carvalho et al. [3], and Xu [4], we develop the endpoints least squares estimator for the regression coefficients. But data of this kind implicate the lengths of the interval data of the dependent variable and the independent variables are correlated with each other. In some applications, the interval lengths of the two variables may not depend on each other. Thus, for the situation, we consider model 2 data and suggest the interval weighted least squares estimation method. In addition, we compare our proposed methods with CM proposed by Billard and Diday [2], CRM proposed by Carvalho et al. [3] and SCM proposed by Xu [4] via simulations. From simulation studies, the performance of the endpoints LSE is similar to others for model 1 data. The interval weighted LSE

Table 9. Estimations of β for mushroom data.

Table 10. Estimations of β for medical data (Y: chol, X1: age, X2: income).

with W3 has better performance for model 2 data. Finally, we analyze two real datasets for illustration. Furthermore, the results coincide with the results in simulation studies.

Conflicts of Interest

The authors declare no conflicts of interest regarding the publication of this paper.

Cite this paper

Hsieh, J.-J. and Pan, C.-C. (2018) Linear Regression Analysis for Symbolic Interval Data. Open Journal of Statistics, 8, 885-901. https://doi.org/10.4236/ojs.2018.86059

References

  1. 1. Diday, E. (1987) Introduction à l’ Approache Symbolique en Analyse des Données. Premières Journées Symbolique-Numérique. CEREMADE, Université Paris, 21-56.

  2. 2. Billard, L. and Diday, E. (2000) Regression Analysis for Interval-Valued Data. In: Bock, H.-H. and Diday, E., Eds., Data Analysis, Classification and Related Methods, Springer-Verlag, Berlin, 369-374. https://doi.org/10.1007/978-3-642-59789-3_58

  3. 3. Carvalho, F.A.T., Neto, L. and Tenorio, C.P. (2004) A New Method to Fit a Linear Regression Model for Interval-Valued Data. Annual Conference on Artificial Intelligence: KI2004 Advances in Artical Inteligence, Ulm, 20-24 September 2004, 295-306. https://doi.org/10.1007/978-3-540-30221-6_23

  4. 4. Xu, W. (2010) Symbolic Data Analysis: Interval-Valued Data Regression. Ph.D. Thesis, The University of Georgia, Athens.

  5. 5. Billard, L. (2008) Sample Covariance Functions for Complex Quantitative Data. In: Mizuta, M. and Nakano, J., Eds., Proceedings of the International Association of Statistical Computing Conference 2008, 157-163.

  6. 6. Billard, L. and Diday, E. (2004) Symbolic Data Analysis: Definitions and Examples. Technical Report.