Engineering
Vol.06 No.12(2014), Article ID:51360,14 pages
10.4236/eng.2014.612074
Effect Modeling of Count Data Using Logistic Regression with Qualitative Predictors*
Haeil Ahn
Department of Industrial Engineering, Seokyeong University, Seoul, Republic of Korea
Email: hiahn@skuniv.ac.kr
Copyright © 2014 by author and Scientific Research Publishing Inc.
This work is licensed under the Creative Commons Attribution International License (CC BY).
http://creativecommons.org/licenses/by/4.0/



Received 25 August 2014; revised 24 September 2014; accepted 9 October 2014
ABSTRACT
We modeled binary count data with categorical predictors, using logistic regression to develop a statistical method. We found that ANOVA-type analyses often performed unsatisfactorily, even when using different transformations. The logistic transformation of fraction data could be an alternative, but it is not desirable in the statistical sense. We concluded that such methods are not appropriate, especially in cases where the fractions were close to 0 or 1. The major purpose of this paper is to demonstrate that logistic regression with an ANOVA-model like parameterization aids our understanding and provides a somewhat different, but sound, statistical background. We examined a simple real world example to show that we can efficiently test the significance of regression parameters, look for interactions, estimate related confidence intervals, and calculate the difference between the mean values of the referent and experimental subgroups. This paper demonstrates that precise confidence interval estimates can be obtained using the proposed ANOVA- model like approach. The method discussed here can be extended to any type of experimental fraction data analysis, particularly for experimental design.
Keywords:
Logistic Regression, Logit, Logistic Response, Categorical, Binary Count Data

1. Introduction
When manufacturing high-end goods, there is a trade-off between a high yield rate or lower fraction of nonconforming goods. A slight change in the process can drastically affect the yield rate or fraction of defective products, which results in a considerable increase or decrease in product turnover.
To develop a better process, purposeful changes should be made to the input variables of a process or production system, so that we can identify the reasons for changes in either the continuous or categorical outcomes and improve the manufacturing conditions. For this reason, it is commonplace in industry to analyze fraction data such as yield rates, percentages, and units of conforming or nonconforming product. When the input variables or regression predictors are all qualitative and the responses are countable, the data are often called categorical outcomes. Analysis of variance (ANOVA) has long been a favorite technique for investigating this type of data, as discussed in Rao [1] , Wiener et al. [2] and Toutenburg and Shalabh [3] .
Unfortunately, however, there are many cases where the fraction of nonconforming units of a product is close to zero or the yield rate of conforming units is close to one. In these cases, conventional analysis techniques often result in yield rate estimates exceeding 100%, or negative defective fraction estimates, as noted by many authors.
The drawbacks of using ANOVA for fraction data were noted by Cochran [4] . According to him, even the square-root or arcsine-square-root transformations of ANOVA-type data do not work properly. As Taguchi noted in Ross [5] , the additive property of fraction data does not hold, especially when the fraction is lower than 20% or higher than 80%. He made use of what he called the omega (Ω) transformation for data conversion. Although the omega transformation has its merits, it is not satisfactory in the statistical sense. Jaeger [6] investigated the problem from the point of view of psychological or behavioral sciences, and found that ANOVA can yield spurious results even after applying the arcsine-square-root transformation to proportional data. ANOVAs over proportions can result in hard-to-interpret results, because the confidence intervals can extend beyond the interpretable values (0 and 1). As an alternative, he recommended logistic regression models, which he called the ordinary logit and/or mixed logit models.
In order to avoid above mentioned phenomena, we had better consider the logistic transformation. Dyke and Patterson [7] appear to be the first to use a logistic transformation to analyze ANOVA-type factorial data. Many theoretical backgrounds of logistic regression for categorical data analysis (CDA) are available. Montgomery [8] , Kleinbaum and Klein [9] , and Agresti [10] discussed the theoretical background in some detail. Some dichotomous response data were touched on in Dobson and Barnett [11] and Sloan et al. [12] in relation to contingency table analysis, while some polytomous response data were dealt with in Strokes et al. [13] and Dobson and Barnett [11] .
In most cases, they dealt with quantitative explanatory variables. However, there are many cases when qualitative predictors are appropriate for modeling and analyses. Even in the comprehensive book by Agresti [10] , logistic models with categorical predictors were not fully discussed. He did mention that logistic regression should include qualitative explanatory variables, often called categorical factors. In Agresti [10] , the author touched on the ANOVA-type representation of factors and use of the logistic regression model. But the suggested model is quite limited to the case of one factor, and hence is not informative enough for practitioners who want to extend it to models of multiple factors. In Strokes et al. [13] , the authors briefly introduced model fitting for logistic regression with two quantitative explanatory variables. In our opinion, however, their parameterization is a little confusing, and the ANOVA-model like parameterization is preferable.
Fortunately, modern statistics has presented many ways of extending logistic models. In this study, we consider a binary response variable (i.e., countable or categorical) and explanatory variables or predictors with three or more levels that are qualitative or categorical. The response variable may, for example, represent the units of a product manufactured under a certain condition. When trying to determine an appropriate statistical method for analyzing countable data within categorical settings, we have excluded the ANOVA-type analyses. However, we have used an ANOVA-model like parameterization with logistic regression and qualitative predictors. First, we examined the limitations of ANOVA-type analysis in connection to the defective rate or percentage data. Second, we considered logistic regression modeling of two-way binary count data with categorical variables. Then, we examined the behavior of the logistic regression model when fitted to the two-way count data within the logistic regression framework. We investigated this as an alternative to the ANOVA-type model, in an effort to combine logistic regression and qualitative predictors.
When implementing an experiment and analyzing the results, the optimal condition is sought for by testing the significance of regression parameters, evaluating the existence of interactions, estimating related confidence intervals (CIs), assessing the difference of mean values, and so on. The significance of model parameters and fraction estimates are used by the experimenter to identify and interpret the model.
The objectives of this study can be summarized as follows:
● To extend the ANOVA models with qualitative factors to logistic models with qualitative predictors.
● To estimate the main effects and/or interactions of ANOVA-model like parameterization.
● To estimate the confidence intervals (CIs) for model parameters and fractions.
● To ensure that the CIs for fractions are appropriate (between 0 and 1).
● To discuss the interpretation of the analysis results.
We have used a simple, but real, illustrative example to explain how to test the significance of model parameters, ascertain the existence of interactions, estimate the confidence intervals, and find the difference of the mean values. We have used the SAS in Allison [14] and MINITAB [15] logistic regression program to examine the efficiency of models and demonstrate the usefulness of logistic regression with ANOVA-model like parameterization.
2. Logistic Model with Categorical Predictors
2.1. Logistic Transformation
Let
be a fraction representing the probability of the event occurring, then
the “odds”. The naturally logged odds
are defined as the logistic transformation and also called “logit” link function. The logit link function converts fractions
between 0 and 1 to values between minus and plus infinity. For example, if we let
, then the random variable
has its own probability mass function, which is discrete, non-negative, and asymmetric. The normal approximation of this random variable might cause aforementioned problems, especially when
or
, due to the lack of normality as explained in Montgomery et al. [8] . If we take the sample log-odds
, then the shape of the distribution becomes the logistic function, which is close to the normal distribution function. The cumulative distribution function is a monotonically increasing function. For another example, Ross [5] introduced Taguchi’s omega (Ω) transformation formula in calculating db (decibel) value, which is similar to log-odds. The omega transformation formula is
,
. In this study, however, only the “logit” conversion is going to be considered.
On the other hand, the logistic model is set up to ensure that whatever estimate for success or failure we have, it will always be some number between 0 and 1. Thus, we can never get a success or failure estimate either above 1 or below 0. For the variable
, the standard logistic response function, called
, is given by
to the
over 1 plus
to the 



For linear variety, 

Let us think of a simple regression analysis where there exist several types of responses such as observations, regression line, confidence interval, and prediction interval. The logistic response function transforms the responses into some number between 0 and 1, which results in S-shaped curves.
Generally, for a linear predictor


Typical response functions with and without interaction term can be depicted as in (b) and (a) of Figure 1. The logit link function called “log-odds” and logistic response function are reciprocal to each other. The logistic model is widely used for binomial data and is implemented in many statistical programs, such as SAS and MINITAB.
2.2. Models for Two-Way Responses
Let us consider two-way 



Figure 1. Response of (a) 

where 





Such an ANOVA model can be transformed into a regression model. One way of defining the regression model corresponding to this model is as follows:

This model is also subject to the constraints in Equation (5). This type of modeling is often called “effect modeling” or “incremental effect parameterization”.
2.3. Odds Ratio Assessment
The odds ratio (OR) is defined as the ratio of any odds of experimental subgroup to that of the referent one.

In this study, we found that if the odds ratio is less than or equal to one; i.e., 

If we are sure that the upper and lower limits of 



2.4. Interaction Assessment
As defined in Kleinbaum and Klein [9] , an equation for assessing interaction can be identified as follows. We begin with the null hypothesis that:

It is interesting to note that the interaction effects can be expressed to be multiplicative. One way to state this null hypothesis, in terms of odds ratios, is that 



If the equation of this null hypothesis is satisfied, we say that there is “no interaction on a multiplicative scale.” In contrast, if this expression does not hold, we say that there is “evidence of interaction on a multiplicative scale.”
We can make use of this formula to test the null hypothesis 

2.5. Estimation of Regression Parameters
We consider generalized linear models in which the outcome variables are measured on a binary scale, as explained in Dobson and Barnett [11] . For example, the responses may be “success” or “failure” or non-conform- ing or conforming. “S” and “F” denoted by 1 or 0 are used as generic terms of the two categories. First, the binary random variable is defined.

with probabilities 










In Dobson and Barnett [11] , the one factor case of 




In this study, we intend to extend this one-way single factor case to two-way two factor case.
We consider the case of 


If we define

Since

Table 1. Frequencies of 
As defined in Equation (7),



The partial derivative and the Hessian of this likelihood function with respect to 






This is usually called generalized estimating equation (GEE). For more details, refer to Montgomery et al. [8]
and Dobson and Barnett [11] . The information matrix corresponds to
matrix 



where the 








2.6. Interval Estimation of Fractions
The odds can be estimated by 

This is the point estimate for the fraction of each subgroup of outcomes. To obtain the interval estimation, prediction vectors are needed. Let 


The expectation and variance can be given by:

The 



By taking the reciprocals, 


In the estimation process, the precision varies depending on the sample size of each subgroup. The bigger the sample size is, the more accurate the fraction estimates are.
There exist excellent computer programs that implement maximum-likelihood estimation for logistic regression, such as SAS PROC LOGISTIC in Allison [14] and MINITAB [15] . We have only to apply ourselves to modeling and parameterization.
3. Illustrative Example of Qualitative Predictors
3.1. Illustrative Example
A manufacturing company produces touch-screen panels on a massive scale to deliver to a customer company, who produces smart phones and tablet PCs. Currently, resistive touch-screen panels (TSP) are being widely used. The company plans to produce capacitive TSP (CTSP) to minimize the thickness. However, the following problems may be caused during the fabrication of CTSP. For example, after performing the screen printing process, when an Ag paste is cured at a high temperature, cracks may occur in a fine indium tin oxide (ITO) line. Moreover, many defects such as air bubbles, alien substances and scratches, may take places during the interlayer lamination process. The defective items are the major source of failure cost of the product.
An experimenter is seeking a method of fabricating the CTSP that can efficiently reduce the cost of CTSP fabrication, which can be assessed in terms of yield rate or fraction non-conforming. There are four patented methods of fabrication and four types of facilities available for the process operation. The experimenter is concerned with handling with explanatory variables that are qualitative and contain three or more levels.
3.2. Units of Nonconforming
Since the example lends itself to the problem of two-way binary data, let us consider two qualitative factor experiments with 





Expressed in terms of the combination of factor levels, the subgroup 



The experimenter is concerned about the optimal subgroup and the significance of the fractions of these experimental subgroups, eventually to find out the improved experimental subgroup, if any, which gives the lowest 
3.3. ANOVA with Logistic Transformation
The logistically transformed log-odds data shown in Table 3 can be regarded as a two-way layout without
Table 2. Units of product nonconforming.
Table 3. Logistically transformed data.
replications.
The following model seems to be relevant as far as the ANOVA is concerned.

Since this is a two-way layout without replication, the interactions are not considered in the model. Based on the model, the ANOVA can be conducted as in Table 4.
Factor 


Since the number of effective replication is given by:

The 95% confidence interval for 

The result of Minitab processing can be obtained as in Figure 2.
The point estimate for 

Likewise, the 95% confidence interval for 

The result is quite plausible in that the interval is narrower than before and the lower limit can never be a negative. However, the optimal manufacturing condition of ANOVA is

3.4. Logistic Regression for the Illustrative Example
The data structure for the illustrative example shown in Table 2 is identified as follows:

where 





where 


There are several ways of handling these constraints as in Dobson and Barnett [11] . One of those methods is to set to zero the first term of each constraint. That is,

If 








Let 






The explanatory variables are all indicators. Notice that the columns corresponding to











3.5. Estimation of Parameters
The full pattern of model 1 is fitted into the data in Table 2. Minitab logistic regression output is displayed in Figure 3.
Table 4. ANOVA table.
*: significant at 5% **: significant at 1%
Figure 2. 95% confidence and prediction interval (Minitab).
Figure 3. Parameter estimates for model 1.
Figure 3. Parameter estimates for model 1.
The p-values and the CI’s of odds ratios can be regarded as measures of the significance tests of regression parameters. Some parameters look significant, but others do not. As a matter of fact, regardless of whether parameters are significant or not, we can eliminate any rows or columns from the table on purpose without affecting the estimation of other parameters, owing to the incremental effect parameterization.
For example, we are interested in comparison of reference subgroup and strong candidate subgroup for optimality. Since the combination 







Likewise, the logistic regression model reduces to the following.

The design matrix 


Table 5. First reduced table.
The parameter estimates are shown in Figure 4. It is worthy to note that the parameter estimates remain the same. We can ensure that the elimination of rows and columns does not affect the parameter estimates. The phenomenon that makes matters simple is the major difference between ANOVA-type and incremental effect modeling.
On the one hand, 



The design matrix 


The parameter estimates are shown in Figure 5.
The estimate for 



3.6. Existence of Interactions
Notice that the estimate for the interaction 






Usually, equality does not hold, unless



3.7. Estimation of Confidence Intervals
We can ensure that the point estimates for each fraction can be obtained as follows:

Figure 4. Parameter estimates for model 2.
Figure 5. Parameter estimates for model 3.
Figure 5. Parameter estimates for model 3.
From a conventional statistical view point, we might like to calculate the confidence intervals. To find confidence intervals, we have to know the standard errors 
matrix

90% CI for
90% CI for
In the same way, we can calculate the confidence intervals for 



In general, the standard errors corresponding to



where 




The confidence intervals can provide us with the information on whether the sample size is large enough or not. For example, the interval estimates for 


Table 6. Point and interval estimates for fractions of model 3.
3.8. Collection of More Data
Sometimes we might have to see the problem in another way. In order to make the two subgroups 


The logistic regression model becomes as simple as the following.

The parameter 



The parameter estimates are shown in Figure 6.
The interval estimate (0.21, 1.63) for 


A simulated data are given in Table 8 where the numbers of replications are increased up to 1000 with the referent subgroup 

Model 4 is fitted into the data. The design matrix 

The estimate 



If we are sure that the upper and lower limits of 
Seen from the conventional view point of statistics, the point and interval estimates for fractions can be given as in Table 9. Since the confidence intervals overlap, the intervals are not discriminative enough.
In this manner, an experimenter can decide on whether parameter estimates are significant, whether the model is appropriate, whether sample size is large enough, and whether the fraction of candidate subgroup is smaller, until he or she is convinced that the candidate subgroup is superior to the current one.
4. Conclusion and Further Study
In reality, there are many cases where an experimenter has to analyze fraction data, usually provided in the form of percentages or yield rates as the outcomes of an experiment. The input variables are quantitative, qualitative, or both. In this study, the case that the two input variables are all qualitative and the responses are countable is considered for study in order to extend the model in Agresti [10] and leverage the logistic model in Strokes et al. [13] . That is to say, an attempt is given to the problem of binary outcomes with two categorical predictors by
Table 7. Second reduced table.
Table 8. Simulated data table.
Table 9. Point and interval estimates for fractions of model 3.
Figure 6. Parameter estimates for model 4.
Figure 7. Parameter estimates for model 4.
Figure 7. Parameter estimates for model 4.
utilizing logistic regression. In this study, we excluded ANOVA-type analyses, but we adopted ANOVA-model like parameterization, that is, incremental effect modeling.
The optimal manufacturing condition can be ensured, mainly by testing the significance of regression parameters, testing the existence of interactions, estimating related confidence intervals, testing the difference of mean values, and so on. The conventional ANOVA-type analyses are based on the assumption of normality, independence, and equality of variances of experimental observations. For this reason, the ANOVA-type model entails much detrimental to the goodness-of-fit test and the efficient and precise estimation of regression parameters, mainly because the additive property of fraction data is no longer valid, especially when the fractions are close to zero or is near one, as discussed by Jaeger [6] .
As it is always the case with logistic regression, the point estimates are more accurate than those of ANOVA- type modeling. Not only is the lower limit always positive, but also the upper limit is always less than one. The significance test of a parameter can be performed by checking whether the confidence interval of the corresponding odds ratio contains one or not, based on the assumption that the null hypothesis 
When dealing with logistic regression with categorical predictors, the generalized estimating equations (GEE) must be utilized to estimate the parameters. These demerits, nevertheless, can be easily overcome by making use of commercialized computer programs such as SAS PROC LOGISTIC and MINITAB. The analyzing process is somewhat different from the conventional statistical analysis method. We might have to abandon our conventional ANOVA-type of way to interpret the analysis result.
The use of logistic regression has its merits: 1) the analyzer can never get a yield rate or defective rate estimate either above 1 or below 0, 2) the estimates for parameters are more efficient and accurate compared to those of the ANOVA-type model since the logistic regression model describes more accurately the intrinsic nature of the count data, and 3) the significance test of regression parameters is easily performed by checking the interval estimates for odds ratios.
There exist other types of transformations, not mentioned in this study, such as probit and complementary log-log transformations, which seems to be worthy of trying. The logistic regression model is sometimes called ordinary logit model to distinguish it from what they call mixed logit model. The mixed logit model could be the next topic of this study.
The analyses method discussed throughout this study can be extended to the case of multiple qualitative predictors for count data, just as there are a variety of models available in the literature, especially in the area of experimental design and regression analysis.
Acknowledgements
This research was supported by Seokyeong University in 2013.
References
- Rao, M.M. (1960) Some Asymptotic Results on Transformations in the Analysis of Variance. ARL Technical Note, Aerospace Research Laboratory, Wright-Patterson Air Force Base, Dayton, 60-126.
- Wiener, B.J., Brown, D.R. and Michels, K.M. (1971) Statistical Principles in Experimental Design. McGraw Hill, New York.
- Toutenburg, H. and Shalabh (2009) Statistical Analysis of Designed Experiments. 3rd Edition, Springer Texts in Statistics.
- Cochran, W.G. (1940) The Analysis of Variances When Experimental Errors Follow the Poisson or Binomial Laws. The Annals of Mathematical Statistics, 11, 335-347. http://dx.doi.org/10.1214/aoms/1177731871
- Ross, P.J. (1989) Taguchi Techniques for Quality Engineering. McGraw Hill, Singapore.
- Jaeger, T.F. (2008) Categorical Data Analysis: Away from ANOVAs (Transformation or Not) and towards Logit Mixed Models. Journal of Memory and Language, 59, 434-446. http://dx.doi.org/10.1016/j.jml.2007.11.007
- Dyke, G.V. and Patterson, H.D. (1952) Analysis of Factorial Arrangements When the Data Are Proportions. Biometrics, 8, 1-12. http://dx.doi.org/10.2307/3001521
- Montgomery, D.C., Peck, E.A., and Vining, G.G. (2006) Introduction to Linear Regression Analysis. 4th Edition, John Wiley & Sons, Inc., Hoboken.
- Kleinbaum, D.G. and Klein, M. (2010) Logistic Regression: A Self Learning Text. 3rd Edition, Springer, New York. http://dx.doi.org/10.1007/978-1-4419-1742-3
- Agresti, A. (2013) Categorical Data Analysis. 3rd Edition, John Wiley & Sons Inc., Hoboken.
- Dobson, A.J. and Barnett, A.G. (2008) An Introduction to Generalized Linear Models. 3rd Edition, CRC Press, Chapman & Hall, Boca Raton.
- Sloan, D. and Morgan, S.P. (1996) An Introduction to Categorical Data Analysis. Annual Review of Sociology, 22, 351-375. http://dx.doi.org/10.1146/annurev.soc.22.1.351
- Strokes, M.E., Davis, C.S. and Koch, G.G. (2000) Categorical Data Analysis Using the SAS System. 2nd Edition, SAS Institute Inc., Cary, NC.
- Allison, P.D. (1999) Logistic Regression Using the SAS System—Theory and App. SAS Institute Inc., Cary, NC.
- Minitab (2011) Minitab Manual. Minitab Inc. http://www.minitab.com/en-us/
- Hsieh, F.Y., Bloch, D.L. and Larsen, M.D. (1998) A Simple Method of Sample Size Calculation for Linear and Logistic Regression. Statistics in Medicine, 17, 1623-1634. http://dx.doi.org/10.1002/(SICI)1097-0258(19980730)17:14<1623::AID-SIM871>3.0.CO;2-S
NOTES
*Effect modeling means “incremental effect parameterization”.









