HHS Public Access Author manuscript Author Manuscript

Stat Modelling. Author manuscript; available in PMC 2017 June 01. Published in final edited form as: Stat Modelling. 2016 June ; 16(3): 238–248.

Discussion of ‘Regularized Regression for Categorical Data’ Bryan E. Shepherd1 and Qi Liu1 1Department

of Biostatistics, Vanderbilt University School of Medicine, Nashville, TN, USA

1 Introduction Author Manuscript

We congratulate Drs. Tutz and Gertheiss (TG) on their nice review article, ‘Regularized Regression for Categorical Data,’ and more generally on their excellent contributions to the statistical modelling field. Their approaches for dealing with nominal and ordered categorical variables using various penalization strategies that favor fusing categories are interesting, sensible, and applicable. We like the ideas and we learned a lot from reading this manuscript.

Author Manuscript

In our discussion of TG’s article, we will first examine their approach in some simple settings with a single ordinal predictor. Applying TG’s approaches to these simple scenarios has helped us better understand their methods for penalizing ordinal predictors. In the setting of ordinal predictors, we also consider how this approach compares with isotonic regression and alternative approaches treating ordinal variables as continuous. Finally, we will discuss more general aspects of regularization – specifically, interpretation, assumptions, and inference.

2 Some simple examples with a single ordinal predictor

Author Manuscript

We assume we were asked to discuss this article because of our interest and research in methods for ordinal data analysis. A few years ago, Chun Li and I (B. Shepherd) became interested in approaches for assessing the association between ordinal X and ordinal Y conditional on multivariable Z. We were aware of standard regression techniques for ordinal Y (e.g., proportional odds models), but we were not satisfied with existing solutions for including the ordinal predictor X in the model. It seems that in practice X is almost always treated as either a nominal variable or a continuous variable. In the former case, order information is ignored, presumably resulting in a loss of power. In the latter case, arbitrary numbers have to be assigned to the categories, and results are generally not robust to the choice of numbers. Our proposed solution to this problem was to fit an appropriate multinomial (or ordinal) regression model of X on Z, an appropriate regression model of Y on Z, obtain probability-scale residuals from these models, and then estimate the residual correlation (Li and Shepherd, 2010, 2012). This approach results in a covariate-adjusted partial Spearman’s rank correlation between X and Y that incorporates the order information of these variables but does not assign arbitrary numbers to the categories; the null hypothesis

Address for correspondence: Bryan Shepherd, Department of Biostatistics, Vanderbilt University School of Medicine, 2525 West End, Suite 11000, Nashville, TN 37203 USA. [email protected]., Phone: 615-343-3496.,Fax: 615-343-4924.

Shepherd and Liu

Page 2

Author Manuscript

of conditional independence can be tested. Some additional work has considered other outcome types (Liu et al., 2016; Shepherd et al., 2016). The penalization methods highlighted by Tutz and Gertheiss (TG) address a different, broader class of problems. Whereas our methods have been primarily designed to measure the strength of conditional associations with ordinal variables, TG’s methods are designed from a statistical modelling perspective, with improved prediction as the presumed primary goal. As with our methods, TG’s methods can address situations with single ordinal predictors; however TG’s methods can also simultaneously handle multiple ordered categorical predictors.

Author Manuscript

Despite the differences between our methods, it has been helpful for us to consider the behavior of the TG approach in some simple settings that we used in our earlier work to investigate the performance of our methods. We consider the ordinal predictor penalization approaches of TG in examples with a single ordered categorical X following the simulations of Li and Shepherd (2010) except, for simplicity, we will let Y be a continuous outcome. Specifically, we first generated Z using the standard normal distribution, Z ~ N(0,1). Then we generated X with five categories using the proportional odds model: , with and βX = 1. The outcome variable Y was then generated given X = x and Z = z using a normal distribution: N(β0 + ηx + βY z, 1), with β0 = −0.5, βY = 0.5. We consider the following values for η:

Author Manuscript



Monotone steps: η = {0, 0, 1.5, 1.5, 3},



Monotone, non-linear: η = {0, 2.4, 2.5, 2.6, 2.7},



Linear: η = {0, 0.5, 1, 1.5, 2},



Non-monotone: η = {0, 0.75, 1.5, 0.75, 0}.

Each η describes a different relationship between ordinal X and Y. The latter three were used by Li and Shepherd (2010); we added the first because it is a situation where the utility of the TG approach is readily apparent. In each example, we simulate n = 500 observations.

Author Manuscript

With each simulated dataset, we apply the penalty given in equation (8) of TG that is meant to enforce the fusion of adjacent categories. As weights for a ‘standard’ approach we use those suggested by TG (Gertheiss and Tutz, 2010) which are [(nr + ns)/n]1/2 where nr and ns are the number of observations in the rth and sth adjacent categories of X. For an ‘adaptive’ approach, we use these weights divided by the difference between OLS estimates as described in section 3.1.3 of TG. (In an earlier iteration of our analyses we did not include [(nr+ns)/n]1/2 in our weights; results were very similar, presumably because imbalance in the categories of X in our example is not severe.) The key components of the analysis code were kindly provided to us by Drs. Tutz and Gertheiss. Our simulation and analysis code using R statistical software is posted at biostat.mc.vanderbilt.edu/ArchivedAnalyses.

Stat Modelling. Author manuscript; available in PMC 2017 June 01.

Shepherd and Liu

Page 3

2.1 Monotone Steps Association

Author Manuscript Author Manuscript

Figure 1 shows coefficient paths for the simulation where the relationship between X and Y is described with monotone steps. The coefficients for X = 2…, 5 and Z are labeled, with X = 1 being the reference category (i.e., the coefficient for category X = 1 is set to 0). Coefficients are shown as a function of s/smax, with s denoting the value of the penalty and smax denoting the s value for the OLS estimates. The intercept’s path is illustrated by the dashed curve. Using TG’s approach with standard weighting (left panel), coefficients for categories 1 and 2 are the first to fuse (i.e., fuse with the largest s/smax), then coefficients for categories 4 and 5, and finally coefficients for categories 3, 4, and 5. Interestingly, coefficients for categories 4 and 5 fused before 3 and 4, despite the equality of the coefficients for categories 3 and 4 in the true underlying data generation mechanism and the closer proximity of their OLS estimates. TG’s approach with adaptive weighting performs better in this example, with coefficients for categories 3 and 4 coming together before meeting up with the coefficient for category 5. All plots also show the parameter estimates based on the tuning parameter that results in the lowest mean-squared error (MSE) using 5fold cross-validation (vertical dotted line) and the parameter estimates based on the smallest s within one standard error of the best MSE (dashed vertical line), which favors parsimonious models (Hastie et al., 2009). A model with no fusing results in the lowest MSE, likely because in this example the sample size is fairly large (n = 500) relative to the number of parameters (p = 6). The more parsimonious model (dashed vertical line) may be more interesting in our examples. The parsimonious model using the adaptive TG approach is what we would hope to have selected given our knowledge of the data generation mechanism, with categories 1 and 2 and categories 3 and 4 fused together. 2.2 Monotone Non-linear Association

Author Manuscript

The coefficient paths for the TG approach in the monotone non-linear simulation are shown in Figure 2. These seem to be well-behaved with the coefficients for categories 2, 3, 4, and 5 grouping together fairly quickly using both the standard and adaptive weights. The parsimonious adaptive model (dashed vertical line) fuses together the four higher categories into a single category, which is reasonable in this setting. 2.3 Linear Association

Author Manuscript

Figure 3 shows the coefficient paths when data were generated under a linear relationship between X and Y. In this case, the grouping of coefficients occurs more slowly, as would be expected, and in a manner consistent with the stochastic variation in these simulations. Both of the TG approaches chosen after 5-fold cross-validation select 4 separate coefficients for X as the optimal model. The more parsimonious models (dashed lines) fuse together categories 4 and 5 using the standard weights and categories 4 and 5 and categories 1 and 2 using the adaptive weights. However, the most parsimonious model in this setting would be to assign the numbers 1–5 to the 5 categories and treat them as continuous variables with a single parameter describing the linear relationship. Although what TG’s approach does in this setting is reasonable, we believe this example illustrates the potential price one can pay in additional parameters by using these methods. In this example, the additional parameters are not likely to be much of a problem. However, as the number of categories of the ordinal

Stat Modelling. Author manuscript; available in PMC 2017 June 01.

Shepherd and Liu

Page 4

Author Manuscript

variable gets large, the choice to treat it as a categorical variable rather than assigning numbers may lead to more dramatic over-parameterization. This will be discussed further in Section 4. 2.4 Non-monotone Association

Author Manuscript

Finally, the coefficient paths under the non-monotone scenario are shown in Figure 4. We find these results interesting. Notice that under this data generation scenario, the true coefficients for categories 1 and 5 are identical, as are coefficients for categories 2 and 4. Although the coefficient for category 5 is close to zero (the coefficient for category 1, the reference category), it is not zero, and groups with zero very late – after all other coefficients have been combined. Similarly, the coefficients for categories 2 and 4 do not come together until after those for categories 3 and 4 have combined. This behavior, therefore, is not optimal. The more parsimonious model would have first grouped together categories 1 and 5 and categories 2 and 4. However, this less parsimonious fusing is not surprising given that the penalization scheme favors grouping together adjacent categories. Presumably, if X had been treated as a nominal categorical variable, fusing would have been more correct with the data generation mechanism. We believe this illustrates an important point – ordinal data do not necessarily imply the existence of monotone relationships, whereas the TG approach may be suboptimal if monotonicity does not hold. This is not a criticism of the approach (our correlation methods similarly break down in this non-monotone scenario, ), but simply a recognition that many ordinal methods do not work well if relationships are not monotone.

3 Isotonic Regression Author Manuscript Author Manuscript

The TG penalization approaches remind us a little of isotonic regression,. An isotonic regression solution to the above problems would be to force monotonicity in the coefficients of the ordinal categories by merging together categories where monotonicity does not hold. In Figure 2, for example, the OLS estimates of the coefficients for categories 4 and 5 are reversed in order. Isotonic regression would group these categories together and result in nearly identical estimates to those chosen using 5-fold cross validation using TG’s adaptive approach. With that said, results using isotonic regression will generally be different from those using TG’s penalization approaches. For example, in Figure 1, isotonic regression would group together categories 1 and 2, because the order of the OLS estimates of their coefficients is reversed, but not categories 3 and 4, because the order of their OLS coefficient estimates is the same as the natural ordering. Additional fusing of categories can be achieved using larger penalization parameters with the TG approach, but this is not the case for standard isotonic regression approaches. Because the TG approach leads to grouping of categories with similar coefficients regardless of the exact order of adjacent categories and because of its better flexibility, the TG approach seems favorable over isotonic regression.

4 Continuous versus Categorical In TG’s example data analysis in their section 3, we were struck by the manner in which the following two variables were included: the number of persons in the household (possible values 1, 2, 3, 4, 5, and 6 or more) and the monthly household net income, which was categorized into 12 ordinal categories. We found it notable that these variables were included Stat Modelling. Author manuscript; available in PMC 2017 June 01.

Shepherd and Liu

Page 5

Author Manuscript Author Manuscript

in the analysis as ordered categories rather than as numbers. Other than its largest value (6 or more), the number of persons in a household is numerical, and differences between values are sensible and meaningful. Household income as collected in this survey is ordinal, but is based on the categorization of a continuous variable. Furthermore, with 12 categories, it requires many parameters to model this variable as an ordered categorical variable, even after applying TG’s approach to fuse categories together. It seems that a more parsimonious approach would be to simply assign the numbers 1–12 to the categories, treat these numbers as continuous, and fit flexible regression models using splines to the data, perhaps choosing the optimal number of knots with cross-validation. This sort of follows the line of thought described by TG in their section 3.2.1. We would be interested in seeing how results compare treating these variables as continuous and modeling using splines, and comparing the number of parameters chosen in the final model. We believe it will be substantially fewer parameters than obtained treating these variables as categorical. In some of our work, we have seen that the benefits (in terms of power) of treating ordinal variables as ordinal rather than continuous decrease when there are large numbers of categories.

Author Manuscript

With that said, a somewhat crazy idea – but perhaps a natural extension to the ideas in TG – would be to see what happens if we treat a truly continuous predictor (i.e., a variable where the number of distinct values is equal to the sample size) as ordinal and then applied TG’s penalization methods. In essence, this approach could be thought of as an empirical attempt to find an optimal categorization of the continuous variable. We tried it, and our results are described below. In a first simulation, we generated Z from a N(0,1) distribution and then generated Y given Z from a normal distribution with mean 1 + 5Z and variance 1. Our sample size was 200. For simplicity, in these simulations we have only a single predictor. Figure 5 shows the coefficient paths treating Z as ordered categorical, starting with the OLS estimates (199 dummy variables), and fusing categories together as a function of the amount of penalization. These plots are not very helpful other than to show the large number of potential categories and their gradual fusion. The models that are favored based on crossvalidation have many fewer categories than the OLS estimates and are located on the lower end of the s/smax fraction. Figure 6 shows the coefficient estimates chosen after crossvalidation as dummy variables. Each point represents the estimated difference between the expectation of Y given Z and that for the reference Z, the smallest value of Z. The linear relationship between X and Y is clear, and the slope is approximately 5, which is the truth – although it took ≥ 54 parameters to estimate this relationship. Figure 7 shows estimates chosen using TG’s approach after cross-validation on another simulation using continuous data as before, except now Y~N(1 + Z3,1). Again, the cubic nature of the relationship is captured using TG’s approach.

Author Manuscript

Although interesting, we doubt such an approach is useful in practice, as it results in models with far too many parameters; relationships could have been described using many fewer parameters if the predictor had been analyzed as continuous. Of course, we could have generated continuous data from distributions with step functions and then TG’s approach would likely have estimated the step function better than splines; however, such data are rarely seen in practice.

Stat Modelling. Author manuscript; available in PMC 2017 June 01.

Shepherd and Liu

Page 6

Author Manuscript

Perhaps the bigger point is this: What do we believe is the underlying (maybe latent) nature of our data? Step functions in which grouping categories together is desirable and the methods of TG suitable? Or smooth curves in which it might be better to treat the data as continuous and use flexible splines to model it? We believe as the number of categories of the ordinal predictor variable increases, it becomes more reasonable simply to treat ordinal data as continuous, as generally we believe in smooth relationships.

5 Assumptions and Interpretation

Author Manuscript

Although most of our discussion has focused on ordinal predictors, we were fascinated to see the interesting ways in which penalization could be used to drive a model. For example, TG suggest that instead of assuming a proportional odds model, one can use penalization on multinomial logistic regression models to let the data dictate what model should be used (Section 4 of TG). If the data support a proportional odds model, then coefficients for covariates across outcome categories may fuse together. However, if proportional odds does not hold, a better model will emerge. Similarly, instead of fitting random effects models, one could assign a fixed effect to each of the grouping categories and then use penalization to essentially cluster categories (Section 5 of TG). These are very interesting ideas. With that said, there are certainly a few drawbacks.

Author Manuscript

First, we are often willing to make assumptions that may not lead to the optimal fit of the data, but that are sufficient for our purposes and lead to improved interpretation. The proportional odds assumption is certainly one of these assumptions, and minus substantial violations, interpretation with a common odds ratio is easier than having some covariates with coefficients that vary by the categorical outcome level and some that do not, or that vary with different levels. Often we are willing to sacrifice optimality of fit for ease of interpretation. With regards to random effects: one of TG’s motivations behind using the penalization methods – that random effects assume different effects for each group/individual and that one may not want to make this assumption – is not very strong. We believe in random effects – that, for example, the impact of distinct groups/individuals is slightly different for each of them. We can see the benefits of clustering individuals into groups; we can also see random effects methods being criticized because they make strong distributional assumptions on the random effects. But we believe it is natural to assume that random effects are not identical across any units.

6 Inference Author Manuscript

Finally, although beyond the scope of TG’s review manuscript, we are left wondering how to make inference after performing these penalization approaches. For example, if we wanted to construct 95% confidence intervals for the coefficients of the categories of a penalized ordinal predictor, how would one do this? Surely, one cannot just use estimates based on the optimal tuning parameter and ignore the model selection procedure. Similarly, if we wanted to test the null hypothesis of no association conditional on the other variables in the model, how would this be done after penalization? Likelihood ratio tests no longer seem valid.

Stat Modelling. Author manuscript; available in PMC 2017 June 01.

Shepherd and Liu

Page 7

Author Manuscript

Although we appreciate the benefit of these methods for creating prediction models, in much of our collaborative work there is also interest in these inferential quantities.

7 Conclusion

Author Manuscript

Although we have highlighted some perceived limitations and areas of potential disagreement, our overall impression of the methods of TG is favorable. We agree that the group lasso approaches are reasonable when dealing with categorical predictors, as an entire categorical variable should be considered (and potentially removed) as a whole. We also agree that with ordinal data it makes sense to consider penalization based on adjacent categories to promote fusing coefficients/categories. A nice product of TG’s approaches is that they simultaneously remove entire categorical variables and fuse categories within other variables. In our simple simulations, the approaches seem to do what they advertise: categories are generally grouped as one would expect based on the adjacent penalization. As noted above, there are some settings where TG’s methods for dealing with ordinal predictors will likely be suboptimal. These include situations where the relationship between the outcome and an ordinal predictor is not monotonic. With large numbers of categories or linear associations, we believe treating the ordinal predictor as a categorical variable will generally be inferior to simply assigning numbers to the categories and treating it as a continuous variable modeled splines. More generally, we are often willing to sacrifice the optimal model fit for a more interpretable model, which may favor making reasonable modeling assumptions rather than letting the data drive the choice through strategic penalization. And, it would be helpful to have additional guidance on creating confidence intervals and testing hypotheses of conditional independence after regularization.

Author Manuscript

These potential difficulties/limitations noted, we again congratulate Drs. Tutz and Gertheiss on their interesting article. We look forward to their rejoinder and thank the editor for inviting us to be discussants.

Acknowledgments This work was funded in part by the U.S. National Institutes of Health grant number R01 AI093234.

References

Author Manuscript

Barlow, RE., Bartholomew, DJ., Bremner, JM., Brunk, HD. Statistical Inference under Order Restrictions. Wiley; New York: 1972. Li C, Shepherd BE. Test of association between two ordinal variables while adjusting for covariates. Journal of the American Statistical Association. 2010; 105:612–620. [PubMed: 20882122] Li C, Shepherd BE. A new residual for ordinal outcomes. Biometrika. 2012; 99:473–480. [PubMed: 23843667] Liu Q, Shepherd BE, Li C. Covariate-adjusted Spearman’s rank correlation with probability-scale residuals. Submitted. 2016 Shepherd BE, Li C, Liu Q. Probability-scale residuals for continuous, discrete, and censored data. Submitted. 2016 Hastie, T., Tibshirani, R., Friedman, J. Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second. Springer; New York: 2009. p. 244

Stat Modelling. Author manuscript; available in PMC 2017 June 01.

Shepherd and Liu

Page 8

Author Manuscript

Gertheiss J, Tutz G. Sparse modeling of categorical explanatory variables. Annals of Applied Statistics. 4:2150–2180.

Author Manuscript Author Manuscript Author Manuscript Stat Modelling. Author manuscript; available in PMC 2017 June 01.

Shepherd and Liu

Page 9

Author Manuscript Author Manuscript

Figure 1.

Coefficient paths for the simulation generated using monotone steps.

Author Manuscript Author Manuscript Stat Modelling. Author manuscript; available in PMC 2017 June 01.

Shepherd and Liu

Page 10

Author Manuscript Author Manuscript

Figure 2.

Coefficient paths for the simulation generated under a monotone non-linear association.

Author Manuscript Author Manuscript Stat Modelling. Author manuscript; available in PMC 2017 June 01.

Shepherd and Liu

Page 11

Author Manuscript Author Manuscript

Figure 3.

Coefficient paths for the simulation generated under a linear association.

Author Manuscript Author Manuscript Stat Modelling. Author manuscript; available in PMC 2017 June 01.

Shepherd and Liu

Page 12

Author Manuscript Author Manuscript

Figure 4.

Coefficient paths for the simulation generated under a non-monotone association.

Author Manuscript Author Manuscript Stat Modelling. Author manuscript; available in PMC 2017 June 01.

Shepherd and Liu

Page 13

Author Manuscript Author Manuscript

Figure 5.

Coefficient paths for the simulation with continuous linear X.

Author Manuscript Author Manuscript Stat Modelling. Author manuscript; available in PMC 2017 June 01.

Shepherd and Liu

Page 14

Author Manuscript Author Manuscript

Figure 6.

Selected coefficients for the simulation with continuous linear X.

Author Manuscript Author Manuscript Stat Modelling. Author manuscript; available in PMC 2017 June 01.

Shepherd and Liu

Page 15

Author Manuscript Author Manuscript

Figure 7.

Selected coefficients for the simulation with continuous cubic X.

Author Manuscript Author Manuscript Stat Modelling. Author manuscript; available in PMC 2017 June 01.

Discussion of 'Regularized Regression for Categorical Data'.

Discussion of 'Regularized Regression for Categorical Data'. - PDF Download Free
610KB Sizes 0 Downloads 17 Views