INVITED REVIEW SERIES: MODERN STATISTICAL METHODS IN RESPIRATORY MEDICINE SERIES EDITORS: RORY WOLFE AND MICHAEL ABRAMSON

Interpretation of commonly used statistical regression models JESSICA KASZA1,2 AND RORY WOLFE1,2 1

Department of Epidemiology and Preventive Medicine, Monash University, and 2Victorian Centre for Biostatistics (ViCBiostat), Melbourne, Victoria, Australia

ABSTRACT A review of some regression models commonly used in respiratory health applications is provided in this article. Simple linear regression, multiple linear regression, logistic regression and ordinal logistic regression are considered. The focus of this article is on the interpretation of the regression coefficients of each model, which are illustrated through the application of these models to a respiratory health research study. Key words: linear model, logistic model, ordinal logistic model, regression analysis. Abbreviation: CI, confidence interval; FEV1, forced expiratory volume in 1 s; FEV1%, percentage of predicted forced expiratory volume in 1 s.

INTRODUCTION Often, we seek to understand the association between a measure of respiratory health, and one or more patient characteristics, be they socio-demographic, genetic, environmental, molecular, clinical/psychological or behavioural. For example, a researcher may wish to investigate how lung function as measured by forced expiratory volume in 1 s (FEV1) is related to occupational exposure to biological dust, while allowing for its well-known relationships with height, age and sex of individuals. Regression models provide tools for explaining how an outcome variable (e.g. a measure of disease status) varies for different values of explanatory variables. They provide an estimate of Correspondence: Jessica Kasza, Department of Epidemiology and Preventive Medicine, The Alfred Centre, Monash University, 99 Commercial Road, Melbourne, Vic. 3004, Australia. Email: [email protected] The Authors: Dr. Jessica Kasza, BSc, PhD, a research fellow in biostatistics at the Department of Epidemiology and Preventive Medicine at Monash University, has research interests that include healthcare provider comparison and the estimation of causal effects. Professor Rory Wolfe, BSc, PhD, Professor of Biostatistics at the School of Public Health and Preventive Medicine, has broad research interests in biostatistics. Received 16 September 2013; accepted 1 October 2013. © 2013 The Authors Respirology © 2013 Asian Pacific Society of Respirology

the average relationship between the outcome and explanatory variables in the population from which the research study’s participants have been sampled. Such models can also be used to predict future outcomes for new individuals based on their set of explanatory variable values. The models do not explicitly differentiate which of the explanatory variables are causes and which are effects of outcome, or which are non-causally related with outcome, that is, only carry association with outcome because of mutual relationships with some other causative factor. These aspects are dealt with elsewhere in this series.1,2 Different types of outcome variables require different forms of regression model. FEV1 can be thought of as a continuous variable, potentially taking any value within a sensible range, and as such, linear regression is appropriate. Respiratory symptoms, for example dyspnoea, wheeze and cough, are often assessed as binary: either the patient has the symptom or does not. For binary outcomes, logistic regression models are appropriate. An outcome may fall into one of a number of ordered categories (e.g. lung cancer stages), and for this type of outcome, ordinal logistic regression (also known as a proportional odds model although this term can get confused with the proportional hazards model used in survival analysis3) provides a tailor-made method of analysis. We will discuss each of these forms of regression in turn, focussing on the interpretation of the models.

Example To illustrate these regression models, we consider data from a study on the relationship between occupational exposure to biological dust and chronic obstructive pulmonary disease.4 This was a crosssectional study of risk factors for chronic obstructive pulmonary disease in individuals aged between 45 and 70 years, living in Melbourne, Victoria, Australia, randomly selected from electoral rolls. The importance of random sampling in obtaining a sample of research participants from whom we can generalize to the population of interest is discussed in the introduction to this series.1 We consider data from 1212 Respirology (2014) 19, 14–21 doi: 10.1111/resp.12221

15

Interpretation of regression models

5000

6000

Table 1 Coefficients, standard errors (SE), 95% confidence intervals (CI) and P-value for the simple linear regression of forced expiratory volume in 1 s (mL) on height (cm)

FEV1 (mL) 3000 4000

Variable

59.3 −6829.4

SE 1.9

95% CI

P-value

55.6 to 63.0