Visualization Viewpoints

Editor: Theresa-Marie Rhyne

Watch Out for Superman: First Visualize, Then Analyze Marcin Kozak University of Information Technology and Management in Rzeszow

A

12

12

10

10

8

8

y2

y1

n old truth: data analysis without data visualization is no data analysis.1–4 This truth is already a cliché, but mainly among professional applied statisticians who are aware of the surprises you can meet in data. Unfortunately, practical data analysis still insufficiently exploits visualization methods, even though they can be extremely powerful and sometimes the only tool to show something. What is this something? You never know; it could be

6

6

4

4 5

10

15

5

x1

10

15

x2



■ ■



a phenomenon hidden in the data, awaiting an analyst’s mistake; an outlier or a bunch of outliers; a strange, interesting, or undesirable pattern; or something else calling for attention.

However, finding the appropriate display can require time, practice, and experience. To illustrate how the failure to properly visualize data can lead to a serious misinterpretation of the phenomena underlying the data, I’ve developed an artificial dataset. The related analysis requires some statistical skills, but that’s not my point. Even people with only a basic knowledge of statistics will immediately understand the results. Even more important, they’ll learn (and likely remember) that data visualization should come before data analysis. To make the point even more strongly, I asked Superman to help me out, and he agreed!

12

12

10

10

8

8

y4

y3

Two Previous Arguments for Data Visualization

6

6

4

4 5

10

15

x3

5

10

15

x4

Figure 1. Francis Anscombe’s datasets.6 Each graph has the same fitted regression model (y = 3 + 0.5x), determination coefficient (R 2 = 66.7%), and p-value for the slope (p = 0.002). However, only data from the topleft graph are suitable for linear regression. Applying linear regression to the data for the other graphs won’t work (although the data in the top-right graph are suitable for polynomial regression). 6

May/June 2012

David Farnsworth showed that when fitting a linear model, you should always graphically check its residuals.5 Otherwise, you could find a fish there! (Or anything else, of course.) Francis Anscombe suggested you shouldn’t fit a linear-regression model before graphing the data— otherwise, you can get nonsensical results.6 He presented four datasets, each consisting of 11 observations but representing rather different relationships between the dependent and independent variables. However, each dataset had the same simple regression models. Figure 1 shows the datasets and the consequences of fitting those models without prior graphing. Anscombe’s example makes his point perfectly. Next, I show how I’ve adopted his idea.

Published by the IEEE Computer Society

0272-1716/12/$31.00 © 2012 IEEE

Table 1. A summary of the analyzed dataset.



Standard deviation of y

Mean of x

Standard deviation of x

A

4.20

2.28

0.94

1.00

B

5.05

2.42

1.31

0.79

−1

0

A

1

2

B

8 6 y

The dataset comprises 100 observations of the quantitative variables y and x plus the qualitative variable group. (The dataset is in the supplementary file Superman.txt at http://doi.ieeecomputersociety. org/10.119/MCG.2012.46.) The analysis aims to study how x influences y given group, because we can’t assume that the relationship is the same in both groups being analyzed. This might be a typical general linear-model situation, but we must check it. Table 1 summarizes the data. Of course, not too much follows from that. We see differences in the estimates of y and x in the two groups. However, we can’t yet determine whether they’re significant; this calls for a more advanced statistical analysis. We also need to analyze the relationship of y to x. But isn’t it high time to graph the data? Actually, we should have done that before making the summaries. In Figure 2, we can immediately see that the relationship in both groups resembles an exponential association of y against x. This simple graph offers important information. However, if we don’t graph the data but simply employ linear modeling (as unfortunately far too many applied researchers do), we derive the model in Figure 3. Okay, so we now know the relationship could be exponential. Let’s fit a model of y against ex conditioned on group (see Table 2, which shows shortened, simplified output from the R language; www.r-project.org). High determination, significant coefficients, and nice graphs for residual analysis (not shown)—fantastic! However, the two slopes for ex estimated in the previous model (0.99 and 1.03) seem quite similar. An F-test comparing the previous model with the model with a common slope shows that we don’t need two different slopes: a common slope will suffice (p = 0.143). Table 3 shows the revised model. Again, it looks very good, but there’s a hint that we don’t need the main effect of group. The model without it (that is, with only the slope for y | ex) wasn’t worse (p = 0.116), yielding Table 4. In this model’s residuals (not shown), some patterns are visible, but given the high determination coefficient (R 2 = 98%), perhaps the model isn’t that bad? But did we really do everything we should? We played with statistical models: fitted them, checked residuals, tested them, and compared them with simpler models. But we haven’t graphed the data of y against ex. We’ve only graphed the data of y against x, which basically means we didn’t graph the model we fitted. Why not give it a try?

Mean of y

Group

4 2 0 −1

0

1

2 x

Figure 2. The association between y and x conditioned on the qualitative variable group. The relationship in groups A and B resembles an exponential association of y against x. −1 A

0

1

2

B

8 6 y

A Visit from Superman

4 2 0 −1

0

1

2 x

Figure 3. Forcing a linear association. Despite statistical significance, the relationships don’t look fully correct.

Table 2. Fitting a model of y against ex conditioned on the qualitative variable group. Effect Intercept groupB

Estimated coefficient

Standard error

0.62890

0.09465

t

Pr(>|t|)

6.645

1.83e–9

–0.30424

0.14664

–2.075

0.0407

groupA:exp(x)

0.98784

0.02224

44.426

Watch out for superman: first visualize, then analyze.

A visit from Superman shows why data visualization should come before data analysis. The Web extra is a dataset that comprises 100 observations of the...
2MB Sizes 1 Downloads 3 Views