Regression and Other Stories - Notes and Excerpts

R
statistics
Author

Vishal Bakshi

Published

May 17, 2023

In this blog post, I will work through the textbook Regression and Other Stories by Andrew Gelman, Jennifer Hill, and Aki Vehtari. I’ll be posting excerpts (sometimes verbatim), notes (paraphrased excerpts) and my thoughts about the content (and any other related reading that I come across while understanding the textbook topics).

Starting in Chapter 2, I recreate the plots shown in the textbook first by following their provided code on the supplemental website, and then by recreating it using tidyverse and ggplot. This process gives me practice reading and understanding other people’s code, as well as practice applying tidyverse and ggplot syntax and principles to produce a similar result with different code.

Before I get into the reading, I’ll note that I have not posted my solutions to exercises to honor the request of one of the authors, Aki Vehtari:

A tweet thread of Aki Vehtari explaining why they do not want solutions to the exercises published.

However, to learn by writing, I will write about the process of doing the exercises, the results I got, and what I learned from it all.

Installation

The data for this textbook is available in the rosdata package.

A couple of errors I had to resolve that I’m documenting here.

If you get the following error when installing a package:

Error in file(filename, "r", encoding = encoding) : 
  cannot open the connection
Calls: source -> file
In addition: Warning message:
In file(filename, "r", encoding = encoding) :
  cannot open file 'renv/activate.R': No such file or directory
Execution halted

remove the line source("renv/activate.R") from both the project directory and home directory .Rprofile files.

When running quarto render if you get the following error:

there is no package called …

the issue might be that Quarto is not pointing towards the right .libPaths(). In your home directory’s .Rprofile, add .libPaths("path/to/lib", "path/to/lib", …) and it will show up in quarto check and may resolve this issue.

Preface

  • After reading this book and working through the exercises, you should be able to simulate regression models on the computer and build, critically evaluate, and use them for applied problems.
  • The other special feature of this book is its broad coverage: basics of statistics and measurement, linear regression, multiple regression, Bayesian inference, logistic regression and generalized linear models, extrapolation from sample to population, and causal inference.
  • After completing Part 1 you should have access to the tools of mathematics, statistics, and computing that will allow you to work with regression models.
  • Part 1 Goals
    • displaying and exploring data
    • computing and graphing linear relations
    • understanding basic probability distributions and statistical inferences
    • simulation of random processes to represent inferential and forecast uncertainty
  • After completing Part 2 you should be able to build, fit, understand, use, and assess the fit of linear regression models.
  • After completing Part 3, you should be able to similarly work with logistic regression and other generalized linear models.
  • Part 4 covers data collection and extrapolation from sample to population.
  • Part 5 covers casual inference.
  • Part 6 introduces more advanced regression models.

What you should be able to do after reading and working through this book

  • Part 1: Review key tools and concepts in mathematics, statistics and computing

    • Chapter 1: Have a sense of the goals and challenges of regression

    • Chapter 2: Explore data and be aware of issues of measurement and adjustment

    • Chapter 3: Graph a straight line and know some basic mathematical tools and probability distributions

    • Chapter 4: Understand statistical estimation and uncertainty assessment, along with the problems of hypothesis testing in applied statistics

    • Chapter 5: Simulate probability models and uncertainty about inferences and predictions

  • Part 2: Build linear regression models, use them in real problems, and evaluate their assumptions and fit to data

    • Chapter 6: Distinguish between descriptive and causal interpretations of regression, understanding these in historical context

    • Chapter 7: Understand and work with simple linear regression with one predictor

    • Chapter 8: Gain a conceptual understanding of least squares fitting and be able to perform these fits on the computer

    • Chapter 9: Perform and understand probabilistic and simple Bayesian information aggregation, and be introduced to prior distributions and Bayesian inference

    • Chapter 10: Build, fit, and understand linear models with multiple predictors

    • Chapter 11: Understand the relative importance of different assumptions of regression models and be able to check models and evaluate their fit to data

    • Chapter 12: Apply linear regression more effectively by transforming and combining predictors

  • Part 3: Build and work with logistic regression and generalized linear models

    • Chapter 13: Fit, understand, and display logistic regression models for binary data

    • Chapter 14: Build, understand and evaluate logistic regressions with interactions and other complexities

    • Chapter 15: Fit, understand, and display generalized linear models, including the Poisson and negative binomial regression, ordered logistic regression, and other models

  • Part 4: Design studies and use data more effectively in applied settings

    • Chapter 16: Use probability theory and simulation to guide data-collection decisions, without falling into the trap of demanding unrealistic levels of certainty

    • Chapter 17: Use poststratification to generalize from sample to population, and use regression models to impute missing data

  • Part 5: Implement and understand basic statistical designs and analyses for causal inference

    • Chapter 18: Understand assumptions underlying causal inference with a focus on randomized experiments

    • Chapter 19: Perform causal inference in simple setting using regressions to estimate treatments and interactions

    • Chapter 20: Understand the challenges of causal inference from observational data and statistical tools for adjusting for differences between treatment and control groups

    • Chapter 21: Understand the assumptions underlying more advanced methods that use auxiliary variables or particular data structures to identify causal effects, and be able to fit these models to data

  • Part 6: Become aware of more advanced regression models

    • Chapter 22: Get a sense of the directions in which linear and generalized linear models can be extended to attack various classes of applied problems
  • Appendixes

    • Appendix A: Get started in the statistical software R, with a focus on data manipulation, statistical graphics, and fitting using regressions

    • Appendix B: Become aware of some important ideas in regression workflow

After working through the book, you should be able to fit, graph, understand, and evaluate linear and generalized linear models and use these model fits to make predictions and inferences about quantities of interest, including causal effects of treatments and exposures.

Part 1: Fundamentals

Chapter 1: Overview

Alternate title: Prediction as a unifying theme in statistics and causal inference

1.1 The three challenges of statistics

  1. Generalizing from sample to population
  2. Generalizing from treatment to support group
  3. Generalizing from observed measurements to the underlying constructs of interest

Prediction: for new people or new items that are not in the sample, future outcomes under differently potentially assigned treatments, and underlying constructs of interest, if they could be measured exactly.

Key skills you should learn from this book:

  • Understanding regression models: mathematical models for predicting an outcome variable from a set of predictors, starting with straight-line fits and moving to various nonlinear generalizations

  • Constructing regression models: with many options involving the choice of what variables to include and how to transform and constrain them

  • Fitting regression models to data

  • Displaying and interpreting the results

Inference: using mathematical models to make general claims from particular data.

1.2 Why learn regression?

Regression: a method that allows researchers to summarize how predictions or average values of an outcome vary across individuals defined by a set of predictors.

# load the data
data(hibbs)

# make a scatterplot
plot(
  hibbs$growth, 
  hibbs$vote, 
  xlab = "Average recent growth in personal income",
  ylab = "Incumbent party's vote share")

# estimate the regression: y = a + bx + error
M1 <- stan_glm(vote ~ growth, data=hibbs)
# make a scatter plot
plot(
  hibbs$growth, 
  hibbs$vote, 
  xlab = "Average recent growth in personal income",
  ylab = "Incumbent party's vote share")

# add fitted line to the graph
abline(coef(M1), col = 'gray')

# display the fitted model
print(M1)
stan_glm
 family:       gaussian [identity]
 formula:      vote ~ growth
 observations: 16
 predictors:   2
------
            Median MAD_SD
(Intercept) 46.3    1.7  
growth       3.0    0.7  

Auxiliary parameter(s):
      Median MAD_SD
sigma 3.9    0.7   

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

y = 46.3 + 3.0x

MAD: Median Absolute Deviations.

sigma: the scale of the variation in the data explained by the regression model (the scatter of points above and below the regression line). The linear model predicts vote share roughly to an accuracy of 3.9 percentage points.

Some of the most important uses of regression:

  • Prediction: Modeling existing observations or forecasting new data.

  • Exploring associations: Summarizing how well one variable, or set of variables, predicts the outcome. One can use a model to explore associations, stratifications, or structural relationships between variables.

  • Extrapolation: Adjusting for known differences between the sample (observed data) and a population of interest.

  • Causal inference: Estimating treatment effects. A key challenge of causal inference is ensuring that treatment and control groups are similar, on average, before exposure to treatment, or else adjusting for differences between these groups.

1.3 Some examples of regression

Estimating public opinion from an opt-in internet survey

  • A characteristic problem of big data: a very large sample, relatively inexpensive to collect, but not immediately representative of the larger population

A randomized experiment on the effect of an educational television program

Estimating the effects of United Nations peacekeeping, using pre-treatment variables to adjust for differences between treatment and control groups

  • Selection bias: perhaps peacekeepers chose the easy cases, which would explain the difference in outcomes. The treatment–peacekeeping–was not randomly assigned. They had an observational study rather than an experiment, where we must do our best to adjust for pre-treatment differences between the treatment and control groups.

  • Censoring: certain ranges of data cannot be observed (e.g. countries where civil war had not yet returned by the time data collection ended)

  • When adjusting for badness, peacekeeping was performed in tougher conditions, on average. As a result, the adjustment increases the estimated beneficial effects of peacekeeping, at least during the study.

Estimating the effects of gun laws, and the difficulty of inference using regression with a large number of predictors

  • In this sort of regression with 50 data points and 30 predictors and no prior information to guide the inference, the coefficient estimates will be hopelessly noisy and compromised by dependence among predictors.

  • The treatments are observational and not externally applied. There is no reason to believe that the big differences in gun-related deaths between states are mostly attributable to particular policies.

Comparing the peacekeeping and gun-control studies

  • In both cases, policy conclusions have been drawn from observational data, using regression modeling to adjust for differences between treatment and control groups.

  • It is more practical to perform adjustments when there is a single goal (peacekeeping study)

    • particular concern that the UN might be more likely to step in when the situation on the ground was not so bad

    • the data analysis found the opposite

    • the measure of badness is constructed based on particular measured variables and so it is possible that there are important unmeasured characteristics that would cause adjustments to go the other way

  • The gun-control model adjusts for many potential causal variables at once

    • The effect of each law is estimated conditional on all others being held constant (not realistic—no particular reason for their effects to add up in a simple manner)

    • The comparison is between states, which vary in systematic ways (it is not at all clear that a simple model can hope to adjust for the relevant differences)

    • Regression analysis was taken naively to be able to control for variation and give valid causal inference from observational data

1.4 Challenges in building, understanding, and interpreting regressions

Two different ways in which regression is used for causal inference: estimating a relationship and adjusting for background variables.

Regression to estimate a relationship of interest

Randomization: a design in which people—or, more generally, experimental units—are randomly assigned to treatment or control groups.

There are various ways to attain approximate comparability of treatment and control groups, and to adjust for known or modeled differences between the groups.

We assume the comparability of the groups assigned to different treatments so that a regression analysis predicting the outcome given the treatment gives us a direct estimate of the causal effect.

It is always possible to estimate a linear model, even if it does not fit the data.

Interactions: treatment effects that vary as a function of other predictors in a model.

  • Example: the relation between cancer rate and radon is different for smokers and nonsmokers. The effect of radon is assumed to be linear but with an interaction with smoking.

Regression to adjust for differences between treatment and control groups

In most real-world causal inference problems, there are systematic differences between experimental units that receive treatment and control. In such settings it is important to adjust for pre-treatment differences between the groups, and we can use regression to do this.

Adjusting for background variables is particularly important when there is imbalance so that the treated and control groups differ on key pre-treatment predictors.

The estimated treatment effect is necessarily model based.

Interpreting coefficients in a predictive model

A model fit to survey data: earnings = 11000 + 1500 * (height - 60) + error, with errors in the range of mean of 0 and standard deviation of 22000.

  • Prediction: useless for forecasting because errors are too large

  • Exploring an association: best fit for this example, as the estimated slope is positive, can lead to further research to study reasons taller people earn more than shorter people

  • Sampling inference: the regression coefficients can be interpreted directly to the extent that people in the survey are a representative sample of the population of interest

  • Causal inference: height is not a randomly assigned treatment. Tall and short people may differ in many other ways.

Building, interpreting and checking regression models

Statistical analysis cycles through four steps:

  1. Model building
  2. Model fitting
  3. Understanding model fits
  4. Criticism

Accept that we can learn from statistical analysis—we can generalize from sample to population, from treatment to control, and from observed measurements to underlying constructs of interest—even while these inferences can be flawed.

Overinterpretation of noisy data: the gun control study took existing variations among states and too eagerly attributed it to available factors.

No study is perfect.

We should recognize challenges in extrapolation and then work to adjust for them.

1.5 Classical and Bayesian Inference

Three concerns related to fitting models to data and using models to make predictions:

  1. what information is being used in the estimation process
  2. what assumptions are being made
  3. how estimates and predictions are interpreted, in a classical or Bayesian framework

Information

  • The starting point for any regression problem is data on an outcome variable y and one or more predictors x.

  • Information should be available on what data were observed at all

  • We typically have prior knowledge coming from sources other than the data at hand. Where local data are weak it would be foolish to draw conclusions without using prior knowledge

Assumptions

  • Three sorts of assumptions that are essential to any regression model of an outcome y given predictors x.

    • The functional form of the relation between x and y

    • Where the data comes from (strongest assumptions tend to be simple and easy to understand, weaker assumptions, being more general, can also be more complicated)

    • Real-world relevance of the measured data

      • The interpretation of a regression of y on x depends also on the relation between the measured x and the underlying predictors of interest, and on the relation between the measured y and the underlying outcomes of interest

Classical inference

The traditional approach to statistical analysis is based on summarizing the information in the data, not using prior information, but getting estimates and predictions that have well-understood statistical properties, low bias and low variance.

Unbiasedness: estimates should be true on average

Coverage: confidence intervals should cover the true parameter value 95% of the time

Conservatism: sometimes data are weak and we can’t make strong statements, but we’d like to be able to say, at least approximately, that our estimates are unbiased and our intervals have the advertised coverage.

In classical statistics there should be a clear and unambiguous (“objective”) path from data to inferences, which in turn should be checkable, at least in theory, based on their frequency properties.

Bayesian inference

Incorporates prior information into inferences, going beyond the goal of merely summarizing existing data. The analysis gives more reasonable results and can be used to make direct predictions about future outcomes and about the results of future experiments.

The prior distribution represents the arena over which any predictions will be evaluated.

We have a choice: classical inference, leading to pure summaries of data which can have limited value as predictions; or Bayesian inference, which in theory can yield valid predictions even with weak data, but relies on additional assumptions.

All Bayesian inferences are probabilistic and thus can be represented by random simulations. For this reason, whenever we want to summarize uncertainty in estimation beyond simple confidence intervals, and whenever we want to use regression models for predictions, we go Bayesian.

1.6 Computing least squares and Bayesian regression

In general, we recommend using Bayesian inference for regression

  • If prior information is available, you can use it

  • if not, Bayesian regression with weakly informative default priors still has the advantage of yielding stable estimates and producing simulations that enable you to express inferential and predictive uncertainty (estimates with uncertainties and probabilistic predictions or forecasts)

Bayesian regression in R:

fit <- stan_glm(y ~ x, data=mydata)

# stan_glm can run slow for large datasets, make it faster by running in optimizing mode
fit <- stan_glm(y ~ x, data=mydata, algorithm="optimizing")

least squares regression (classical statistics)

fit <- lm(y ~ x, data=mydata)

Bayesian and simulation approaches become more important when fitting regularized regression and multilevel models.

1.8 Exercises

It took me 8-9 days to complete the 10 exercises at the end of this chapter. A few observations:

  • Experiment design, even with simple paper helicopters, still requires time, effort and attention to detail. Even then, a well thought out and executed plan can still end up with uninspiring results. It is therefore the process of experimentation and not the result where I learned the most. It was important to continue experimentation (as it resulted in finding a high-performing copter) even after my initial approach didn’t seem to work as I had thought it would.

  • Finding good and bad examples of research is hard on your own! I relied on others existing work where they explicitly called out (and explained) good and bad research, sometimes with great detail, in order to answer exercises 1.7 and 1.8.

  • Think about the problem in real life! I don’t know if my answer to 1.9 is correct, but it helped to think about the physical phenomenon that was taking place (a helicopter falling through air, pulled by gravity) in order to come up with a better model fit than the example provided. I found that thinking of physical constraints of the real world problem helped narrow the theoretical model form.

  • Pick something you’re interested in and is worth your while! For exercise 1.10, the dataset I have chosen is the NFL Play-by-Play dataset, as I’m already working on a separate project where I’ll be practicing writing SQL queries to analyze the data. Play-by-play data going back to 1999 will provide me with ample opportunities to pursue different modeling choices. My interest in football will keep me dedicated through the tough times.

Chapter 2: Data and Measurement

In this book we will be:

  • fitting lines and curves to data.

  • making comparisons and predictions and assessing our uncertainties in the resulting inferences.

  • discuss the assumptions underlying regression models, methods for checking these assumptions, and directions for improving fitted models.

  • discussing the challenges of extrapolating from available data to make causal inferences and predictions for new data.

  • using computer simulations to summarize the uncertainties in our estimates and predictions.

2.1 Examining where data comes from

I’ll recreate the charts shown in the textbook by following the code provided in the website supplement to the book, using tidyverse where it makes sense:

# load the Human Development Index data
# start using DT to produce better table outputs
library(DT)
hdi <- read.table("../../../ROS_data/HDI/data/hdi.dat", header=TRUE)
DT::datatable(head(hdi))
# load income data via the votes data
library(foreign)
votes <- read.dta("../../../ROS_data/HDI/data/state vote and income, 68-00.dta")
DT::datatable(head(votes), options=list(scrollX=TRUE))
# preprocess
income2000 <- votes[votes[,"st_year"]==2000, "st_income"]

# it seems like the income dataset doesn't have DC
# whereas HDI does
# so we're placing an NA for DC in the income list
state.income <- c(income2000[1:8], NA, income2000[9:50])
state.abb.long <- c(state.abb[1:8], "DC", state.abb[9:50])
state.name.long <- c(state.name[1:8], "Washington, D.C.", state.name[9:50])
hdi.ordered <- rep(NA, 51)
can <- rep(NA, 51)

for (i in 1:51) {
  ok <- hdi[,"state"]==state.name.long[i]
  hdi.ordered[i] <- hdi[ok, "hdi"]
  can[i] <- hdi[ok, "canada.dist"]
}
no.dc <- state.abb.long != "DC"
# plot average state income and Human Development Index
par(mar=c(3, 3, 2.5, 1), mgp=c(1.5, .2, 0), tck=-0.01, pty='s')
plot(
  state.income, 
  hdi.ordered, 
  xlab="Average state income in 2020",
  ylab="Human Development Index",
  type="n")

text(state.income, hdi.ordered, state.abb.long)

The pattern between HDI numbers and average state income is strong and nonlinear. If anyone is interested in following up on this, we suggest looking into South Carolina and Kentucky, which are so close in average income and so far apart on HDI.

# plot rank of average state income and Human Development Index
par(mar=c(3, 3, 2.5, 1), mgp=c(1.5, 0.2, 0), tck=-0.01, pty='s')
plot(
  rank(state.income[no.dc]), 
  rank(hdi.ordered[no.dc]),
  xlab="Rank of average state income in 2000",
  ylab="Rank of Human Development Index", 
  type="n")

text(rank(state.income[no.dc]), rank(hdi.ordered[no.dc]), state.abb)

There is a high correlation between the ranking of HDI and state income.

I will redo the above data processing and plotting with tidyverse and ggplot for practice:

# plot average state income and Human Development Index
# using tidyverse and ggplot
merged_data <- hdi %>% dplyr::left_join(
  votes %>% filter(st_year==2000),
  by = c("state" = "st_state")
) 

p <- ggplot(
  merged_data,
  aes(x = st_income, y = hdi, label =  st_stateabb)
) + theme(
    plot.margin = margin(3, 3, 2.5, 1, "lines"),
    panel.border = element_rect(colour = "black", fill=NA, size=1),
    panel.background = element_rect(fill = 'white'),
    aspect.ratio = 1
  ) +
  labs(
    x = "Average state income in 2020",
    y = "Human Development Index"
  )
Warning: The `size` argument of `element_rect()` is deprecated as of ggplot2 3.4.0.
ℹ Please use the `linewidth` argument instead.
p + geom_text()
Warning: Removed 1 rows containing missing values (`geom_text()`).

None of the tidyverse methods I found (row_number, min_rank, dense_rank, percent_rank, cume_dist, or ntile) work quite like rank (where “ties” are averaged), so I continued using rank for this plot:

# plot rank of average state income and Human Development Index
p <- ggplot(
  merged_data %>% filter(state != 'Washington, D.C.'), 
  aes(x = rank(st_income), y = rank(hdi), label = st_stateabb),
  ) + theme(
    plot.margin = margin(3, 3, 2.5, 1, "lines"),
    panel.border = element_rect(colour = "black", fill=NA, size=1),
    panel.background = element_rect(fill = 'white'),
    aspect.ratio = 1
  ) +
  labs(
    x = "Rank of average state income in 2000",
    y = "Rank of Human Development Index"
  )

p + geom_text()

The relevance of this example is that we were better able to understand the data by plotting them in different ways.

Details of measurement can be important

In the code chunks below, I will recreate the plots in Figure 2.3 in the text (distribution of political ideology by income and distribution of party identification by income)., using their code provided on the supplemental website.

pew_data_path <- file.path(data_root_path, "Pew/data/pew_research_center_june_elect_wknd_data.dta")
pew_pre <- read.dta(pew_data_path)
n <- nrow(pew_pre)
DT::datatable(head(pew_pre), options=list(scrollX=TRUE))
# weight
pop.weight0 <- pew_pre$weight

head(unique(pop.weight0))
[1] 1.326923 0.822000 0.493000 0.492000 2.000000 1.800000
# income (1-9 scale)
inc <- as.numeric(pew_pre$income)

# remove "dk/refused" income level
inc[inc==10] <- NA

levels(pew_pre$income)
 [1] "less than $10,000" "$10,000-$19,999"   "$20,000-$29,999"  
 [4] "$30,000-$39,999"   "$40,000-$49,000"   "$50,000-$74,999"  
 [7] "$75,000-$99,999"   "$100,000-$149,999" "$150,000+"        
[10] "dk/refused"       
unique(inc)
 [1]  6  8 NA  4  7  5  9  1  3  2
# I believe these are the midpoints of the income levels
value.inc <- c(5,15,25,35,45,62.5,87.5,125,200)

# maximum income value
n.inc <- max(inc, na.rm = TRUE)
n.inc
[1] 9
# party id
pid0 <- as.numeric(pew_pre[,"party"])
lean0 <- as.numeric(pew_pre[,"partyln"])

levels(pew_pre[,"party"])
[1] "missing/not asked" "republican"        "democrat"         
[4] "independent"       "no preference"     "other"            
[7] "dk"               
levels(pew_pre[,"partyln"])
[1] "missing/not asked" "lean republican"   "lean democrat"    
[4] "other/dk"         
unique(pid0)
[1]  3  2  4  7  5  6 NA
unique(lean0)
[1] NA  3  4  2
# assigning new integers for party id
pid <- ifelse(pid0==2, 5,  # Republican
         ifelse(pid0==3, 1,  # Democrat
         ifelse(lean0==2, 4, # Lean Republican
         ifelse(lean0==4, 2, # Lean Democrat
         3)))) # Independent

pid.label <- c("Democrat", "Lean Dem.", "Independent", "Lean Rep.", "Republican")
n.pid <- max(pid, na.rm=TRUE)
# ideology
ideology0 <- as.numeric(pew_pre[,"ideo"])

levels(pew_pre[,"ideo"])
[1] "missing/not asked" "very conservative" "conservative"     
[4] "moderate"          "liberal"           "very liberal"     
[7] "dk/refused"       
unique(ideology0)
[1]  5  4  3  6  2  7 NA
# assign new integers for ideology
ideology <- ifelse(ideology0==2, 5, # Very conservative
              ifelse(ideology0==3, 4, # Conservative
              ifelse(ideology0==6, 1, # Very liberal
              ifelse(ideology0==5, 2, # Liberal
              3)))) # Moderate
ideology.label <- c("Very liberal", "Liberal", "Moderate", "Conservative", "Very conservative")
n.ideology <- max(ideology, na.rm=TRUE)
# plot settings
par(mar=c(3,2,2.5,1), mgp=c(1.5, .7, 0), tck=-0.01)

# this creates an empty plot
plot(c(1,n.inc), c(0,1), xaxs="i", yaxs="i", type="n", xlab="", ylab="", xaxt="n", yaxt="n")

# add x-axis tick marks with empty string labels
axis(1, 1:n.inc, rep("", n.inc))

# add x-axis labels
axis(1, seq(1.5, n.inc-.5, length=3), c("Low income", "Middle income", "High income"), tck=0)

# add y-axis tick marks and labels
axis(2, c(0, 0.5, 1), c(0, "50%", "100%"))

center <- floor((1 + n.inc) / 2)

incprop <- array(NA, c(n.pid + 1, n.inc))

incprop[n.pid + 1,] <- 1


for (i in 1:n.pid) {
  for (j in 1:n.inc) {
    incprop[i, j] <- weighted.mean((pid<i)[inc==j], pop.weight0[inc==j], na.rm = TRUE)
  }
}

for (i in 1:n.pid){
  polygon(c(1:n.inc, n.inc:1), c(incprop[i,], rev(incprop[i+1,])), col=paste("gray", 40+10*i, sep=""))
  lines(1:n.inc, incprop[i,])
  text(center, mean(incprop[c(i,i+1),center]), pid.label[i])
}
mtext("Self-declared party identification, by income", side=3, line=1, cex=1.2)

par(mar=c(3,2,2.5,1), mgp=c(1.5,.7,0), tck=-.01)
plot(c(1,n.inc), c(0,1), xaxs="i", yaxs="i", type="n", xlab="", ylab="", xaxt="n", yaxt="n")
axis(1, 1:n.inc, rep("",n.inc))
axis(1, seq(1.5,n.inc-.5,length=3), c("Low income", "Middle income", "High income"), tck=0)
axis(2, c(0,.5,1), c(0,"50%","100%"))
center <- floor((1+n.inc)/2)
incprop <- array(NA, c(n.ideology+1,n.inc))
incprop[n.ideology+1,] <- 1
for (i in 1:n.ideology){
  for (j in 1:n.inc){
    incprop[i,j] <- weighted.mean((ideology<i)[inc==j], pop.weight0[inc==j], na.rm=TRUE)
  }
}
for (i in 1:n.ideology){
  polygon(c(1:n.inc, n.inc:1), c(incprop[i,], rev(incprop[i+1,])), col=paste("gray", 40+10*i, sep=""))
  lines(1:n.inc, incprop[i,])
  text(center, mean(incprop[c(i,i+1),center]), ideology.label[i])
}
mtext("Self-declared political ideology, by income", side=3, line=1, cex=1.2)

I’ll recreate the two plots using tidyverse and ggplot. I referenced this article by the Pew Research Center which shows how to calculated weighted estimates.

pew_pre <- pew_pre %>%
  mutate(
    # create an integer column for income levels
    inc = as.numeric(income),
    # remove "dk/refuse" value for income
    inc = case_when(
      inc == 10 ~ NA,
      TRUE ~ inc
    ),
    # set political party integer column
    pid0 = as.numeric(party),
    # set lean integer column
    lean0 = as.numeric(partyln),
    # set ideology integer column
    ideology0 = as.numeric(ideo),
    # re-assign party values
    pid = case_when(
      pid0 == 2 ~ 5, # Repubican
      pid0 == 3 ~ 1, # Democrat
      lean0 == 2 ~ 4, # Lean Republican
      lean0 == 4 ~ 2, # Lean Democrat
      is.na(pid0) ~ NA,
      TRUE ~ 3 # Independent
    ),
    # reassign ideology values
    ideology = case_when(
      ideology0 == 2 ~ 5, # Very conservative
      ideology0 == 3 ~ 4, # Conservative
      ideology0 == 6 ~ 1, # Very liberal
      ideology0 == 5 ~ 2, # Liberal
      is.na(ideology0) ~ NA,
      TRUE ~ 3 # Moderate
    )
  )

# constants
n_inc <- max(pew_pre$inc, na.rm = TRUE)
n_pid <- max(pew_pre$pid, na.rm = TRUE)
n_ideology <- max(pew_pre$ideology, na.rm = TRUE)
# calculate income proportions using population weight
inc_totals <- pew_pre %>% group_by(
  inc
) %>% summarize(
  total_inc = n()
)

pid_weighted_estimates <- pew_pre %>% dplyr::left_join(
  inc_totals
) %>% group_by(
  inc,
  pid
) %>% summarise(
  weighted_n = sum(weight)
) %>% mutate(
  weighted_group_size = sum(weighted_n),
  weighted_estimate = weighted_n / weighted_group_size
) %>% arrange(
  inc
)
Joining with `by = join_by(inc)`
`summarise()` has grouped output by 'inc'. You can override using the `.groups`
argument.
DT::datatable(pid_weighted_estimates)
ggplot(pid_weighted_estimates, aes(x=inc, y=weighted_estimate)) + 
  geom_area(aes(group = pid, fill = pid), position = position_stack(reverse = TRUE), show.legend = FALSE) +
  annotate("text", x=5, y=.2, label="Democrat", color="white") + 
  annotate("text", x=5, y=.425, label="Lean Dem.", color="white") + 
  annotate("text", x=5, y=.55, label="Independent", color="white") + 
  annotate("text", x=5, y=.68, label="Lean Rep.") + 
  annotate("text", x=5, y=.87, label="Republican") + 
  scale_x_continuous("Average Income Level", breaks=c(1,5,9), labels = c("Low Income", "Middle Income", "High Income"), , expand = c(0, 0)) +
  scale_y_continuous("Proportion of Sample", breaks=c(0,.5, 1), labels = c("0%", "50%", "100%"), expand = c(0, 0)) +
  theme(
    plot.margin = margin(3, 3, 2.5, 1, "lines"),
    panel.border = element_rect(colour = "black", fill=NA, size=1),
    plot.title = element_text(size=14,hjust = 0.5),
    axis.text = element_text(size = 10)
  ) + 
  ggtitle("Party Identification by Income Levels")
Warning: Removed 6 rows containing non-finite values (`stat_align()`).

# ideology data prep
ideo_weighted_estimates <- pew_pre %>% dplyr::left_join(
  inc_totals
) %>% group_by(
  inc,
  ideology
) %>% summarise(
  weighted_n = sum(weight)
) %>% mutate(
  weighted_group_size = sum(weighted_n),
  weighted_estimate = weighted_n / weighted_group_size
) %>% arrange(
  inc
)
Joining with `by = join_by(inc)`
`summarise()` has grouped output by 'inc'. You can override using the `.groups`
argument.
DT::datatable(ideo_weighted_estimates)
ggplot(ideo_weighted_estimates, aes(x=inc, y=weighted_estimate)) + 
  geom_area(aes(group = ideology, fill = ideology), position = position_stack(reverse = TRUE), show.legend = FALSE) +
  annotate("text", x=5, y=.03, label="Very Liberal", color="white", size=3) + 
  annotate("text", x=5, y=.13, label="Liberal", color="white", size=3) + 
  annotate("text", x=5, y=.4, label="Moderate", color="white", size=3) + 
  annotate("text", x=5, y=.75, label="Conservative", size=3) + 
  annotate("text", x=5, y=.95, label="Very Conservative", size=3) + 
  scale_x_continuous("Average Income Level", breaks=c(1,5,9), labels = c("Low Income", "Middle Income", "High Income"), , expand = c(0, 0)) +
  scale_y_continuous("Proportion of Sample", breaks=c(0,.5, 1), labels = c("0%", "50%", "100%"), expand = c(0, 0)) +
  theme(
    plot.margin = margin(3, 3, 2.5, 1, "lines"),
    panel.border = element_rect(colour = "black", fill=NA, size=1),
    plot.title = element_text(size=14,hjust = 0.5),
    axis.text = element_text(size = 10)
  ) + 
  ggtitle("Ideology by Income Levels")
Warning: Removed 6 rows containing non-finite values (`stat_align()`).

Overall, I think the plots I made look pretty similar to the textbook’s. The main bottleneck was learning how to calculate proportions based on weighted estimates. Once that took shape, I was able to figure out how to present the data visually in a way that resembles the authors’ approach.

The figure “Ideology by Income Levels” shows that the proportion of political liberals, moderates, and conservatives is about the same for all income levels. The figure “Party Identification by Income Levels” shows a strong relation between income and Republican partisanship, at least as of 2008. The partisanship and ideology example is a reminder that even very similar measures can answer quite different questions.

2.2 Validity and reliability

Data analysis reaches a dead end if we have poor data.

Two reasons for discussing measurement:

  • we need to understand what our data actually mean, otherwise we cannot extract the right information

  • learning about accuracy, reliability and validity will set a foundation for understanding variance, correlation, and error, which will be useful in setting up linear models in the forthcoming chapters

The property of being precise enough is a combination of the properties of the scale and what we are trying to use it for.

In social science, the way to measure what we are trying to measure is not as transparent as it is in everyday life.

Other times, the thing we are trying to measure is pretty straightforward, but a little bit fuzzy, and the ways to tally it up aren’t obvious.

Sometimes we are trying to measure something that we all agree has meaning, but which is subjective for every person and does not correspond to a “thing” we can count or measure with a ruler.

Attitudes are private; you can’t just weigh them or measure their widths.

It can be helpful to take multiple measurements on an underlying construct of interest.

Validity

A measure is valid to the degree that it represents what you are trying to measure. Asking people how satisfied they are with some government service might not be considered a valid measure of the effectiveness of that service.

Valid measures are ones in which there is general agreement that the observations are closely related to the intended construct.

validity: the property of giving the right answer on average across a wide range of plausible scenarios. To study validity in an empirical way, ideally you want settings in which there is an observable true value and multiple measurements can be taken.

Reliability

A reliable measure is one that is precise and stable. We would hope the variability in our sample is due to real differences among people or things, and not due to random error incurred during the measurement process.

Sample selection

selection: the idea that the data you see may be a nonrepresentative sample of a larger population you will not see.

In addition to selection bias, there are also biases from nonresponse to particular survey items, partially observed measurements, and choices in coding and interpretation of data.

2.3 All graphs are comparisons

Simple scatterplots

health_data_path = file.path(data_root_path, "HealthExpenditure/data/healthdata.txt")
health <- read.table(health_data_path, header=TRUE)
DT::datatable(head(health))
par(mgp=c(1.7,.5,0), tck=-.01, mar=c(3,3,.1,.1))
ggplot(data=health, aes(x=spending, y=lifespan, label=country)) +
  geom_text() + 
  labs(x="Health care spending (PPP US$)",y="Life expectancy (years)") + 
  theme(
    panel.border = element_rect(colour = "black", fill=NA, size=1),
    panel.background = element_rect(fill = 'white'),
    aspect.ratio = 0.5
  ) 

The graph shows the exceptional position of the United States (highest health care spending, low life expectancy) and also shows the relation between spending and lifespan in other countries.

Displaying more information on the graph

At least in theory, you can display five variables easily with a scatterplot:

  • x position

  • y position

  • symbol

  • symbol size

  • symbol color

We prefer clearly distinguishable colors or symbols such as open circles, solid circles, crosses and dots. When a graph has multiple lines, label them directly.

Multiple plots

Looking at data in unexpected ways can lead to discovery.

We can learn by putting multiple related graphs in a single display.

There is no single best way to display a dataset.

In the code chunks below I’ll recreate plots 2.6, 2.7, 2.8 and 2.9 using their code provided in the supplemental website.

allnames_path = file.path(data_root_path, "Names/data/allnames_clean.csv")
allnames <- read.csv(allnames_path)
DT::datatable(head(allnames), options=list(scrollX=TRUE))
# create a vector identifying which rows have Female names (TRUE) and which have Male names (FALSE)
girl <- as.vector(allnames$sex) == "F"

# Since there is a `names` base R function, I'm renaming the names vector to names_vec
names_vec <- as.vector(allnames$name)
# plot data

# calculate the number of characters in each name
namelength <- nchar(names_vec)

# create a vector of last letters of each name
lastletter <- substr(names_vec, namelength, namelength)

# create a vector of first letters of each name
firstletter <- substr(names_vec, 1, 1)

# plotting function
# i am renaming it to names_histogram since there is an `arm` package function called `discrete.histogram`
names_histogram <- function(x, prob, prob2=NULL, xlab="x", ylab="Probability", xaxs.label=NULL, yaxs.label=NULL, bar.width=NULL, ...) {
  if (length(x) != length(prob)) stop()
  
  x.values <- sort (unique(x))
  n.x.values <- length (x.values)
  gaps <- x.values[2:n.x.values] - x.values[1:(n.x.values-1)]
  if (is.null(bar.width)) bar.width <- min(gaps)*.2
  par(mar=c(3,3,2,2), mgp=c(1.7,.3,0), tck=0)
  plot(range(x)+c(-1,1)*bar.width, c(0,max(prob)),
    xlab=xlab, ylab=ylab, xaxs="i", xaxt="n",  yaxs="i",
    yaxt=ifelse(is.null(yaxs.label),"s","n"), bty="l", type="n", ...)
  if (is.null(xaxs.label)){
    axis(1, x.values)
  }
  else {
    n <- length(xaxs.label[[1]])
    even <- (1:n)[(1:n)%%2==0]
    odd <- (1:n)[(1:n)%%2==1]
    axis(1, xaxs.label[[1]][even], xaxs.label[[2]][even], cex.axis=.9)
    axis(1, xaxs.label[[1]][odd], xaxs.label[[2]][odd], cex.axis=.9)
  }
  if (!is.null(yaxs.label)){
    axis(2, yaxs.label[[1]], yaxs.label[[2]], tck=-.02)
  }
  for (i in 1:length(x)){
    polygon(x[i] + c(-1,-1,1,1)*bar.width/2, c(0,prob[i],prob[i],0),
      border="gray10", col="gray10")
    if (!is.null(prob2))
      polygon(x[i] + c(-1,-1,1,1)*bar.width/10, c(0,prob2[i],prob2[i],0),
        border="red", col="black")
  }
}

for (year in c(1900,1950,2010)){
  # get the column X1900, X1950 and X2010 from the data.frame `allnames`
  thisyear <- allnames[,paste("X",year,sep="")]
  # create 2D empty arrays to store Male and Female last and first letter of names
  lastletter.by.sex <- array(NA, c(26,2))
  firstletter.by.sex <- array(NA, c(26,2))
  # construct the 2D arrays of Male/Female last/first letter of names
  for (i in 1:26){
    lastletter.by.sex[i,1] <- sum(thisyear[lastletter==letters[i] & girl])
    lastletter.by.sex[i,2] <- sum(thisyear[lastletter==letters[i] & !girl])
    firstletter.by.sex[i,1] <- sum(thisyear[firstletter==LETTERS[i] & girl])
    firstletter.by.sex[i,2] <- sum(thisyear[firstletter==LETTERS[i] & !girl])
  }
  
  # plot the histogram for the given year
  names_histogram(1:26, 100*(lastletter.by.sex[,2])/sum(lastletter.by.sex[,2]), xaxs.label=list(1:26,letters), yaxs.label=list(seq(0,30,10),seq(0,30,10)), xlab="", ylab="Percentage of boys born", main=paste("Last letter of boys' names in", year), cex.axis=.9, cex.main=.9, bar.width=.8)
  for (y in c(10,20,30)) abline (y,0,col="gray",lwd=.5)
  
  names_histogram(1:26, 100*(lastletter.by.sex[,1])/sum(lastletter.by.sex[,1]), xaxs.label=list(1:26,letters), yaxs.label=list(seq(0,30,10),seq(0,30,10)), xlab="", ylab="Percentage of girls born", main=paste("Last letter of girls' names in", year), cex.main=.9)
  # adding the horizontal grid lines for the girls plots
  for (y in c(10,20,30)) abline (y,0,col="gray",lwd=.5)
}

I’ll recreate similar plots using tidyverse and ggplot.

First, I’ll make the data long (instead of it’s current wide shape) so that I can aggregate the data more easily for plotting:

# reshape the data to make it longer
allnames_long <- allnames %>% rename(
  id = 'X'
) %>% pivot_longer(
  starts_with("X"),
  names_to = "year",
  values_to = "name_n"
) %>% mutate(
  # remove X at start of the four-digit year
  year = str_remove(year, "X"),
  # extract last letter of each name
  last_letter = str_sub(name, -1)
) 

DT::datatable(head(allnames_long))

I then aggregate the data by year, sex and last letter, and calculate the count and percentage of last letters in each year for each sex:

allnames_agg <- allnames_long %>% group_by(
  year,
  sex,
  last_letter
) %>% summarise(
  last_letter_n = sum(name_n)
) %>% mutate(
  total_n = sum(last_letter_n),
  last_letter_pct = last_letter_n / total_n
)
`summarise()` has grouped output by 'year', 'sex'. You can override using the
`.groups` argument.
DT::datatable(head(allnames_agg, n=30))

I then plot one year’s data to make sure I’m getting the same result as the textbook:

ggplot(allnames_agg %>% filter(year == "1900", sex == "F"), aes(x = last_letter, y = last_letter_pct)) + 
  geom_bar(stat="identity", fill="black") + 
  theme(
    plot.margin = margin(3, 3, 2.5, 1, "lines"),
    panel.border = element_rect(color = "black", fill=NA, size=1),
    panel.background = element_rect(color = "white", fill=NA),
    plot.title = element_text(size=14,hjust = 0.5),
    axis.text = element_text(size = 10)
  ) + 
  scale_x_discrete("Last Letter of Name", expand = c(0, 0)) +
  scale_y_continuous("Percentage of Total", limits=c(0,0.4), breaks=c(0,.1,.2,.3,.4), labels=c("0%", "10%", "20%", "30%", "40%"), expand = c(0, 0)) +
  ggtitle("Last Letter of Girls' Names in 1900")

Finally I’ll plots histograms to match figures 2.6 and 2.7:

# plotting function 
plot_last_letter_distribution <- function(year_arg, sex_arg) {
  # prepare string for plot title
  title_sex <- if(sex_arg == 'F') "Girls'" else "Boys'"
  title_str <- paste("Last Letter of", title_sex, "in", year_arg)
  
  # prep data
  plot_data <- allnames_agg %>% filter(year == year_arg, sex == sex_arg)
  
  # calculate y-axis limits, break points and labels
  # calculate max y limit to 0.05 more than the nearest 0.05
  limits_max <- round(max(plot_data$last_letter_pct)/0.05) * 0.05 + 0.05
  
  # calculate y-axis breaks
  breaks_val = seq(0, limits_max, by=0.05)
  
  # calculate y-axis labels
  labels_val = c()
  
  for (val in breaks_val) {
    labels_val <- append(labels_val, paste0(val*100, "%"))
  }
  
  print(ggplot(plot_data, aes(x = last_letter, y = last_letter_pct)) + 
  geom_bar(stat="identity", fill="black") + 
  theme(
    plot.margin = margin(3, 3, 2.5, 1, "lines"),
    panel.border = element_rect(color = "black", fill=NA, size=1),
    panel.background = element_rect(color = "white", fill=NA),
    plot.title = element_text(size=14,hjust = 0.5),
    axis.text = element_text(size = 10)
  ) + 
  scale_x_discrete("Last Letter of Name", expand = c(0, 0)) +
  scale_y_continuous("Percentage of Total", limits = c(0, limits_max), breaks = breaks_val, labels = labels_val, expand = c(0, 0)) +
  ggtitle(title_str)
  )
}

for (year in c(1906, 1956, 2006)) {
  plot_last_letter_distribution(year, sex = 'M')
}

The following code chunks recreate Figure 2.8 using their code provided on the supplemental website:

# define full range of years in the dataset
yrs <- 1880:2010

# total number of years
n.yrs <- length(yrs)

# create empyt 3D arrays to hold last and first letter frequencies
lastletterfreqs <- array(NA, c(n.yrs,26,2))
firstletterfreqs <- array(NA, c(n.yrs,26,2))

# define the names of each dimension of the letter frequency 3D arrays
# as the list of years, letters and the list c("girls", "boys")
dimnames(lastletterfreqs) <- list(yrs, letters, c("girls","boys"))
dimnames(firstletterfreqs) <- list(yrs, letters, c("girls","boys"))

# construct the arrays lastletterfreqs and fistletterfreqs
for (i in 1:n.yrs){
  thisyear <- allnames[,paste("X",yrs[i],sep="")]
  for (j in 1:26){
    # sum of last letters for girls
    lastletterfreqs[i,j,1] <- sum(thisyear[lastletter==letters[j] & girl])
    
    # sum of last letters for boys
    lastletterfreqs[i,j,2] <- sum(thisyear[lastletter==letters[j] & !girl])
    
    # sum o first letters for girls
    firstletterfreqs[i,j,1] <- sum(thisyear[firstletter==LETTERS[j] & girl])
    
    # sum of laster letters for boys
    firstletterfreqs[i,j,2] <- sum(thisyear[firstletter==LETTERS[j] & !girl])
  }
  for (k in 1:2){
    # percentage of each last letter (of all last letters that year)
    lastletterfreqs[i,,k] <- lastletterfreqs[i,,k]/sum(lastletterfreqs[i,,k])
    
    # percentage of each first letter (of all first letters that year)
    firstletterfreqs[i,,k] <- firstletterfreqs[i,,k]/sum(firstletterfreqs[i,,k])
  }
}
# plot of percentage of last letters over time

# plot settings
par(mar=c(2,3,2,1), mgp=c(1.7,.3,0), tck=-.01)

# index of letters N, Y and D
popular <- c(14,25,4)

# width and line type for all letters except N, Y and D
width <- rep(.5,26)
type <- rep(1,26)

# width and line type for N, Y and D
width[popular] <- c(2,3,3)
type[popular] <- c(1,3,2)

plot(range(yrs), c(0,41), type="n", xlab="", ylab="Percentage of all boys' names that year", bty="l", xaxt="n", yaxt="n", yaxs="i", xaxs="i")
  axis(1, seq(1900,2000,50))
  axis(2, seq(0,40,20), paste(seq(0,40,20), "%", sep=""))
  for (j in 1:26){
    # I don't think the following two variables, `maxfreq` and `best` are used in plotting
    maxfreq <- max(lastletterfreqs[,j,2])
    best <- (1:n.yrs)[lastletterfreqs[,j,2]==maxfreq]
    
    # plotting line for each letter for all years for boys
    lines(yrs, 100*lastletterfreqs[,j,2], col="black", lwd=width[j], lty=type[j])
  }
  
# plot annotations
text(2000, 35, "N")
text(1935, 20, "D")
text(1975, 15, "Y")
mtext("Last letters of boys' names", side=3, line=.5)

Next, I’ll recreate the plot of Figure 2.8 using tidyverse and ggplot for practice:

# define new linewidth vector

# index of letters N, Y and D
popular <- c(14,25,4)

# width and line type for all letters except N, Y and D
width <- rep(.25,26)

# width and line type for N, Y and D
width[popular] <- c(1,1,1)

ggplot(allnames_agg %>% filter(sex == "M"), aes(x=year, y=last_letter_pct, linetype=factor(last_letter), linewidth = factor(last_letter))) + 
  geom_line(aes(group = last_letter)) +
  scale_linetype_manual(values = type, guide = "none") + 
  scale_linewidth_manual(values = width, guide = "none") + 
  scale_x_discrete("Year", breaks = c(1900, 1950, 2000), expand = c(0, 0)) +
  scale_y_continuous("Percentage of all boys' names that year", breaks = c(0, 0.1, 0.2, 0.3, 0.4), labels = c("0%", "10%", "20%", "30%", "40%"), expand = c(0, 0)) +
  ggtitle("Last letters of boys' names") + 
  theme(
    panel.border = element_rect(color = "black", fill=NA, size=1),
    panel.background = element_rect(color = "white", fill=NA),
    plot.title = element_text(size=14,hjust = 0.5),
    axis.text = element_text(size = 10)
  ) + 
  annotate("text", x = "2000", y = 0.35, label = "N") + 
  annotate("text", x = "1935", y = 0.20, label = "D") + 
  annotate("text", x = "1975", y = 0.15, label = "Y")

The following code chunks recreate Figure 2.9 using their code provided on the supplemental website:

# create empty 3D array to hold yearly percentage of top ten names
topten_percentage <- array(NA, c(length(yrs), 2))

for (i in 1:length(yrs)){
  # get data for the given year
  thisyear <- allnames[,paste("X",yrs[i],sep="")]
  
  # get boys' data for this year
  boy.totals <- thisyear[!girl]
  
  # calculate percentages of boys' names
  boy.proportions <- boy.totals/sum(boy.totals)
  
  # get total percentage of the top ten names
  index <- rev(order(boy.proportions))
  popularity <- boy.proportions[index]
  topten_percentage[i,2] <- 100*sum(popularity[1:10])
  
  # do the same for girls' top ten names
  girl.totals <- thisyear[girl]
  girl.proportions <- girl.totals/sum(girl.totals)
  index <- rev(order(girl.proportions))
  popularity <- girl.proportions[index]
  topten_percentage[i,1] <- 100*sum(popularity[1:10])
}
# plot settings
par(mar=c(4,2,1,0), mgp=c(1.3,.2,0), tck=-.02)

# plot girls and boys top ten name percentages over time
plot(yrs, topten_percentage[,2], type="l", xaxt="n", yaxt="n", xaxs="i", yaxs="i", ylim=c(0,45), bty="l", xlab="Year", ylab="", cex.lab=.8)
lines(yrs, topten_percentage[,1])
axis(1, c(1900,1950,2000), cex.axis=.8)
axis(2, c(0,20,40), c("0%","20%","40%"), cex.axis=.8)
text(1902, 35, "Boys", cex=.75, adj=0)
text(1911, 20, "Girls", cex=.75, adj=0)
mtext("Total popularity of top ten names each year, by sex", cex=.8)
mtext("Source:  Social Security Administration, courtesy of Laura Wattenberg", 1, 2.5, cex=.5, adj=0)

Next, I’ll recreate the plot of Figure 2.9 using tidyverse and ggplot for practice:

DT::datatable(head(allnames_long))
allnames_topten_names <- allnames_long %>% group_by(
  year,
  sex
) %>% mutate(
  name_pct = name_n / sum(name_n)
) 
DT::datatable(head(allnames_topten_names %>% arrange(year, sex, desc(name_pct))))
allnames_top_ten_pct <- allnames_topten_names %>% arrange(
  year, 
  sex, 
  desc(name_pct)
) %>% group_by(
  year,
  sex
) %>% slice(
  1:10
) %>% summarise(
  top_ten_pct = sum(name_pct)
)
`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.
DT::datatable(head(allnames_top_ten_pct))
# plot the lines

ggplot(allnames_top_ten_pct, aes(x = year, y = top_ten_pct)) + 
  geom_line(aes(group = sex)) +
  scale_x_discrete("Year", breaks = c(1900, 1950, 2000), expand = c(0, 0)) +
  scale_y_continuous("Percentage of top ten names that year", breaks = c(0, 0.1, 0.2, 0.3, 0.4, 0.5), labels = c("0%", "10%", "20%", "30%", "40%", "50%"), expand = c(0, 0)) +
  ggtitle("Total popularity of top ten names each year, by sex") + 
  theme(
    panel.border = element_rect(color = "black", fill=NA, size=1),
    panel.background = element_rect(color = "white", fill=NA),
    plot.title = element_text(size=14,hjust = 0.5),
    axis.text = element_text(size = 10)
  ) +
  annotate("text", x = "1902", y = 0.36, label = "Boys") + 
  annotate("text", x = "1911", y = 0.19, label = "Girls")

Grids of plots

Realistically, it can be difficult to read a plot with more than two colors.

Constructing a two-way grid of plots to represent two more variables, an approach of small multiples can be more effective than trying to cram five variables onto a single plot.

In the following code chunks, I’ll recreate Figure 2.10 following their code provided on the supplemental website.

# view the data
congress[[27]][1,]
[1]   1948      1      1     -1 127802 103294
congress[[28]][1,]
[1]   1950      1      1      1 134258  96251
congress[[37]][10,]
[1]   1968      3      1     -1      0 140419
congress[[38]][10,]
[1]   1970      3      1     -1      0 117045
congress[[47]][20,]
[1]   1988      4      1     -1  86623 131824
congress[[48]][50,]
[1]  1990    13    11     1 36286     0
# plot settings
par(mfrow=c(3,5), mar=c(0.1,3,0,0), mgp=c(1.7, .3, 0), tck=-.02, oma=c(1,0,2,0))

# plot data for each of the selected year pairs: 1948/50, 1968/70, 1988/90
for (i in c(27, 37, 47)) {
  # calculate year based on index
  year <- 1896 + 2*(i-1)
  
  # get the data for the first election year
  cong1 <- congress[[i]]
  
  # get data for the next election year
  cong2 <- congress[[i+1]]
  
  # the second column contains the state code
  state_code <- cong1[,2]
  
  # convert from state code to region index
  region <- floor(state_code/20) + 1
  
  # the fourth column contains whether the election has an incumbent
  inc <- cong1[,4]
  
  # the fifth and sixth columns contain democrat and republican vote counts
  dvote1 <- cong1[,5]/(cong1[,5] + cong1[,6])
  dvote2 <- cong2[,5]/(cong2[,5] + cong2[,6])
  
  # their definition of an election being "contested" if the 
  # proportion of dem votes is between 20 and 80 percent each election year
  contested <- (abs(dvote1 - 0.5)) < 0.3 & (abs(dvote2 - 0.5) < 0.3)
  
  # plot function
  plot(c(0, 1), c(0, 1), type="n", xlab="", ylab="", xaxt="n", yaxt="n", bty="n")
  text(0.8, 0.5, paste(year,"\nto\n", year+2, sep=""), cex=1.1)
  
  # four columns, one for each region
  for (j in 1:4){
    plot(c(.2, .8), c(-.4, .3), type="n", xlab= "" , ylab=if (j==1) "Vote swing" else "", xaxt="n", yaxt="n", bty="n", cex.lab=.9)
    if (i==47) {
      # x-axis labels for the final pair of elections 1988/90
      text(c(.25, .5, .75), rep(-.4, 3), c("25%", "50%", "75%"), cex=.8)
      abline(-.35, 0, lwd=.5, col="gray60")
      segments(c(.25, .5, .75), rep(-.35, 35), c(.25, .5, .75), rep(-.37, 3), lwd=.5)
      mtext("Dem. vote in election 1", side=1, line=.2, cex=.5)
    }
    # y-axis for each plot
    axis(2, c(-0.25, 0, 0.25), c("-25%", "0", "25%"),  cex.axis=.8)
    abline(0, 0)
    # region name above the first pair of elections 1948/50
    if (i==27) mtext(region_name[j], side=3, line=1, cex=.75)
    
    # plotting contested elections for the region with incumbent
    ok <- contested & abs(inc)==1 & region==j
    points(dvote1[ok], dvote2[ok] - dvote1[ok], pch=20, cex=.3, col="gray60")
    
    # plotting contested elections for the region with open seats
    ok <- contested & abs(inc)==0 & region==j
    points(dvote1[ok], dvote2[ok] - dvote1[ok], pch=20, cex=.5, col="black")
  }
}

Next, I’ll recreate similar plots with tidyverse and ggplot.

# get list of all .asc file paths
congress_data_folder <- file.path(data_root_path, "Congress/data")
congress_data_files <- list.files(congress_data_folder, full.names=TRUE, pattern = "*.asc")

# helper function which loads data into a CSV and adds a year column
read_congress_data <- function(fpath) {
  year_str = str_remove(basename(fpath), ".asc")
  
  out <- readr::read_table(
    fpath,
    col_names = c("state", "col2", "incumbent", "dem", "rep"),
    col_types = cols(state = "i", col2 = "c", incumbent = "i", dem = "i", rep = "i")
  ) %>% mutate(
    year = year_str
  )
  
  return(out)
}

# read all asc files into a single data.frame
# filter for the years in question
congress_data <- congress_data_files %>% map(
  read_congress_data
) %>% list_rbind(
) %>% filter(
  year %in% c(1948, 1950, 1968, 1970, 1988, 1990)
)
# view the data
DT::datatable(congress_data)

Next, I’ll create two data.frames, cong1 and cong2 with 1948, 1968, 1988 and 1950, 1970, 1990 data, respectively, then merge them row-wise in order to perform year-to-year calculations:

# 1948, 1968, and 1988 data
cong1 <- congress_data %>% filter(
  year %in% c(1948, 1968, 1988)
) %>% rename(
  # rename columns to avoid duplicates later on when joining the two data.frames
  state1 = state,
  col21 = col2,
  incumbent1 = incumbent,
  dem1 = dem,
  rep1 = rep,
  year1 = year
) %>% mutate(
  # create a lookup value to join the two data.frames by
  election = case_when(
    year1 == 1948 ~ "1948 to 1950",
    year1 == 1968 ~ "1968 to 1970",
    year1 == 1988 ~ "1988 to 1990"
  ),
  lookup = paste0(state1, "-", col21, "-", election)
)

# 1950, 1970 and 1990 data
cong2 <- congress_data %>% filter(
  year %in% c(1950, 1970, 1990)
) %>% rename(
  # rename columns to avoid duplicates later on when joining the two data.frames
  state2 = state,
  col22 = col2,
  incumbent2 = incumbent,
  dem2 = dem,
  rep2 = rep,
  year2 = year
) %>% mutate(
  # create a lookup value to join the two data.frames by
  election = case_when(
    year2 == 1950 ~ "1948 to 1950",
    year2 == 1970 ~ "1968 to 1970",
    year2 == 1990 ~ "1988 to 1990"
  ),
  lookup = paste0(state2, "-", col22, "-", election)
)


# merge the two data.frames 
cong <- cong1 %>% dplyr::left_join(
  cong2,
  by = "lookup"
) %>% mutate(
  # calculate democrat vote proportion
  dvote1 = dem1 / (dem1 + rep1),
  dvote2 = dem2 / (dem2 + rep2),
  # calculate difference in democrat vote proportion between election years
  dvote_diff= dvote2 - dvote1,
  # calculate whether an election is contested
  # an election is contested if the dem vote in both years is 
  # between 20% and 80%
  contested = case_when(
    (abs(dvote1 - 0.5)) < 0.3 & (abs(dvote2 - 0.5) < 0.3) ~ 1,
    TRUE ~ 0
  ),
  # standardize incumbent flag in the first election year
  inc = abs(incumbent1),
  # remove values that are not applicable
  inc = na_if(inc, 9),
  inc = na_if(inc, 3),
  inc = as.character(inc),
  # calculate region code
  region = floor(state1/20) + 1
) %>% filter(
  # only include contested elections
  contested == 1,
  # only include four main regions
  region %in% c(1,2,3,4)
) 

# %>% mutate(
#   # change region to text
#   region = case_when(
#     region == 1 ~ "Northeast",
#     region == 2 ~ "Midwest",
#     region == 3 ~ "South",
#     region == 4 ~ "West"
#   )
# )
DT::datatable(cong, options=list(scrollX=TRUE))

Finally, I can plot the data:

# labels for regions
region_labeller <- c(
  "1" = "Northeast",
  "2" = "Midwest",
  "3" =  "South",
  "4" = "West"
)

ggplot(cong, aes(x=dvote1, y = dvote_diff)) + 
  geom_point(aes(color=inc, size=inc), show.legend = FALSE) + 
  facet_grid(rows = vars(election.x), cols = vars(region), switch = "y", labeller = labeller(region = region_labeller)) + 
  theme(
    panel.background = element_rect(fill = "white"),
    strip.placement = "outside",
    axis.line=element_line()
  ) + 
  scale_color_manual(values = c("black", "grey"), na.translate = FALSE) + 
  scale_size_manual(values = c(0.75, 0.5), na.translate = FALSE) + 
  geom_abline(slope = 0, intercept = 0) +
  scale_y_continuous("Vote Swing", breaks = c(-0.25, 0, 0.25), labels = c("-25%", "0%", "25%")) +
  scale_x_continuous("Dem. Vote in Election 1", breaks = c(.25, .50, .75), labels = c("25%", "50%", "75%"))
Warning: Removed 38 rows containing missing values (`geom_point()`).

While I wasn’t able to get exactly the same output as the textbook, I’m content with the grid of plots I created with ggplot.

Applying graphical principles to numerical displays and communication more generally

Avoid overwhelming the reader with irrelevant material.

Do not report numbers to too many decimal places. You should display precision in a way that respects the uncertainty and variability in the numbers being presented. It makes sense to save lots of digits for intermediate steps in computations.

You can often make a list or table of numbers more clear by first subtracting out the average (or for a table, row and column averages).

A graph can almost always be made smaller than you think and still be readable. This then leaves room for more plots on a grid, which then allows more patterns to be seen at once and compared.

Never display a graph you can’t talk about. Give a full caption for every graph (this explains to yourself and others what you are trying to show and what you have learned from each plot).

Avoid displaying graphs that have been made simply because they are conventional.

Graphics for understanding statistical models

Three uses of graphs in statistical analysis:

  1. Displays of raw data. Exploratory analysis.
  2. Graphs of fitted models and inferences (and simulated data).
  3. Graphs presenting your final results.

The goal of any graph is communication to self and others. Graphs are comparisons: to zero, to other graphs, to horizontal lines, and so forth. The unexpected is usually not an “outlier” or aberrant point but rather a systematic pattern in some part of the data.

Graphs as comparisons

When making a graph, line things up so that the most important comparisons are clearest.

Graphs of fitted models

It can be helpful to graph a fitted model and data on the same plot. Another use of graphics with fitted models is to plot predicted datasets and compare them visually to actual data.