R Shiny Census App

R
data analysis
SQL
An explanation of my development process for a census data shiny app
Author

Vishal Bakshi

Published

September 21, 2021

In this blog post, I’ll walk through my development process for a U.S. Census data visualization web app I created using the Shiny package in R.

You can access the app at vbakshi.shinyapps.io/census-app.

Table of Contents

Backstory

I started this project by reading the handbook Understanding and Using American Community Survey Data: What State and Local Government Users Need to Know published by the U.S. Census Bureau. I recreated the handbook’s first case study in R, in which they make comparisons across geographic areas, create custom geographic areas from census tracts and calculate margins of error for derived estimates for Minnesota Census Tract 5-year earnings estimates.

During the process of recreating the derived median earnings estimate calculations, I was unable to recreate a key value from the handbook (the Standard Error for the 50% proportion, calculated to be 0.599) because I was unable to deduce the values used in the following formula referenced from page 17 of the PUMS Accuracy of the Data documentation:

Standard Error equals Design Factor times square root of the product of 95 over 5B and 50 squared

The documentation defines B as the base, which is the calculated weighted total. I chose the value of 1.3 for the design factor DF since it corresponds to STATE = Minnesota, CHARTYP = Population, CHARACTERISTIC = Person Earnings/Income in the Design Factors CSV published by the Census Bureau.

I called the Census Bureau Customer Help Center for assistance and was transferred to a member of the ACS Data User Support team with whom I discussed my woes. He was unable to confirm the values of the design factor DF or B, and was unable to pull up the contact information for the statistical methodology team, so I emailed him my questions. After a few email exchanges, the statistical methodology team provided the following:

  • DF = 1.3
  • B = the total population estimate for which the median is being calculated, which is 82488 for the case study calculation (Minnesota Rural Male Full Time Workers)
  • The term 95/5 is associated with the finite population correction factor (100 - f) divided by the sample fraction (f), where f = 5% (later on I note in the documentation that this 95/5 term is based on a 68% confidence interval). The data used in the handbook case study is from 5-year estimates. 1-year estimates sample 2.5% of the population, so the 5-year estimates represent a 5 * 2.5 = 12.5% sample. Instead of 95/5, the ratio becomes (100 - 12.5)/12.5 = 87.5/12.5

The updated formula is then:

Standard Error equals Design Factor times square root of the product of 87.5 over 12.5B and 50 squared

I was able to calculate the median earnings estimate (and associated standard error and margin of error) within a few percent of the values given in the handbook. This provided me with confirmation that I was ready to expand my code to calculate median earnings estimates for other subgroups.

Codebase

I built this app using the R package Shiny which handles both the UI and the server. I store the data in a sqlite database and access it with queries written using the RSQLite package which uses the DBI API. The following sections break down the R scripts based on functionality. Click on the script name to navigate to that section.

app.R

A shiny app has three fundamental components:

ui <- (...)
server <- (...)
shinyApp(ui, server,...)

The ui object holds all UI layout, input and output objects which define the front-end of your app. The server object holds all rendering functions which are assigned to outputs that appear on the UI. The shinyApp function takes a ui and server object (along with other arguments) and creates a shiny app object which can be run in a browser by passing it to the runApp function. Person inputs (such as selections in a dropdown) are assigned to a global input object.

What’s in my ui?

All of my UI objects are wrapped within a fluidPage call which returns a page layout which “consists of rows which in turn include columns” (from the docs).

My app’s UI has four sections:

  1. Dropdowns to select state, sex and work status for which the person using the app wants ACS 5-year earnings estimates

Dropdowns to select state, sex and work status for which the person using the app wants ACS 5-year earnings estimates

  1. A table with the estimate, standard error and margin of error for median earnings

A table with the estimate, standard error and margin of error for median earnings

  1. A bar plot of population estimates for earnings levels for the selected state, sex, work status and RUCA (Rural-Urban Commuting Areas) level

A bar plot of population estimates for earnings levels for the selected state, sex, work status and RUCA (Rural-Urban Commuting Areas) level

  1. A table with population estimates for earnings levels for each RUCA level for the selected state, sex and work status

Each section has a download button so that people can get the CSV files or plot image for their own analysis and reporting. Each section is separated with markdown('---') which renders an HTML horizontal rule (<hr>).

Tables

Tables (the HTML <table> element) are a type of UI Output. I define each with an outputId for reference in the server.

tableOutput(outputId = "...")

Plot

Similarly, a plot (which is rendered as an HTML <img> element) is a type of UI Output. I define each with an outputId.

plotOutput(outputId = "...")

Download Buttons

The download button (an HTML <a> element) is also a type of UI Output. I define each with an outputId and label (which is displayed as the HTML textContent attribute of the <a> element).

downloadButton(
  outputId = "...",
  label = "..."
)

What does my server do?

The server function has three parameters: input, output and session. The input object is a ReactiveValues object which stores all UI Input values, which are accessed with input$inputId. The output object similarly holds UI Output values at output$outputId. I do not use the session object in my app (yet).

My app’s server has four sections:

  1. Get data from the SQLite database
  2. Render table and plot outputs
  3. Prepare dynamic text (for filenames and the plot title)
  4. Handle data.frame and plot downloads

Get Data

There are three high-level functions which call query/format/calculation functions to return the data in the format necessary to produce table, text, download and plot outputs:

  • The earnings_data function passes the person-selected dropdown options input$sex, input$work_status and input$state to the get_b20005_ruca_aggregate_earnings function to get a query result from the SQLite database. That function call is passed to format_earnings, which in turn is passed to the reactive function to make it a reactive expression. Only reactive expressions (and reactive endpoints in the output object) are allowed to access the input object which is a reactive source. You can read more about Shiny’s “reactive programming model” in this excellent article.
earnings_data <- reactive(
  format_earnings(
    get_b20005_ruca_aggregate_earnings(
      input$sex, 
      input$work_status, 
      input$state)))
  • The design_factor function passes the input$state selection to the get_design_factor function which in turn is passed to the reactive function.
design_factor <- reactive(get_design_factor(input$state))
  • The median_data function passes the return values from earnings_data() and design_factor() to the calculate_median function which in turn is passed to the reactive function.
median_data <- reactive(calculate_median(earnings_data(), design_factor()))

Render Outputs

I have two reactive endpoints for table outputs, and one endpoint for a plot. The table outputs use renderTable (with row names displayed) with the data.frame coming from median_data() and earnings_data(). The plot output uses renderPlot, and a helper function make_plot to create a bar plot of earnings_data() for a person-selected input$ruca_level with a title created with the helper function earnings_plot_title().

output$median_data <- renderTable(
  expr = median_data(), 
  rownames = TRUE)
  
output$earnings_data <- renderTable(
  expr = earnings_data(), 
  rownames = TRUE)
    
output$earnings_histogram <- renderPlot(
  expr = make_plot(
    data=earnings_data(), 
    ruca_level=input$ruca_level, 
    plot_title=earnings_plot_title()))

Prepare Dynamic Text

I created four functions that generate filenames for the downloadHandler call when the corresponding downloadButton gets clicked, one function that generates the title used to generate the bar plot, and one function which takes computer-readable character objects (e.g. "Large_Town") and maps it to and returns a more human-readable character object (e.g. "Large Town"). I chose to keep filenames more computer-readable (to avoid spaces) and the plot title more human-readable.

get_pretty_text <- function(raw_text){
  text_map <- c("M" = "Male", 
  "F" = "Female",
  "FT" = "Full Time",
  "OTHER" = "Other",
  "Urban" = "Urban",
  "Large_Town" = "Large Town",
  "Small_Town" = "Small Town",
  "Rural" = "Rural")
  return(text_map[raw_text])
  }
 
earnings_plot_title <- function(){
  return(paste(
    input$state,
    get_pretty_text(input$sex),
    get_pretty_text(input$work_status),
    input$ruca_level,
    "Workers",
    sep=" "))
  }

b20005_filename <- function(){
    return(paste(
      input$state,
      get_pretty_text(input$sex),
      input$work_status,
      "earnings.csv",
      sep="_"
    ))
  }
  
median_summary_filename <- function() {
  paste(
    input$state,  
    get_pretty_text(input$sex), 
    input$work_status, 
    'estimated_median_earnings_summary.csv',  
    sep="_")
  }
  
ruca_earnings_filename <- function() {
  paste(
    input$state,  
    get_pretty_text(input$sex),  
    input$work_status, 
    'estimated_median_earnings_by_ruca_level.csv',  
    sep="_")
  }
  
earnings_plot_filename <- function(){
  return(paste(
    input$state,
    get_pretty_text(input$sex),
    input$work_status,
    input$ruca_level,
    "Workers.png",
    sep="_"))
  }

Handle downloads

I have five download buttons in my app: two which trigger a download of a zip file with two CSVs, two that downloads a single CSV, and one that downloads a single PNG. The downloadHandler function takes a filename and a content function to write data to a file.

In order to create a zip file, I use the zip base package function and pass it a vector with two filepaths (to which data is written using the base package’s write.csv function) and a filename. I also specify the contentType as "application/zip". In the zip file, one of the CSVs contains a query result from the b20005 SQLite database table with earnings data, and the other file, "b20005_variables.csv" contains B20005 table variable names and descriptions. In order to avoid the files being written locally before download, I create a temporary directory with tempdir and prepend it to the filename to create the filepath.

For the bar plot image download, I use the ggplot2 package’s ggsave function, which takes a filename, a plot object (returned from the make_plot helper function) and the character object "png" (for the device parameter).

output$download_selected_b20005_data <- downloadHandler(
    filename = "b20005_data.zip",
    content = function(fname) {
      # Create a temporary directory to prevent local storage of new files
      temp_dir <- tempdir()
      
      # Create two filepath character objects and store them in a list
      # which will later on be passed to the `zip` function
      path1 <- paste(temp_dir, '/', b20005_filename(), sep="")
      path2 <- paste(temp_dir, "/b20005_variables.csv", sep="")
      fs <- c(path1, path2)
      
      # Create a CSV with person-selection input values and do not add a column
      # with row names
      write.csv(
        get_b20005_earnings(input$state, input$sex, input$work_status), 
        path1,
        row.names = FALSE)
      
      # Create a CSV for table B20005 variable names and labels for reference
      write.csv(
        get_b20005_ALL_labels(),
        path2,
        row.names = FALSE)
      
      # Zip together the files and add flags to maximize compression
      zip(zipfile = fname, files=fs, flags = "-r9Xj")
    },
    contentType = "application/zip"
  )
  
output$download_all_b20005_data <- downloadHandler(
  filename = "ALL_B20005_data.zip",
  content = function(fname){
    path1 <- "ALL_B20005_data.csv"
    path2 <- "b20005_variables.csv"
    fs <- c(path1, path2)
    
    write.csv(
      get_b20005_earnings('ALL', 'ALL', 'ALL'),
      path1,
      row.names = FALSE)
    
    write.csv(
      get_b20005_ALL_labels(),
      path2,
      row.names = FALSE)
    
    zip(zipfile = fname, files=fs, flags = "-r9Xj")
    },
    contentType = "application/zip"
  )
  
output$download_median_summary <- downloadHandler(
  filename = median_summary_filename(),
  content = function(file) {
    write.csv(median_data(), file)
    }
  )
  
output$download_earnings_plot <- downloadHandler(
  filename = earnings_plot_filename(),
  content = function(file) {
    ggsave(
      file, 
      plot = make_plot(
        data=earnings_data(), 
        ruca_level=input$ruca_level, 
        plot_title=earnings_plot_title()), 
        device = "png")
      }
  )
  
output$download_ruca_earnings <- downloadHandler(
  filename = ruca_earnings_filename(),
  content = function(file) {
    write.csv(earnings_data(), file)
  }
  )

prep_db.R

This script is meant to be run locally, and is not deployed, as doing so would create a long delay to load the app.

Database Tables

The database diagram is shown below (created using dbdiagram.io):

Database diagram showing the database table schemas and their relationships

I have five tables in my database:

b20005

Holds the data from the ACS 2015-2019 5-year detailed table B20005 (Sex By Work Experience In The Past 12 Months By Earnings In The Past 12 Months). This includes earnings estimates and margins of errors for Male and Female, Full Time and Other workers, for earning ranges (No earnings, $1 - $2499, $2500 - $4999, …, $100000 or more). The following table summarizes the groupings of the (non-zero earnings) variables relevant to this app:


Variable Demographic
B20005_003 to B20005_025 Male Full Time Workers
B20005_029 to B20005_048 Male Other Workers
B20005_050 to B20005_072 Female Full Time Workers
B20005_076 to B20005_095 Female Other Workers


b20005_vars

Has the name (e.g. B20005_003E) and label (e.g. “Estimate!!Total!!Male!!Worked full-time, year-round in the past 12 months”) for all B20005 variables. Variable names ending with an E are estimates, and those ending with M are margins of error. - ruca contains RUCA (Rural-Urban Commuting Area) codes published by the U.S. Department of Agriculture Economic Research Service which classify U.S. census tracts using measures of population density. The following table shows the code ranges relevant to this app:


RUCA Code RUCA Level
1-3 Urban
4-6 Large Town
7-9 Small Town
10 Rural
99 Zero Population


codes

olds state FIPS (Federal Information Processing Standards) codes and RUCA levels - design_factors contains Design Factors for different characteristics (e.g. Person Earnings/Income) which are used to determine “the standard error of total and percentage sample estimates”, and “reflect the effects of the actual sample design and estimation procedures used for the ACS.” (2015-2019 PUMS 5-Year Accuracy of the Data).

In prep_db.R, I use the DBI package, censusapi and base R functions to perform the following protocol for each table:

Load the Data

  • For tables b20005 and b20005_vars, I use the censusapi::getCensus and censusapi::listCensusMetadata repsectively to get the data
# TABLE b20005_vars ------------------------------
b20005_vars <- listCensusMetadata(
  name = 'acs/acs5',
  vintage = 2015,
  type = 'variables',
  group = 'B20005')
  
 # TABLE b20005 ----------------------------------
 b20005 <- getCensus(
  name = 'acs/acs5',
  region = "tract:*",
  regionin = regionin_value,
  vintage = 2015,
  vars = b20005_vars$name,
  key="..."
  )
  • For tables codes, ruca, and design_factors I load the data from CSVs that I either obtained (in the case of the Design Factors) or created (in the case of the codes and RUCA levels)
 # TABLE codes ----------------------------------
state_codes <- read.csv(
  "data/state_codes.csv",
  colClasses = c(
    "character", 
    "character", 
    "character")
)

ruca_levels <- read.csv(
  "data/ruca_levels.csv",
  colClasses = c(
    "character",
    "character",
    "character")
)

Create Tables

Once the data is ready, I use DBI::dbExecute to run a SQLite command to create each table. The relationships shown in the image above dictate which fields create the primary key (in some cases, a compound primary key) as listed below:

Table Primary Key Notes
b20005 (state, county, tract)) Foreign key for table ruca
b20005_vars name e.g. B20005_001E
ruca TRACTFIPS Foreign key for table b20005
codes (CODE, DESCRIPTION) e.g. (1, "Urban")
design_factors (ST, CHARACTERISTIC) e.g. ("27", "Person Earnings/Income")

Write to Tables

Once the table has been created in the database, I write the data.frame to the corresponding table with the following call:

dbWriteTable(census_app_db, "<table name>", <data.frame>, append = TRUE

get_b20005_ruca_aggregate_earnings.R

The function inside this script (with the same name), receives inputs from the server, sends queries to the database and returns the results. This process involves two steps:

Get Variable Names

The person using the app selects Sex (M or F), Work Status (Full Time or Other) and State (50 states + D.C. + Puerto Rico) for which they want to view and analyze earnings data. As shown above, different variables in table b20005 correspond to different sexes and work statuses, and each tract for which there is all that earnings data resides in a given state.

I first query b20005_vars to get the relevent variables names which will be used in the query to b20005, as shown below. names that end with “M” (queried with the wilcard '%M') are for margins of error and those that end with “E” (wildcard '%E') are for estimates.

vars <- dbGetQuery(
    census_app_db, 
    "SELECT name FROM b20005_vars 
    WHERE label LIKE $label_wildcard 
    AND name LIKE '%M'",
    params=list(label_wildcard=label_wildcard))

The b20005_vars.label column holds long string labels (which follow a consistent pattern, which is captured by the $label_wildcard) that describe the variable’s contents. Here are a couple of examples:

b20005_vars.name b20005_vars.label
B20005_053E "Estimate!!Total!!Female!!Worked full-time, year-round in the past 12 months!!With earnings")
B20005_076M "Margin of Error!!Total!!Female!!Other!!With earnings!!$1 to $2,499 or loss"


Since the label string contains the sex and work status, I assign a label_wildcard based on the person inputs from the sex and work status UI dropdowns.

# Prepare wildcard for query parameter `label_wildcard`
  if (sex == 'M') {
    if (work_status == 'FT') { label_wildcard <- "%!!Male!!Worked%" }
    if (work_status == 'OTHER') { label_wildcard <- "%!!Male!!Other%" }
  }
  
  if (sex == 'F') {
    if (work_status == 'FT') { label_wildcard <- "%!!Female!!Worked%" }
    if (work_status == 'OTHER') { label_wildcard <- "%!!Female!!Other%" }
  }

Derive RUCA Level Estimates and MOE

Once the variables are returned, the actual values are queried from b20005, grouped by RUCA level. The ACS handbook Understanding and Using American Community Survey Data: What All Data Users Need to Know shows how to calculate that margin of error for derived estimates. In our case, the margin of error for a RUCA level such as “Urban” for a given state is derived from the margin of error of individual Census Tracts using the formula below:

The MOE for a sum of estimates is the square root of the sum of MOEs squared

Translating this to a SQLite query:

# Construct query string to square root of the sum of margins of error squared grouped by ruca level
query_string <- paste0(
    "SQRT(SUM(POWER(b20005.", vars$name, ", 2))) AS ", vars$name, collapse=",")

Where vars$name is a list of variable names, and the collapse parameter converts a list or vector to a string. The beginning of that query_string looks like:

"SQRT(SUM(POWER(b20005.B20005_001M, 2))) AS B20005_001M, SQRT(..."

The query is further built by adding the rest of the SQL statements:

query_string <- paste(
    "SELECT ruca.DESCRIPTION,",
    query_string,
    "FROM 'b20005' 
    INNER JOIN ruca 
    ON b20005.state || b20005.county || b20005.tract = ruca.TRACTFIPS
    WHERE 
    b20005.state = $state
    GROUP BY ruca.DESCRIPTION"
  )

The ruca.DESCRIPTION column, which contains RUCA levels (e.g. "Urban") is joined onto b20005 from the ruca table using the foreign keys representing the Census Tract FIPS code (TRACTFIPS for the ruca table and the concatenated field state || county || tract for b20005). The $state parameter is assigned the person-selected state input, and the columns are aggreaggated by RUCA levels (i.e. GROUP BY ruca.DESCRIPTION). Finally, the RUCA level and square root of the sum of MOEs squared are SELECTed from the joined tables.

The query for estimates is simpler than MOEs, because estimates only need to be summed over RUCA levels:

# Construct a query to sum estimates grouped by ruca level
  query_string <- paste0("SUM(b20005.",vars$name, ") AS ", vars$name, collapse=",")

get_b20005_ruca_aggregate_earnings returns the query result data.frames in a named list:

return(list("estimate" = estimate_rs, "moe" = moe_rs))

calculate_median.R

The procedure for calculating a median earnings data estimate is shown starting on page 17 of the Accuracy of PUMS documentation. This script follows it closely:

Create Frequency Distribution

  1. Obtain the weighted frequency distribution for the selected variable. data is a data.frame with earning estimate values. The rows are the earning ranges and the columns are ruca_levels:


cum_percent <- 100.0 * cumsum(data[ruca_level]) / sum(data[ruca_level])

Calculate Weighted Total

  1. Calculate the weighted total to yield the base, B.


B <- colSums(data[ruca_level])

Approximate Standard Error

  1. Approximate the standard error of a 50 percent proportion using the formula in Standard Errors for Totals and Percentages. The design_factor is passed to this function by the server who uses the get_design_factor function explained below to query the design_factors table.


se_50_percent <- design_factor * sqrt(87.5/(12.5*B) * 50^2)

Calculate Median Estimate Bounds

  1. Create the variable p_lower by subtracting the SE from 50 percent. Create p_upper by adding the SE to 50 percent.


p_lower <- 50 - se_50_percent
p_upper <- 50 + se_50_percent
  1. Determine the categories in the distribution that contain p_lower and p_upper


# Determine the indexes of the cumulative percent data.frame corresponding  
# to the upper and lower bounds of the 50% proportion estimate
cum_percent_idx_lower <- min(which(cum_percent > p_lower))
cum_percent_idx_upper <- min(which(cum_percent > p_upper))

.._If p_lower and p_upper fall in the same category, follow step 6. If p_lower and p_upper fall in different categories, go to step 7…_


# The median estimation calculation is handled differently based on 
# whether the upper and lower bound indexes are equal
    if (cum_percent_idx_lower == cum_percent_idx_upper) {
  1. If p_lower and p_upper fall in the same category, do the following:
  • Define A1 as the smallest value in that category.


# A1 is the minimum earnings value (e.g. 30000) of the earning range 
# (e.g. 30000 to 34999) corresponding to the lower bound cumulative percent
A1 <- earnings[cum_percent_idx_lower, "min_earnings"]
  • Define A2 as the smallest value in the next (higher) category.


# A2 is the minimum earnings value of the earning range above the 
# earning range corresponding to the upper bound cumulative percent
A2 <- earnings[cum_percent_idx_lower + 1, "min_earnings"]
  • Define C1 as the cumulative percent of units strictly less than A1.


# C1 is the cumulative percentage of earnings one row below the 
# lower bound cumulative percent
C1 <- cum_percent[cum_percent_idx_lower - 1, ]
  • Define C2 as the cumulative percent of units strictly less than A2.


# C2 is the cumulative percentage of the earnings below the 
# lower bound cumulative percent
C2 <- cum_percent[cum_percent_idx_lower, ]
  • Use the following formulas to approximate the lower and upper bounds for a confidence interval about the median:


# the lower bound of the median 
lower_bound <- (p_lower - C1) / (C2 - C1) * (A2 - A1) + A1
      
# the upper bound of the median
upper_bound <- (p_upper - C1) / (C2 - C1) * (A2 - A1) + A1
  1. If p_lower and p_upper fall in different categories, do the following:
  • For the category containing p_lower: Define A1, A2, C1, and C2 as described in step 6. Use these values and the formula in step 6 to obtain the lower bound.


# A1, A2, C1 and C2 are calculated using the lower bound cumulative percent
# to calculate the lower bound of the median estimate
A1 <- earnings[cum_percent_idx_lower, "min_earnings"]
A2 <- earnings[cum_percent_idx_lower + 1, "min_earnings"]
C1 <- cum_percent[cum_percent_idx_lower - 1, ]
C2 <- cum_percent[cum_percent_idx_lower, ]
lower_bound <- (p_lower - C1) / (C2 - C1) * (A2 - A1) + A1
  • For the category containing p_upper: Define new values for A1, A2, C1, and C2 as described in step 6. Use these values and the formula in step 6 to obtain the upper bound.


# A1, A2, C1 and C2 are calculated using the upper bound cumulative percent
# to calculate the upper bound of the median estimate
A1 <- earnings[cum_percent_idx_upper, "min_earnings"]
A2 <- earnings[cum_percent_idx_upper + 1, "min_earnings"]
C1 <- cum_percent[cum_percent_idx_upper - 1,]
C2 <- cum_percent[cum_percent_idx_upper,]
upper_bound <- (p_upper - C1) / (C2 - C1) * (A2 - A1) + A1
  1. Use the lower and upper bounds approximated in steps 6 or 7 to approximate the standard error of the median. SE(median) = 1/2 X (Upper Bound – Lower Bound)


# The median earning estimate is the average of the upper and lower bounds
# of the median estimates calculated above in the if-else block
median_earnings <- 0.5 * (lower_bound + upper_bound)
    
# The median SE is half the distance between the upper and lower bounds
# of the median estimate
median_se <- 0.5 * (upper_bound - lower_bound)

# The 90% confidence interval critical z-score is used to calculate 
# the margin of error
median_90_moe <- 1.645 * median_se

Reshape the Data

Finally, a data.frame is returned, which will be displayed in a tableOutput element.


# A data.frame will be displayed in the UI
median_data <- data.frame(
  "Estimate" = median_earnings,
  "SE" = median_se,
  "MOE" = median_90_moe
)

format_query_result.R

The purpose of this function is to receive two data.frame objects, one for earnings estimate values, and one for the corresponding moe values, and return a single data.frame which is ready to be displayed in a tableOutput.

Extract data.frame Objects from List

Since get_b20005_ruca_aggregate_earnings returns a named list, I first pull out the estimate and moe data.frame objects:


# Pull out query result data.frames from the list
estimate <- rs[["estimate"]]
moe <- rs[["moe"]]

Reshape data.frame Objects

These data.frame objects have RUCA levels in the column DESCRIPTION and one column for each population estimate. For example, the estimate for Alabama Full Time Female workers looks like this:

DESCRIPTION B20005_053E B20005_054E B20005_055E
1 Large Town 149 257 546
2 Rural 75 66 351
3 Small Town 28 162 634
4 Urban 468 1061 4732
5 Zero Population 0 0 0

The moe data.frame has a similar layout.

However, in the UI, I want the table to look like this:

Population estimates for earnings levels from $1 to $2499 up to $100000 and more for Alabama Full Time Female Workers

To achieve this, I first transpose the estimate and moe data.frames…


# Transpose the query results
col_names <- estimate[,"DESCRIPTION"]
estimate <- t(estimate[-1])
colnames(estimate) <- col_names
  
col_names <- moe[,"DESCRIPTION"]
moe <- t(moe[-1])
colnames(moe) <- col_names

…then zip them together, keeping in mind that not all states have tracts designated with all RUCA levels:


# Create a mapping to make column names more computer-readable
format_ruca_level <- c(
  "Urban" = "Urban", 
  "Large Town" = "Large_Town", 
  "Small Town" = "Small_Town", 
  "Rural" = "Rural",
  "Zero Population" = "Zero_Population")

# bind together estimate and corresponding moe columns
# some states do not have all RUCA levels
# for example, Connecticut does not have "Small Town" tracts

# Create empty objects
output_table <- data.frame(temp = matrix(NA, nrow = nrow(estimate), ncol = 0))
col_names <- c()

for (ruca_level in c("Urban", "Large Town", "Small Town", "Rural")) {
  if (ruca_level %in% colnames(estimate)) {
    output_table <- cbind(output_table, estimate[,ruca_level], moe[,ruca_level])
    
    # paste "_MOE" suffix for MOE columns
    col_names <- c(
      col_names,
      format_ruca_level[[ruca_level]],
      paste0(format_ruca_level[[ruca_level]], "_MOE"))
  }
}

# Replace old names with more computer-readable names
colnames(output_table) <- col_names


Add Descriptive Labels

Finally, merge the output_table data.frame with labels (long form description of the B20005 variables) which are retrieved from the database using the get_b20005_labels function explained later on in this post. Remember that the label is delimited with "!!" and the last substring contains earnings ranges (e.g. “$30,000 to $34,999”):


# name rows as long-form labels, by splitting them by '!!' and 
# grabbing the last chunk which has dollar ranges e.g. 
# $30000 to $34999
output_table <- merge(output_table, labels, by.x = 0, by.y = "name")
split_label <- data.frame(
  do.call(
    'rbind', 
    strsplit(as.character(output_table$label),'!!',fixed=TRUE)))

rownames(output_table) <- split_label$X6



get_b20005_labels.R

This script contains two helper functions to retrieve the label column from the b20005_vars table.

Get Earnings Population Estimate Labels

The first one, get_b20005_labels retrieves the variable name and label for earning range strings (e.g. “$30,000 to $34,999”):


get_b20005_labels <- function() {
  census_app_db <- dbConnect(RSQLite::SQLite(), "census_app_db.sqlite")
  rs <- dbGetQuery(
    census_app_db, 
    "SELECT 
      name, label
    FROM 'b20005_vars' 
    WHERE 
      label LIKE '%$%'
    ORDER BY name"
    )
  dbDisconnect(census_app_db)
  return(rs)
}


Get All Labels

The second function, get_b20005_ALL_labels returns the whole table:


get_b20005_ALL_labels <- function() {
  census_app_db <- dbConnect(RSQLite::SQLite(), "census_app_db.sqlite")
  rs <- dbGetQuery(
    census_app_db, 
    "SELECT 
      name, label
    FROM 'b20005_vars' 
    ORDER BY name"
  )
  dbDisconnect(census_app_db)
  return(rs)
}



get_b20005_tract_earnings.R

This function is similar to get_b20005_ruca_aggregate_earnings but does not aggregate by RUCA level, and also includes Census Tracts that are not designated a RUCA level. The label_wildcard is constructed the same way as before.

Get Variable Names

The variable names are obtained for both margin of error and estimates in the same query:


 # Get b20005 variable names (estimates and moe)
vars <- dbGetQuery(
  census_app_db, 
  "SELECT name FROM b20005_vars 
  WHERE label LIKE $label_wildcard",
  params=list(label_wildcard=label_wildcard)
  )


Join Tables

The tract-level earnings are queried with the following, using a LEFT JOIN between b20005 and ruca tables to include tracts that do not have a RUCA level.


# Construct query to get tract-level earnings data
query_string <- paste(
  "SELECT ruca.DESCRIPTION,
  b20005.state || b20005.county || b20005.tract AS TRACTFIPS,",
  paste0(vars$name, collapse=","),
  "FROM b20005 
  LEFT JOIN ruca 
  ON b20005.state || b20005.county || b20005.tract = ruca.TRACTFIPS
  WHERE 
  b20005.state LIKE $state")



get_b20005_states.R

This function retrieves state codes and names from the codes table, and is used to assign choices to selectInput dropdowns. "United States" which has a FIPS code of "00" is excluded because the b20005 table contains state-level data only. The query result is sorted by the state name so that the dropdown menu choices are in ascending alphabetical order.

states <- dbGetQuery(
  census_app_db, 
  "SELECT DESCRIPTION, CODE
  FROM codes 
  WHERE CATEGORY = 'state'
  AND CODE <> '00'
  ORDER BY DESCRIPTION")



get_design_factor.R

This function retrieves a single numeric Design Factor for the “Person Earnings/Income” characteristic from the design_factors table for a given state parameter:


rs <- dbGetQuery(
  census_app_db, 
  "SELECT DESIGN_FACTOR FROM design_factors
  WHERE ST = $state
  AND CHARACTERISTIC = 'Person Earnings/Income'",
  params = list(state=state))

rs <- as.numeric(rs[1, "DESIGN_FACTOR"])



make_plot.R

This is function creates a ggplot.bar_plot object using a given data, RUCA level, and title. The x-axis labels are rotated, both axis labels are resized, and plot title and subtitle are formatted.


make_plot <- function(data, ruca_level, plot_title){
  # Prepare x-axis factor for `aes` parameter
  xs <- rownames(data)
  xs <- factor(xs, xs)

  bar_plot <- ggplot(
    data=data,
    aes(x=xs, y=get(ruca_level))) + 
    geom_bar(stat='identity') + 

    theme(
      # Rotate x-axis labels
      axis.text.x=element_text(
        angle = -90, 
        vjust = 0.5, 
        hjust=1, 
        size=12),

      # Resize x-axis labels and move them away from axis
      axis.title.x=element_text(vjust=-0.75,size=14),

      # Resize y-axis labels
      axis.text.y=element_text(size=12),
      axis.title.y=element_text(size=14),

      # Set plot title and subtitle font and placement
      plot.title = element_text(size = 18, hjust=0.5, face='bold'),
      plot.subtitle = element_text(size = 12, hjust=0.5)) +

    labs(x="Earnings", y="Population Estimate") + 
    ggtitle(plot_title, subtitle="Population Estimate by Earnings Level")

  return (bar_plot)
}