R Shiny Census App
In this blog post, I’ll walk through my development process for a U.S. Census data visualization web app I created using the Shiny package in R.
You can access the app at vbakshi.shinyapps.io/census-app.
Table of Contents
Backstory
I started this project by reading the handbook Understanding and Using American Community Survey Data: What State and Local Government Users Need to Know published by the U.S. Census Bureau. I recreated the handbook’s first case study in R, in which they make comparisons across geographic areas, create custom geographic areas from census tracts and calculate margins of error for derived estimates for Minnesota Census Tract 5-year earnings estimates.
During the process of recreating the derived median earnings estimate calculations, I was unable to recreate a key value from the handbook (the Standard Error for the 50% proportion, calculated to be 0.599) because I was unable to deduce the values used in the following formula referenced from page 17 of the PUMS Accuracy of the Data documentation:
The documentation defines B as the base, which is the calculated weighted total. I chose the value of 1.3 for the design factor DF since it corresponds to STATE = Minnesota, CHARTYP = Population, CHARACTERISTIC = Person Earnings/Income in the Design Factors CSV published by the Census Bureau.
I called the Census Bureau Customer Help Center for assistance and was transferred to a member of the ACS Data User Support team with whom I discussed my woes. He was unable to confirm the values of the design factor DF or B, and was unable to pull up the contact information for the statistical methodology team, so I emailed him my questions. After a few email exchanges, the statistical methodology team provided the following:
- DF = 1.3
- B = the total population estimate for which the median is being calculated, which is 82488 for the case study calculation (Minnesota Rural Male Full Time Workers)
- The term 95/5 is associated with the finite population correction factor (100 - f) divided by the sample fraction (f), where f = 5% (later on I note in the documentation that this 95/5 term is based on a 68% confidence interval). The data used in the handbook case study is from 5-year estimates. 1-year estimates sample 2.5% of the population, so the 5-year estimates represent a 5 * 2.5 = 12.5% sample. Instead of 95/5, the ratio becomes (100 - 12.5)/12.5 = 87.5/12.5
The updated formula is then:
I was able to calculate the median earnings estimate (and associated standard error and margin of error) within a few percent of the values given in the handbook. This provided me with confirmation that I was ready to expand my code to calculate median earnings estimates for other subgroups.
Codebase
I built this app using the R package Shiny
which handles both the UI and the server. I store the data in a sqlite
database and access it with queries written using the RSQLite
package which uses the DBI API. The following sections break down the R scripts based on functionality. Click on the script name to navigate to that section.
app.R
- UI and server functions to handle people inputs and plot/table/text outputs
prep_db.R
- Import, clean, combine and then load data into the
census_app_db.sqlite
database
- Import, clean, combine and then load data into the
get_b20005_ruca_aggregate_earnings.R
- Queries the database for earnings and associated margins of error for RUCA levels derived from Census Tracts
calculate_median.R
- Derives estimate, standard of error and margin of error of median earnings for RUCA levels
format_query_result.R
- Formats
calculate_median
query results
- Formats
get_b20005_labels.R
- Queries the database for descriptive labels of B20005 table variables
get_b20005_tract_earnings.R
- Queries the database for Census Tract-level earnings and associated margins of error
get_b20005_states.R
- Queries the SQLite database for a list of U.S. states
get_design_factor.R
- Queries database for the design factor used for the median earnings estimation calculation
make_plot.R
- Creates a bar plot object
app.R
A shiny app has three fundamental components:
<- (...)
ui <- (...)
server shinyApp(ui, server,...)
The ui
object holds all UI layout, input and output objects which define the front-end of your app. The server
object holds all rendering functions which are assigned to outputs that appear on the UI. The shinyApp
function takes a ui
and server
object (along with other arguments) and creates a shiny app object which can be run in a browser by passing it to the runApp
function. Person inputs (such as selections in a dropdown) are assigned to a global input
object.
What’s in my ui
?
All of my UI objects are wrapped within a fluidPage
call which returns a page layout which “consists of rows which in turn include columns” (from the docs).
My app’s UI has four sections:
- Dropdowns to select state, sex and work status for which the person using the app wants ACS 5-year earnings estimates
- A table with the estimate, standard error and margin of error for median earnings
- A bar plot of population estimates for earnings levels for the selected state, sex, work status and RUCA (Rural-Urban Commuting Areas) level
- A table with population estimates for earnings levels for each RUCA level for the selected state, sex and work status
Each section has a download button so that people can get the CSV files or plot image for their own analysis and reporting. Each section is separated with markdown('---')
which renders an HTML horizontal rule (<hr>
).
Dropdowns
Dropdowns (the HTML <select>
element) are a type of UI Input. I define each with an inputId
which is a character
object for reference on the server-side, a label character
object which is rendered above the dropdown, and a list
object which defines the dropdown options.
selectInput(
inputId = "...",
label = "...",
choices = list(...)
)
In some cases, I want the person to see a character
object in the dropdown that is more human-readable (e.g. "Large Town"
) but use a corresponding input value in the server which is more computer-readable (e.g. "Large_Town
). To achieve this, I use a named character
vector where the names are displayed in the dropdown, and the assigned values are assigned to the global input
:
selectInput(
inputId = "ruca_level",
label = "Select RUCA Level",
choices = list(
"RUCA LEVEL" = c(
"Urban" = "Urban",
"Large Town" = "Large_Town",
"Small Town" = "Small_Town",
"Rural" = "Rural"))
)
In this case, if the person selects "Large Town"
the value assigned to input$ruca_level
is "Large_Town"
.
Tables
Tables (the HTML <table>
element) are a type of UI Output. I define each with an outputId
for reference in the server.
tableOutput(outputId = "...")
Plot
Similarly, a plot (which is rendered as an HTML <img>
element) is a type of UI Output. I define each with an outputId
.
plotOutput(outputId = "...")
What does my server
do?
The server function has three parameters: input
, output
and session
. The input
object is a ReactiveValues
object which stores all UI Input values, which are accessed with input$inputId
. The output
object similarly holds UI Output values at output$outputId
. I do not use the session
object in my app (yet).
My app’s server has four sections:
- Get data from the SQLite database
- Render table and plot outputs
- Prepare dynamic text (for filenames and the plot title)
- Handle data.frame and plot downloads
Get Data
There are three high-level functions which call query/format/calculation functions to return the data in the format necessary to produce table, text, download and plot outputs:
- The
earnings_data
function passes the person-selected dropdown optionsinput$sex
,input$work_status
andinput$state
to theget_b20005_ruca_aggregate_earnings
function to get a query result from the SQLite database. That function call is passed toformat_earnings
, which in turn is passed to thereactive
function to make it a reactive expression. Only reactive expressions (and reactive endpoints in theoutput
object) are allowed to access theinput
object which is a reactive source. You can read more about Shiny’s “reactive programming model” in this excellent article.
<- reactive(
earnings_data format_earnings(
get_b20005_ruca_aggregate_earnings(
$sex,
input$work_status,
input$state))) input
- The
design_factor
function passes theinput$state
selection to theget_design_factor
function which in turn is passed to thereactive
function.
<- reactive(get_design_factor(input$state)) design_factor
- The
median_data
function passes the return values fromearnings_data()
anddesign_factor()
to thecalculate_median
function which in turn is passed to thereactive
function.
<- reactive(calculate_median(earnings_data(), design_factor())) median_data
Render Outputs
I have two reactive endpoints for table outputs, and one endpoint for a plot. The table outputs use renderTable
(with row names displayed) with the data.frame
coming from median_data()
and earnings_data()
. The plot output uses renderPlot
, and a helper function make_plot
to create a bar plot of earnings_data()
for a person-selected input$ruca_level
with a title created with the helper function earnings_plot_title()
.
$median_data <- renderTable(
outputexpr = median_data(),
rownames = TRUE)
$earnings_data <- renderTable(
outputexpr = earnings_data(),
rownames = TRUE)
$earnings_histogram <- renderPlot(
outputexpr = make_plot(
data=earnings_data(),
ruca_level=input$ruca_level,
plot_title=earnings_plot_title()))
Prepare Dynamic Text
I created four functions that generate filenames for the downloadHandler
call when the corresponding downloadButton
gets clicked, one function that generates the title used to generate the bar plot, and one function which takes computer-readable character
objects (e.g. "Large_Town"
) and maps it to and returns a more human-readable character
object (e.g. "Large Town"
). I chose to keep filenames more computer-readable (to avoid spaces) and the plot title more human-readable.
<- function(raw_text){
get_pretty_text <- c("M" = "Male",
text_map "F" = "Female",
"FT" = "Full Time",
"OTHER" = "Other",
"Urban" = "Urban",
"Large_Town" = "Large Town",
"Small_Town" = "Small Town",
"Rural" = "Rural")
return(text_map[raw_text])
}
<- function(){
earnings_plot_title return(paste(
$state,
inputget_pretty_text(input$sex),
get_pretty_text(input$work_status),
$ruca_level,
input"Workers",
sep=" "))
}
<- function(){
b20005_filename return(paste(
$state,
inputget_pretty_text(input$sex),
$work_status,
input"earnings.csv",
sep="_"
))
}
<- function() {
median_summary_filename paste(
$state,
inputget_pretty_text(input$sex),
$work_status,
input'estimated_median_earnings_summary.csv',
sep="_")
}
<- function() {
ruca_earnings_filename paste(
$state,
inputget_pretty_text(input$sex),
$work_status,
input'estimated_median_earnings_by_ruca_level.csv',
sep="_")
}
<- function(){
earnings_plot_filename return(paste(
$state,
inputget_pretty_text(input$sex),
$work_status,
input$ruca_level,
input"Workers.png",
sep="_"))
}
Handle downloads
I have five download buttons in my app: two which trigger a download of a zip file with two CSVs, two that downloads a single CSV, and one that downloads a single PNG. The downloadHandler
function takes a filename
and a content
function to write data to a file.
In order to create a zip file, I use the zip
base package function and pass it a vector with two filepaths (to which data is written using the base package’s write.csv
function) and a filename. I also specify the contentType
as "application/zip"
. In the zip file, one of the CSVs contains a query result from the b20005
SQLite database table with earnings data, and the other file, "b20005_variables.csv"
contains B20005 table variable names and descriptions. In order to avoid the files being written locally before download, I create a temporary directory with tempdir
and prepend it to the filename to create the filepath.
For the bar plot image download, I use the ggplot2
package’s ggsave
function, which takes a filename, a plot object (returned from the make_plot
helper function) and the character
object "png"
(for the device
parameter).
$download_selected_b20005_data <- downloadHandler(
outputfilename = "b20005_data.zip",
content = function(fname) {
# Create a temporary directory to prevent local storage of new files
<- tempdir()
temp_dir
# Create two filepath character objects and store them in a list
# which will later on be passed to the `zip` function
<- paste(temp_dir, '/', b20005_filename(), sep="")
path1 <- paste(temp_dir, "/b20005_variables.csv", sep="")
path2 <- c(path1, path2)
fs
# Create a CSV with person-selection input values and do not add a column
# with row names
write.csv(
get_b20005_earnings(input$state, input$sex, input$work_status),
path1,row.names = FALSE)
# Create a CSV for table B20005 variable names and labels for reference
write.csv(
get_b20005_ALL_labels(),
path2,row.names = FALSE)
# Zip together the files and add flags to maximize compression
zip(zipfile = fname, files=fs, flags = "-r9Xj")
},contentType = "application/zip"
)
$download_all_b20005_data <- downloadHandler(
outputfilename = "ALL_B20005_data.zip",
content = function(fname){
<- "ALL_B20005_data.csv"
path1 <- "b20005_variables.csv"
path2 <- c(path1, path2)
fs
write.csv(
get_b20005_earnings('ALL', 'ALL', 'ALL'),
path1,row.names = FALSE)
write.csv(
get_b20005_ALL_labels(),
path2,row.names = FALSE)
zip(zipfile = fname, files=fs, flags = "-r9Xj")
},contentType = "application/zip"
)
$download_median_summary <- downloadHandler(
outputfilename = median_summary_filename(),
content = function(file) {
write.csv(median_data(), file)
}
)
$download_earnings_plot <- downloadHandler(
outputfilename = earnings_plot_filename(),
content = function(file) {
ggsave(
file, plot = make_plot(
data=earnings_data(),
ruca_level=input$ruca_level,
plot_title=earnings_plot_title()),
device = "png")
}
)
$download_ruca_earnings <- downloadHandler(
outputfilename = ruca_earnings_filename(),
content = function(file) {
write.csv(earnings_data(), file)
} )
prep_db.R
This script is meant to be run locally, and is not deployed, as doing so would create a long delay to load the app.
Database Tables
The database diagram is shown below (created using dbdiagram.io):
I have five tables in my database:
b20005
Holds the data from the ACS 2015-2019 5-year detailed table B20005 (Sex By Work Experience In The Past 12 Months By Earnings In The Past 12 Months). This includes earnings estimates and margins of errors for Male and Female, Full Time and Other workers, for earning ranges (No earnings, $1 - $2499, $2500 - $4999, …, $100000 or more). The following table summarizes the groupings of the (non-zero earnings) variables relevant to this app:
Variable | Demographic |
---|---|
B20005_003 to B20005_025 | Male Full Time Workers |
B20005_029 to B20005_048 | Male Other Workers |
B20005_050 to B20005_072 | Female Full Time Workers |
B20005_076 to B20005_095 | Female Other Workers |
b20005_vars
Has the name (e.g. B20005_003E) and label (e.g. “Estimate!!Total!!Male!!Worked full-time, year-round in the past 12 months”) for all B20005 variables. Variable names ending with an E
are estimates, and those ending with M
are margins of error. - ruca
contains RUCA (Rural-Urban Commuting Area) codes published by the U.S. Department of Agriculture Economic Research Service which classify U.S. census tracts using measures of population density. The following table shows the code ranges relevant to this app:
RUCA Code | RUCA Level |
---|---|
1-3 | Urban |
4-6 | Large Town |
7-9 | Small Town |
10 | Rural |
99 | Zero Population |
codes
olds state FIPS (Federal Information Processing Standards) codes and RUCA levels - design_factors
contains Design Factors for different characteristics (e.g. Person Earnings/Income) which are used to determine “the standard error of total and percentage sample estimates”, and “reflect the effects of the actual sample design and estimation procedures used for the ACS.” (2015-2019 PUMS 5-Year Accuracy of the Data).
In prep_db.R
, I use the DBI
package, censusapi
and base
R functions to perform the following protocol for each table:
Load the Data
- For tables
b20005
andb20005_vars
, I use thecensusapi::getCensus
andcensusapi::listCensusMetadata
repsectively to get the data
# TABLE b20005_vars ------------------------------
<- listCensusMetadata(
b20005_vars name = 'acs/acs5',
vintage = 2015,
type = 'variables',
group = 'B20005')
# TABLE b20005 ----------------------------------
<- getCensus(
b20005 name = 'acs/acs5',
region = "tract:*",
regionin = regionin_value,
vintage = 2015,
vars = b20005_vars$name,
key="..."
)
- For tables
codes
,ruca
, anddesign_factors
I load the data from CSVs that I either obtained (in the case of the Design Factors) or created (in the case of the codes and RUCA levels)
# TABLE codes ----------------------------------
<- read.csv(
state_codes "data/state_codes.csv",
colClasses = c(
"character",
"character",
"character")
)
<- read.csv(
ruca_levels "data/ruca_levels.csv",
colClasses = c(
"character",
"character",
"character")
)
Create Tables
Once the data is ready, I use DBI::dbExecute
to run a SQLite command to create each table. The relationships shown in the image above dictate which fields create the primary key (in some cases, a compound primary key) as listed below:
Table | Primary Key | Notes |
---|---|---|
b20005 |
(state, county, tract) ) |
Foreign key for table ruca |
b20005_vars |
name |
e.g. B20005_001E |
ruca |
TRACTFIPS |
Foreign key for table b20005 |
codes |
(CODE, DESCRIPTION) |
e.g. (1, "Urban") |
design_factors |
(ST, CHARACTERISTIC) |
e.g. ("27", "Person Earnings/Income") |
Write to Tables
Once the table has been created in the database, I write the data.frame
to the corresponding table with the following call:
dbWriteTable(census_app_db, "<table name>", <data.frame>, append = TRUE
get_b20005_ruca_aggregate_earnings.R
The function inside this script (with the same name), receives inputs from the server, sends queries to the database and returns the results. This process involves two steps:
Get Variable Names
The person using the app selects Sex (M or F), Work Status (Full Time or Other) and State (50 states + D.C. + Puerto Rico) for which they want to view and analyze earnings data. As shown above, different variables in table b20005
correspond to different sexes and work statuses, and each tract for which there is all that earnings data resides in a given state.
I first query b20005_vars
to get the relevent variables names which will be used in the query to b20005
, as shown below. name
s that end with “M” (queried with the wilcard '%M'
) are for margins of error and those that end with “E” (wildcard '%E'
) are for estimates.
<- dbGetQuery(
vars
census_app_db, "SELECT name FROM b20005_vars
WHERE label LIKE $label_wildcard
AND name LIKE '%M'",
params=list(label_wildcard=label_wildcard))
The b20005_vars.label
column holds long string labels (which follow a consistent pattern, which is captured by the $label_wildcard
) that describe the variable’s contents. Here are a couple of examples:
b20005_vars.name |
b20005_vars.label |
---|---|
B20005_053E |
"Estimate!!Total!!Female!!Worked full-time, year-round in the past 12 months!!With earnings" ) |
B20005_076M |
"Margin of Error!!Total!!Female!!Other!!With earnings!!$1 to $2,499 or loss" |
Since the label
string contains the sex and work status, I assign a label_wildcard
based on the person inputs from the sex and work status UI dropdowns.
# Prepare wildcard for query parameter `label_wildcard`
if (sex == 'M') {
if (work_status == 'FT') { label_wildcard <- "%!!Male!!Worked%" }
if (work_status == 'OTHER') { label_wildcard <- "%!!Male!!Other%" }
}
if (sex == 'F') {
if (work_status == 'FT') { label_wildcard <- "%!!Female!!Worked%" }
if (work_status == 'OTHER') { label_wildcard <- "%!!Female!!Other%" }
}
Derive RUCA Level Estimates and MOE
Once the variables are returned, the actual values are queried from b20005
, grouped by RUCA level. The ACS handbook Understanding and Using American Community Survey Data: What All Data Users Need to Know shows how to calculate that margin of error for derived estimates. In our case, the margin of error for a RUCA level such as “Urban” for a given state is derived from the margin of error of individual Census Tracts using the formula below:
Translating this to a SQLite query:
# Construct query string to square root of the sum of margins of error squared grouped by ruca level
<- paste0(
query_string "SQRT(SUM(POWER(b20005.", vars$name, ", 2))) AS ", vars$name, collapse=",")
Where vars$name
is a list of variable names, and the collapse
parameter converts a list or vector to a string. The beginning of that query_string
looks like:
"SQRT(SUM(POWER(b20005.B20005_001M, 2))) AS B20005_001M, SQRT(..."
The query is further built by adding the rest of the SQL statements:
<- paste(
query_string "SELECT ruca.DESCRIPTION,",
query_string,"FROM 'b20005'
INNER JOIN ruca
ON b20005.state || b20005.county || b20005.tract = ruca.TRACTFIPS
WHERE
b20005.state = $state
GROUP BY ruca.DESCRIPTION"
)
The ruca.DESCRIPTION
column, which contains RUCA levels (e.g. "Urban"
) is joined onto b20005
from the ruca
table using the foreign keys representing the Census Tract FIPS code (TRACTFIPS
for the ruca
table and the concatenated field state || county || tract
for b20005
). The $state
parameter is assigned the person-selected state
input, and the columns are aggreaggated by RUCA levels (i.e. GROUP BY ruca.DESCRIPTION
). Finally, the RUCA level and square root of the sum of MOEs squared are SELECT
ed from the joined tables.
The query for estimates is simpler than MOEs, because estimates only need to be summed over RUCA levels:
# Construct a query to sum estimates grouped by ruca level
<- paste0("SUM(b20005.",vars$name, ") AS ", vars$name, collapse=",") query_string
get_b20005_ruca_aggregate_earnings
returns the query result data.frame
s in a named list
:
return(list("estimate" = estimate_rs, "moe" = moe_rs))
calculate_median.R
The procedure for calculating a median earnings data estimate is shown starting on page 17 of the Accuracy of PUMS documentation. This script follows it closely:
Create Frequency Distribution
- Obtain the weighted frequency distribution for the selected variable.
data
is adata.frame
with earning estimate values. The rows are the earning ranges and the columns areruca_level
s:
<- 100.0 * cumsum(data[ruca_level]) / sum(data[ruca_level]) cum_percent
Calculate Weighted Total
- Calculate the weighted total to yield the base, B.
<- colSums(data[ruca_level]) B
Approximate Standard Error
- Approximate the standard error of a 50 percent proportion using the formula in Standard Errors for Totals and Percentages. The
design_factor
is passed to this function by the server who uses theget_design_factor
function explained below to query thedesign_factors
table.
<- design_factor * sqrt(87.5/(12.5*B) * 50^2) se_50_percent
Calculate Median Estimate Bounds
- Create the variable p_lower by subtracting the SE from 50 percent. Create p_upper by adding the SE to 50 percent.
<- 50 - se_50_percent
p_lower <- 50 + se_50_percent p_upper
- Determine the categories in the distribution that contain p_lower and p_upper…
# Determine the indexes of the cumulative percent data.frame corresponding
# to the upper and lower bounds of the 50% proportion estimate
<- min(which(cum_percent > p_lower))
cum_percent_idx_lower <- min(which(cum_percent > p_upper)) cum_percent_idx_upper
.._If p_lower and p_upper fall in the same category, follow step 6. If p_lower and p_upper fall in different categories, go to step 7…_
# The median estimation calculation is handled differently based on
# whether the upper and lower bound indexes are equal
if (cum_percent_idx_lower == cum_percent_idx_upper) {
- If p_lower and p_upper fall in the same category, do the following:
- Define A1 as the smallest value in that category.
# A1 is the minimum earnings value (e.g. 30000) of the earning range
# (e.g. 30000 to 34999) corresponding to the lower bound cumulative percent
<- earnings[cum_percent_idx_lower, "min_earnings"] A1
- Define A2 as the smallest value in the next (higher) category.
# A2 is the minimum earnings value of the earning range above the
# earning range corresponding to the upper bound cumulative percent
<- earnings[cum_percent_idx_lower + 1, "min_earnings"] A2
- Define C1 as the cumulative percent of units strictly less than A1.
# C1 is the cumulative percentage of earnings one row below the
# lower bound cumulative percent
<- cum_percent[cum_percent_idx_lower - 1, ] C1
- Define C2 as the cumulative percent of units strictly less than A2.
# C2 is the cumulative percentage of the earnings below the
# lower bound cumulative percent
<- cum_percent[cum_percent_idx_lower, ] C2
- Use the following formulas to approximate the lower and upper bounds for a confidence interval about the median:
# the lower bound of the median
<- (p_lower - C1) / (C2 - C1) * (A2 - A1) + A1
lower_bound
# the upper bound of the median
<- (p_upper - C1) / (C2 - C1) * (A2 - A1) + A1 upper_bound
- If p_lower and p_upper fall in different categories, do the following:
- For the category containing p_lower: Define A1, A2, C1, and C2 as described in step 6. Use these values and the formula in step 6 to obtain the lower bound.
# A1, A2, C1 and C2 are calculated using the lower bound cumulative percent
# to calculate the lower bound of the median estimate
<- earnings[cum_percent_idx_lower, "min_earnings"]
A1 <- earnings[cum_percent_idx_lower + 1, "min_earnings"]
A2 <- cum_percent[cum_percent_idx_lower - 1, ]
C1 <- cum_percent[cum_percent_idx_lower, ]
C2 <- (p_lower - C1) / (C2 - C1) * (A2 - A1) + A1 lower_bound
- For the category containing p_upper: Define new values for A1, A2, C1, and C2 as described in step 6. Use these values and the formula in step 6 to obtain the upper bound.
# A1, A2, C1 and C2 are calculated using the upper bound cumulative percent
# to calculate the upper bound of the median estimate
<- earnings[cum_percent_idx_upper, "min_earnings"]
A1 <- earnings[cum_percent_idx_upper + 1, "min_earnings"]
A2 <- cum_percent[cum_percent_idx_upper - 1,]
C1 <- cum_percent[cum_percent_idx_upper,]
C2 <- (p_upper - C1) / (C2 - C1) * (A2 - A1) + A1 upper_bound
- Use the lower and upper bounds approximated in steps 6 or 7 to approximate the standard error of the median. SE(median) = 1/2 X (Upper Bound – Lower Bound)
# The median earning estimate is the average of the upper and lower bounds
# of the median estimates calculated above in the if-else block
<- 0.5 * (lower_bound + upper_bound)
median_earnings
# The median SE is half the distance between the upper and lower bounds
# of the median estimate
<- 0.5 * (upper_bound - lower_bound)
median_se
# The 90% confidence interval critical z-score is used to calculate
# the margin of error
<- 1.645 * median_se median_90_moe
Reshape the Data
Finally, a data.frame
is returned, which will be displayed in a tableOutput
element.
# A data.frame will be displayed in the UI
<- data.frame(
median_data "Estimate" = median_earnings,
"SE" = median_se,
"MOE" = median_90_moe
)
format_query_result.R
The purpose of this function is to receive two data.frame
objects, one for earnings estimate
values, and one for the corresponding moe
values, and return a single data.frame
which is ready to be displayed in a tableOutput
.
Extract data.frame
Objects from List
Since get_b20005_ruca_aggregate_earnings
returns a named list, I first pull out the estimate
and moe
data.frame
objects:
# Pull out query result data.frames from the list
<- rs[["estimate"]]
estimate <- rs[["moe"]] moe
Reshape data.frame Objects
These data.frame
objects have RUCA levels in the column DESCRIPTION
and one column for each population estimate. For example, the estimate
for Alabama Full Time Female workers looks like this:
DESCRIPTION |
… | B20005_053E |
B20005_054E |
B20005_055E |
… | |
---|---|---|---|---|---|---|
1 | Large Town |
… | 149 | 257 | 546 | … |
2 | Rural |
… | 75 | 66 | 351 | … |
3 | Small Town |
… | 28 | 162 | 634 | … |
4 | Urban |
… | 468 | 1061 | 4732 | … |
5 | Zero Population |
… | 0 | 0 | 0 | … |
The moe
data.frame
has a similar layout.
However, in the UI, I want the table to look like this:
To achieve this, I first t
ranspose the estimate
and moe
data.frame
s…
# Transpose the query results
<- estimate[,"DESCRIPTION"]
col_names <- t(estimate[-1])
estimate colnames(estimate) <- col_names
<- moe[,"DESCRIPTION"]
col_names <- t(moe[-1])
moe colnames(moe) <- col_names
…then zip them together, keeping in mind that not all states have tracts designated with all RUCA levels:
# Create a mapping to make column names more computer-readable
<- c(
format_ruca_level "Urban" = "Urban",
"Large Town" = "Large_Town",
"Small Town" = "Small_Town",
"Rural" = "Rural",
"Zero Population" = "Zero_Population")
# bind together estimate and corresponding moe columns
# some states do not have all RUCA levels
# for example, Connecticut does not have "Small Town" tracts
# Create empty objects
<- data.frame(temp = matrix(NA, nrow = nrow(estimate), ncol = 0))
output_table <- c()
col_names
for (ruca_level in c("Urban", "Large Town", "Small Town", "Rural")) {
if (ruca_level %in% colnames(estimate)) {
<- cbind(output_table, estimate[,ruca_level], moe[,ruca_level])
output_table
# paste "_MOE" suffix for MOE columns
<- c(
col_names
col_names,
format_ruca_level[[ruca_level]],paste0(format_ruca_level[[ruca_level]], "_MOE"))
}
}
# Replace old names with more computer-readable names
colnames(output_table) <- col_names
Add Descriptive Labels
Finally, merge
the output_table
data.frame
with labels
(long form description of the B20005 variables) which are retrieved from the database using the get_b20005_labels
function explained later on in this post. Remember that the label
is delimited with "!!"
and the last substring contains earnings ranges (e.g. “$30,000 to $34,999”):
# name rows as long-form labels, by splitting them by '!!' and
# grabbing the last chunk which has dollar ranges e.g.
# $30000 to $34999
<- merge(output_table, labels, by.x = 0, by.y = "name")
output_table <- data.frame(
split_label do.call(
'rbind',
strsplit(as.character(output_table$label),'!!',fixed=TRUE)))
rownames(output_table) <- split_label$X6
get_b20005_labels.R
This script contains two helper functions to retrieve the label
column from the b20005_vars
table.
Get Earnings Population Estimate Labels
The first one, get_b20005_labels
retrieves the variable name
and label
for earning range strings (e.g. “$30,000 to $34,999”):
<- function() {
get_b20005_labels <- dbConnect(RSQLite::SQLite(), "census_app_db.sqlite")
census_app_db <- dbGetQuery(
rs
census_app_db, "SELECT
name, label
FROM 'b20005_vars'
WHERE
label LIKE '%$%'
ORDER BY name"
)dbDisconnect(census_app_db)
return(rs)
}
Get All Labels
The second function, get_b20005_ALL_labels
returns the whole table:
<- function() {
get_b20005_ALL_labels <- dbConnect(RSQLite::SQLite(), "census_app_db.sqlite")
census_app_db <- dbGetQuery(
rs
census_app_db, "SELECT
name, label
FROM 'b20005_vars'
ORDER BY name"
)dbDisconnect(census_app_db)
return(rs)
}
get_b20005_tract_earnings.R
This function is similar to get_b20005_ruca_aggregate_earnings
but does not aggregate by RUCA level, and also includes Census Tracts that are not designated a RUCA level. The label_wildcard
is constructed the same way as before.
Get Variable Names
The variable name
s are obtained for both margin of error and estimates in the same query:
# Get b20005 variable names (estimates and moe)
<- dbGetQuery(
vars
census_app_db, "SELECT name FROM b20005_vars
WHERE label LIKE $label_wildcard",
params=list(label_wildcard=label_wildcard)
)
Join Tables
The tract-level earnings are queried with the following, using a LEFT JOIN
between b20005
and ruca
tables to include tracts that do not have a RUCA level.
# Construct query to get tract-level earnings data
<- paste(
query_string "SELECT ruca.DESCRIPTION,
b20005.state || b20005.county || b20005.tract AS TRACTFIPS,",
paste0(vars$name, collapse=","),
"FROM b20005
LEFT JOIN ruca
ON b20005.state || b20005.county || b20005.tract = ruca.TRACTFIPS
WHERE
b20005.state LIKE $state")
get_b20005_states.R
This function retrieves state codes and names from the codes
table, and is used to assign choices
to selectInput
dropdowns. "United States"
which has a FIPS code of "00"
is excluded because the b20005
table contains state-level data only. The query result is sorted by the state name so that the dropdown menu choices
are in ascending alphabetical order.
<- dbGetQuery(
states
census_app_db, "SELECT DESCRIPTION, CODE
FROM codes
WHERE CATEGORY = 'state'
AND CODE <> '00'
ORDER BY DESCRIPTION")
get_design_factor.R
This function retrieves a single numeric Design Factor for the “Person Earnings/Income” characteristic from the design_factors
table for a given state
parameter:
<- dbGetQuery(
rs
census_app_db, "SELECT DESIGN_FACTOR FROM design_factors
WHERE ST = $state
AND CHARACTERISTIC = 'Person Earnings/Income'",
params = list(state=state))
<- as.numeric(rs[1, "DESIGN_FACTOR"]) rs
make_plot.R
This is function creates a ggplot.bar_plot
object using a given data, RUCA level, and title. The x-axis labels are rotated, both axis labels are resized, and plot title and subtitle are formatted.
<- function(data, ruca_level, plot_title){
make_plot # Prepare x-axis factor for `aes` parameter
<- rownames(data)
xs <- factor(xs, xs)
xs
<- ggplot(
bar_plot data=data,
aes(x=xs, y=get(ruca_level))) +
geom_bar(stat='identity') +
theme(
# Rotate x-axis labels
axis.text.x=element_text(
angle = -90,
vjust = 0.5,
hjust=1,
size=12),
# Resize x-axis labels and move them away from axis
axis.title.x=element_text(vjust=-0.75,size=14),
# Resize y-axis labels
axis.text.y=element_text(size=12),
axis.title.y=element_text(size=14),
# Set plot title and subtitle font and placement
plot.title = element_text(size = 18, hjust=0.5, face='bold'),
plot.subtitle = element_text(size = 12, hjust=0.5)) +
labs(x="Earnings", y="Population Estimate") +
ggtitle(plot_title, subtitle="Population Estimate by Earnings Level")
return (bar_plot)
}