ACS data workflow • camiller

This is a simple workflow that makes use of functions in camiller to analyze and visualize some Census American Community Survey data. The data was downloaded with tidycensus, cleaned up a little bit, and is all included in this package.

library(dplyr)
library(tidyr)
library(ggplot2)
library(forcats)
library(camiller)
library(showtext)

font_add_google("PT Sans", "ptsans")
showtext_auto()

The dataset pov_age contains estimates and margins of error (MOEs) of residents by age group in different ratio brackets compared to the federal poverty line by town in Greater New Haven.

head(pov_age)
#> # A tibble: 6 × 5
#>   name    age           ratio                     estimate   moe
#>   <chr>   <chr>         <chr>                        <dbl> <dbl>
#> 1 Bethany Under 6 years Poverty status determined      274   146
#> 2 Bethany Under 6 years Under .50                        0    17
#> 3 Bethany Under 6 years .50 to .74                       0    17
#> 4 Bethany Under 6 years .75 to .99                       8    15
#> 5 Bethany Under 6 years 1.00 to 1.24                     0    17
#> 6 Bethany Under 6 years 1.25 to 1.49                     0    17

The function add_grps calls the function make_grps, which makes it easy to aggregate sums by collapsing multiple subgroups into one larger group. These functions take a list of larger categories, with the indexes of different subgroups in that column. These are easy to figure out by calling unique on the column of interest.

unique(pov_age$ratio)
#>  [1] "Poverty status determined" "Under .50"                
#>  [3] ".50 to .74"                ".75 to .99"               
#>  [5] "1.00 to 1.24"              "1.25 to 1.49"             
#>  [7] "1.50 to 1.74"              "1.75 to 1.84"             
#>  [9] "1.85 to 1.99"              "2.00 to 2.99"             
#> [11] "3.00 to 3.99"              "4.00 to 4.99"             
#> [13] "5.00 and over"

unique(pov_age$age)
#>  [1] "Under 6 years"     "6 to 11 years"     "12 to 17 years"   
#>  [4] "18 to 24 years"    "25 to 34 years"    "35 to 44 years"   
#>  [7] "45 to 54 years"    "55 to 64 years"    "65 to 74 years"   
#> [10] "75 years and over"

ratio_grps <- list(
  determined = 1,
  poverty = 2:4,
  low_income = 2:9
)

age_grps <- list(
  young_children = 1,
  children = 1:3,
  seniors = 9:10
)

Alternatively, show_uniq will print out the unique values of a column with their indexes, and return the original data frame unchanged. This is convenient for finding positions without having to break a workflow.

pov_age %>%
  show_uniq(age) %>%
  group_by(name, ratio)
#> 
#> 1: Under 6 years      2: 6 to 11 years      3: 12 to 17 years     
#> 4: 18 to 24 years     5: 25 to 34 years     6: 35 to 44 years     
#> 7: 45 to 54 years     8: 55 to 64 years     9: 65 to 74 years     
#> 10: 75 years and over
#> # A tibble: 1,690 × 5
#> # Groups:   name, ratio [169]
#>    name    age           ratio                     estimate   moe
#>    <chr>   <chr>         <chr>                        <dbl> <dbl>
#>  1 Bethany Under 6 years Poverty status determined      274   146
#>  2 Bethany Under 6 years Under .50                        0    17
#>  3 Bethany Under 6 years .50 to .74                       0    17
#>  4 Bethany Under 6 years .75 to .99                       8    15
#>  5 Bethany Under 6 years 1.00 to 1.24                     0    17
#>  6 Bethany Under 6 years 1.25 to 1.49                     0    17
#>  7 Bethany Under 6 years 1.50 to 1.74                    47    66
#>  8 Bethany Under 6 years 1.75 to 1.84                     0    17
#>  9 Bethany Under 6 years 1.85 to 1.99                     0    17
#> 10 Bethany Under 6 years 2.00 to 2.99                    20    23
#> # … with 1,680 more rows

Using dplyr::group_by(name, ratio) and then add_grps gives aggregates of estimates and MOEs for ratio levels and towns, with the original age groups collapsed into the desired, larger age groups. MOE calculations are done using tidycensus. For example:

c("Under 6 years", "6 to 11 years", "12 to 17 years")

becomes children. The same is then done to collapse ratios. calc_shares then calculates shares of residents in each group over the denominator "determined", and, optionally, calculates MOEs for this proportion.

pov_rates <- pov_age %>%
  mutate(across(c(age, ratio), as_factor)) %>%
  group_by(name, ratio) %>%
  add_grps(age_grps, group = age, value = estimate, moe = moe) %>%
  group_by(name, age) %>%
  add_grps(ratio_grps, group = ratio, value = estimate, moe = moe) %>%
  calc_shares(group = ratio, denom = "determined", value = estimate, moe = moe)

head(pov_rates)
#> # A tibble: 6 × 7
#> # Groups:   name, age [2]
#>   name    age            ratio      estimate   moe  share sharemoe
#>   <chr>   <fct>          <fct>         <dbl> <dbl>  <dbl>    <dbl>
#> 1 Bethany young_children determined      274   146 NA       NA    
#> 2 Bethany young_children poverty           8    28  0.029    0.101
#> 3 Bethany young_children low_income       55    79  0.201    0.268
#> 4 Bethany children       determined     1259   230 NA       NA    
#> 5 Bethany children       poverty          25    43  0.02     0.034
#> 6 Bethany children       low_income      102    92  0.081    0.072

The function brk_labels allows for cleaning up break labels, such as those generated using cut. See ?brk_labels for formatting options. theme_din is a clean ggplot2 theme that works well for dot plots and bar charts.

This legend is unnecessary and not a good idea, just a place to display the output of brk_labels.

pal <- c("#FDCC8A", "#FC8D59", "#D7301F")
pov_rates %>%
  filter(ratio == "low_income", age == "children") %>%
  mutate(share = signif(share, digits = 2)) %>%
  ungroup() %>%
  mutate(name = as.factor(name) %>% fct_reorder(share, max)) %>%
  arrange(name) %>% 
  mutate(brk = cut(share, breaks = c(min(share), 0.14, 0.3, max(share)), include.lowest = T)) %>%
  ggplot(aes(x = name, y = share, color = brk)) +
    geom_point(size = 4) +
    coord_flip() +
    scale_color_manual(values = pal, 
                       labels = function(x) brk_labels(x, format = "percent", mult_by = 100)) +
    theme_din(base_family = "ptsans", base_size = 10, xgrid = T, ygrid = "dotted") +
    scale_y_continuous(labels = function(x) sprintf("%0g", x * 100)) +
    labs(x = NULL, y = NULL, color = "Rate", title = "Child low-income rate by town", subtitle = "Greater New Haven towns, 2016", caption = "Source: US Census Bureau 2016 ACS 5-year estimate")

The moe_test function applies t-tests for differences between two estimates, given their MOEs. This works well for comparing values in one year to those in another, or between related locations or groups. The grunt-work of these calculations is done with tidycensus functions. Included in the package is the pov_trend tibble for testing out significance testing on data in both 2010 and 2016.

pov_trend <- pov_age_10_16 %>%
  mutate(across(c(age, ratio), as_factor)) %>%
  group_by(name, year, ratio) %>%
  add_grps(age_grps, group = age, value = estimate, moe = moe) %>%
  group_by(name, year, age) %>%
  add_grps(ratio_grps, group = ratio, value = estimate, moe = moe) %>%
  calc_shares(group = ratio, denom = "determined", value = estimate, moe = moe)

The output of moe_test includes optional intermediary calculations, such as standard errors and Z-scores used for significance testing.

pov_sigs <- pov_trend %>%
  filter(ratio == "low_income") %>%
  select(-estimate, -moe, -ratio) %>%
  ungroup() %>%
  mutate(year = paste0("y", year)) %>%
  make_wide(share, sharemoe, group = year) %>%
  moe_test(est1 = y2010_share, moe1 = y2010_sharemoe, est2 = y2016_share, moe2 = y2016_sharemoe, alpha = 0.1)

pov_sigs %>%
  select(name, age, diff:isSig_90) %>%
  filter(name %in% c("New Haven", "Hamden"))
#> # A tibble: 6 × 8
#>   name      age                diff    se1    se2     se z_score isSig_90
#>   <chr>     <fct>             <dbl>  <dbl>  <dbl>  <dbl>   <dbl> <lgl>   
#> 1 Hamden    young_children -0.089   0.0444 0.0353 0.0567  -1.57  FALSE   
#> 2 Hamden    children        0.0210  0.0237 0.0255 0.0348   0.603 FALSE   
#> 3 Hamden    seniors         0.0100  0.0188 0.0207 0.0280   0.357 FALSE   
#> 4 New Haven young_children  0.0610  0.0401 0.0401 0.0567   1.07  FALSE   
#> 5 New Haven children        0.0680  0.0225 0.0207 0.0305   2.23  TRUE    
#> 6 New Haven seniors        -0.00500 0.0213 0.0201 0.0292  -0.171 FALSE