Title: | Operationalizing Social Determinants of Health Data for Researchers |
---|---|
Description: | Accesses raw data via API and calculates social determinants of health measures for user-specified locations in the US, returning them in tidyverse- and sf-compatible data frames. |
Authors: | Nik Krieger [aut, cre], Jarrod Dalton [aut], Cindy Wang [aut], Adam Perzynski [aut], National Institutes of Health/National Institute on Aging [fnd] (The development of this software package was supported by a research grant from the National Institutes of Health/National Institute on Aging, (Principal Investigators: Jarrod E. Dalton, PhD and Adam T. Perzynski, PhD; Grant Number: 5R01AG055480-02). All of its contents are solely the responsibility of the authors and do not necessarily represent the official views of the NIH.) |
Maintainer: | Nik Krieger <[email protected]> |
License: | MIT + file LICENSE |
Version: | 2.2.5 |
Built: | 2025-02-09 05:01:11 UTC |
Source: | https://github.com/clevelandclinicqhs/sociome |
A two-column data set of the American Community Survey variable names and their descriptions. Contains counts of various subdivisions of the population based on age, sex, race, and ethnicity.
acs_age_sex_race_ethnicity_vars
acs_age_sex_race_ethnicity_vars
A tibble
with 65 rows and 2 variables:
ACS variable name
A description of who is present in the count
These variable names have been consistent throughout the existence of the ACS from its beginning through 2020.
This data set is used to support synthetic_population()
.
decennial_age_sex_race_ethnicity_vars
A dataset of the ACS variable names used to calculate the Area Deprivation Index (ADI) and Berg Indices (ADI-3).
acs_vars
acs_vars
A ['tibble'][tibble::tibble] with 139 rows and 10 variables:
ACS variable name
Brief description of the data the variable contains
Logical, indicating the variables to be used when calculating ADI and ADI-3 using the 1- or 3-year estimates from 2011 and later or when using the 5-year estimates from 2012 or later
Logical, indicating the variables to be used when calculating ADI and ADI-3 at the block group level using the 2015 or 2016 estimates
Logical, indicating the variables to be used when calculating ADI using the 2011 5-year estimates
Logical, indicating the variables to be used when calculating ADI and ADI-3 using the 2010 1- or 3-year estimates
Logical, indicating the variables to be used when calculating ADI and ADI-3 using the 2010 5-year estimates
Logical, indicating the variables to be used when calculating ADI and ADI-3 using the 2008 or 2009 1-year estimates
Logical, indicating the variables to be used when calculating ACS estimates not previously mentioned, including the 2009 5-year estimates
Logical, indicating the variables to use
in conjunction with the few actual 2010 decennial census variables when
running get_adi(year = 2010, dataset = "decennial")
Note that not all year/estimate combinations are currently supported by the
census API and/or tidycensus
, and some may never be supported.
Runs cluster::daisy()
on a data frame, breaks up the columns of the
resulting dissimilarity into a list, and adds this list to the data frame as
a list column. In addition or instead, it adds a transformed version of the
dissimilarity list, which can be used as sampling weights.
append_dissimilarities( data, cols = dplyr::everything(), dissimilarity_measure_name = "dissimilarities", sampling_weight_name = "sampling_weights", metric = "gower", ... )
append_dissimilarities( data, cols = dplyr::everything(), dissimilarity_measure_name = "dissimilarities", sampling_weight_name = "sampling_weights", metric = "gower", ... )
data |
A data frame that has at least one row and at least one column. |
cols |
< |
dissimilarity_measure_name , sampling_weight_name
|
The names of the list
columns that will be added to |
metric , ...
|
Passed to |
All columns are fed to cluster::daisy()
by default, but the user can select
which ones using the cols
argument.
Once the full dissimilarity matrix is obtained, the columns are separated
into a list via asplit()
and appended to data
. Each element of the list
is therefore a double vector with nrow
(data)
values. For any given
row, its dissimilarity vector represents the row's dissimilarity to every
row.
The optional/alternative "sampling weight" column is a transformed version of the dissimilarity list: 1. All dissimilarity measures of 0 are replaced with the next smallest dissimilarity value in the vector. In effect, this means that a row's dissimilarity to itself (and any rows identical to it) is replaced with the dissimilarity value of its next most similar row. (Exception: if all elements are 0, all of them are replaced with 1). 2. Then the reciprocal of each element is taken so that larger values represent greater similarity. 3. Each element is divided by the sum of the vector, which standardizes the elements to add to 1.
Requires the package cluster
to be installed.
A data frame, specifically the data
argument with one or two more
columns added to the end.
# Running this on all mtcars columns mtdissim <- append_dissimilarities(mtcars) # Therefore, these numbers represent the dissimilarity of each row to the # fifth row: mtdissim$dissimilarities[[5]] # And these are the dissimilarities' corresponding sampling weights: mtdissim$sampling_weights[[5]] # Now we run it on mtcars without the wt and qsec colums so that we purposely # end up with some duplicate rows (the first and second). mtdissim_dup <- append_dissimilarities(mtcars, cols = !c(wt, qsec)) # These represent each row's dissimilarity to its first row. # Since we specifically told it not to take wt and qsec into account, the # first two rows are identical. Therefore, both values are zero. mtdissim_dup$dissimilarities[[1]] # Here are the corresponding sampling weights. Notice that the first two # rows' sampling weights are the same as the sampling weight of row 30, which # is the next most similar row. mtdissim_dup$sampling_weights[[1]]
# Running this on all mtcars columns mtdissim <- append_dissimilarities(mtcars) # Therefore, these numbers represent the dissimilarity of each row to the # fifth row: mtdissim$dissimilarities[[5]] # And these are the dissimilarities' corresponding sampling weights: mtdissim$sampling_weights[[5]] # Now we run it on mtcars without the wt and qsec colums so that we purposely # end up with some duplicate rows (the first and second). mtdissim_dup <- append_dissimilarities(mtcars, cols = !c(wt, qsec)) # These represent each row's dissimilarity to its first row. # Since we specifically told it not to take wt and qsec into account, the # first two rows are identical. Therefore, both values are zero. mtdissim_dup$dissimilarities[[1]] # Here are the corresponding sampling weights. Notice that the first two # rows' sampling weights are the same as the sampling weight of row 30, which # is the next most similar row. mtdissim_dup$sampling_weights[[1]]
Calculate the Area Deprivation Index and Berg Indices (ADI-3) using decennial US census or American Community Survey (ACS) variables.
calculate_adi(data_raw, keep_indicators = FALSE, seed = NA)
calculate_adi(data_raw, keep_indicators = FALSE, seed = NA)
data_raw |
A data frame, The columns of his data frame must be named according to the elements of
the The easiest way to obtain data like this is to run
|
keep_indicators |
Logical indicating whether or not to keep the
component indicators of the ADI and ADI-3 as well as the original census
variables used to calculate them. Defaults to See |
seed |
Passed to the |
The function get_adi()
calls this function by default as its final step,
but some users may want to calculate ADI and ADI-3 values for different
combinations of areas in a given data set. get_adi
(raw_data_only = TRUE)
returns the raw census data used to calculate ADI and ADI-3. Users may
select subsets of such a data set and pipe them into calculate_adi()
.
This function discerns what kind of census data that data
contains (ACS, or
one of the decennial censuses) by checking for the existence of key variables
unique to each kind of data set.
Areas listed as having zero households are excluded from ADI and ADI-3
calculation. Their resulting ADIs and ADI-3s will be NA
.
If calling this function directly (i.e., not via get_adi()
) on a data set
that contains median household income (B19013_001) and does not contain
median family income (B19113_001), median household income will be used in
place of median family income, with a warning()
. See the "Missingness and
imputation" section of get_adi()
.
A tibble
(or sf
) with the same number
of rows as data
. Columns include GEOID
, NAME
, ADI
, Financial Strength
, Economic_Hardship_and_Inequality
, and
Educational_Attainment
. Further columns containing the indicators and raw
values will also be present if keep_indicators = TRUE
.
For more information, see get_adi()
, especially the sections
titled ADI and ADI-3 factor loadings and Missingness and
imputation.
## Not run: # Wrapped in \dontrun{} because these examples require a Census API key. raw_census <- get_adi("state", year = 2017, raw_data_only = TRUE) calculate_adi(raw_census) calculate_adi(raw_census, keep_indicators = TRUE) ## End(Not run)
## Not run: # Wrapped in \dontrun{} because these examples require a Census API key. raw_census <- get_adi("state", year = 2017, raw_data_only = TRUE) calculate_adi(raw_census) calculate_adi(raw_census, keep_indicators = TRUE) ## End(Not run)
A three-column data set of the Decennial Census variable names, their descriptions, and their decennial census year. Contains counts of various subdivisions of the population based on age, sex, race, and ethnicity.
decennial_age_sex_race_ethnicity_vars
decennial_age_sex_race_ethnicity_vars
A tibble
with 130 rows and 3 variables:
The year of the decennial census with which the variable is associated.
ACS variable name
A description of who is present in the count
Currently, the 2000 and 2010 Decennial Census variables are available.
This data set is used to support synthetic_population()
.
acs_age_sex_race_ethnicity_vars
A dataset of the decennial census variable names used to calculate the Area Deprivation Index (ADI) and the Berg Indices (ADI-3).
decennial_vars
decennial_vars
A tibble
with 137 rows and 4 variables:
Decennial census variable name
The summary tape file of the decennial census variable
The year of the decennial census variable
Brief description of the data the variable contains
Returns the ADI and ADI-3 of user-specified areas.
get_adi( geography, state = NULL, county = NULL, geoid = NULL, zcta = NULL, year, dataset = c("acs5", "acs3", "acs1", "decennial"), geometry = FALSE, keep_indicators = FALSE, raw_data_only = FALSE, cache_tables = TRUE, key = NULL, seed = NA, ... )
get_adi( geography, state = NULL, county = NULL, geoid = NULL, zcta = NULL, year, dataset = c("acs5", "acs3", "acs1", "decennial"), geometry = FALSE, keep_indicators = FALSE, raw_data_only = FALSE, cache_tables = TRUE, key = NULL, seed = NA, ... )
geography |
A character string denoting the level of census geography
whose ADIs and ADI-3s you'd like to obtain. Must be one of |
state |
A character string specifying states whose ADI and ADI-3 data is
desired. Defaults to |
county |
A vector of character strings specifying the counties whose ADI
and ADI-3 data you're requesting. Defaults to |
geoid |
A character vector of GEOIDs (use quotation marks and leading
zeros). Defaults to |
zcta |
A character vector of ZCTAs or the leading digit(s) of ZCTAs (use
quotation marks and leading zeros). Defaults to Strings under 5 digits long will yield all ZCTAs that begin with those digits. Requires that |
year |
Single integer specifying the year of US Census data to use. |
dataset |
The data set used to calculate ADIs and ADI-3s. Must be one of
When The 2010 decennial census did not include the long-form questionnaire used in the 1990 and 2000 censuses, so this function uses the 5-year estimates from the 2010 ACS to supply the data not included in the 2010 decennial census. In fact, the only 2010 decennial variables used are H003002, H014002, P020002, and P020008. Important: data are not always available depending on the level of geography and data set chosen. See https://www.census.gov/programs-surveys/acs/guidance/estimates.html. |
geometry |
Logical value indicating whether or not shapefile data should
be included in the result, making the result an The shapefile data that is returned is somewhat customizable by passing
certain arguments along to the |
keep_indicators |
Logical value indicating whether or not the resulting
See |
raw_data_only |
Logical, indicating whether or not to skip calculation
of the ADI and ADI-3 and only return the census variables. Defaults to
|
cache_tables |
The plural version of the |
key |
Your Census API key as a character string. Obtain one at
http://api.census.gov/data/key_signup.html. Defaults to |
seed |
Passed to |
... |
Additional arguments to be passed onto This may be found to be helpful when setting |
Returns a tibble
or sf
object of the Area
Deprivation Indices (ADIs) and Berg Indices (ADI-3s) of user-specified
locations in the United States, utilizing US Census data. Locations that are
listed as having zero households are excluded from ADI and ADI-3 calculation:
their ADI and ADI-3 values will be NA
.
If geometry = FALSE
, (the default) a tibble
. If
geometry = TRUE
is specified, an sf
.
The concept of "reference area" is important to understand when using this function. The algorithm that produced the original ADIs employs factor analysis. As a result, the ADI is a relative measure; the ADI of a particular location is dynamic, varying depending on which other locations were supplied to the algorithm. In other words, ADI will vary depending on the reference area you specify.
For example, the ADI of Orange County, California is x when calculated
alongside all other counties in California, but it is y when calculated
alongside all counties in the US. The get_adi()
function enables the user
to define a reference area by feeding a vector of GEOIDs to its geoid
parameter (or alternatively for convenience, states and/or counties to
state
and county
). The function then gathers data from those specified
locations and performs calculations using their data alone.
The Berg Indices (ADI-3) were developed with this principle of relativity in mind, and as such there is no set of seminal ADI-3 values. Thus, the terms "Berg Indices" and "ADI-3" refer more nearly to any values generated using the algorithm employed in this package.
Areas listed as having zero households are excluded from the reference
area, and their ADI and ADI-3 values will be NA
.
geoid
parameterElements of geoid
can represent different
levels of geography, but they all must be either 2 digits (for states), 5
digits (for counties), 11 digits (for tracts), or 12 digits (for block
groups). It must contain character strings, so use quotation marks as well
as leading zeros where applicable.
The returned
tibble
or sf
is of class adi
, and it
contains an attribute called loadings
, which contains a tibble of the PCA
loadings of each factor. This is accessible through
attr
(name_of_tibble, "loadings")
.
While this function allows flexibility in specifying reference areas (see the Reference area section above), data from the US Census are masked for sparsely populated places, resulting in many missing values.
Imputation is attempted via mice::mice
(m = 1, maxit = 50, method = "pmm", seed = seed)
. If imputation is unsuccessful, an error is thrown,
but the dataset of indicators on which imputation was unsuccessful is
available via rlang::last_error()
$adi_indicators
and the raw census
data are available via rlang::last_error()
$adi_raw_data
. The former
excludes areas with zero households, but the latter includes them.
One of the indicators of both ADI and the Financial Strength component of
ADI-3 is median family income, but methodological issues with the 2015 and
2016 ACS have rendered this variable unavailable at the block group level
for those years. When requested, this function will use median household
income in its place, with a warning()
. See
https://www.census.gov/programs-surveys/acs/technical-documentation/user-notes/2016-01.html.
Depending on user input, this function
may call its underlying functions (tidycensus::get_acs()
or
tidycensus::get_decennial()
) many times in order to accommodate their
behavior. When these calls are broken up by state or by state and county, a
message is printed indicating the state or state and county whose data is
being pulled. These calls are wrapped in
purrr::insistently
(
purrr::rate_delay()
, quiet = FALSE)
, meaning
that they are attempted over and over until success, and tidycensus
error
messages are printed as they occur.
Please note that this function calls data from US Census servers, so execution may take a long time depending on the user's internet connection and the amount of data requested.
For advanced users, if changing the dataset
argument, be sure to know the
advantages and limitations of the 1-year and 3-year ACS estimates. See
https://www.census.gov/programs-surveys/acs/guidance/estimates.html for
details.
## Not run: # Wrapped in \dontrun{} because all these examples take >5 seconds # and require a Census API key. # ADI of all census tracts in Cuyahoga County, Ohio get_adi(geography = "tract", year = 2017, state = "OH", county = "Cuyahoga") # ADI and ADI-3 of all counties in Connecticut, using the 2014 ACS1 survey. # Returns a warning because there are only 8 counties. # A minimum of 30 locations is recommended. get_adi(geography = "county", state = "CT", year = 2014, dataset = "acs1") # Areas with zero households will have an ADI and ADI-3 of NA: queens <- get_adi( "tract", year = 2017, state = "NY", county = "Queens", keep_indicators = TRUE, geometry = TRUE ) queens %>% dplyr::as_tibble() %>% dplyr::select(GEOID, NAME, ADI, households = B11005_001) %>% dplyr::filter(is.na(ADI) | households == 0) %>% print(n = Inf) # geoid argument allows for highly customized reference populations. # ADI of all census tracts in the GEOIDs stored in "delmarva" below: # Notice the mixing of state- ("10") and county-level GEOIDs (the others). delmarva_geoids <- c("10", "51001", "51131", "24015", "24029", "24035", "24011", "24041", "24019", "24045", "24039", "24047") delmarva <- get_adi( geography = "tract", geoid = delmarva_geoids, dataset = "acs5", year = 2009, geometry = TRUE ) # Demonstration of geom_sf() integration: require(ggplot2) # The na.value argument changes the fill of NA ADI areas. delmarva %>% ggplot() + geom_sf(aes(fill = ADI), lwd = 0) # Setting direction = -1 makes the less deprived areas the lighter ones # The argument na.value changes the color of zero-household areas queens %>% ggplot() + geom_sf(aes(fill = ADI), lwd = 0) + scale_fill_viridis_c(na.value = "red", direction = -1) # Obtain factor loadings: attr(queens, "loadings") ## End(Not run)
## Not run: # Wrapped in \dontrun{} because all these examples take >5 seconds # and require a Census API key. # ADI of all census tracts in Cuyahoga County, Ohio get_adi(geography = "tract", year = 2017, state = "OH", county = "Cuyahoga") # ADI and ADI-3 of all counties in Connecticut, using the 2014 ACS1 survey. # Returns a warning because there are only 8 counties. # A minimum of 30 locations is recommended. get_adi(geography = "county", state = "CT", year = 2014, dataset = "acs1") # Areas with zero households will have an ADI and ADI-3 of NA: queens <- get_adi( "tract", year = 2017, state = "NY", county = "Queens", keep_indicators = TRUE, geometry = TRUE ) queens %>% dplyr::as_tibble() %>% dplyr::select(GEOID, NAME, ADI, households = B11005_001) %>% dplyr::filter(is.na(ADI) | households == 0) %>% print(n = Inf) # geoid argument allows for highly customized reference populations. # ADI of all census tracts in the GEOIDs stored in "delmarva" below: # Notice the mixing of state- ("10") and county-level GEOIDs (the others). delmarva_geoids <- c("10", "51001", "51131", "24015", "24029", "24035", "24011", "24041", "24019", "24045", "24039", "24047") delmarva <- get_adi( geography = "tract", geoid = delmarva_geoids, dataset = "acs5", year = 2009, geometry = TRUE ) # Demonstration of geom_sf() integration: require(ggplot2) # The na.value argument changes the fill of NA ADI areas. delmarva %>% ggplot() + geom_sf(aes(fill = ADI), lwd = 0) # Setting direction = -1 makes the less deprived areas the lighter ones # The argument na.value changes the color of zero-household areas queens %>% ggplot() + geom_sf(aes(fill = ADI), lwd = 0) + scale_fill_viridis_c(na.value = "red", direction = -1) # Obtain factor loadings: attr(queens, "loadings") ## End(Not run)
Returns a tibble
containing the census areas whose
centers of population are closest to some user-specified center. To specify
the center, the user can manually enter longitude/latitude coordinates or use
the helper function lon_lat_from_area()
to automatically grab the
longitude/latitude coordinates of the center of population of an area. The
cutoff point for how many areas will be return depends on the function used.
areas_in_radius( geography = c("state", "county", "tract", "block group"), center = lon_lat_from_area(state = "DC"), radius = 5, units = "miles", measure_from = "center of population", year = 2020, distance_fun = geosphere::distVincentyEllipsoid, batch_size = 50L ) closest_n_areas( geography = c("state", "county", "tract", "block group"), center = lon_lat_from_area(state = "DC"), n = 50, measure_from = "center of population", year = 2020, distance_fun = geosphere::distVincentyEllipsoid, units = NULL, batch_size = 50L ) closest_population( geography = c("state", "county", "tract", "block group"), center = lon_lat_from_area(state = "DC"), population = 1e+06, measure_from = "center of population", year = 2020, distance_fun = geosphere::distVincentyEllipsoid, units = NULL, batch_size = 50L )
areas_in_radius( geography = c("state", "county", "tract", "block group"), center = lon_lat_from_area(state = "DC"), radius = 5, units = "miles", measure_from = "center of population", year = 2020, distance_fun = geosphere::distVincentyEllipsoid, batch_size = 50L ) closest_n_areas( geography = c("state", "county", "tract", "block group"), center = lon_lat_from_area(state = "DC"), n = 50, measure_from = "center of population", year = 2020, distance_fun = geosphere::distVincentyEllipsoid, units = NULL, batch_size = 50L ) closest_population( geography = c("state", "county", "tract", "block group"), center = lon_lat_from_area(state = "DC"), population = 1e+06, measure_from = "center of population", year = 2020, distance_fun = geosphere::distVincentyEllipsoid, units = NULL, batch_size = 50L )
geography |
The type of census areas that the resulting table will
contain. One of |
center |
The longitude/latitude coordinates of the center of the circle.
A double vector of length 2 whose elements are finite numbers. Passed to
the The first element is the longitude coordinate (positive for west, negative for east). The second element is the latitude coordinate (positive for north, negative for south). The convenience function Defaults to the center of population of the District of Columbia according to the 2020 decennial census. |
radius |
A single, non-negative number specifying the radius of the circle. Defaults to 5. |
units |
A single string specifying the units of the resulting For |
measure_from |
Currently can only be |
year |
Must be 2020, 2010, or 2000. Defaults to 2020. |
distance_fun |
Passed to the |
batch_size |
The number of distances calculated in each iterative call
to |
n |
A single positive integer specifying how many of the areas closest
to |
population |
A single positive integer specifying the target total population of the areas returned. See Details. |
areas_in_radius()
returns all areas whose centers of population are within
the user-specified radius
around center
.
closest_n_areas()
returns the top n
areas whose centers of population are
closest areas to center
.
Conceptually, closest_population()
sequentially gathers the next closest
area to center
until the total population of the areas meets or exceeds
population
.
Distances are determined with geosphere::distm()
.
Requires the packages USpopcenters
and geosphere
to be installed.
Requires the units
to be installed unless units = NULL
.
Centers of population are based on the decennial census data. Only states,
counties, tracts, and block groups are currently supported. See the
documentation of the USpopcenters
package and
https://www.census.gov/geographies/reference-files/time-series/geo/centers-population.html
for more information.
A tibble
with each of the columns found in the
corresponding USpopcenters
table, with two columns appended:
geoid
- all FIPS code columns combined with paste0()
.
distance
- the number of units
the area's LONGITUDE
/LATITUDE
center
of population is away from the coordinates given in center
.
if (requireNamespace("USpopcenters", quietly = TRUE) && requireNamespace("geosphere", quietly = TRUE)) { # All states whose centers of population are within 300 kilometers of the # center of population of New York County, New York (i.e, Manhattan): areas_in_radius( geography = "state", center = lon_lat_from_area(state = "NY", county = "New York"), radius = 300, units = "km" ) # The four census tracts whose centers of population are closest to the # Four Corners (distance column is in meters due to setting units = NULL): closest_n_areas("tract", center = c(-109.0452, 36.9991), n = 4, units = NULL) # The counties closest to center of population of Kauai County, Hawaii whose # total population reaches 3 million people: closest_population( geography = "county", center = lon_lat_from_area("15007"), population = 3e6, units = "barleycorns" ) }
if (requireNamespace("USpopcenters", quietly = TRUE) && requireNamespace("geosphere", quietly = TRUE)) { # All states whose centers of population are within 300 kilometers of the # center of population of New York County, New York (i.e, Manhattan): areas_in_radius( geography = "state", center = lon_lat_from_area(state = "NY", county = "New York"), radius = 300, units = "km" ) # The four census tracts whose centers of population are closest to the # Four Corners (distance column is in meters due to setting units = NULL): closest_n_areas("tract", center = c(-109.0452, 36.9991), n = 4, units = NULL) # The counties closest to center of population of Kauai County, Hawaii whose # total population reaches 3 million people: closest_population( geography = "county", center = lon_lat_from_area("15007"), population = 3e6, units = "barleycorns" ) }
Returns a tibble
or sf
of GEOIDs, names, and
decennial census population of user-specified locations.
get_geoids( geography, state = NULL, county = NULL, geoid = NULL, year = 2010, geometry = FALSE, cache_tables = TRUE, key = NULL, ... )
get_geoids( geography, state = NULL, county = NULL, geoid = NULL, year = 2010, geometry = FALSE, cache_tables = TRUE, key = NULL, ... )
geography |
A character string denoting the level of census geography
whose GEOIDs you'd like to obtain. Must be one of Note that block-level data cannot be obtained from 1990 and 2000 decennial
census data due to limitations in |
state , county , geoid , geometry , cache_tables , key
|
See the descriptions of
the arguments in |
year |
Single integer specifying the year of US Census data to use.
Defaults to 2010. Based on this year, data from the most recent decennial
census will be returned (specifically, |
... |
Additional arguments to be passed to
|
This allows users to quickly obtain all GEOIDs in a specified location at a specific level of geography without having to manually look them up somewhere else.
This facilitates calls to get_adi()
that involve somewhat complicated
reference areas.
## Not run: # Wrapped in \dontrun{} because it requires a Census API key. # Get all tract GEOIDs for Manhattan tracts <- get_geoids(geography = "tract", state = "New York", county = "New York") tracts # Get all block GEOIDs for the fifth tract on that list get_geoids(geography = "block", geoid = tracts$GEOID[5]) ## End(Not run)
## Not run: # Wrapped in \dontrun{} because it requires a Census API key. # Get all tract GEOIDs for Manhattan tracts <- get_geoids(geography = "tract", state = "New York", county = "New York") tracts # Get all block GEOIDs for the fifth tract on that list get_geoids(geography = "block", geoid = tracts$GEOID[5]) ## End(Not run)
The user specifies a census area, and the function returns the longitude/latitude coordinates of the area's center of population according to the decennial census.
lon_lat_from_area(geoid = NULL, state = NULL, county = NULL, year = 2020)
lon_lat_from_area(geoid = NULL, state = NULL, county = NULL, year = 2020)
geoid |
A single string specifying the geoid of a census area. Must be
2, 5, 11, or 12 digits. Must be |
state |
A single string containing the FIPS code, two-letter
abbreviation, or full state name of a US state or the District of Columbia
or Puerto Rico. Not case sensitive. Must be |
county |
A single string specifying the name of a county in |
year |
One of 2020, 2010, or 2000. Defaults to 2020. |
Centers of population are based on the decennial census. Only states,
counties, tracts, and block groups are currently supported. See the
documentation of the USpopcenters
package and
https://www.census.gov/geographies/reference-files/time-series/geo/centers-population.html
for more information.
Requires the data package USpopcenters
to be installed.
A double vector of length 2. The first element is LONGITUDE (positive for east, negative for west). The second element is LATITUDE (positive for north, negative for south).
if (requireNamespace("USpopcenters", quietly = TRUE)) { # The center of population of Alaska lon_lat_from_area(state = "alAskA") # The center of population of Cook County, Illinois. lon_lat_from_area(state = "IL", county = "Cook") # The center of population of some tract in Manhattan lon_lat_from_area(geoid = "36061021600") }
if (requireNamespace("USpopcenters", quietly = TRUE)) { # The center of population of Alaska lon_lat_from_area(state = "alAskA") # The center of population of Cook County, Illinois. lon_lat_from_area(state = "IL", county = "Cook") # The center of population of some tract in Manhattan lon_lat_from_area(geoid = "36061021600") }
Returns a data set of synthetic individuals based on user-specified US Census areas. The age, sex, race, and ethnicity of each individual is probabilistic, based on the demographics of the areas as reported in a user-specified US Census data set.
synthetic_population( geography, state = NULL, county = NULL, geoid = NULL, zcta = NULL, year, dataset = c("acs5", "acs3", "acs1", "decennial"), geometry = FALSE, cache_tables = TRUE, max_age = 115, rate = 0.25, key = NULL, seed = NULL, ... )
synthetic_population( geography, state = NULL, county = NULL, geoid = NULL, zcta = NULL, year, dataset = c("acs5", "acs3", "acs1", "decennial"), geometry = FALSE, cache_tables = TRUE, max_age = 115, rate = 0.25, key = NULL, seed = NULL, ... )
geography |
A character string denoting the level of US census geography at which you want to create a synthetic population. Required. |
state |
A character string specifying states whose population you want
to synthesize. Defaults to |
county |
A vector of character strings specifying the counties whose
population you want to synthesize. Defaults to |
geoid |
A character vector of GEOIDs (use quotation marks and leading
zeros). Defaults to |
zcta |
A character vector of ZCTAs or the leading digit(s) of ZCTAs (use
quotation marks and leading zeros). Defaults to Strings under 5 digits long will yield all ZCTAs that begin with those digits. Requires that |
year , dataset
|
Specifies the US Census data set on which to base the demographic profile of your synthetic population.
When Important: data are not always available depending on the level of geography and data set chosen. See https://www.census.gov/programs-surveys/acs/guidance/estimates.html. |
geometry |
Logical value indicating whether or not shapefile data should
be included in the result, making the result an The shapefile data that is returned is somewhat customizable by passing
certain arguments along to the |
cache_tables |
The plural version of the |
max_age |
A single integer representing the largest possible age that can appear in the data set. Simulated age values exceeding this value will be top-coded to this value. Defaults to 115. See details. |
rate |
A single number, passed to |
key |
Your Census API key as a character string. Obtain one at
http://api.census.gov/data/key_signup.html. Defaults to |
seed |
Passed onto |
... |
Additional arguments to be passed onto This may be found to be helpful when setting |
Returns a tibble
or sf
object where each row
represents a synthetic person. Each person has an age, sex, race, and
ethnicity. The probability of what each person's age/sex/race/ethnicity will
be is equal to the proportions in their census area as reported in the
user-specified US Census data set (e.g., 2010 Decennial Census or 2017 ACS
5-year estimates). The number of rows in the data set will equal the number
of people living in the user-specified US Census areas, as reported in the
same US Census data set.
If geometry = FALSE
, (the default) a tibble
. If
geometry = TRUE
is specified, an sf
.
US Census data provides
counts of the number of people in different age brackets of varying widths.
The age_lo
and age_hi
columns in the output depict the age bracket of
each individual in the synthetic population. There is also an age
column
that probabilistically generates a non-whole-number age within the age
bracket. A uniform distribution (via stats::runif()
) guides this age
generation for all age brackets except the highest age bracket ("age 85 and
over" in the extant ACS and Decennial Census data). An exponential
distribution (via stats::rexp()
) guides the age generation for this
highest age bracket, and the user can specify rate
to customize the
exponential distribution that is used.
## Not run: # Wrapped in \dontrun{} because all these examples take >5 seconds # and require a Census API key. # Synthetic population for Utah, using the 2019 ACS 5-year estimates: synthetic_population(geography = "state", state = "UT", year = 2019) # Same, but make it so that survival past age 85 is highly unlikely # (via rate = 10), and so that 87 is the maximum possible age synthetic_population( geography = "state", state = "UT", year = 2019, max_age = 87, rate = 10 ) # Synthetic population of the Delmarva Peninsula at the census tract level, # using 2000 Decennial Census data synthetic_population( geography = "tract", geoid = # This two-digit GEOID is the state of Delaware. c("10", # These five-digit GEOIDs are specific counties in Virginia and Maryland "51001", "51131", "24015", "24029", "24035", "24011", "24041", "24019", "24045", "24039", "24047"), year = 2000, dataset = "decennial" ) ## End(Not run)
## Not run: # Wrapped in \dontrun{} because all these examples take >5 seconds # and require a Census API key. # Synthetic population for Utah, using the 2019 ACS 5-year estimates: synthetic_population(geography = "state", state = "UT", year = 2019) # Same, but make it so that survival past age 85 is highly unlikely # (via rate = 10), and so that 87 is the maximum possible age synthetic_population( geography = "state", state = "UT", year = 2019, max_age = 87, rate = 10 ) # Synthetic population of the Delmarva Peninsula at the census tract level, # using 2000 Decennial Census data synthetic_population( geography = "tract", geoid = # This two-digit GEOID is the state of Delaware. c("10", # These five-digit GEOIDs are specific counties in Virginia and Maryland "51001", "51131", "24015", "24029", "24035", "24011", "24041", "24019", "24045", "24039", "24047"), year = 2000, dataset = "decennial" ) ## End(Not run)