4.3 Reading the survey and direct estimates
This code performs several operations on a labor survey in Jamaica, represented by the encuesta object, which is read from a file in RDS format. Here’s the breakdown:
Data Reading: The code reads data from the Jamaican labor survey from an RDS file located at ‘Resources/05_Employment/01_data_JAM.rds’. The data is stored in the
encuestaobject.Data Transformation: Through a sequence of operations using the
%>%pipe and thetransmute()function from thedplyrpackage, the following transformations are performed on the survey:- Specific columns
dam2,RFACT,PAR_COD,CONST_NUMBER,ED_NUMBER,STRATA, andEMPSTATUSare selected from the survey. - More descriptive names are assigned to some columns such as
fepforRFACT,upmby combiningPAR_COD,CONST_NUMBER, andED_NUMBER,estratousing conditions to define the value based onSTRATA, andempleo_labelandempleorepresenting specific categories derived fromEMPSTATUSwith labeled levels and categorical values.
- Specific columns
In summary, the code reads a labor survey in Jamaica and performs a series of transformations on selected columns, renaming and reorganizing them for future analyses or processing.
encuesta <- readRDS('Recursos/05_Empleo/01_data_JAM.rds')
##
id_dominio <- "dam2"
encuesta <-
encuesta %>%
transmute(
dam2,
fep = RFACT,
upm = paste0(PAR_COD , CONST_NUMBER, ED_NUMBER),
estrato = ifelse(is.na(STRATA) ,strata,STRATA),
empleo_label = as_factor(EMPSTATUS ,levels = "labels"),
empleo = as_factor(EMPSTATUS ,levels = "values")
)The presented code defines the sampling design for the analysis of the “survey” in R. The first line sets an option for handling singleton PSU (primary sampling units), indicating that adjustments need to be applied in standard error calculations. The second line uses the “as_survey_design” function from the “survey” library to define the sampling design. The function takes “encuesta” as an argument and the following parameters:
strata: The variable defining the strata in the survey, in this case, the “estrato” variable.ids: The variable identifying the PSUs in the survey, here, the “upm” variable.weights: The variable indicating the survey weights of each observation, in this case, the “fep” variable.nest: A logical parameter indicating whether the survey data is nested or not. In this case, it’s set to “TRUE” because the data is nested by domain.
Together, these steps allow defining a sampling design that takes into account the sampling characteristics and the weights assigned to each observation in the survey. This is necessary to obtain precise and representative estimations of the parameters of interest.
options(survey.lonely.psu= 'adjust' )
diseno <- encuesta %>%
as_survey_design(
strata = estrato,
ids = upm,
weights = fep,
nest=T
)The following code conducts a descriptive analysis based on a survey design represented by the object diseno.
Grouping and Filtering: It uses the
%>%function to chain operations. Initially, it groups the data by the domain identifier (id_dominio) usinggroup_by_at()and subsequently filters observations where the variableempleofalls within the range of 3 to 5.Variable Summary: With the
summarise()function, it computes various summaries for different categories of the variableempleo. These summaries include the weighted count for employed, unemployed, and inactive individuals (n_ocupado,n_desocupado,n_inactivo). Furthermore, it utilizes thesurvey_mean()function to obtain weighted mean estimates for each category ofempleo, considering the variable type (vartype) and design effect (deff).
indicador_dam <-
diseno %>% group_by_at(id_dominio) %>%
filter(empleo %in% c(3:5)) %>%
summarise(
n_ocupado = unweighted(sum(empleo == 3)),
n_desocupado = unweighted(sum(empleo == 4)),
n_inactivo = unweighted(sum(empleo == 5)),
Ocupado = survey_mean(empleo == 3,
vartype = c("se", "var"),
deff = T
),
Desocupado = survey_mean(empleo == 4,
vartype = c("se", "var"),
deff = T
),
Inactivo = survey_mean(empleo == 5,
vartype = c("se", "var"),
deff = T
)
)- Upms counts by domains: This code performs operations on the survey data. First, it selects the columns id_dominio and upm, removes duplicate rows, and then counts the number of unique upm values for each id_dominio. Subsequently, it performs an inner join of these results with an existing object indicador_dam based on the id_dominio column, thus consolidating information about the quantity of unique upm values per identified domain in the survey.