Chapter 2 Session 2- Generalized Variance Function
One of the most important inputs in the area model is the variance of the direct estimator at the domain level, which cannot be calculated directly. Accordingly, this value must be estimated from the data collected in each domain. However, in domains with very small sample sizes, these estimations will not perform well. Hence, it is very useful to use a smoothing model for variances to eliminate noise and volatility from these estimations and extract the true signal of the process.
Hidiroglou (2019) states that \(E_{\mathscr{MP}}\left(\hat{\theta}^{dir}_d\right)=\boldsymbol{x}^{T}_{d}\boldsymbol{\beta}\) and \(V_{\mathscr{MP}}\left(\hat{\theta}^{dir}_d\right)=\sigma_{u}^2+\tilde{\sigma}^2_{d}\), where the subscript \(\mathscr{MP}\) refers to the double inference that must be taken into account in these adjustments and defines the joint probability measure between the model and the sampling design.
\(\mathscr{M}\) refers to the probability measure induced by modeling and the inclusion of auxiliary covariates (\(\boldsymbol{x}_{d}\)).
\(\mathscr{MP}\) refers to the probability measure induced by the complex sampling design that yields direct estimations.
The solution proposed here is known as the Generalized Variance Function, which involves fitting a log-linear model to the estimated direct variance. Starting from the fact that an unbiased estimator of \(\sigma^2\) denoted by \(\hat{\sigma}^2\) is available, it follows that: \[ E_{\mathscr{MP}}\left(\hat{\sigma}_{d}^{2}\right)=E_{\mathscr{M}}\left(E_{\mathscr{P}}\left(\hat{\sigma}_{d}^{2}\right)\right)=E_{\mathscr{M}}\left(\sigma_{d}^{2}\right)=\tilde{\sigma}_{d}^{2} \]
The above equality can be interpreted as an unbiased and simple estimator of \(\tilde{\sigma}_{d}^{2}\), denoted as \(\hat{\sigma}_{d}^{2}\). However, this sampling estimator is unstable when the sample size is small, which is precisely the dominant paradigm in small area estimation. Rivest and Belmonte (2000) consider smoothing models for the estimation of direct variances defined as follows:
\[ \log\left(\hat{\sigma}_{d}^{2}\right)=\boldsymbol{z}_{d}^{T}\boldsymbol{\alpha}+\boldsymbol{\varepsilon}_{d} \]
Where \(\boldsymbol{z}_{d}\) is a vector of explanatory covariates that are functions of \(\boldsymbol{x}_{d}\), \(\boldsymbol{\alpha}\) is a vector of parameters to be estimated, \(\boldsymbol{\varepsilon}_{d}\) are random errors with zero mean and constant variance, assumed to be identically distributed conditionally on \(\boldsymbol{z}_{d}\). From the above model, the smoothed estimation of the sampling variance is given by:
\[ \tilde{\sigma}_{d}^{2}=E_{\mathscr{MP}}\left(\sigma_{d}^{2}\right)=\exp\left(\boldsymbol{z}_{d}^{T}\boldsymbol{\alpha}\right)\times\Delta \]
Where \(E_{\mathscr{MP}}\left(\varepsilon_{d}\right)=\Delta\). There’s no need to specify a parametric distribution for the errors of this model. Using the method of moments, the following unbiased estimator for \(\Delta\) is obtained:
\[ \hat{\Delta}=\frac{\sum_{d=1}^{D}\hat{\sigma}_{d}^{2}}{\sum_{d=1}^{D}\exp\left(\boldsymbol{z}_{d}^{T}\boldsymbol{\alpha}\right)} \]
Similarly, using standard procedures in linear regression, the estimation of the regression parameter coefficients is given by the following expression:
\[ \hat{\boldsymbol{\alpha}}=\left(\sum_{d=1}^{D}\boldsymbol{z}_{d}\boldsymbol{z}_{d}^{T}\right)^{-1}\sum_{d=1}^{D}\boldsymbol{z}_{d}\log\left(\hat{\sigma}_{d}^{2}\right) \]
Finally, the smoothed estimator of the sampling variance is defined as:
\[ \hat{\tilde{\sigma}}_{d}^{2}=\exp\left(\boldsymbol{z}_{d}^{T}\hat{\boldsymbol{\alpha}}\right)\hat{\Delta} \]
Survey Data:
The following code processes data using various R packages such as survey, tidyverse, srvyr, TeachingSampling, and haven.
Library Loading: The code loads the necessary libraries (
survey,tidyverse,srvyr,TeachingSampling,haven) required for data manipulation and analysis.Survey Data Set Reading: The code reads the dataset named ‘data_JAM.rds’ using the
readRDSfunction.Data Manipulation:
- It creates new variables (
dam2,fep,upm,estrato,ingreso,pobreza) using themutatefunction from thedplyrpackage. - Filters the dataset based on a specific condition using the
filterfunction.
- It creates new variables (
library(survey)
library(tidyverse)
library(srvyr)
library(TeachingSampling)
library(haven)
#read in data set
# Q518A: Gross average income From Employment
encuesta <- read_rds("Recursos/02_FGV/01_data_JAM.rds")
encuesta <-
encuesta %>%
mutate(
dam2,
fep = RFACT/4,
upm = paste0(PAR_COD , CONST_NUMBER, ED_NUMBER),
estrato = ifelse(is.na(STRATA) ,strata,STRATA),
ingreso = case_when(!is.na(Q518A) & !is.na(Q518B) ~ Q518A + Q518B,
!is.na(Q518A) & !is.na(Q518B) ~ NA_real_,
!is.na(Q518A) & is.na(Q518B) ~ Q518A,
is.na(Q518A) & !is.na(Q518B) ~ Q518B),
pobreza = ifelse(ingreso < 50,1,0)
) %>% filter(ingreso > 11)dam2: Corresponds to the code assigned to the country’s second administrative division.
The income definition is structured in this manner to illustrate the process utilized in the small area estimation methodology for poverty estimation.
| ID | QUARTER | EMPSTATUS | STRATA | CLUSTER | CONST_NUMBER | ED_NUMBER | URCODE | SEX | AGE | Q518A | Q518AP | Q518B | Q518BP | RFACT | PAR_COD | PARISH | strata | cluster | dam2 | fep | upm | estrato | ingreso | pobreza |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 02010105902170101 | 3 | 5 | 014 | 014011 | 01 | 059 | 1 | 2 | 44 | 9 | NA | 9e+00 | NA | 127.4657 | 01 | Kingston | NA | NA | 0101 | 31.8664 | 0101059 | 014 | 18 | 1 |
| 02020207900130103 | 3 | 5 | 235 | 235472 | 02 | 079 | 2 | 1 | 36 | 0 | NA | 1e+05 | 4 | 164.6809 | 12 | Manchester | NA | NA | 1202 | 41.1702 | 1202079 | 235 | 100000 | 0 |
| 02020404901820105 | 3 | 5 | 239 | 239488 | 04 | 049 | 2 | 1 | 25 | NA | NA | 1e+04 | 3 | 152.1773 | 12 | Manchester | NA | NA | 1204 | 38.0443 | 1204049 | 239 | 10000 | 0 |
| 04010400700370102 | 3 | 5 | 036 | 036073 | 04 | 007 | 1 | 2 | 34 | 9 | NA | 9e+00 | NA | 188.7227 | 02 | St Andrew | NA | NA | 0204 | 47.1807 | 0204007 | 036 | 18 | 1 |
| 04030406400160106 | 3 | 5 | 221 | 221453 | 04 | 064 | 3 | 1 | 20 | 15000 | 1 | 0e+00 | NA | 308.0567 | 11 | St Elizabeth | NA | NA | 1104 | 77.0142 | 1104064 | 221 | 15000 | 0 |
| 05020506600550101 | 3 | 5 | 164 | 164352 | 05 | 066 | 2 | 2 | 60 | 9 | NA | 9e+00 | NA | 94.0227 | 08 | St James | NA | NA | 0805 | 23.5057 | 0805066 | 164 | 18 | 1 |
| 06030106300920101 | 3 | 5 | 122 | 122227 | 01 | 063 | 3 | 1 | 44 | 9 | NA | 9e+00 | NA | 252.1380 | 05 | St Mary | NA | NA | 0501 | 63.0345 | 0501063 | 122 | 18 | 1 |
| 06030200300140102 | 3 | 5 | 228 | 228464 | 02 | 003 | 3 | 2 | 70 | 9 | NA | 9e+00 | NA | 49.1197 | 12 | Manchester | NA | NA | 1202 | 12.2799 | 1202003 | 228 | 18 | 1 |
| 06030204100960101 | 3 | 5 | 084 | 084180 | 02 | 041 | 3 | 2 | 54 | 9 | NA | 9e+00 | NA | 97.2597 | 03 | St Thomas | NA | NA | 0302 | 24.3149 | 0302041 | 084 | 18 | 1 |
| 06030208501550102 | 3 | 5 | 234 | 234474 | 02 | 085 | 3 | 2 | 77 | 0 | NA | 3e+04 | 2 | 70.9262 | 12 | Manchester | NA | NA | 1202 | 17.7315 | 1202085 | 234 | 30000 | 0 |
In the following code block, the libraries survey and srvyr are used to create a sampling design from a survey database. The sampling design encompasses information about primary sampling units (PSUs), sampling weights (wkx), and strata (estrato) utilized in the sampling. Additionally, the “survey.lonely.psu” option is employed to adjust sample sizes within groups of primary sampling units that lack other primary sampling units within the same group.
library(survey)
library(srvyr)
options(survey.lonely.psu = "adjust")
id_dominio <- "dam2"
diseno <-
as_survey_design(
ids = upm,
weights = fep,
strata = estrato,
nest = TRUE,
.data = encuesta
)
#summary(diseno)´
- Indicator Calculation:
- Groups the sampling design by domain ID and calculates indicators related to poverty (
n_pobrezaandpobreza).n_pobrezacounts the number of instances wherepobrezaequals 1 (indicating poverty), whilepobrezacomputes the survey mean of thepobrezavariable with specific variance estimations.
- Groups the sampling design by domain ID and calculates indicators related to poverty (
# Calculating indicators related to poverty and counts of UPMS per domain
# Indicator Calculation:
# Grouping the sampling design by domain ID and summarizing variables related to poverty.
indicador_dam <-
diseno %>% group_by_at(id_dominio) %>%
summarise(
n_pobreza = unweighted(sum(pobreza == 1)),
pobreza = survey_mean(pobreza,
vartype = c("se", "var"),
deff = T
)
)- Counts of UPMS per Domain:
- Extracts domain ID and UPMs from the survey dataset, obtaining unique UPMs per domain and counting them.
- Joins the count of unique UPMs per domain with the previously calculated indicators related to poverty.
# Counts of UPMS per domain:
# Selecting domain ID and UPMs, obtaining unique UPMs per domain, and counting them.
# Joining the count of unique UPMs per domain with previously calculated indicators related to poverty.
indicador_dam <- encuesta %>% select(id_dominio, upm) %>%
distinct() %>%
group_by_at(id_dominio) %>%
tally(name = "n_upm") %>%
inner_join(indicador_dam, by = id_dominio)
# saveRDS(directodam2, "Recursos/02_FGV/indicador_dam.Rds")| dam2 | n_upm | n_pobreza | pobreza | pobreza_se | pobreza_var | pobreza_deff |
|---|---|---|---|---|---|---|
| 0101 | 17 | 237 | 0.9980 | 0.0021 | 0.0000 | 5.228000e-01 |
| 0102 | 12 | 197 | 0.9836 | 0.0089 | 0.0001 | 1.053700e+00 |
| 0103 | 9 | 65 | 1.0000 | 0.0000 | 0.0000 | 3.065785e+32 |
| 0201 | 11 | 244 | 1.0000 | 0.0000 | 0.0000 | NaN |
| 0202 | 11 | 141 | 0.9391 | 0.0292 | 0.0009 | 2.317700e+00 |
| 0203 | 12 | 101 | 0.8117 | 0.0775 | 0.0060 | 4.795200e+00 |
| 0204 | 11 | 224 | 1.0000 | 0.0000 | 0.0000 | NaN |
| 0205 | 9 | 85 | 0.9646 | 0.0331 | 0.0011 | 2.932600e+00 |
| 0206 | 8 | 102 | 1.0000 | 0.0000 | 0.0000 | NaN |
| 0207 | 11 | 149 | 1.0000 | 0.0000 | 0.0000 | NaN |
Domains with 5 or more UPMs and all those with a deff greater than 1 are now filtered.
Next, the transformation \(\log(\hat{\sigma}^2_d)\) is performed. Additionally, the selection of the municipality identifier columns (dam2), the direct estimation (pobreza), the number of people in the domain (nd), and the estimated variance for the direct estimation (vardir) is carried out, the latter being transformed using the log() function.