Chapter 2 Session 2- Generalized Variance Function

One of the most important inputs in the area model is the variance of the direct estimator at the domain level, which cannot be calculated directly. Accordingly, this value must be estimated from the data collected in each domain. However, in domains with very small sample sizes, these estimations will not perform well. Hence, it is very useful to use a smoothing model for variances to eliminate noise and volatility from these estimations and extract the true signal of the process.

Hidiroglou (2019) states that \(E_{\mathscr{MP}}\left(\hat{\theta}^{dir}_d\right)=\boldsymbol{x}^{T}_{d}\boldsymbol{\beta}\) and \(V_{\mathscr{MP}}\left(\hat{\theta}^{dir}_d\right)=\sigma_{u}^2+\tilde{\sigma}^2_{d}\), where the subscript \(\mathscr{MP}\) refers to the double inference that must be taken into account in these adjustments and defines the joint probability measure between the model and the sampling design.

  • \(\mathscr{M}\) refers to the probability measure induced by modeling and the inclusion of auxiliary covariates (\(\boldsymbol{x}_{d}\)).

  • \(\mathscr{MP}\) refers to the probability measure induced by the complex sampling design that yields direct estimations.

The solution proposed here is known as the Generalized Variance Function, which involves fitting a log-linear model to the estimated direct variance. Starting from the fact that an unbiased estimator of \(\sigma^2\) denoted by \(\hat{\sigma}^2\) is available, it follows that: \[ E_{\mathscr{MP}}\left(\hat{\sigma}_{d}^{2}\right)=E_{\mathscr{M}}\left(E_{\mathscr{P}}\left(\hat{\sigma}_{d}^{2}\right)\right)=E_{\mathscr{M}}\left(\sigma_{d}^{2}\right)=\tilde{\sigma}_{d}^{2} \]

The above equality can be interpreted as an unbiased and simple estimator of \(\tilde{\sigma}_{d}^{2}\), denoted as \(\hat{\sigma}_{d}^{2}\). However, this sampling estimator is unstable when the sample size is small, which is precisely the dominant paradigm in small area estimation. Rivest and Belmonte (2000) consider smoothing models for the estimation of direct variances defined as follows:

\[ \log\left(\hat{\sigma}_{d}^{2}\right)=\boldsymbol{z}_{d}^{T}\boldsymbol{\alpha}+\boldsymbol{\varepsilon}_{d} \]

Where \(\boldsymbol{z}_{d}\) is a vector of explanatory covariates that are functions of \(\boldsymbol{x}_{d}\), \(\boldsymbol{\alpha}\) is a vector of parameters to be estimated, \(\boldsymbol{\varepsilon}_{d}\) are random errors with zero mean and constant variance, assumed to be identically distributed conditionally on \(\boldsymbol{z}_{d}\). From the above model, the smoothed estimation of the sampling variance is given by:

\[ \tilde{\sigma}_{d}^{2}=E_{\mathscr{MP}}\left(\sigma_{d}^{2}\right)=\exp\left(\boldsymbol{z}_{d}^{T}\boldsymbol{\alpha}\right)\times\Delta \]

Where \(E_{\mathscr{MP}}\left(\varepsilon_{d}\right)=\Delta\). There’s no need to specify a parametric distribution for the errors of this model. Using the method of moments, the following unbiased estimator for \(\Delta\) is obtained:

\[ \hat{\Delta}=\frac{\sum_{d=1}^{D}\hat{\sigma}_{d}^{2}}{\sum_{d=1}^{D}\exp\left(\boldsymbol{z}_{d}^{T}\boldsymbol{\alpha}\right)} \]

Similarly, using standard procedures in linear regression, the estimation of the regression parameter coefficients is given by the following expression:

\[ \hat{\boldsymbol{\alpha}}=\left(\sum_{d=1}^{D}\boldsymbol{z}_{d}\boldsymbol{z}_{d}^{T}\right)^{-1}\sum_{d=1}^{D}\boldsymbol{z}_{d}\log\left(\hat{\sigma}_{d}^{2}\right) \]

Finally, the smoothed estimator of the sampling variance is defined as:

\[ \hat{\tilde{\sigma}}_{d}^{2}=\exp\left(\boldsymbol{z}_{d}^{T}\hat{\boldsymbol{\alpha}}\right)\hat{\Delta} \]

Survey Data:

The following code processes data using various R packages such as survey, tidyverse, srvyr, TeachingSampling, and haven.

  1. Library Loading: The code loads the necessary libraries (survey, tidyverse, srvyr, TeachingSampling, haven) required for data manipulation and analysis.

  2. Survey Data Set Reading: The code reads the dataset named ‘data_JAM.rds’ using the readRDS function.

  3. Data Manipulation:

    • It creates new variables (dam2, fep, upm, estrato, ingreso, pobreza) using the mutate function from the dplyr package.
    • Filters the dataset based on a specific condition using the filter function.
library(survey)
library(tidyverse)
library(srvyr)
library(TeachingSampling)
library(haven)
 
#read in data set

# Q518A: Gross average income From Employment
  
encuesta <- read_rds("Recursos/02_FGV/01_data_JAM.rds") 

encuesta <-
  encuesta %>%
  mutate(
    dam2,
    fep = RFACT/4,
    upm = paste0(PAR_COD , CONST_NUMBER, ED_NUMBER),
    estrato = ifelse(is.na(STRATA) ,strata,STRATA),
    ingreso = case_when(!is.na(Q518A) & !is.na(Q518B) ~ Q518A + Q518B, 
                        !is.na(Q518A) & !is.na(Q518B) ~ NA_real_,
                        !is.na(Q518A) & is.na(Q518B) ~ Q518A,
                        is.na(Q518A) & !is.na(Q518B) ~ Q518B),
    pobreza = ifelse(ingreso < 50,1,0)
  ) %>% filter(ingreso > 11)

dam2: Corresponds to the code assigned to the country’s second administrative division.

The income definition is structured in this manner to illustrate the process utilized in the small area estimation methodology for poverty estimation.

ID QUARTER EMPSTATUS STRATA CLUSTER CONST_NUMBER ED_NUMBER URCODE SEX AGE Q518A Q518AP Q518B Q518BP RFACT PAR_COD PARISH strata cluster dam2 fep upm estrato ingreso pobreza
02010105902170101 3 5 014 014011 01 059 1 2 44 9 NA 9e+00 NA 127.4657 01 Kingston NA NA 0101 31.8664 0101059 014 18 1
02020207900130103 3 5 235 235472 02 079 2 1 36 0 NA 1e+05 4 164.6809 12 Manchester NA NA 1202 41.1702 1202079 235 100000 0
02020404901820105 3 5 239 239488 04 049 2 1 25 NA NA 1e+04 3 152.1773 12 Manchester NA NA 1204 38.0443 1204049 239 10000 0
04010400700370102 3 5 036 036073 04 007 1 2 34 9 NA 9e+00 NA 188.7227 02 St Andrew NA NA 0204 47.1807 0204007 036 18 1
04030406400160106 3 5 221 221453 04 064 3 1 20 15000 1 0e+00 NA 308.0567 11 St Elizabeth NA NA 1104 77.0142 1104064 221 15000 0
05020506600550101 3 5 164 164352 05 066 2 2 60 9 NA 9e+00 NA 94.0227 08 St James NA NA 0805 23.5057 0805066 164 18 1
06030106300920101 3 5 122 122227 01 063 3 1 44 9 NA 9e+00 NA 252.1380 05 St Mary NA NA 0501 63.0345 0501063 122 18 1
06030200300140102 3 5 228 228464 02 003 3 2 70 9 NA 9e+00 NA 49.1197 12 Manchester NA NA 1202 12.2799 1202003 228 18 1
06030204100960101 3 5 084 084180 02 041 3 2 54 9 NA 9e+00 NA 97.2597 03 St Thomas NA NA 0302 24.3149 0302041 084 18 1
06030208501550102 3 5 234 234474 02 085 3 2 77 0 NA 3e+04 2 70.9262 12 Manchester NA NA 1202 17.7315 1202085 234 30000 0

In the following code block, the libraries survey and srvyr are used to create a sampling design from a survey database. The sampling design encompasses information about primary sampling units (PSUs), sampling weights (wkx), and strata (estrato) utilized in the sampling. Additionally, the “survey.lonely.psu” option is employed to adjust sample sizes within groups of primary sampling units that lack other primary sampling units within the same group.

library(survey)
library(srvyr)
options(survey.lonely.psu = "adjust")
id_dominio <- "dam2"

diseno <-
  as_survey_design(
    ids = upm,
    weights = fep,
    strata = estrato,
    nest = TRUE,
    .data = encuesta
  )

#summary(diseno)

´

  1. Indicator Calculation:
    • Groups the sampling design by domain ID and calculates indicators related to poverty (n_pobreza and pobreza). n_pobreza counts the number of instances where pobreza equals 1 (indicating poverty), while pobreza computes the survey mean of the pobreza variable with specific variance estimations.
# Calculating indicators related to poverty and counts of UPMS per domain

# Indicator Calculation:
# Grouping the sampling design by domain ID and summarizing variables related to poverty.
indicador_dam <-
  diseno %>% group_by_at(id_dominio) %>% 
  summarise(
    n_pobreza = unweighted(sum(pobreza == 1)),
    pobreza = survey_mean(pobreza,
                          vartype = c("se",  "var"),
                          deff = T
    )
  )
  1. Counts of UPMS per Domain:
    • Extracts domain ID and UPMs from the survey dataset, obtaining unique UPMs per domain and counting them.
    • Joins the count of unique UPMs per domain with the previously calculated indicators related to poverty.
# Counts of UPMS per domain:
# Selecting domain ID and UPMs, obtaining unique UPMs per domain, and counting them.
# Joining the count of unique UPMs per domain with previously calculated indicators related to poverty.
indicador_dam <- encuesta %>% select(id_dominio, upm) %>%
  distinct() %>% 
  group_by_at(id_dominio) %>% 
  tally(name = "n_upm") %>% 
  inner_join(indicador_dam, by = id_dominio)
# saveRDS(directodam2, "Recursos/02_FGV/indicador_dam.Rds")
dam2 n_upm n_pobreza pobreza pobreza_se pobreza_var pobreza_deff
0101 17 237 0.9980 0.0021 0.0000 5.228000e-01
0102 12 197 0.9836 0.0089 0.0001 1.053700e+00
0103 9 65 1.0000 0.0000 0.0000 3.065785e+32
0201 11 244 1.0000 0.0000 0.0000 NaN
0202 11 141 0.9391 0.0292 0.0009 2.317700e+00
0203 12 101 0.8117 0.0775 0.0060 4.795200e+00
0204 11 224 1.0000 0.0000 0.0000 NaN
0205 9 85 0.9646 0.0331 0.0011 2.932600e+00
0206 8 102 1.0000 0.0000 0.0000 NaN
0207 11 149 1.0000 0.0000 0.0000 NaN

Domains with 5 or more UPMs and all those with a deff greater than 1 are now filtered.

base_sae <- indicador_dam %>%
  filter(n_upm >= 5,
         pobreza_deff > 1) 

Next, the transformation \(\log(\hat{\sigma}^2_d)\) is performed. Additionally, the selection of the municipality identifier columns (dam2), the direct estimation (pobreza), the number of people in the domain (nd), and the estimated variance for the direct estimation (vardir) is carried out, the latter being transformed using the log() function.

baseFGV <-  base_sae %>% 
  select(dam2, pobreza, nd = n_pobreza, vardir = pobreza_var) %>%
  mutate(ln_sigma2 = log(vardir))