Chapter 2 Session 2- Generalized Variance Function

One of the most important inputs in the area model is the variance of the direct estimator at the domain level, which cannot be calculated directly. Accordingly, this value must be estimated from the data collected in each domain. However, in domains with very small sample sizes, these estimations will not perform well. Hence, it is very useful to use a smoothing model for variances to eliminate noise and volatility from these estimations and extract the true signal of the process.

Hidiroglou (2019) states that \(E_{\mathscr{MP}}\left(\hat{\theta}^{dir}_d\right)=\boldsymbol{x}^{T}_{d}\boldsymbol{\beta}\) and \(V_{\mathscr{MP}}\left(\hat{\theta}^{dir}_d\right)=\sigma_{u}^2+\tilde{\sigma}^2_{d}\), where the subscript \(\mathscr{MP}\) refers to the double inference that must be taken into account in these adjustments and defines the joint probability measure between the model and the sampling design.

\(\mathscr{M}\) refers to the probability measure induced by modeling and the inclusion of auxiliary covariates (\(\boldsymbol{x}_{d}\)).
\(\mathscr{MP}\) refers to the probability measure induced by the complex sampling design that yields direct estimations.

The solution proposed here is known as the Generalized Variance Function, which involves fitting a log-linear model to the estimated direct variance. Starting from the fact that an unbiased estimator of \(\sigma^2\) denoted by \(\hat{\sigma}^2\) is available, it follows that: \[ E_{\mathscr{MP}}\left(\hat{\sigma}_{d}^{2}\right)=E_{\mathscr{M}}\left(E_{\mathscr{P}}\left(\hat{\sigma}_{d}^{2}\right)\right)=E_{\mathscr{M}}\left(\sigma_{d}^{2}\right)=\tilde{\sigma}_{d}^{2} \]

The above equality can be interpreted as an unbiased and simple estimator of \(\tilde{\sigma}_{d}^{2}\), denoted as \(\hat{\sigma}_{d}^{2}\). However, this sampling estimator is unstable when the sample size is small, which is precisely the dominant paradigm in small area estimation. Rivest and Belmonte (2000) consider smoothing models for the estimation of direct variances defined as follows:

\[ \log\left(\hat{\sigma}_{d}^{2}\right)=\boldsymbol{z}_{d}^{T}\boldsymbol{\alpha}+\boldsymbol{\varepsilon}_{d} \]

Where \(\boldsymbol{z}_{d}\) is a vector of explanatory covariates that are functions of \(\boldsymbol{x}_{d}\), \(\boldsymbol{\alpha}\) is a vector of parameters to be estimated, \(\boldsymbol{\varepsilon}_{d}\) are random errors with zero mean and constant variance, assumed to be identically distributed conditionally on \(\boldsymbol{z}_{d}\). From the above model, the smoothed estimation of the sampling variance is given by:

\[ \tilde{\sigma}_{d}^{2}=E_{\mathscr{MP}}\left(\sigma_{d}^{2}\right)=\exp\left(\boldsymbol{z}_{d}^{T}\boldsymbol{\alpha}\right)\times\Delta \]

Where \(E_{\mathscr{MP}}\left(\varepsilon_{d}\right)=\Delta\). There’s no need to specify a parametric distribution for the errors of this model. Using the method of moments, the following unbiased estimator for \(\Delta\) is obtained:

\[ \hat{\Delta}=\frac{\sum_{d=1}^{D}\hat{\sigma}_{d}^{2}}{\sum_{d=1}^{D}\exp\left(\boldsymbol{z}_{d}^{T}\boldsymbol{\alpha}\right)} \]

Similarly, using standard procedures in linear regression, the estimation of the regression parameter coefficients is given by the following expression:

\[ \hat{\boldsymbol{\alpha}}=\left(\sum_{d=1}^{D}\boldsymbol{z}_{d}\boldsymbol{z}_{d}^{T}\right)^{-1}\sum_{d=1}^{D}\boldsymbol{z}_{d}\log\left(\hat{\sigma}_{d}^{2}\right) \]

Finally, the smoothed estimator of the sampling variance is defined as:

\[ \hat{\tilde{\sigma}}_{d}^{2}=\exp\left(\boldsymbol{z}_{d}^{T}\hat{\boldsymbol{\alpha}}\right)\hat{\Delta} \]

Survey Data:

The following code processes data using various R packages such as survey, tidyverse, srvyr, TeachingSampling, and haven.

Library Loading: The code loads the necessary libraries (survey, tidyverse, srvyr, TeachingSampling, haven) required for data manipulation and analysis.
Survey Data Set Reading: The code reads the dataset named ‘data_JAM.rds’ using the readRDS function.
Data Manipulation:
- It creates new variables (dam2, fep, upm, estrato, ingreso, pobreza) using the mutate function from the dplyr package.
- Filters the dataset based on a specific condition using the filter function.

library(survey)
library(tidyverse)
library(srvyr)
library(TeachingSampling)
library(haven)
 
#read in data set

# Q518A: Gross average income From Employment
  
encuesta <- read_rds("Recursos/02_FGV/01_data_JAM.rds") 

encuesta <-
  encuesta %>%
  mutate(
    dam2,
    fep = RFACT/4,
    upm = paste0(PAR_COD , CONST_NUMBER, ED_NUMBER),
    estrato = ifelse(is.na(STRATA) ,strata,STRATA),
    ingreso = case_when(!is.na(Q518A) & !is.na(Q518B) ~ Q518A + Q518B, 
                        !is.na(Q518A) & !is.na(Q518B) ~ NA_real_,
                        !is.na(Q518A) & is.na(Q518B) ~ Q518A,
                        is.na(Q518A) & !is.na(Q518B) ~ Q518B),
    pobreza = ifelse(ingreso < 50,1,0)
  ) %>% filter(ingreso > 11)

dam2: Corresponds to the code assigned to the country’s second administrative division.

The income definition is structured in this manner to illustrate the process utilized in the small area estimation methodology for poverty estimation.

ID	QUARTER	EMPSTATUS	STRATA	CLUSTER	CONST_NUMBER	ED_NUMBER	URCODE	SEX	AGE	Q518A	Q518AP	Q518B	Q518BP	RFACT	PAR_COD	PARISH	strata	cluster	dam2	fep	upm	estrato	ingreso	pobreza
02010105902170101	3	5	014	014011	01	059	1	2	44	9	NA	9e+00	NA	127.4657	01	Kingston	NA	NA	0101	31.8664	0101059	014	18	1
02020207900130103	3	5	235	235472	02	079	2	1	36	0	NA	1e+05	4	164.6809	12	Manchester	NA	NA	1202	41.1702	1202079	235	100000	0
02020404901820105	3	5	239	239488	04	049	2	1	25	NA	NA	1e+04	3	152.1773	12	Manchester	NA	NA	1204	38.0443	1204049	239	10000	0
04010400700370102	3	5	036	036073	04	007	1	2	34	9	NA	9e+00	NA	188.7227	02	St Andrew	NA	NA	0204	47.1807	0204007	036	18	1
04030406400160106	3	5	221	221453	04	064	3	1	20	15000	1	0e+00	NA	308.0567	11	St Elizabeth	NA	NA	1104	77.0142	1104064	221	15000	0
05020506600550101	3	5	164	164352	05	066	2	2	60	9	NA	9e+00	NA	94.0227	08	St James	NA	NA	0805	23.5057	0805066	164	18	1
06030106300920101	3	5	122	122227	01	063	3	1	44	9	NA	9e+00	NA	252.1380	05	St Mary	NA	NA	0501	63.0345	0501063	122	18	1
06030200300140102	3	5	228	228464	02	003	3	2	70	9	NA	9e+00	NA	49.1197	12	Manchester	NA	NA	1202	12.2799	1202003	228	18	1
06030204100960101	3	5	084	084180	02	041	3	2	54	9	NA	9e+00	NA	97.2597	03	St Thomas	NA	NA	0302	24.3149	0302041	084	18	1
06030208501550102	3	5	234	234474	02	085	3	2	77	0	NA	3e+04	2	70.9262	12	Manchester	NA	NA	1202	17.7315	1202085	234	30000	0

In the following code block, the libraries survey and srvyr are used to create a sampling design from a survey database. The sampling design encompasses information about primary sampling units (PSUs), sampling weights (wkx), and strata (estrato) utilized in the sampling. Additionally, the “survey.lonely.psu” option is employed to adjust sample sizes within groups of primary sampling units that lack other primary sampling units within the same group.

library(survey)
library(srvyr)
options(survey.lonely.psu = "adjust")
id_dominio <- "dam2"

diseno <-
  as_survey_design(
    ids = upm,
    weights = fep,
    strata = estrato,
    nest = TRUE,
    .data = encuesta
  )

#summary(diseno)

Indicator Calculation:
- Groups the sampling design by domain ID and calculates indicators related to poverty (n_pobreza and pobreza). n_pobreza counts the number of instances where pobreza equals 1 (indicating poverty), while pobreza computes the survey mean of the pobreza variable with specific variance estimations.

# Calculating indicators related to poverty and counts of UPMS per domain

# Indicator Calculation:
# Grouping the sampling design by domain ID and summarizing variables related to poverty.
indicador_dam <-
  diseno %>% group_by_at(id_dominio) %>% 
  summarise(
    n_pobreza = unweighted(sum(pobreza == 1)),
    pobreza = survey_mean(pobreza,
                          vartype = c("se",  "var"),
                          deff = T
    )
  )

Counts of UPMS per Domain:
- Extracts domain ID and UPMs from the survey dataset, obtaining unique UPMs per domain and counting them.
- Joins the count of unique UPMs per domain with the previously calculated indicators related to poverty.

# Counts of UPMS per domain:
# Selecting domain ID and UPMs, obtaining unique UPMs per domain, and counting them.
# Joining the count of unique UPMs per domain with previously calculated indicators related to poverty.
indicador_dam <- encuesta %>% select(id_dominio, upm) %>%
  distinct() %>% 
  group_by_at(id_dominio) %>% 
  tally(name = "n_upm") %>% 
  inner_join(indicador_dam, by = id_dominio)
# saveRDS(directodam2, "Recursos/02_FGV/indicador_dam.Rds")

dam2	n_upm	n_pobreza	pobreza	pobreza_se	pobreza_var	pobreza_deff
0101	17	237	0.9980	0.0021	0.0000	5.228000e-01
0102	12	197	0.9836	0.0089	0.0001	1.053700e+00
0103	9	65	1.0000	0.0000	0.0000	3.065785e+32
0201	11	244	1.0000	0.0000	0.0000	NaN
0202	11	141	0.9391	0.0292	0.0009	2.317700e+00
0203	12	101	0.8117	0.0775	0.0060	4.795200e+00
0204	11	224	1.0000	0.0000	0.0000	NaN
0205	9	85	0.9646	0.0331	0.0011	2.932600e+00
0206	8	102	1.0000	0.0000	0.0000	NaN
0207	11	149	1.0000	0.0000	0.0000	NaN

Domains with 5 or more UPMs and all those with a deff greater than 1 are now filtered.

base_sae <- indicador_dam %>%
  filter(n_upm >= 5,
         pobreza_deff > 1)

Next, the transformation \(\log(\hat{\sigma}^2_d)\) is performed. Additionally, the selection of the municipality identifier columns (dam2), the direct estimation (pobreza), the number of people in the domain (nd), and the estimated variance for the direct estimation (vardir) is carried out, the latter being transformed using the log() function.

baseFGV <-  base_sae %>% 
  select(dam2, pobreza, nd = n_pobreza, vardir = pobreza_var) %>%
  mutate(ln_sigma2 = log(vardir))