2.2 Variance Model

The code fits a multiple linear regression model (using the lm() function), where ln_sigma2 is the response variable and the predictor variables include pobreza, nd, and various transformations of these variables. The goal of this model is to estimate the generalized variance function (FGV) for the observed domains.

library(gtsummary)
FGV1 <- lm(ln_sigma2 ~ -1 +  pobreza +
             I(pobreza*nd),
     data = baseFGV)

tbl_regression(FGV1) %>% 
  add_glance_table(include = c(r.squared, adj.r.squared))

Characteristic	Beta	95% CI¹	p-value
pobreza	-12	-20, -4.5	0.003
I(pobreza * nd)	0.02	-0.04, 0.07	0.5
R²	0.345
Adjusted R²	0.311
¹ CI = Confidence Interval

After obtaining the model estimation, the value of the constant \(\Delta\) must be obtained, for which the following code is used.

delta.hat = sum(baseFGV$vardir) / 
  sum(exp(fitted.values(FGV1)))

From which it is derived that \(\Delta = 0.110434\). Finally, it is possible to obtain the smoothed variance by executing the following command.

hat.sigma <- 
  data.frame(dam2 = baseFGV$dam2,
             hat_var = delta.hat * exp(fitted.values(FGV1)))

baseFGV <- left_join(baseFGV, hat.sigma)
tba(head(baseFGV, 10))

dam2	pobreza	nd	vardir	ln_sigma2	hat_var
0102	0.9836	197	0.0001	-9.4372	0e+00
0103	1.0000	65	0.0000	-76.3620	0e+00
0202	0.9391	141	0.0009	-7.0701	0e+00
0203	0.8117	101	0.0060	-5.1159	0e+00
0205	0.9646	85	0.0011	-6.8149	0e+00
0212	0.8304	59	0.0101	-4.5936	0e+00
0301	0.9419	45	0.0013	-6.6569	0e+00
0302	0.9109	159	0.0017	-6.3681	0e+00
0401	0.8063	319	0.0006	-7.4369	4e-04
0402	0.8311	259	0.0012	-6.7172	1e-04

Model validation for the FGV

par(mfrow = c(2, 2))
plot(FGV1)

Smoothed variance prediction

base_sae <- left_join(indicador_dam,
                      baseFGV %>% select(id_dominio, hat_var),
                      by = id_dominio) %>%
  mutate(
    pobreza_var = ifelse(is.na(hat_var), NA_real_, pobreza_var),
    pobreza_deff = ifelse(is.na(hat_var), NA_real_, pobreza_deff)
  )

Now, we make a line graph to see the volatility and the estimates of the variances.

nDom <- sum(!is.na(base_sae$hat_var))
temp_FH <- base_sae %>% filter(!is.na(hat_var))
ggplot(temp_FH %>%
         arrange(n_pobreza), aes(x = 1:nDom)) +
  geom_line(aes(y = pobreza_var, color = "VarDirEst")) +
  geom_line(aes(y = hat_var, color = "FGV")) +
  labs(y = "Varianzas", x = "Sample size", color = " ") +
  scale_x_continuous(breaks = seq(1, nDom, by = 10),
                     labels = temp_FH$n_pobreza[order(temp_FH$n_pobreza)][seq(1, nDom, by = 10)]) +
  scale_color_manual(values = c("FGV" = "Blue", "VarDirEst" = "Red"))

This code performs several transformations on the dataset base_sae:

Creation of new variables:
- pobreza_deff: Replaces NaN values with 1 if they exist; otherwise, it keeps the original value.
- deff_FGV: Computes a new Design Effect (DEFF) using the formula hat_var / (pobreza_var / pobreza_deff) when pobreza_var is not equal to 0.
- n_eff_FGV: Calculates the effective number of surveyed individuals as n_pobreza / deff_FGV.
Modification of the variable pobreza:
- If hat_var is NA, it replaces pobreza values with NA; otherwise, it retains the original value.

base_FH <- base_sae %>%
  mutate(
    pobreza_deff = ifelse(is.nan(pobreza_deff), 1, pobreza_deff),
    deff_FGV = ifelse(pobreza_var == 0 ,
      1,
      hat_var / (pobreza_var / pobreza_deff) #Fórmula del nuevo DEFF
    ),
    # Criterio MDS para regularizar el DeffFGV
    n_eff_FGV = n_pobreza / deff_FGV, #Número efectivo de personas encuestadas

     pobreza = ifelse(is.na(hat_var), NA_real_, pobreza) 
  )


#saveRDS(object = base_FH, "Recursos/02_FGV/base_FH_2020.rds")