2.2 Variance Model

The code fits a multiple linear regression model (using the lm() function), where ln_sigma2 is the response variable and the predictor variables include pobreza, nd, and various transformations of these variables. The goal of this model is to estimate the generalized variance function (FGV) for the observed domains.

library(gtsummary)
FGV1 <- lm(ln_sigma2 ~ -1 +  pobreza +
             I(pobreza*nd),
     data = baseFGV)

tbl_regression(FGV1) %>% 
  add_glance_table(include = c(r.squared, adj.r.squared))
Characteristic Beta 95% CI1 p-value
pobreza -12 -20, -4.5 0.003
I(pobreza * nd) 0.02 -0.04, 0.07 0.5
0.345

Adjusted R² 0.311

1 CI = Confidence Interval

After obtaining the model estimation, the value of the constant \(\Delta\) must be obtained, for which the following code is used.

delta.hat = sum(baseFGV$vardir) / 
  sum(exp(fitted.values(FGV1)))

From which it is derived that \(\Delta = 0.110434\). Finally, it is possible to obtain the smoothed variance by executing the following command.

hat.sigma <- 
  data.frame(dam2 = baseFGV$dam2,
             hat_var = delta.hat * exp(fitted.values(FGV1)))

baseFGV <- left_join(baseFGV, hat.sigma)
tba(head(baseFGV, 10))
dam2 pobreza nd vardir ln_sigma2 hat_var
0102 0.9836 197 0.0001 -9.4372 0e+00
0103 1.0000 65 0.0000 -76.3620 0e+00
0202 0.9391 141 0.0009 -7.0701 0e+00
0203 0.8117 101 0.0060 -5.1159 0e+00
0205 0.9646 85 0.0011 -6.8149 0e+00
0212 0.8304 59 0.0101 -4.5936 0e+00
0301 0.9419 45 0.0013 -6.6569 0e+00
0302 0.9109 159 0.0017 -6.3681 0e+00
0401 0.8063 319 0.0006 -7.4369 4e-04
0402 0.8311 259 0.0012 -6.7172 1e-04

Model validation for the FGV

par(mfrow = c(2, 2))
plot(FGV1)

Smoothed variance prediction

base_sae <- left_join(indicador_dam,
                      baseFGV %>% select(id_dominio, hat_var),
                      by = id_dominio) %>%
  mutate(
    pobreza_var = ifelse(is.na(hat_var), NA_real_, pobreza_var),
    pobreza_deff = ifelse(is.na(hat_var), NA_real_, pobreza_deff)
  )

Now, we make a line graph to see the volatility and the estimates of the variances.

nDom <- sum(!is.na(base_sae$hat_var))
temp_FH <- base_sae %>% filter(!is.na(hat_var))
ggplot(temp_FH %>%
         arrange(n_pobreza), aes(x = 1:nDom)) +
  geom_line(aes(y = pobreza_var, color = "VarDirEst")) +
  geom_line(aes(y = hat_var, color = "FGV")) +
  labs(y = "Varianzas", x = "Sample size", color = " ") +
  scale_x_continuous(breaks = seq(1, nDom, by = 10),
                     labels = temp_FH$n_pobreza[order(temp_FH$n_pobreza)][seq(1, nDom, by = 10)]) +
  scale_color_manual(values = c("FGV" = "Blue", "VarDirEst" = "Red"))

This code performs several transformations on the dataset base_sae:

  1. Creation of new variables:
    • pobreza_deff: Replaces NaN values with 1 if they exist; otherwise, it keeps the original value.
    • deff_FGV: Computes a new Design Effect (DEFF) using the formula hat_var / (pobreza_var / pobreza_deff) when pobreza_var is not equal to 0.
    • n_eff_FGV: Calculates the effective number of surveyed individuals as n_pobreza / deff_FGV.
  2. Modification of the variable pobreza:
    • If hat_var is NA, it replaces pobreza values with NA; otherwise, it retains the original value.
base_FH <- base_sae %>%
  mutate(
    pobreza_deff = ifelse(is.nan(pobreza_deff), 1, pobreza_deff),
    deff_FGV = ifelse(pobreza_var == 0 ,
      1,
      hat_var / (pobreza_var / pobreza_deff) #Fórmula del nuevo DEFF
    ),
    # Criterio MDS para regularizar el DeffFGV
    n_eff_FGV = n_pobreza / deff_FGV, #Número efectivo de personas encuestadas

     pobreza = ifelse(is.na(hat_var), NA_real_, pobreza) 
  )


#saveRDS(object = base_FH, "Recursos/02_FGV/base_FH_2020.rds")