2.2 Variance Model
The code fits a multiple linear regression model (using the lm() function), where ln_sigma2 is the response variable and the predictor variables include pobreza, nd, and various transformations of these variables. The goal of this model is to estimate the generalized variance function (FGV) for the observed domains.
library(gtsummary)
FGV1 <- lm(ln_sigma2 ~ -1 + pobreza +
I(pobreza*nd),
data = baseFGV)
tbl_regression(FGV1) %>%
add_glance_table(include = c(r.squared, adj.r.squared))| Characteristic | Beta | 95% CI1 | p-value |
|---|---|---|---|
| pobreza | -12 | -20, -4.5 | 0.003 |
| I(pobreza * nd) | 0.02 | -0.04, 0.07 | 0.5 |
| R² | 0.345 | ||
| Adjusted R² | 0.311 | ||
| 1 CI = Confidence Interval | |||
After obtaining the model estimation, the value of the constant \(\Delta\) must be obtained, for which the following code is used.
From which it is derived that \(\Delta = 0.110434\). Finally, it is possible to obtain the smoothed variance by executing the following command.
hat.sigma <-
data.frame(dam2 = baseFGV$dam2,
hat_var = delta.hat * exp(fitted.values(FGV1)))
baseFGV <- left_join(baseFGV, hat.sigma)
tba(head(baseFGV, 10))| dam2 | pobreza | nd | vardir | ln_sigma2 | hat_var |
|---|---|---|---|---|---|
| 0102 | 0.9836 | 197 | 0.0001 | -9.4372 | 0e+00 |
| 0103 | 1.0000 | 65 | 0.0000 | -76.3620 | 0e+00 |
| 0202 | 0.9391 | 141 | 0.0009 | -7.0701 | 0e+00 |
| 0203 | 0.8117 | 101 | 0.0060 | -5.1159 | 0e+00 |
| 0205 | 0.9646 | 85 | 0.0011 | -6.8149 | 0e+00 |
| 0212 | 0.8304 | 59 | 0.0101 | -4.5936 | 0e+00 |
| 0301 | 0.9419 | 45 | 0.0013 | -6.6569 | 0e+00 |
| 0302 | 0.9109 | 159 | 0.0017 | -6.3681 | 0e+00 |
| 0401 | 0.8063 | 319 | 0.0006 | -7.4369 | 4e-04 |
| 0402 | 0.8311 | 259 | 0.0012 | -6.7172 | 1e-04 |
Model validation for the FGV
Smoothed variance prediction
base_sae <- left_join(indicador_dam,
baseFGV %>% select(id_dominio, hat_var),
by = id_dominio) %>%
mutate(
pobreza_var = ifelse(is.na(hat_var), NA_real_, pobreza_var),
pobreza_deff = ifelse(is.na(hat_var), NA_real_, pobreza_deff)
)Now, we make a line graph to see the volatility and the estimates of the variances.
nDom <- sum(!is.na(base_sae$hat_var))
temp_FH <- base_sae %>% filter(!is.na(hat_var))
ggplot(temp_FH %>%
arrange(n_pobreza), aes(x = 1:nDom)) +
geom_line(aes(y = pobreza_var, color = "VarDirEst")) +
geom_line(aes(y = hat_var, color = "FGV")) +
labs(y = "Varianzas", x = "Sample size", color = " ") +
scale_x_continuous(breaks = seq(1, nDom, by = 10),
labels = temp_FH$n_pobreza[order(temp_FH$n_pobreza)][seq(1, nDom, by = 10)]) +
scale_color_manual(values = c("FGV" = "Blue", "VarDirEst" = "Red"))This code performs several transformations on the dataset base_sae:
- Creation of new variables:
pobreza_deff: Replaces NaN values with 1 if they exist; otherwise, it keeps the original value.deff_FGV: Computes a new Design Effect (DEFF) using the formulahat_var / (pobreza_var / pobreza_deff)whenpobreza_varis not equal to 0.n_eff_FGV: Calculates the effective number of surveyed individuals asn_pobreza / deff_FGV.
- Modification of the variable
pobreza:- If
hat_varis NA, it replacespobrezavalues with NA; otherwise, it retains the original value.
- If
base_FH <- base_sae %>%
mutate(
pobreza_deff = ifelse(is.nan(pobreza_deff), 1, pobreza_deff),
deff_FGV = ifelse(pobreza_var == 0 ,
1,
hat_var / (pobreza_var / pobreza_deff) #Fórmula del nuevo DEFF
),
# Criterio MDS para regularizar el DeffFGV
n_eff_FGV = n_pobreza / deff_FGV, #Número efectivo de personas encuestadas
pobreza = ifelse(is.na(hat_var), NA_real_, pobreza)
)
#saveRDS(object = base_FH, "Recursos/02_FGV/base_FH_2020.rds")