Aplicación 2.4: Predicciones

Curva de Engel para el gasto en alimentos

En esta aplicación se estimará una curva de Engel que relaciona el gasto de las familias en alimentación (\(GALIM\)) con la renta disponible (\(RENTA\)), usando una muestra de 235 familias americanas:

\[GALIM_{i} = \beta_1 + \beta_2 RENTA_{i} + e_{i}\]

Una vez estimado el modelo, se usa la ecuación estimada para predecir el valor esperado del gasto en alimentos de distintos tipos de familia en función de su renta familiar.

Code
# Lectura de librerías
library(tidyverse)
# Lectura de datos
ENGEL_ALIM <- read_delim("data/ENGEL_ALIM_USA.csv", ";", 
                         escape_double = FALSE, trim_ws = TRUE)
# Diagrama de puntos (scatter plot) de las variables RENTA y GALIM
ggplot(ENGEL_ALIM, aes(x = RENTA, y = GALIM)) + 
  geom_point() + 
  scale_x_continuous(limits = c(350, 5000), expand = c(0, 0)) + 
  theme_bw() + 
  labs(x = "Renta", y = "Gasto en alimentos")

Code
ggplot(ENGEL_ALIM, aes(x = RENTA, y = GALIM)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) + 
  scale_x_continuous(limits = c(350, 5000), expand = c(0, 0)) + 
  theme_bw() + 
  labs(x = "Renta", y = "Gasto en alimentos")

Code
# Estimación de una curva de Engel lineal por MCO
lin_model <- lm(GALIM ~ RENTA, data = ENGEL_ALIM)
summary(lin_model)

Call:
lm(formula = GALIM ~ RENTA, data = ENGEL_ALIM)

Residuals:
    Min      1Q  Median      3Q     Max 
-725.70  -60.24   -4.32   53.41  515.77 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 147.47539   15.95708   9.242   <2e-16 ***
RENTA         0.48518    0.01437  33.772   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 114.1 on 233 degrees of freedom
Multiple R-squared:  0.8304,    Adjusted R-squared:  0.8296 
F-statistic:  1141 on 1 and 233 DF,  p-value: < 2.2e-16
Code
# Predicción
# Vector que contiene los nuevos valores de las variables explicativas
new_RENTA <- data.frame(RENTA=c(400, 2000, 4500))
new_RENTA
  RENTA
1   400
2  2000
3  4500
Code
# Predicción puntual
pred_GALIM <- predict(lin_model, new_RENTA)
names(pred_GALIM) <-c("Renta = 400", "2000", "4500")
pred_GALIM
Renta = 400        2000        4500 
   341.5468   1117.8322   2330.7783 
Code
# Predicción del valor esperado con intervalo de confianza
pred_GALIM_IC <- predict(lin_model, new_RENTA, interval="confidence", level=0.95)
pred_GALIM_IC
        fit       lwr       upr
1  341.5468  319.4814  363.6122
2 1117.8322 1085.5127 1150.1518
3 2330.7783 2230.1418 2431.4148
Code
# Lectura de librerías
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
# Lectura de datos
ENGEL_ALIM = pd.read_csv('data/ENGEL_ALIM_USA.csv', delimiter=';')
# Estimación del modelo
model = smf.ols('GALIM ~ RENTA', data=ENGEL_ALIM)
lin_model=model.fit()
print(lin_model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  GALIM   R-squared:                       0.830
Model:                            OLS   Adj. R-squared:                  0.830
Method:                 Least Squares   F-statistic:                     1141.
Date:                Sun, 09 Feb 2025   Prob (F-statistic):           9.92e-92
Time:                        13:14:04   Log-Likelihood:                -1445.7
No. Observations:                 235   AIC:                             2895.
Df Residuals:                     233   BIC:                             2902.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    147.4754     15.957      9.242      0.000     116.037     178.914
RENTA          0.4852      0.014     33.772      0.000       0.457       0.513
==============================================================================
Omnibus:                       68.110   Durbin-Watson:                   1.411
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              927.676
Skew:                          -0.670   Prob(JB):                    3.61e-202
Kurtosis:                      12.641   Cond. No.                     2.38e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.38e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
Code
# Predicción
# Generar un vector que contiene los nuevos valores de las variables explicativas
new_RENTA = pd.DataFrame({'RENTA': [400, 2000, 4500]}, index=['newRENTA1', 'newRENTA2', 'newRENTA3'])
print(f'new_RENTA: \n{new_RENTA}\n')
new_RENTA: 
           RENTA
newRENTA1    400
newRENTA2   2000
newRENTA3   4500
Code
# Predicción puntual
pred_GALIM = lin_model.predict(new_RENTA)
print(f'pred_GALIM: \n{pred_GALIM}\n')
pred_GALIM: 
newRENTA1     341.546758
newRENTA2    1117.832236
newRENTA3    2330.778295
dtype: float64
Code
# Predicción con intervalo de confianza
pred_GALIM_IC = lin_model.get_prediction(new_RENTA).summary_frame(alpha=0.05)
print(f'pred_GALIM_IC: \n{pred_GALIM_IC}\n')
pred_GALIM_IC: 
          mean    mean_se  ...  obs_ci_lower  obs_ci_upper
0   341.546758  11.199590  ...    115.651327    567.442189
1  1117.832236  16.404210  ...    890.705804   1344.958668
2  2330.778295  51.079406  ...   2084.466352   2577.090238

[3 rows x 6 columns]