DESCRIBE THE VARIABLES, DATA AND CORRELATION
Describe the variables
Function we have in this report will include these following variables:
Dependent variable: salepric - Sale price and characteristics of house in
2 communities of California: Dove Canyon and Coto de Caza (thousands of dollars)
Independent variables: sqft – Living area in square feet garage – Number of car spaces city – City: 1 for Coto de Caza and 0 for Dove Canyon
Describe the data
We collect data of Sale price and characteristics of house in 2 communities of California: Dove Canyon and Coto de Caza from Ramanathan - Gretl
Describe the correlation between variables
Correlation Matrix for Linear – linear Model:
Correlation coefficients, using the observations 1 - 224 5% critical value (two-tailed) = 0.1311 for n = 224
- Salepric is directly proportional to sqft The set standard between these two variable is quite high
- Salepric is directly proportional to garage The set standard between these two variable is medium
- Salepric is directly proportional to city The set standard between these two variable is medium
Correlation Matrix for Log – linear Model:
Correlation coefficients, using the observations 1 - 224 5% critical value (two-tailed) = 0.1311 for n = 224 l_salepric sqft garage city 1.0000 0.8857 0.6135 0.6486 l_salepric
- Salepric is directly proportional to sqft The set standard between these two variable is quite high
- Salepric is directly proportional to garage The set standard between these two variable is medium
- Salepric is directly proportional to city The set standard between these two variable is medium
Correlation Matrix for Log – log Model:
Correlation coefficients, using the observations 1 - 224 5% critical value (two-tailed) = 0.1311 for n = 224
- l_salepric l_sqft l_garage city 1.0000 0.9001 0.5988 0.6486 l_salepric
- Salepric is directly proportional to sqft The set standard between these two variable is quite high
- Salepric is directly proportional to garage The set standard between these two variable is medium
- Salepric is directly proportional to city The set standard between these two variable is medium
ESTIMATED MODEL AND STATISTICAL INFERENCES
Linear – linear model
Model 1: OLS, using observations 1-224 Dependent variable: salepric
Coefficient Std Error t-ratio p-value
Mean dependent var 642.9294 S.D dependent var 371.3762 Sum squared resid 3641423 S.E of regression 128.6543 R-squared 0.881604 Adjusted R-squared 0.879989
Log-likelihood −1403.821 Akaike criterion 2815.642 Schwarz criterion 2829.289 Hannan-Quinn 2821.150
Describe the basic content of the value when estimating the function:
- The Population regression function is set up: salepric i = 𝜷 1 + 𝜷 2 sqft+ 𝜷 3 garage+ 𝜷 4 city + 𝒖 i
- The Sample regression function is set up:
- Equation of regression: salepric = -704.854 + 0.220060sqft + 129.286garage + 101.275city
𝜷2 = 0.220060: When sqft increases by 1 (square feet), holding the value of garage and city constant, the estimated value of salepric increases by 0.220060 (thousands of dollars)
𝜷3 = 129.286: When garage increases by 1 (number of car space), holding the value of sqft and city constant, the estimated value of salepric increases by 129.286 (thousands of dollars)
𝜷4 = 101.275: The expected sale price of house in Coto de Caza is higher than that in Dove Canyon with the value is 101.275 (thousands of dollars)
In our results, we can see R 2 which indicates that the model explains all the variability of the response data around its mean
That R 2 = 0.881604 is quite high, which suggests that the model is good fit
The analysis reveals that 88.16% of the variation in sale prices can be attributed to changes in key independent variables, including living area, number of car spaces, and city location.
2 Testing 2.1 Testing hypothesis 2.1.1 Testing an individual regression coefficient
Purpose: Test for the statistical significace or the effect of independent variables on dependent one We have: α = 0.05
Testing the variable of Living area in square feet (sqft):
Given that the hypothesis is:
We see: P-value of sqft is < 0.0001 < 0.05 → Reject H0 → The coefficient 𝛽 2 is statistically significant
Testing the variable of Number of car spaces (garage):
Given that the hypothesis is:
We see: P-value of garage is < 0.0001 < 0.05 → Reject H0 → The coefficient
Testing the variable of City:
Given that the hypothesis is:
We see: P-value of sqft is < 0.0001 < 0.05 → Reject H0 → The coefficient 𝛽 4 is statistically significant
Purpose: Test the null hypothesis stating that none of the explanatory variables has an effect on the dependent variable.We have: 𝛼 = 0.05
Given that the hypothesis is:
We have: P-value(F) = 1.3e-101 < 𝛼 = 0.05 → Reject H0 → All parameters are not simultaneously equal to zero→ At least one variable has an effect on dependent one
The model is statistically fitted
Given that the hypothesis is:
Auxiliary regression for RESET specification test OLS, using observations 1-224
Dependent variable: salepric coefficient std error t-ratio p-value - const 419.370 278.731 1.505 0.1339 sqft −0.0255144 0.0615390 −0.4146 0.6788 garage −14.1284 41.3909 −0.3413 0.7332 city 53.0970 27.8779 1.905 0.0581 * yhat^2 0.000847862 0.000268149 3.162 0.0018 *** yhat^3 −1.83128e-07 7.65971e-08 −2.391 0.0177 **
Method: Because of the limited research, we will spend more time reading more documents to find out which variable is omitted
Using the following command vif regression to examine multicollinearity
“VIF” commands specific to the variance inflation factor, if a variable value vif > 10, the model has the possibility of multicollinearity
Using “VIF” command in Gretl, we have following result:
Variance Inflation Factors Minimum possible value = 1.0 Values > 10.0 may indicate a collinearity problem sqft 1.742 garage 1.512 city 1.224
VIF(j) = 1/(1 - R(j)^2), where R(j) is the multiple correlation coefficient between variable j and the other independent variables Belsley-Kuh-Welsch collinearity diagnostics:
- variance proportions - lambda cond const sqft garage city 3.591 1.000 0.002 0.004 0.001 0.022 0.354 3.185 0.008 0.003 0.004 0.858 0.044 9.020 0.169 0.787 0.013 0.120 0.011 18.192 0.821 0.206 0.981 0.001 lambda = eigenvalues of X'X, largest to smallest cond = condition index note: variance proportions columns sum to 1.0
We see: VIF(sqft) = 1.742 < 10 VIF(garage) = 1.512 < 10 VIF(city) = 1.224 < 10
→ The model does not contain perfect multicollinearity
Given that the hypothesis is:
White's test for heteroskedasticity OLS, using observations 1-224
Dependent variable: uhat^2 coefficient std error t-ratio p-value - const 93359.5 63100.7 1.480 0.1405 sqft −55.8510 15.7454 −3.547 0.0005 *** garage 26287.7 31028.8 0.8472 0.3978 city −41070.9 62326.7 −0.6590 0.5106 sq_sqft 0.0106259 0.000988871 10.75 7.78e-022 ***
Test statistic: TR^2 = 149.342835, with p-value = P(Chi-square(8) > 149.342835) = 2.68837e-028
We see: p-value = P(Chi-square(8) > 149.342835) = 2.68837e - 028 < 𝛼 0.05 → Reject H0 → The model has heteroskedasticity problem
Method: Using Robust to fix the problem:
Model 2: OLS, using observations 1-224 Dependent variable: salepric Heteroskedasticity-robust standard errors, variant HC1
Coefficient Std Error t-ratio p-value
Mean dependent var 642.9294 S.D dependent var 371.3762 Sum squared resid 3641423 S.E of regression 128.6543 R-squared 0.881604 Adjusted R-squared 0.879989
Log-likelihood −1403.821 Akaike criterion 2815.642 Schwarz criterion 2829.289 Hannan-Quinn 2821.150
→ The model has BLUE quality but it still contains heteroskedasticity problem
Given that the hypothesis is:
Using normality of residual in Gretl:
Test for normality of residual - Null hypothesis: error is normally distributed Test statistic: Chi-square(2) = 265.203 with p-value = 2.58197e-058
We see: Chi-square(2) = 265.203 with p-value 2.58197e-058 < 𝛼 = 0.05 →
Reject H0 → The model does not have normality.
Method: Increasing the number of observations until n ≥ 384.
Log – linear model
Model 3: OLS, using observations 1-224 Dependent variable: l_salepric
Coefficient Std Error t-ratio p-value
Mean dependent var 6.365959 S.D dependent var 0.403646 Sum squared resid 4.038026 S.E of regression 0.135479 R-squared 0.888862 Adjusted R-squared 0.887346
Log-likelihood 131.9375 Akaike criterion −255.8749 Schwarz criterion −242.2283 Hannan-Quinn −250.3665
● Describe the basic content of the value when estimating the function:
- The Population regression function is set up: ln(salepric)= 𝜷 1 + 𝜷 2 sqft + 𝜷 3 garage+ 𝜷 3 city + 𝒖 i
- The Sample regression function is set up:
- Equation of regression: ln(salepric) = 5.01704 + 0.000207498 sqft − 0.117941 garage + 0.267482 city
✓ 𝜷 2 = 0.000207498: When sqft increases by 1 (square feet), keeping the value of garage and city constant, the Expected value of salepric increases by 0.0207498%
✓ 𝜷3 = -0.117941: When garage increases by 1 (year), keeping the value of sqft and city constant, the Expected value of salepric decreases by
✓ 𝜷4 = 0.267482: The expected sale price of house in Coto de Caza is higher than that in Dove Canyon with the value is 26.7482 %
In our results, we can see R 2 which indicates that the model explains all the variability of the response data around its mean
The R² value of 0.888862 indicates a strong model fit, meaning that 88.8862% of the variation in the dependent variable (sale price) is explained by the independent variables (square footage, garage, and city).
2.1 Testing hypothesis 2.1.1 Testing an individual regression coefficient
Purpose: Test for the statistical significace or the effect of independent variables on dependent one We have: α = 0.05
Given that the hypothesis is:
We see: P-value of sqft is < 0.0001 < 0.05 → Reject H0 → The coefficient β2 is statistically significant
Given that the hypothesis is:
We see: P-value of garage is < 0.0001 < 0.05 → Reject H0 → The coefficient β3 is statistically significant
Given that the hypothesis is:
We see: P-value of city is < 0.0001 < 0.05 → Reject H0 → The coefficient β4 is statistically significant
Purpose: Test the null hypothesis stating that none of the explanatory variables has an effect on the dependent variable.We have: α=0.05 Given that the hypothesis is:
We have: P-value(F) = 1.2e – 104 < α=0.05 → Reject H0 → All parameters are not simultaneously equal to zero→ At least one variable has an effect on dependent one
→ The model is statistically fitted
Given that the hypothesis is:
Auxiliary regression for RESET specification test OLS, using observations 1-224
Dependent variable: l_salepric coefficient std error t-ratio p-value - const −86.4806 28.9110 −2.991 0.0031 *** sqft −0.00647575 0.00218279 −2.967 0.0033 *** garage −3.68898 1.24248 −2.969 0.0033 *** city −8.40197 2.80299 −2.998 0.0030 *** yhat^2 4.90549 1.53920 3.187 0.0016 *** yhat^3 −0.247302 0.0748301 −3.305 0.0011 ***
Method: Because of the limited research, we will spend more time reading more documents to find out which variable is omitted
Using the following command vif regression to examine multicollinearity
“VIF” commands specific to the variance inflation factor, if a variable value vif > 10, the model has the possibility of multicollinearity
Using “VIF” command in Gretl, we have following result:
Variance Inflation Factors (VIF) are crucial for detecting multicollinearity in regression analysis, with a minimum value of 1.0 Values exceeding 10.0 may signal potential collinearity issues among variables For instance, the VIF values for square footage (1.742), garage size (1.512), and city (1.224) indicate low collinearity The formula for VIF is VIF(j) = 1/(1 - R(j)^2), where R(j) represents the multiple correlation coefficient between variable j and other independent variables Additionally, the Belsley-Kuh-Welsch collinearity diagnostics provide further insights into multicollinearity concerns.
- variance proportions - lambda cond const sqft garage city 3.591 1.000 0.002 0.004 0.001 0.022 0.354 3.185 0.008 0.003 0.004 0.858 0.044 9.020 0.169 0.787 0.013 0.120 0.011 18.192 0.821 0.206 0.981 0.001 lambda = eigenvalues of X'X, largest to smallest cond = condition index note: variance proportions columns sum to 1.0
We see: VIF (sqft) = 1.742 < 10 VIF (garage) = 1.512 < 10 VIF (city) = 1.224 < 10
→ The model does not contain perfect multicollinearity
Given that the hypothesis is:
White's test for heteroskedasticity OLS, using observations 1-224
Dependent variable: uhat^2 coefficient std error t-ratio p-value - const 0.0199661 0.0562010 0.3553 0.7227 sqft −1.76611e-05 1.40238e-05 −1.259 0.2093 garage 0.0326536 0.0276360 1.182 0.2387 city −0.0776614 0.0555116 −1.399 0.1633 sq_sqft −2.70427e-012 8.80744e-010 −0.003070 0.9976 X2_X3 6.84208e-06 3.88359e-06 1.762 0.0795 * X2_X4 7.65262e-06 9.38796e-06 0.8152 0.4159 sq_garage −0.0127315 0.00500544 −2.544 0.0117 **
Test statistic: TR^2 = 48.607688, with p-value = P(Chi-square(8) > 48.6077) = 7.55903e-008
We see: p-value = P(Chi-square(8) > 48.6077) = 7.55903e-008 < α = 0.05
→ Reject H0 → The model has heteroskedasticity problem
Method: Using Robust to fix the problem:
Model 4: OLS, using observations 1-224 Dependent variable: l_salepric Heteroskedasticity-robust standard errors, variant HC1
Coefficient Std Error t-ratio p-value
Mean dependent var 6.365959 S.D dependent var 0.403646 Sum squared resid 4.038026 S.E of regression 0.135479 R-squared 0.888862 Adjusted R-squared 0.887346
Log-likelihood 131.9375 Akaike criterion −255.8749 Schwarz criterion −242.2283 Hannan-Quinn −250.3665
→ The model has BLUE quality but it still contains heteroskedasticity problem
Given that the hypothesis is:
𝐻 1 : The residuals don ′ t have normality
Using normality of residual in Gretl:
Frequency distribution for uhat1, obs 1-224 number of bins = 15, mean = 3.17207e-017, sd = 0.135479 interval midpt frequency rel cum
Test for null hypothesis of normal distribution:
We see: Chi-square(2) = 16.779 with p-value 0.00023 < α = 0.05 → Reject
H0 → The model does not have normality
Method: Increasing the number of observations until n ≥ 384.
Log – log model
Model 5: OLS, using observations 1-224 Dependent variable: l_salepric
Coefficient Std Error t-ratio p-value const −3.35140 0.361870 −9.261 60.581711) = 3.58349e-010
We see: p-value = P(Chi-square(8) > 60.581711) = 3.58349e - 010 < 𝛼 0.05 → Reject H0 → The model has heteroskedasticity problem
Method: Using Robust to fix the problem:
Model 6: OLS, using observations 1-224 Dependent variable: l_salepric Heteroskedasticity-robust standard errors, variant HC1
Coefficient Std Error t-ratio p-value
Mean dependent var 6.365959 S.D dependent var 0.403646 Sum squared resid 4.088755 S.E of regression 0.136328 R-squared 0.887465 Adjusted R-squared 0.885931
Log-likelihood 130.5392 Akaike criterion −253.0784 Schwarz criterion −239.4318 Hannan-Quinn −247.5700
→ The model has BLUE quality but it still contains heteroskedasticity problem
Given that the hypothesis is:
Using normality of residual in Gretl:
Frequency distribution for uhat1, obs 1-224 number of bins = 15, mean = 1.41157e-015, sd = 0.136328 interval midpt frequency rel cum
Test for null hypothesis of normal distribution:
We see: Chi-square(2) = 265.203 with p-value 0.00669 < 𝛼 = 0.05 → Reject
H0 → The model does not have normality.
Method: Increasing the number of observations until n ≥ 384
Our study reveals a significant relationship between the sale prices of homes in Coto de Caza and Dove Canyon, California, influenced by key factors such as living area, number of car spaces, and location Notably, living area emerges as the most impactful variable; larger living spaces correlate with higher sale prices Additionally, an increase in the number of car spaces also contributes positively to sale prices The location of the property serves as a critical explanatory variable for demand Through a descriptive analysis and regression modeling, we found that all three variables demonstrated statistical significance Our hypothesis testing confirmed that the overall model is statistically valid, and we identified no perfect multicollinearity across the three tested models: linear-linear, log-linear, and log-log However, we encountered common issues, including omitted variable bias, heteroskedasticity, and non-normality of residuals We successfully addressed the heteroskedasticity issue using robust regression techniques, while further research is needed to resolve the other challenges Ultimately, there is no definitive best model; our approach focuses on estimating and testing various models to improve their accuracy over time.
Despite the existing flaws in the models, the significant impact of independent variables on the dependent variable offers hope for further research and the development of improved models This advancement can empower investors, particularly Vietnamese investors, to succeed in one of the world's most dynamic real estate markets, the USA Additionally, enhanced models can assist policymakers in creating effective housing policies, alleviating the housing burden on the population.