The article analysis the factors that affect the cost of houses near Boston. Our data includes a lot of possible factors. The initial original model of the house price contained 13 predictor variables. In fact, it is certainly expected that not all of the factors affect the house price. Therefore, in order to get an appropriate model that could describe the house price in Boston suburbs, it is necessary to exclude the factors that are less important. Indeed, according to the statistical tests most of the above factors had no impact on the response variable House price much, so we removed them. Other predictor variables were checked again. The model describes 86% of variation in house prices. It means that in 86 cases out of 100 it will predict price of house quite accurate.
The data we used in our project was obtained from the collection of databases of the UCI Machine Learning Repository, which has been widely used by students and researchers around the world since 1987. There are a lot of data sets on different topics, among them we chose the Housing data set, maintained at Carnegie Mellon University. On that page (http://archive.ics.uci.edu/ml/datasets/Housing) you can find the names of the variables and on the other (http://archive.ics.uci.edu/ml/machine-learningdatabases/housing/housing.data) there is a quantitative data that was later saved as txt document. The data set provides information about the house price in suburbs of Boston in 1993. We found the data set is interesting, because we wanted to find out on which variables the house price depended on and how much they cost near Boston in the year we were born. At the same time, among the variables of the data the most curious thing for us was the 12th variable: the proportion of blacks by town. We were interested in the answer for the question: were black people still being discriminated in the US in 1993? During the research, we could answer the question by looking at the relation between the house price and black people proportion in the suburbs of Boston. NOTE: First, we tried to do linear regression and found out outliers that hurt our model. Then, we removed all outliers from whole data (validation + regression data), and run the regression again.
Firstly, we divided our data into two parts. The first part of the data we used for obtaining our regression function. The second part is used for validation of our regression function. Furthermore, we tried to predictour model linearly using our all variables. As we can see from our table below some p values of variables is more than 0.01 (crim, indus, age) and we decided that we should change our model and remove insignificant variables. After identifying the model, we checked assumptions: Normality of errors, Independence of errors, Constant variance of errors, Linearity of regression. First, to check Independence of residuals: We looked at plots below: residuals vs predictor variables, and got that lstat and rm needs square terms.
To add the squared terms of variables lstat andrm, we used polynomial regression model. Thus, variable lstat1 is centered lstat and lstat1_sq is square of lstat1. Similar procedure was performed with variables rm1 and rm1_sq. We centered these variables in order to have less correlation. From the table below we can see that p value of coefficients of all variables is less than 0.05 and they can be considered as significant. In addition, our model explains 83.22 percent of variation in medvalue and we have very small pvalue = < 2.2e16 that is p-value for lack-of-fit test. Therefore, this model is significant.
These graphs confirm our assumption of independence of errors and constant variance of errors. From histogram ofresidualabove we see that there is an outlyingpoint. It corresponds to the index 373 in data. And we removed this outlying point from our data. And construct our last model again without this point. After this procedure, there was not any significant changes in coefficients of variables and R_squared. AccordingtotheShapiro testwegetp-value=0.01189, which is not so high. The normality not so good. From this, MSPR/MSE>2. Thus, validation did not work out. But, we should notice that we excluded the outlier with index 373 in our model. Thus, we should understand an outlying effect and identify it in whole data. Because, our model will not work in such cases.
By investigating the data, we found out that the outlying effect was in two variables: dis (weighted distance from 5 Boston employment centers) was less than 2 miles. And price of house was high, approximately more than 45. Wegetmodel, where every variable where significant (pvalue<0.05) and Multiple R-squared: 0.8669. The fitted model explains 86.69% of variation in prices house and we have very small p value = < 2.2e-16 that is p-value for lack-of-fit test. Therefore, this model is significant. Fromabovegraphofbkvsresidualwecannoticethatthevarianceofbkmaybe notconstant.WeusedtheBrown Forsythe test on bk.p-value of test was 0.9141. Thus, we failtoreject Ho and conclude that variance is constant.
The regression model is valid and explain 86.69% of variation in median value of houses. However, our model will not predict the price of houses that are located near city center and cost high. The final fitted model involves particular predictor variable like ratio of black people by town that is specific for the US. Therefore, in real life this model can be used for all suburbs of the US cities, adjusting the value of dollars into real terms by accounting the inflation. And the model does not work for houses in other countries where variable Bk (proportion of blacks) is not valid. Moreover, as in section 3 we found out outliers with characteristics: 1. Houses that are located too close to Boston employment centers (less than 2 miles) and 2. That are too expensive (>$45,000), which means the model will not predict price of such extraordinary and luxury houses.
In our model, suppose that we are interested in the house price in the suburb near Boston that has 10% lower status of the population and 0.5 nitric oxides concentration (parts per 10 million), at the same time, an average number of rooms per dwelling is 5.5 and pupil-teacher ratio by town equals to 16, the ratio of blacks and whites by town is 1.178, weighted distances to five Boston employment centers are 5 miles.
Leverage value of this new point was within the usual range of values in the model. Thus, no extrapolation was involved. Numerically, within the usual range of values in our model is to be less than In this case, leverage was 0.028, which is far less than upper bound. The resulting interval of this house price is (15.29489; 28.24503). Hence with 95% confidence we conclude that the cost of the single new house in Boston’s suburban with 10% of lower status of the population and 0.5 nitric oxides concentration (parts per 10 million), at the same time, when an average number of rooms per dwelling is 5.5 and pupilteacher ratio by town equals to 16, the ratio of blacks and whites by town is 1.178, weighted distances to five Boston employment centers are 5 miles will be between $15,295 and$28,245, which is
In another example we changed only weighted distances to five Boston employment centers to 1 mile. In this case, leverage value of this new point was 0.066, which is within the usual range of values in the model. Thus, no extrapolation was involved. But, we should notice that it is too close to 0.072 – the maximum allowed leverage. This is because 1 mile distance to five Boston employment centers is one feature of outlier. The resulting interval of this house price is (18.42295; 31.61059). Hence with 95% confidence we conclude that the cost of the single new house in Boston’s suburban with 10% of lower status of the population and 0.5 nitric oxides concentration (parts per 10 million), at the same time, when an average number of rooms per dwelling is 5.5 and pupil-teacher ratio by town equals to 16, the ratio of blacks and whites by town is 1.178, weighted distances to five Boston employment centers are 1 miles will be between $18.423 and $31.611. As you see from the examples, we made prediction that two houses in the same Boston’s suburb, but while the one is near to the center and another is farther, the near one costs more expensive than the further.
Before starting this work we guessed what the final model will be like. We expected the signs of the variables’ coefficients. According to our initial expectation, the coefficients of the variables including % lower status of the population, pupil-teacher ratio by town, weighted distances to five Boston employment centers, nitric oxides concentration (parts per 10 million) should be negative, and in the result model supported our expectations.
However, initially we guessed that the relation between average number of rooms per dwelling and the house price would be positive, since in reality the more rooms in the house, the more expensive the house price is. However, as a result of our project we got negative linear relation (β2=-31.343) between average number of rooms per dwelling and house price, which means with additional one room, the house price decreases by $31343, which did not support our expectation. From the project we have done, we learned several interesting things:
1st is that the samples picked in random by R program do not provide the same conclusion at the end, that’s why it’s very important to work with the initial sorted samples.
2nd is that there is still discrimination of Afro-Americans in the US. It was concluded, from the p-value of the variable of black people proportion – the variable was significant, which means it still has effect to the house price. If there was no matter of people’s race in house trade, the variable about the proportion of black people would not be included in the model.