Posted by Jim,
This seems to imply that you can use the entire dataset of all sales of SFR's without regard to segmentation, and as long as R-squared is above about 90, you've got a "good" model. Correct? Incorrect?
You have a knack. I hardly know where to begin, but I would like to point out that Mark Twain predates regression.
It is hard for me to say where analysis leaves off and regression analysis begins. Some brilliant appraiser wrote an article showing how you could pair, graph or use the regression equation to get the same answer under certain conditions. However, there is something of a big difference between 1) working with a line of best fit (or trend line as Austin calls it) which you can even draw by hand on your workfile folder and 2) a fully-tested statistically-signficant correlation.
Regression was developed in the context of random samples and populations. When we start sorting sales, we de-randomize the sample. (This is where I start to disagree with my esteemed colleague Austin). Working with small and de-randomized samples, the whole idea of “correlation” gets distorted. If the sold properties are similar enough, then price per doorknob and price per light bulb will come in with Rsq’s of 90% because they are (as Austin likes to say) covariant with square footage. That is, more house means more doors and more lights.
The “test” statistics that Austin refers to include a calculation that indicates the chance that your “correlation” occurred by accident. Suppose a scientist wants to test the hypothesis that a flipped coin will always land on heads. He spends six months testing the coin. He flips it twice, it comes up heads twice and he publishes his findings in a paper. The problem is that this pattern of results occurs by random chance 25% of the time. So even though his “correlation” is 100%, the results are not “statistically significant.”
I know that example is off the wall, but it is to illustrate a point. We all knew the experiment had an absurd design – and yet – I would say 25% chance that it is coincidence sounds low compared to the magnitude of the absurdity. This, in my opinion, is why it is important to run the whole show if you are going to tell the client it is “regression” and you have high “correlation.” It is better in some ways to have 70% correlation that is almost surely accurate, than it is to have 95% correlation that could easily be an illusion.
We had one of these discussions a little over a year ago about a specific three-sale sample. The graph pattern had an important quirk. Two sales were close together, one above the other. The third sale was off to the right. The line of best fit had to pass through the third point and half way between the first two. There is just nowhere else for it to go. Now imagine a graph as big as the solar system and move only that third point out to about where Pluto is. The line of best fit still passes through the third point and halfway between the first two. The “value” of the first two points does not change and the “value” of their distance from the line does not change. The Rsq “correlation” does not change either.
This is a good illustration of how random chance offsets high correlation in a small sample. There are just too many ways to get this result by accident. With three sales that are visually simlar and on the same street, I think 90% price-size correlation would be “low.” So Jim, we have to be careful lest Mark Twain's ghost could rise and say - there are liars, damn liars and three-sale R-squared.
