. . . AVM's receive "confidence scores", from low to high, indicating what level of confidence the user. . . .
Statistical models throw off a variety of variance measures, like Standard Error, R^2, coefficient of variation, Confidence Intervals, F-stats, P-stat, etc. The one that the general public is most familiar with are the "+ or -" you see on a political poll. All it means is that political opinion surveys or valuation predictions are a range and not a specific value. It also means that if this same survey was to be re-conducted exactly the same manner and method the results would fall within that range.The Confidence Interval range isn't going to help you if you can't reach vast swaths of the voting public on a cell phone, nor is the CI insightful if Trump's electorate declines to answer or answer falsely.
Thus, to say one politician or opinion is ahead or more popular at 46% vs. 44% {with 10% other} is statistically meaningless. To get the 3% or so that you see on political polls the honest companies commonly survey a 1,000+ people and they have to carefully try to obtain
cluster sampling of each demographic segment. If you survey 1,000 people at a Monster Truck Rally or at a TED Talk all it means is that
this is the likely result for that demographic -- not the nation or state! The famous Truman Beats Dewey headline resulted from the pollsters relying on phone interviews in an era when the lower- and middle-classes did not universally have phone lines thus skewing the "who", known as sampling, is surveyed. The dishonest pollsters, especially for special interest groups, know they can tweak the results, or push-poll results with "what" and "how" is being measured and asked.
What is being measured is as important as who. This becomes messy because of definitions. If we are sampling Koreans, do we mean North and South, both, or American-Koreans? If we are sampling blonde women, or single women, or soccer moms, or angry white men, or restaurants, or McMansions, then what is that exactly? Traditional statistical models require
homogeneity of the data. If you are asking what is the average male height, then we have to define who is an average male. 18 and older? 18 to 65 yo? 25 to 65? Americans? All races or one? If all races, then is it representative of the demographic distribution.
Narrow tight definitions have more knowledge and less noise -- but they are less useful. Consider this question:
"What brand of tooth paste (or did they vote for Hillary or Trump) (or how much house payment can they afford)?" of white single-mothers aged 30 to 35 living in a 1980s multi-family Lakewood, Colorado prefer.
Now
that is useful knowledge! Very little noise if you survey 30 or more of these people. It can be successfully interpolated across the entire
narrow population.
Yet the data is useless if you change any one of those variables: race; marriage status; family status; age range; housing type; city; state. Facebook, Kroger, and Visa collect this type of detailed dataset to every little pocket across America.
Broad loose definitions have little knowledge and extensive noise.
What brand of tooth paste (did they vote for Hillary or Trump) (how much house payment can they afford) of Americans?
This is useless knowledge. Vast amounts of noise and doesn't mean much for making behavioral predictions of the statistical population or sub-population.
Bringing this back home, pun intended, the gigantic problem for AVMs is the (1) data, (2) definitions and (2) small sample size.
- Real estate data is sloppy, incomplete, wrong, out of date.
- The definitions can be too precise (e.g., breaking down the full array of bathroom mix which is simply too much unnecessary information), too generalized (e.g., tossing in basement sf into above-grade sf). Definitions aren't very clear: What exactly is a "condition" rating? Real estate academics have a tendency to think "if I have this data variable" then it must be important and I should use it. I call this the "everything in the blender" problem. So they may use "end of cul de sak" or not as their "location" variable but not include school district because they lack that variable in their acquired dataset -- and they will grab sales in dozens of neighborhoods that no sane appraiser, even the skippy appraisers, would consider.
- The sample sizes are so small that the Standard Error becomes gigantic. It is like conducting a survey to conclude the average man is 5' to 7' tall. Duh. It only truly works in cookie cutter subdivisions.
Appraising in Texas, I would get AVMs citing sales that didn't exist. They [Corelogic] would pull a mortgages, and call them sales, treating them at 80% LTV. I would research the data and find they were refinance loans, and often at 90% and higher LTVs.
Corelogic's statistical modelers trying to solve the sample size problem creates a "synthetic sale" but compounds it by making the real estate data wrong. It is an elegant lie.
It gets worse with real estate specific problems, like extrapolation problems like of age or size outside the dataset; unknown categorized externalities; biggest house on the block; nicest {worst} house on the block; potential to raze; distressed or non-arms length sales; renovations-condition.
It gets even worse when you consider model construction -- but this is enough for now.
Said more simply: