How to Model Data Incorrectly

Wednesday 22 April 2020, 7:00 am

Some days you read an article or a blog post that makes you shake your head – you groan and go on to the next thing. Then there are the other days when you read something that makes you want to scream, “NO NO NO!”. That’s what happened when I read the recent call for us to return to “neo-classical” models to solve our business problems. Larry’s introduction got me excited, as it echoed my own sentiments about our industry’s boundless enthusiasm for things new and shiny, regardless of how well they work. Alas, the intro was the best part of this blog.

The author claims that a lot of the newer, more complex modeling techniques don’t work well and may not be an improvement over simple linear models. He begins to explain why he believes this and I’m ready to bite – I’ve seen lots of models fail for any number of reasons and I’m hoping he has some new ammunition for me. I’m even ready to accept a re-hash of the topic. Maybe he’ll point out that these complex models can be black-boxy or they are often over-specified or they are inappropriate for the data source. Instead, I get six reasons why we should go back to simple linear models – not a single one of which is legitimate. Let’s look at the reasons the author brings up:

Simple linear models are simple to understand and communicate – this is both untrue and a terrible reason to choose a model unless we are adopting Occam’s Razor among equally good choices.
Simple models are democratic, in that anyone can run one – this is an even worse reason to choose a model.
Simple models work with small data sets (where complex models may need larger data sets) – this is not a reason to choose a model, although it may fall into the “I’m doing the best I can with the data I have” category. Use it knowing you may be missing something.
Simple linear models are less likely to lead to the wrong conclusions – that is a wrong conclusion. Linear models are, in and of themselves, no more or less likely to lead to wrong conclusions than complex models.
Simple linear models are cheaper and faster to run. The Yugo was a cheap car that kind of went fast – that didn’t make it a good car and it doesn’t make it a good reason to pick a model.
Simple linear models can easily be turned into ones that are complex – which would seem to be contrary to the call for simple linear models.

The only reason you adopt a simple linear model is that the relationships in your data are reasonably linear. If the relationships between variables are non-linear, then you don’t use a linear model. Let me give you an example. If you look at the relationship between sales and the price spread of a private label product against its national brand, you’ll find an inverted U-shaped function. There’s a sweet spot for the price gap at the top of the curve; moving away from that optimal price gap, either by increasing or decreasing the price spread, hurts sales. If you use a simple linear model, you get a non-significant result because the correlation between price spread and sales is zero. You need a more complex, non-linear model to describe this data.

John Tukey wrote his book Exploratory Data Analysis back in 1977 – to encourage us to look at the shape of our data before analyzing it. I write this because this blog is filled with really bad advice and can only lead you, the researcher, down a dangerous road (even if you had good intentions, you know where that road takes you).

A good researcher will read the above, shake her head, and move on. A smart researcher will ask an expert about the criteria for good models. A really smart researcher will look up Kevin Gray. He writes often and well about what makes for good and bad models in a language that you don’t need a Stat degree to comprehend.

Please share…

Source link

Please share…

Related posts