Data scientists like me spend a lot of time
worrying about which type of model to fit and the best way of fitting it.
As one example, we think about regression, its many
pitfalls, and its arguable overuse in market research, and we look for
alternatives that deal with issues such as collinearity, non-linearity in
relationship, non-numeric dependent variables, non-‘normal’ predictors, and
missing values, not to mention data files that are invariably imperfect (e.g.
when a value of zero can mean either ‘0’ or ‘Don’t know’ or ‘the question
wasn’t asked of that respondent’).
In fact, the frontier for modelling in market
research is now very much associated with approaches from the data mining
field, such as (you know I am going to say this) Random Forests and even
Conditional Random Forests.
Random Forests and similar approaches
(which together fall into the category of ‘ensemble modelling’, which is more
or less the quantitative version of Wisdom of Crowds) form the backbone of
competitive attempts to predict the near-impossible, via the various
competitions on Kaggle http://www.kaggle.com/.
In my search for alternatives, a few years back I
came across the unfortunately spelt ‘Eureqa’ software, and have written about
it previouly.
Eureqa Desktop uses something called ‘Symbolic
Regression’, and its mode of operation is essentially to conduct a search for
the best model amongst all possible models.
That is, it doesn’t just say “Here is the model you’ve asked for and I
will now calibrate it”. It says “Here
are a zillion possible different models, and I will calibrate them all and let
you know which is the best one.”
Eureqa Desktop is available from http://www.nutonian.com/
and is, unfortunately, no longer free.
However, the Excel version is (at the time of
writing) in beta testing, and is
free. And unlike Eureqa Desktop, Eureqa
for Excel pretty much makes all the decisions for you (such as what functional
forms to allow in the models that it examines).
I’m still to be convinced about Eureqa … but I do
understand, for example, that if you input the co-ordinates of a swinging
pendulum, Eureqa will eventually come up with Newton’s gravitational law as the
best model for the data !
However in market research, we rarely have data
sets that display the same degree of precision as a swinging pendulum, and
finding a model that predicts satisfactorily, however good the software used,
is probably always going to be at least a little problematic.
Ultimately it comes down to finding an approach
that (a) may not be perfect but works well enough and (b) the outputs of which
make sense, in the light of what else we know about the particular situation
being analysed.
As the statistician G.E.P. Box wrote in 1979 … “All
models are wrong, but some are useful.”
This comment has been removed by the author.
ReplyDeleteEureqa Desktop is still free for academics and students (.edu emails), however, they offer paid service for cloud computing.
ReplyDelete