Tuesday, 20 May 2014

I've found it ! (again)

Data scientists like me spend a lot of time worrying about which type of model to fit and the best way of fitting it. 
As one example, we think about regression, its many pitfalls, and its arguable overuse in market research, and we look for alternatives that deal with issues such as collinearity, non-linearity in relationship, non-numeric dependent variables, non-‘normal’ predictors, and missing values, not to mention data files that are invariably imperfect (e.g. when a value of zero can mean either ‘0’ or ‘Don’t know’ or ‘the question wasn’t asked of that respondent’).
In fact, the frontier for modelling in market research is now very much associated with approaches from the data mining field, such as (you know I am going to say this) Random Forests and even Conditional Random Forests. 
Random Forests and similar approaches (which together fall into the category of ‘ensemble modelling’, which is more or less the quantitative version of Wisdom of Crowds) form the backbone of competitive attempts to predict the near-impossible, via the various competitions on Kaggle http://www.kaggle.com/.
In my search for alternatives, a few years back I came across the unfortunately spelt ‘Eureqa’ software, and have written about it previouly. 
Eureqa Desktop uses something called ‘Symbolic Regression’, and its mode of operation is essentially to conduct a search for the best model amongst all possible models.  That is, it doesn’t just say “Here is the model you’ve asked for and I will now calibrate it”.  It says “Here are a zillion possible different models, and I will calibrate them all and let you know which is the best one.” 
Eureqa Desktop is available from http://www.nutonian.com/ and is, unfortunately, no longer free. 
However, the Excel version is (at the time of writing) in beta testing, and is free.  And unlike Eureqa Desktop, Eureqa for Excel pretty much makes all the decisions for you (such as what functional forms to allow in the models that it examines).
I’m still to be convinced about Eureqa … but I do understand, for example, that if you input the co-ordinates of a swinging pendulum, Eureqa will eventually come up with Newton’s gravitational law as the best model for the data !
However in market research, we rarely have data sets that display the same degree of precision as a swinging pendulum, and finding a model that predicts satisfactorily, however good the software used, is probably always going to be at least a little problematic.
Ultimately it comes down to finding an approach that (a) may not be perfect but works well enough and (b) the outputs of which make sense, in the light of what else we know about the particular situation being analysed.
As the statistician G.E.P. Box wrote in 1979 … “All models are wrong, but some are useful.” 


  1. This comment has been removed by the author.

  2. Eureqa Desktop is still free for academics and students (.edu emails), however, they offer paid service for cloud computing.