Tuesday, 26 February 2013

Does Excel™ really excel ?

A good friend of mine, when asked exactly what it is I do for a living, sometimes replies “he does sums”.

There is a fair amount of truth in this;  indeed much of the quantitative analysis that anyone does can involve doing lots of ‘sums’.

And to help us, we use various software tools such as SPSS, Q, R, Minitab, Stata, Excel,etc.

But occasionally we can notice something slightly weird, with an unexpected result, or a computation that doesn’t seem to lead to where it should.

Very often, this is simply a data cleaning issue (or, rather, a lack of data cleaning issue).  For the heavy quant people, there are well-known mantras such as “Step 1:  clean your data; Step 2:  clean your data again; Step 3:  repeat Steps 1 and 2”. 

Or: “95% of advanced analysis is getting the datafile into shape”. 

Or even, as lamented by Sherlock Holmes, in the Conan-Doyle story “The adventure of the Copper Beeches”:   ‘"Data! Data! Data!" he cried impatiently.  “I can’t make bricks without clay!”’

Another good quant analysis rubric is:  “If it looks unusual, it’s probably wrong.”

In the above vein, I recently came across a rather disturbing article(a), published only a couple of years ago, that deals with a claimed plethora of computational errors that are literally built into Excel™.

After conducting a large number of tests (admittedly, some of them using datasets that might be described as ‘slightly esoteric’), the authors nonetheless conclude that “…it is not safe to assume that Microsoft Excel’s statistical procedures give the correct answer.  Persons who wish to conduct statistical analyses should use some other package.”

I discussed this with a senior statistical consultant, who replied:

“This paper criticising Excel freaked me out when I first read it ...  However, over time, I have become less concerned.  I looked into some of the tests … a few general conclusions:

a)         It is disappointing that Microsoft doesn’t fix these things.

b)         The errors are at the margins.  That is, we are talking about inaccuracies that tend to occur when the techniques are unreliable anyway (e.g., severe multicollinearity). 

c)         There is more than a degree of unfairness in the critique.  For example, in the case of Solver, I have found it repeatedly to do a better job than the various optimisers in R.”

So, given all the above, and all other things being equal, it is probably best to take the bad news concerning Excel with a grain of statistical salt.  Nonetheless, it may sometimes be wise to use two alternative computational means when working with something really critical, just to be sure.

No comments:

Post a Comment