A good friend of mine, when asked exactly what it is I do
for a living, sometimes replies “he does sums”.
There is a fair amount of truth in this; indeed much of the quantitative analysis that
anyone does can involve doing lots of ‘sums’.
And to help us, we use various software tools such as SPSS,
Q, R, Minitab, Stata, Excel,etc.
But occasionally we can notice something slightly weird,
with an unexpected result, or a computation that doesn’t seem to lead to where
it should.
Very often, this is simply a data cleaning issue (or,
rather, a lack of data cleaning issue).
For the heavy quant people, there are well-known mantras such as “Step
1: clean your data; Step 2: clean your data again; Step 3: repeat Steps 1 and 2”.
Or: “95% of advanced analysis is getting the datafile into
shape”.
Or even, as lamented by Sherlock Holmes, in the Conan-Doyle
story “The adventure of the Copper Beeches”:
‘"Data! Data! Data!" he cried impatiently. “I can’t make bricks without clay!”’
Another good quant analysis rubric is: “If it looks unusual, it’s probably wrong.”
In the above vein, I recently came across a rather
disturbing article(a), published only a couple of years ago, that
deals with a claimed plethora of computational errors that are literally built into Excel™.
After conducting a large number of tests (admittedly, some
of them using datasets that might be described as ‘slightly esoteric’), the
authors nonetheless conclude that “…it is not safe to assume that Microsoft
Excel’s statistical procedures give the correct answer. Persons who wish to conduct statistical
analyses should use some other package.”
I discussed this with a senior statistical consultant, who
replied:
“This paper criticising
Excel freaked me out when I first read it ...
However, over time, I have become less concerned. I looked into some of the tests … a few
general conclusions:
a) It is disappointing
that Microsoft doesn’t fix these things.
b) The errors are at the
margins. That is, we are talking about
inaccuracies that tend to occur when the techniques are unreliable anyway
(e.g., severe multicollinearity).
c) There is more than a
degree of unfairness in the critique. For
example, in the case of Solver, I have found it repeatedly to do a better job
than the various optimisers in R.”
So, given all the above, and all other things being equal,
it is probably best to take the bad news concerning Excel with a grain of
statistical salt. Nonetheless, it may
sometimes be wise to use two alternative computational means when
working with something really critical, just to be sure.
No comments:
Post a Comment