Rearranging The Deckchairs

Frank O'Dwyer's blog

Scientific Replication Is Not Rote Repetition

The results of CRUTEM have been replicated, reimplemented and/or corroborated at least six times by now, using a variety of analyses, written in different languages, by different people. Not counting the satellite data and other lines of physical evidence, we have:

I’ll have to admit I was a little astounded at the agreement between Jones’ and my analyses, especially since I chose a rather ad-hoc method of data screening that was not optimized in any way. Note that the linear temperature trends are essentially identical; the correlation between the monthly anomalies is 0.91.

So we have by now temperature analyses all pointing to the same result and written in a variety of languages: R, Fortran, Python, STATA, Perl, and so on, some based on reimplementing code, and some based only on descriptions of the analysis steps.

This rather gives the lie to the notion put about by ‘sceptics’ that this type of replication is impossible without ‘the code’. Indeed further than that, it shows that replication cannot be highly sensitive to the low-level implementation detail, categorisation of ‘good’ and ‘bad’ data, and even some high level details of the data processing.

When ‘sceptics’ talk about replication of temperature analysis the analogy is often made to a chemical or physical experiment, where you supposedly need the lab books etc in order to exactly replicate what was done to the decimal point. However the reality is that those who are replicating results typically do so in their own labs using their own scientific instruments, and simply follow the essential features of the protocol that they are trying to replicate. They may even try to falsify the result using a completely different experiment design with different controls. Getting exactly the same answer to the decimal point isn’t expected, because the circumstances can never be identical.

Think about your high school physics, when you rolled a weight down a ramp you didn’t roll the same weight down the same ramp as everyone else in the world did. Nor I presume did you book a trip to the leaning tower of Pisa to chuck weights off the top. You probably didn’t go up in a space shuttle to test F=ma either. In ‘sceptic’ land, this makes the theory of gravity a hoax and you a scientific fraud. By the same analogy the ‘sceptics’ demand for ‘the code’ is like insisting that it’s insufficient for replication if experimenters simply describe the off-the-shelf equipment they used and what they did with it.

It is like insisting that replication is impossible unless the original scientists provide them with the same scientific instruments they used in their experiment and access to their lab. Worse than that, such is the degree of nitpicking that it is like demanding that scientists record details that aren’t even obviously relevant, such as what they had for breakfast when they did the experiment, how much they weighed that day, or what was the number 1 record that week.

Over on Bishop Hill, dcardno argues:

The problem is that the “instrument” used by various climate investigators was largely software - it didn’t really exist other than in the code, and had never been seen before. The analogy is not to an often-performed drug test, but an entirely new and unproven method of determining eligibility. Consider results of a chemical experiment: If I were to reference results from a mass spectrometer, I can be reasonably confident that any other experimenter can reproduce those results; “mass spectrometer” is a reasonably well-defined product. If they cannot reproduce my results, we might look at their lab techniques, process temperature control, timing and so on before we suspected that a difference in spectrometer is to blame. Eventually, though, if that was the only uncontrolled variable, we would investigate there, as well - and if different spectrometers gave significantly different results, we would learn something, either about my claims or the state of our instruments. On the other hand, if I produce a result, but base it on a newly-developed ‘mass spectophotometron’ other investigators have no way to verify or disprove my results, particularly if my ‘high level’ description of what the device does is either inadervertently or deliberately incomplete. In that case, in order to be taken seriously, I have to demonstrate that my ‘mass spectophotometron’ actually measures some relevant property, and that it does so in a consistent and meaningful way.

Exactly so. However all experiments use off the shelf equipment in a novel protocol or they wouldn’t tell us anything new, so this objection if valid would apply to absolutely any experiment. And the point is that the statistical processing done on the temperature record is also made up of well known and easily described off the shelf techniques. This is not brand new stuff - it is well known techniques assembled to address a new problem. So again it comes down to describing what you did, and that means describing the details that matter. In the case of temperature record, the fact that a variety of other analyses exist with very similar results obviously implies that people did have enough information to replicate. Not only that, but that the answer is right.

dcardno continues (my emphasis):

This is analogous to the CRU / Mann / etc. results - based on the descriptions in their papers, capable investigators could not reproduce their findings, and could not verify that the black box code (their ‘mass spectophotometron’) functioned as decribed. Rather than demonstrate that their analysis was correct, they hid their methodolgies and prevented independent access to their raw data. As shown by the various e-mails, they knew that their results were irreproduceable - so they blustered, appealed to authority, and attacked anyone who actually wanted to see their ‘mass spectophotometron’ and see evidence that it actually worked.

How could they have known something that we now know for sure is not true? The CRU results have been reproduced. The Pedant-General argues:

If I do my own stuff and come up with a different answer, the team, the media and the IPCC ignore it entirely. […] The ONLY way to strike at the team and the consensus is to demonstrate that the consensus scientists have arrived at their results incorrectly. You have to demonstrate flaws in THEIR work.

To which my response is, so what if you are ignored? You have a right to speak but not a right to be listened to. I don’t buy it in the first place - if a result based on reasonable analysis of the same or even similar data existed, it would be addressed. This has already happened in the case of the UAH analysis, for example.

If the ‘sceptics’ really have an analysis that gives a different answer to CRUTEM or Mann or whoever then let’s see it. Let’s see your work. Of course, that means actually doing analysis and science, and not ‘auditing’. In essence what the ‘sceptics’ demand amounts to a proposal that the scientific method which has worked well for centuries should now be re-engineered to meet a different requirement. Instead of being optimised to find out the truth about the world, in spite of error, fraud and human fallibility, it should now be re-engineered to serve the needs of nitpickers and those in search of fraud.

But Science doesn’t care whether somebody added up a column of numbers correctly, and it’s already perfectly capable of routing around errors like that, malicious or otherwise. Science only cares about the damn answer. You get that by doing Science. Ultimately the fact is that there isn’t evidence of a serious problem in CRUTEM and there never was.

Certainly ‘open source science’ and exactly what details should be archived for exact reproduction is an interesting question to consider in its own right. Based on my own experience of archiving my own code for my own future reference, I would argue that you probably can’t do this easily without stashing away a virtual machine containing the complete implementation, and/or a source code control system containing not only your code but almost all dependencies.

However it is rather ironic for ‘sceptics’ to insist based on no evidence at all that there is a transparency problem with implications so serious that it requires dropping everything and turning the world upside down to address it, while they themselves ignore the far more compelling evidence for the need to do something about greenhouse gas emissions.

Perhaps in 30 or 60 years, when the ‘sceptics’ have proved beyond any doubt that releasing code and data to the level of detail they demand is necessary, worth the costs, and not harmful, we can do it then. Meanwhile, maybe releasing ‘the code’ is helpful or maybe it just propagates error. Maybe archiving absolutely every detail of every analysis such that it can be reproduced 50 years hence is worth the onerous costs, maybe it isn’t. And meanwhile, you now have the code for 6 or so implementations which you haven’t investigated yet, but you’re stuck on the one from CRU. What about the others? Why haven’t you got cracking on them? Don’t like the ones in Fortran - try the one in Perl. Don’t like that the Fortran ones don’t have unit tests etc (what were you expecting, poetry?) try the python implementation.

And when it comes to reproducibility of results years hence, you can also think about this: how would you propose to reproduce the results of Gilbert N. Plass, which used early computer models in the 1950s?

Good luck getting those to run on your Mac.