Thursday, February 11, 2016

Reproducibility in Computer Science

There has been a lot of discussion lately about reproducibility in the sciences, especially the social sciences. The result that garnered the most attention was the Nosek study, where the authors tried to reproduce the results of 98 studies published in psychology journals. They found that they were able to reproduce only about 40% of the published results.

Now it's computer science's turn to go under the spotlight. I think this is good, for a number of reasons:

  1. In computer science there is a lot of emphasis placed on annual conferences, as opposed to refereed journal articles. Yes, these conferences are usually refereed, but the reports are generally done rather quickly and there is little time for revision. This emphasis has the unfortunate consequence that computer science papers are often written quite hastily, a week or less before the deadline, in order to make it into the "important" conferences of your area.

  2. These conferences are typically quite selective and accept only 10% to 30% of all submissions. So there is pressure to hype your results and sometimes to claim a little more than you actually got done. (You can rationalize it by saying you'll get it done by the time the conference presentation rolls around.)

    (In contrast, the big conferences in mathematics are often "take-anything" affairs. At the American Mathematical Society meetings, pretty much anyone can present a paper; they sometimes have a special session for the papers that are whispered to be junk or crackpot stuff. Little prestige is associated with conferences in mathematics; the main thing is to publish in journals, which have a longer time frame suitable for good preparation and reflection.)

  3. A lot of research in computer science, especially the "systems" area, seems pretty junky to me. It always amazes me that in some cases you can get a Ph.D. just for writing some code, or, even worse, just modifying a previous graduate student's code.

  4. Computer science is one of the areas where reproducibility should (in theory) be the easiest. Usually, no complicated lab setups or multimillion dollar equipment is needed. You don't need to recruit test subjects or pass through ethics reviews. All you have to do is compile something and run it!

  5. A lot of computer science research is done using public funds, and as a prerequisite for obtaining those funds, researchers agree to share their code and data with others. That kind of sharing should be routine in all the sciences.
Now my old friend and colleague Christian Collberg (who has one of the coolest web pages I've ever seen) has taken up the cudgel of reproducibility in computer science. In a paper to appear in the March 2016 issue of Communications of the ACM, Collberg and co-authors Todd Proebsting and Alex M. Warren relate their experiences in (1) trying to obtain the code described in papers and then (2) trying to compile and run it. They did not attempt to reproduce the results in papers, just the very basics of compiling and running. They did this for 402 (!) papers from recent issues of major conferences and journals.

The results are pretty sad. Many authors had e-mail addresses that failed (probably because they moved on to other institutions or left academia). Many simply did not reply to the request for code (in some cases Collberg filed freedom of information requests to try to get it). Of those that did reply, their code failed for a number of different reasons, like important files missing. Ultimately, only about a half of all papers had code that passed the very basic tests of compiling and running.

This is going to be a blockbuster result when it comes out next month. For a preview, you can look at a technical report describing their results. And don't forget to look at the appendices, where Collberg describes his ultimately unsuccessful attempt to get code for a system that interested him.

Now it's true that there are many reasons (which Collberg et al. detail) why this state of affairs exist. Many software papers are written by teams, including graduate students that come and go. Sometimes they are not adequately archived, and disk crashes can result in losses. Sometimes the current system has been greatly modified from what's in the paper, and nobody saved the old one. Sometimes systems ran under older operating systems but not the new ones. Sometimes code is "fragile" and not suitable for distribution without a great deal of extra work which the authors don't want to do.

So in their recommendations Collberg et al. don't demand that every such paper provide working code when it is submitted. Instead, they suggest a much more modest goal: that at the time of submission to conferences and journals, authors mention what the state of their code is. More precisely, they advocate that "every article be required to specify the level of reproducibility a reader or reviewer should expect". This information can include a permanent e-mail contact (probably of the senior researcher), a website from which the code can be downloaded (if that is envisioned), the degree to which the code is proprietary, availability of benchmarks, and so forth.

Collberg tells me that as a result of his paper, he is now "the most hated man in computer science". That is not the way it should be. His suggestions are well-thought-out and reasonable. They should be adopted right away.

P. S. Ironically, some folks at Brown are now attempting to reproduce Collberg's study. There are many that take issue with specific evaluations in the paper. I hope this doesn't detract from Collberg's recommendations.

Tuesday, February 09, 2016

More Silly Philosopher Tricks

Here's a review of four books about science in the New York Times. You already know the review is going to be shallow and uninformed because it is written not by a scientist or even a science writer, but by James Ryerson. Ryerson is more interested in philosophy and law than science; he has an undergraduate degree from Amherst, and apparently no advanced scientific training.

In the review he discusses a new book by James W. Jones entitled Can Science Explain Religion? and says,

"If presented with this argument, Jones imagines, we would surely make several objections: that the origin of a belief entails nothing about its truth or falsity (if you learn that the earth is round from your drunk uncle, that doesn’t mean it’s not)..."

Now I can't tell if this is Jones or Ryerson speaking, but either way it illustrates the difference between the way philosophers think and the way everyone else thinks. For normal people who live in a physical world, where conclusions are nearly always based on partial information, the origin of a belief does and should impact your evaluation of its truth.

For example, I am being perfectly reasonable when I have a priori doubts about anything that Ted Cruz says, because of his established record for lying: only 20% of his statements were evaluated as "true" or "mostly true". Is it logically possible that Cruz could tell the truth? Sure. It's also logically possible that monkeys could fly out of James Ryerson's ass, but I wouldn't be required to believe it if he said they did.

For non-philosophers, when we evaluate statements, things like a reputation for veracity of the speaker are important, as are evidence, the Dunning-Kruger effect, the funding of the person making the statement, and so forth. Logic alone does not rule in an uncertain world; in the real world these things matter. So when a religion professor and Episcopal priest like Jones writes a book about science, I am not particularly optimistic he will have anything interesting to say. And I can be pretty confident I know his biases ahead of time. The same goes for staff editors of the New York Times without scientific training.

Friday, February 05, 2016

3.37 Degrees of Separation

This is pretty interesting: Facebook has a tool that estimates the average number of intermediate people needed to link you, via the shortest path, to anyone else on Facebook. Mine is 3.37, which means the average path length (number of links) to me is 4.37, or that the average number of people in a shortest chain connecting others with me (including me and the person at the end) is 5.37.

What's yours?

An interesting aspect of this is that they use the Flajolet-Martin algorithm to estimate the path length. The paper of Flajolet-Martin deals with a certain correction factor φ, which is defined as follows: φ = 2 eγ α-1, where γ = 0.57721... is Euler's constant and α is the constant Πn ≥ 1 (2n/(2n+1))(-1)t(n), where t(n) is the Thue-Morse sequence, the sequence that counts the parity of the number of 1's in the binary expansion of n.

The Thue-Morse sequence has long been a favorite of mine, and Allouche and I wrote a survey paper about it some time ago, where we mentioned the Flajolet-Martin formula. The Thue-Morse sequence comes up in many different areas of mathematics and computer science. And we also wrote a paper about a constant very similar to α: it is Πn ≥ 0 ((2n+1)/(2n+2))(-1)t(n). Believe it or not, it is possible to evaluate this constant in closed form: it is equal to 2 !

By contrast, nobody knows a similar simple evaluation for α. In fact, I have offered $50 for a proof that α is irrational or transcendental.

Friday, January 29, 2016

Yet More Bad Creationist Mathematics

It's not just biology that creationists resolutely refuse to understand. Their willful ignorance extends to many other fields. Take mathematics, for example.

At the creationist blog Uncommon Descent we have longtime columnist "kairosfocus" (Gordon Mullings) claiming that "a set of integers that spans to infinity will have members that are transfinite", showing that he doesn't understand even the most basic things about the natural numbers.

And we also have Jonathan Bartlett asking "can you develop an effective procedure for checking proofs? and answering "The answer is, strangely, no."

Actually the answer is "yes". A mathematical proof can indeed be checked and easily so (in principle). This has nothing to do with the statement of Bartlett that follows it: "It turns out that there are true facts that cannot be proved via mechanical means." Yes, that's so; but it has nothing to do with an effective procedure for checking proofs. Such a procedure would simply verify that each line follows from the previous one by an application of the axioms.

If a statement S has a proof, there is a semi-algorithm that will even produce the proof: simply enumerate all proofs in order of length and check whether each one is a proof of S. The problem arises when a true statement simply does not have a proof. It has nothing to do with checking a given proof.

Can't creationists even get the most basic things correct?

Saturday, January 09, 2016

Our Car's Fibonacci Odometer

Been waiting for this for 11 years, and it finally happened!

Saturday, January 02, 2016

You Don't Have to Be a Sociopath to Become a Theist....

...but apparently it helps, at least judging from this video.

Several things come to mind when I watched this. First, if David Wood's story is largely true, then he's clearly a sociopath and why should we believe anything he says? He could just be manipulating us for some sick purpose. On the other hand, if his story is largely false, then he's clearly a pathological liar, and why should we believe anything he says? Of course, his story could be partly true and partly false (my guess), but then the same conclusion holds.

Second is how persuasive even a terrible design argument like the one proposed here can be for a diseased or weak mind. Don't bother studying any mathematics, or computer science, or biology. Just assert that there is no evidence for the scientific world view, and voilĂ !

Third is what an ignorant bastard the guy is for someone who thought he was the greatest person in the world. He thinks shingles are caused by vitamin deficiency, fer chrissake!

Oh well. I am comforted by the fact that there's lots of decent people who are religionists. They're not all sociopaths like David Wood.

Monday, December 21, 2015

10th Blogiversary!

Ten years ago, this blog, Recursivity, was born.

I've had a lot of fun with it, even though I never really had very much time to devote to it. A thousand posts in ten years sounds like a lot, but I wish I could have written a thousand more.

Generally speaking, my readers have been great. In ten years, I think I only had to ban two or three commenters, including one Holocaust denier. Thank you to everyone who read what I had to say, and even more thanks to those who took the time to comment.

Here are 25 of my favorite posts from the last ten years:

  1. Why We Never Lied to Our Kids About Santa: my absolute favorite, and still appropriate. You can criticize atheism and religion, but if you really want to get a reaction, just criticize the myth of Santa Claus.
  2. Robert J. Marks II refuses to answer a simple question: still waiting, more than a year later.
  3. Hell would be having to listen to Francis Spufford: Damn, he was boring.
  4. By the Usual Compactness Argument: for mathematicians only.
  5. Ten Common LaTeX Errors
  6. I defend a conservative politician's right to speak on campus
  7. Science books have errata. Holy books don't
  8. No Formula for the Prime Numbers?: Debunking a common assertion.
  9. In Memory of Sheng Yu (1950-2012): my colleague - I still miss him.
  10. Another Fake Magnet Man Scams AP
  11. William Lane Craig Does Mathematics
  12. Why Do William Lane Craig's Views Merit Respect?: Nobody gave a good answer, by the way!
  13. Stephen Meyer's Bogus Information Theory
  14. Religion Makes Smart People Stupid
  15. Test Your Knowledge of Information Theory
  16. David Berlinski, King of Poseurs
  17. Graeme MacQueen at the 9/11 Denier Evening
  18. Mathematics in a Jack Reacher novel
  19. The Prime Game: This appeared in my 2008 textbook, too.
  20. Debunking Crystal Healing
  21. Nancy Pearcey, The Creationists' Miss Information
  22. Academic Vanity Scams
  23. Time Travel: my second favorite, which nobody seemed to like that much.
  24. Janis Ian Demo Tape: the best part was that Janis Ian herself stopped by to comment!
  25. The Subversive Skepticism of Scooby Doo: my third favorite.
Happy Holidays to everyone, and may 2016 be a great year for you.