Thursday, September 25, 2014

Barry Arrington's Silly Misunderstanding


Ever since the ID creationist blog Uncommon Descent was taken over by Barry Arrington, it's been a first-class show of the irremediable arrogance and ignorance of creationists. I don't post there because Arrington routinely bans dissenters, but I do sometimes enjoy the show.

I particularly enjoyed this post because it touches on the subject of my Winter 2015 course here at the University of Waterloo. Arrington displays two strings of symbols and says "the second string is not a group of random letters because it is highly complex and also conforms to a specification". By implication he thinks the first string is a group of random letters, or at the very least, more random than the second.

Here are the two strings in question, cut-and-pasted from Arrington's post:

#1:

OipaFJPSDIOVJN;XDLVMK:DOIFHw;ZD
VZX;Vxsd;ijdgiojadoidfaf;asdfj;asdj[ije888
Sdf;dj;Zsjvo;ai;divn;vkn;dfasdo;gfijSd;fiojsa
dfviojasdgviojao’gijSd’gvijsdsd;ja;dfksdasd
XKLZVsda2398R3495687OipaFJPSDIOVJN
;XDLVMK:DOIFHw;ZDVZX;Vxsd;ijdgiojadoi
Sdf;dj;Zsjvo;ai;divn;vkn;dfasdo;gfijSd;fiojsadfvi
ojasdgviojao’gijSd’gvijssdv.kasd994834234908u
XKLZVsda2398R34956873ACKLVJD;asdkjad
Sd;fjwepuJWEPFIhfasd;asdjf;asdfj;adfjasd;ifj
;asdjaiojaijeriJADOAJSD;FLVJASD;FJASDF;
DOAD;ADFJAdkdkas;489468503-202395ui34

#2:

To be, or not to be, that is the question—
Whether ’tis Nobler in the mind to suffer
The Slings and Arrows of outrageous Fortune,
Or to take Arms against a Sea of troubles,
And by opposing, end them? To die, to sleep—
No more; and by a sleep, to say we end
The Heart-ache, and the thousand Natural shocks
That Flesh is heir to? ‘Tis a consummation
Devoutly to be wished. To die, to sleep,
To sleep, perchance to Dream; Aye, there’s the rub,
For in that sleep of death, what dreams may come,
When we have shuffled off this mortal coil,

Needless to say, Arrington -- a CPA and lawyer who apparently has no advanced training in the mathematics involved -- doesn't specify what he means by "group of random letters". I think a reasonable interpretation would be that he is imagining that each message is generated by a stochastic process where each letter is generated independently, with uniform probability, from some finite universe of symbols.

Even with just a cursory inspection of the two strings, we see that neither one of them is likely to be "random" in this sense. We immediately see this about the second string because the set of reasonable English texts is quite small among the set of all possible strings. But we also see the same thing about the first because (for example) the trigram "asd" occurs much more often than one could reasonably expect for a random string. Looking at a keyboard, it's a reasonable interpretation that somebody, probably Arrington, dragged his hands repeatedly over the keyboard in a fashion he or she thought was "random" -- but is evidently not. (It is much harder to generate random strings than most untrained people think.)

If we want to test this in a quantitative sense, we can use a lossless compression scheme such as gzip, an implementation of Lempel-Ziv. A truly random file will not be significantly compressible, with very very high probability. So a good test of randomness is simply to attempt to compress the file and see if it is roughly the same size as the original. The larger the produced file, the more random the original string was.

Here are the results. String #1 is of length 502, using the "wc" program. (This also counts characters like the carriage returns separating the lines.) String #2 is of length 545.

Using gzip on Darwin OS on my Mac, I get the following results: string #1 compresses to a file of size 308 and string #2 compresses to a file of size 367. String #2's compressed version is bigger and therefore more random than string #1: exactly the opposite of what Arrington implied!

I suppose one could argue that the right measure of "randomness" is not the size of the compressed file, but rather the difference in size between the compressed file and the original. The smaller this difference is, the more random the original string was. So let's do that test, too. I find that for string #1, this difference is 502-308 = 194, and for string #2, this difference is 545-367 = 178. Again, for string #2 this difference is smaller and hence again string #2 is more random than string #1.

Finally, one could argue that we're comparing apples and oranges because the strings aren't the same size. Maybe we should compute the percentage of compression achieved. For string #1 this percentage is 194/502, or 38.6%. For string #2 this percentage is 178/545, or 32.7%. String #2 was compressed less in terms of percentage and hence once again is more random than string #1.

Barry's implications have failed spectacularly in every measure I tried.

Ultimately, the answer is that it is completely reasonable to believe that neither of Barry's two strings is "random" in the sense of likely to have been generated randomly and uniformly from a given universe of symbols. A truly random string would be very hard to compress. (Warning: if you try to do this with gzip make sure you use the entire alphabet of symbols available to you; gzip is quite clever if your universe is smaller.)

By the way, I should point out that Barry's "conforms to a specification" is the usual ID creationist nonsense. He doesn't even understand Dembski's criterion (not surprising, since Dembski stated it so obscurely). String #2 can be said to "conform" to many, many different specifications: English text, English text written by Shakespeare, messages of length less than 545, and so forth. But the same can be said for string #1. We addressed this in detail in our long paper published in Synthese, but it seems most ID creationists haven't read it. For one thing, it's not good enough to assert just "specification"; even by Dembski's own claims, one must determine that the specification is "independent" and one must compute the size of the space of strings that conforms to the specification. For Dembski, it's not the probability of the string being generated that is of concern; it's the relative measures of the universe of strings and the strings matching the specification that matters! Most ID creationists don't understand this basic point.

Elsewhere, Arrington says he thinks string #1 is more complex than string #2 (more precisely he says the "thesis ... that the first string is less complex than the second string ... is indefensible").

Maybe Barry said the exact opposite of what he meant; his writing is so incoherent that it wouldn't surprise me. But his statement, as given, is wrong again. For mathematicians and computer scientists, complexity of a string can be measured as the size of the optimal compressed version of that string. Again, we don't have a way to determine Kolmogorov complexity, so in practice one can use a lossless compression scheme as we did above. The larger the compressed result, the more complex the original string. And the results are clear: string #1 is, as measured by gzip, somewhat less complex than string #2.

ID creationists, as I've noted previously, usually turn the notion of Kolmogorov complexity on its head, pretending that random strings are not complex at all. We made fun of this in our proposal for "specified anti-information" in the long version of our paper refuting Dembski. Oddly enough, some ID creationists have now adopted this proposal as a serious one, although of course they don't cite us.

Finally, one unrelated point: Barry talks about his disillusion when his parents lied to him about the existence of a supernatural figure --- namely, Santa Claus. But he doesn't have enough introspection to understand that the analogy he tries to draw (with "materialist metaphysics") is completely backwards. Surely the right analogy is Santa Claus to Jesus Christ. Both are mythical figures, both are celebrated by and indoctrinated in by parents, both supposedly have supernatural powers, both are depicted as wise and good, and both are comforting to small children. The list could go on and on. How un-self-aware does one have to be to miss this?

15 comments:

Unknown said...

I'm reminded of my old ID Challenge which, sadly, never got any serious takers.

OgreMkV said...

BA has the common problem that most IDers share. That is, they confuse information with meaning.

Because the 1st string contains no "information" that he can understand, then it must be less complex than the carefully crafted words of an author.

What he doesn't understand (none of them do), is that is the entire purpose of cryptography. That is, to change meaningful information into a form that is only meaningful to a miniscule group of people.

I once provided a plain text paragraph and the same paragraph run an RSA algorithm and tried to have a meaningful discussion about it with IDists. They refused. I guess they knew it was a trap.

Rich Hughes said...

Barry's butthurt is on display in a new post:

http://www.uncommondescent.com/intelligent-design/jeffrey-shallit-design-detector/

Barry is of course a lawyer, math and science are beyond him. Barry claims "to only ban trolls" but fortunately there is a more honest account here:

http://www.antievolution.org/cgi-bin/ikonboard/ikonboard.cgi?s=5424691117d050c9;act=ST;f=14;t=5141

Since Barry taking over UD we've had more appologetics and less pretend-science. Creationism 2.0. Well, make that 1.1

William Spearshake said...

I was the commenter who stated that the first string is not more complex than the second. Then he responded with his normal pompousness and arrogance, but never really explained his rationale.

In short, I was wrong simply because I disagreed with him. I might add that this was after being banned, under a different name, for disagreeing with him on another OP.

John Farrell said...

And now comes Dembski's new tome, Being and Communion, which promises (according to the DI's breathless PR machine) to constitute a paradigm shift.

Piotr GÄ…siorowski said...

OgreMkV:

They have serious fundamental problems with both information and meaning. My challenge to them (how to detect "meaning" even if you don't know how it arose) remains unanswered.

AllanMiller said...

The first string is remarkably well patterned. Numbers and letters cluster, caps and lower-case also group ... I suspect that there is some kind of electrostatic phenomenon at work. Something Shakespeare never had to contend with; it would have really screwed with his prose.

SELBLOG said...

So how do you measure information in the two strings statistically ? Are we to conclude that string one has more meaning (is meaning equal to information at all ?).

Jeffrey Shallit said...

Did you even read the post? I described one way to measure information.

"Meaning" is vague and seemingly has little to do with information, in the sense that the word "information" is ordinarily understood by mathematicians and computer scientists.

Diogenes said...

"Using gzip on Darwin OS on my Mac, I get the following results: string #1 compresses to a file of size 308 and string #2 compresses to a file of size 367. String #2's compressed version is bigger and therefore more random than string #1: exactly the opposite of what Arrington implied!"

I rarely write LOL.

But LOL.

AllanMiller said...

Ouch. In the follow-up post Arrington attempts to use 'compressibility' as a distinction:

"whether a given string of text is “specified” is determined by whether the description of the string can be compressed. Take the second group of text as an example. It can be compressed to “first 12 lines of Hamlet’s soliloquy.” This is simply not possible for the first string. The shortest full description of the first string is nothing less than the string itself.

One could "compress" the first string as ... well, "the first string", in like manner.

Jeffrey Shallit said...

Exactly, Alan. He doesn't even know what a legitimate compression consists of.

Tom English said...

Of no real importance here, but for future reference: gzip the standard input to avoid putting the name of the source file in the header of the output.

Unknown said...

"So how do you measure information in the two strings statistically ? Are we to conclude that string one has more meaning (is meaning equal to information at all ?)."

SELBLOG, Shanning's theory would determine the amount of information by how much each symbol reduces the uncertainty for that symbol among all the possible symbols in the code. At the level of each character, the possible symbols are the letters of the alphabet, space, and punctuation. That constitutes a code.

At the level of the words, the level of uncertainty for each symbol (each word) is the receiver's uncertainty for which word will appear next out of the tens of thousands of shared vocabulary between the sender and the receiver. The shared vocabulary of words between sender and receiver is also a code.

For a human receiver, the letters code is not very important. Let us just assume that the human is interested in the code of words. As such, not every possible word in the shared vocabulary is equally probable as a next word, given the words that predecded it.

But that notion of how much uncertainty for which of the possible symbols is next lays the basis for computing the amount of information in the message. By the way, if the receiver is someone who has memorized Hamlet, that message contains no information at all, because it does not affect the uncertainty of which symbols appear after their predecessors.

To wit, if I called you on the phone to tell you that you write a blog called SELBLOG, my message would not reduce your uncertainty in any way. In other words you would not be further informed by my message. Although it seems like just an engineering notion, Shannon would ask what is the minimum number of bits required to send you that message to reduce your uncertainty. Since the reduction is zero, the minimum number of bits required for me to not reduce your uncertainty for the name of your blog is zero.

I would appreciate any comments from anyone as to whether I explained it correctly and well.

Unknown said...

"whether a given string of text is “specified” is determined by whether the description of the string can be compressed. Take the second group of text as an example. It can be compressed to “first 12 lines of Hamlet’s soliloquy.” This is simply not possible for the first string. The shortest full description of the first string is nothing less than the string itself."

From Shannon's theory, the information content in "first 12 lines of Hamlet's soliloquy" would be the same as the actual text of the first 12 lines. It would reduce the uncertainty of the receiver by the same amount.

In fact, if the receiver was a specialized computer program that dealt only with lines of shakespeare, a code that was something like HA1-12 (for Hamlet, lines 1-12) would also be equivalent. The computer program's uncertainty for which lines of Shakespeare is being specified by the message, out of all possible lines of Shakespeare, would be reduced to zero just like sending the actual lines to you and I. You could probably devise an even more compact code in terms of numbers of bits that could precisely specify a set of lines in a particular play.