on Jun 10th, 2009Avoiding the Kool-Aid
For the past day I’ve been having a frustrating email discussion with a colleague who has drunk from too much of the Computational Linguistics kool-aid. And by “the kool-aid” I mean the notion that nothing is science unless you can measure it, and that anything that’s not science is somehow voodoo.
The particular topic has to do with collective noun agreement patterns in British English. As we all know, the agreement patterns of nouns in English (of any variety) can be complex and confusing. See this Language Log classic for a crash course in the issues. We also know that British English is somewhat more complex on this front than American English in that it generally allows plural agreement with singular nouns denoting an entity comprised of individuals in those cases where the speaker is emphasizing the members of the group over the group itself. So, to take an example from the Economist’s Style Guide - you can say things like “the preceding generation are all dead.” What allows it in this case is that the dying was done by all the individuals in the generation individually, and that the generation as a whole can’t be dead until all of its members are. Since the emphasis here is on the members more than the group, plural agreement is allowed. By the same token, the Economist’s style poo-bahs seem to think that “the me generation has run its course” should take singular agreement - since the emphasis is on the group as a political/cultural entity rather than as a collection of individuals. All of which is to say that British English encodes in its syntax a semantic distinction that American English doesn’t necessarily. The cliche is that British English sacrifices syntactic consistency for semantic accuracy or richness of expression, where American English prefers grammatical consistency and leaves it up to context to sort out the emphasis.
Of course, since this is language, the truth is that neither variety is completely consistent here. Americans do sometimes use plural agreement with collective entity nouns - we just don’t do it nearly as often as the British do (indeed, in most cases it sounds weird). And the British, for their part, disallow plural agreement in some cases where consistent application of the “semantic emphasis” rule would prescribe it. For example - you can’t say “England have just voted resoundingly to send Labour packing” (I’m writing this in early summer 2009!) - it’s “England has.” Period.
In fact, the dispute I’m having with my (equally American, though one frequently suspects he wishes he weren’t) collegue is about my suspicion that the plural agreement on collective entities allowance is more permissive in some domains than others. In particular, I have the impression that I hear it a lot more in British sports commentary than I do elsewhere. Granted, I’m not a native British English speaker, but I do watch a lot of British TV and read a lot of British media and so come into contact with their speech patterns a bit more than the average American. And my gut feeling on the matter is that they’re generally almost as consistent about using singular agreement as we are - save for these distinguished domains.
I got some pretty unexpectedly straightforward confirmation of this from Brian Micklethwait’s blog - where in an entry about Cricket he writes the following:
Well, England have survived. Had I put ‘England has survived,’ you’d know that this was about something important, like the recent elections we’ve been having, but ‘have’ means it can only be sport, and indeed it is.
In other words, this native speaker (one presumes, with a name like “Micklethwait”) of British English is reporting that he can sense the topic domain based on whether or not “England” gets singular or plural agreement, and he further seems to expect his readership to share this intuition. Pleased, I sent the quote in an email to said colleague - who replied the snarky response that “data is not the plural of annecdote.”
Well, no, it’s not. Actually, “data” isn’t the plural of anything in American English as far as I can tell. The link goes to a previous post where I admit that in gradschool I’ve picked up the pretentious “the data are consistent with x” a bit. But the decisive point for me is that I don’t have “datum,” which is supposedly the singular form of “data.” I can’t even say that word with a straight face, and I would never actually use it in a sentence (i.e. “this datum says one thing, whereas that one says another.” GOOD GOD ALMIGHTY NO!). No - I talk about “data points” and “pieces of data” when I want “data” to be just one. Which is consistent with other mass nouns - “a line of coke,” for example.
But alright, nitpicking aside, my colleague is, of course, right that no single testimony from any single Englishman constitutes scientific certainty that the collective entity plural is more prevelant in some domains than others. But this is what I mean about how annoying this “if I can’t measure it it isn’t science” obsession among computational linguists is. I wasn’t presenting it as science! It was just a nice bit of corroboratory evidence that I found on the internets, that’s all! Just as no behaviorist ever seriously measured the sweat on his palms or the increase in his heartrate to determine whether he were really in love with his wife, surely we linguists can satisfy our natural curiosity about language outside the laboratory without always and everywhere having to prove the case to scientific certainty? I wouldn’t attmept to publish a peer-reviewed paper on the basis of Mr. Micklethwait’s testimony, no - but it’s enough to answer what personal questions I have about this phenomenon for myself.
When I said as much to my colleague, suggesting that we use the word “evidence” rather than the more domain-specific “data,” he replied that he still disagreed. Which is, frankly, just A LIE. Now, granted, as an aspiring dialectometrist, he has a certain justifiable prejudice here. Since he measures (erm, aspires to measure) dialect differences for a living, naturally he doesn’t want to encourage sloppy intuition-based discussion of dialects. Nothwithstanding - it’s just A LIE to say that that none of his opinions whatever about how people use language are informed by anything outside of scientific data collection! In fact, MOST of his opinions about language use - up to and including the way he himself uses his own language - are formed outside of controlled studies.
Now, of course it’s possible that we were simply talking past each other, and he was thinking of the exchange as an exchange between language professionals rather than the more informal way in which I intended it. All the same, it illustrates something very important that’s wrong with the way the kool-aid drinkers view the world. The fact of the matter is that people consult their linguistic intuitions every day in order to make sense of the input they receive. Even if Mr. Micklethwait is some kind of bizarre aberration who has acquired the “England have = sport” vs. “England has = politics” distinction completely by accident and totally at odds with how most Britons use their language, the fact remains that he has intuitions about this (possibly phantom) distinction and makes interpretive decisions on the basis of them. Which is what we all do every time we hear or read any kind of lingusitic utterance, in fact. Intuitions ARE linguistic data. Not only are they valid linguistic data, they are the PRIMARY linguistic data. Corpus studies are approximations to actual use. They have certain efficiency advantages for certain kinds of studies, yes. They also have the advantage of yielding quantifiable results, which is a sine qua non for certain kinds of applications, yes (especially when comparison between two approaches to an engineering problem is what’s at issue). So corpus studies are indispensible - no arguments from me. But the fact of their being “indispensible” does not allow one to conclude - as so many do - that they are the whole of the law. They are not. At best, they are a convenient tool. But ultimately, corpora are NOT what we Linguists study! We study language use, and language use is an intuition made concrete as an utterance. Intuitions ARE the data. My colleague can well insist that we want more than just Mr. Micklethwait’s word on this, of course. But how many does it really take to draw a conclusion here? Notice that Mr. Micklethwait isn’t just insisting that he has this intuition, he clearly expects his audience to share it. And since he is, as far as I can tell, a successful speaker of British English - that is, he communicates daily with other Britons, more often than not understanding and being understood - his intuitions about what intuitions other speakers have can’t be too far of the mark. Linguistic communication, after all, involves arranging (some might say “merging,” heh heh) and inflecting lexical items in accordance with a shared communication system. We put the words in this order with this inflection and this intonation because we expect that the person we’re talking to, who has internalized the same rules, will decode them in the obvious and expected way. How else do the kool-aid drinkers think this process works?
We’ve been holding a summer version of the Parsing Reading Group that is rapidly devolving into a Machine Translation Reading Group instead. Since all the reading material consists of papers by kool-aid drinkers, we get a first-hand look at a lot of their follies. For example, one member is fond of repeating an account of someone (Franz Och?) claiming at a conference that because his machine translation system outscored some human translators on the BLEU score, that it has “super-human” performance. Obviously, this is laughable (and maybe he meant it as a joke). But while the papers don’t quite take it to that extreme, they still labor under the delusion in a lot of cases that they’re getting closer to human-like performance. They’re not. Even a shallow glance at a statistical machine translation system should be enough to convince someone that whatever it’s doing, it’s NOT mimicking the process by which human blinguals translate between languages. A quick look at the errors should lay any such vanity to rest. Statistical systems make mistakes that human simply don’t. Under ANY circumstances. But it’s more than that. The method by which they arrive at even their correct results can’t be entirely right either. Sure, there’s something right about it, and some aspects of what they do must be correct (for example, taking a word in the source language and looking up a translation for it in a lexicon). But humans, when translating, almost certainly don’t form a list of hundreds of possible translations that their mental corpus of utterances heard over their lifetime would suggest are valid and then, taking context and frequency into account, score them and pick the best fit! Bullshit! What they almost certainly do is exactly what the hated Noam Chomsky would suggest: they have some kind of super-abstract starting point which selects an array of lexical items in the target language on the basis of semantic intention, and they then apply the rules of that language to assemble them into a meaningful utterance. Just so.
So sure, if you’re a dialectometrist and interested in quantifying the differences between dialects, then Mr. Micklethwait’s testimony alone will not be sufficient. But this is because it is incompatible with your quantitative goal. It is NOT because Mr. Micklethwait is by himself an inadequate example of how the dialect is spoken. Quite the contrary - as someone who regularly uses the dialect himself with great success he is a perfectly adequate example of how it is spoken. Generally, of course, we like to check with a few more native speakers to make sure - just because we know that there are sometimes individual idiosyncracies. But the suggestion that I don’t know anything about the dialect until I’ve run a controlled statistical study over a corpus of a large number of its speakers is bogus and deceptive. That corpora are the only available way to study certain kinds of phenomena does not in any way imply that they are the only way to study any or even most kinds of language phenomena. It comes down to this: are you more interested in your self-image as a hardened scientist, or in actually learning something about natural language? Sadly, for many computational linguists, it seems to be the former.
[...] (various dialects) – 500,000+ words, and… English – 999,985 words, a mere 15 four-letter on Jun 10th, 2009Avoiding the Kool-Aid - jwherring.com 06/10/2009 For the past day I’ve been having a frustrating email discussion with [...]