on Jun 7th, 2008Now That’s Good Advice!

Admittedly not from a linguistic source, but cool nonetheless:

The plural of anecdote is not data.

Now if only we could get Sociolinguists in the habit of repeating that to themselves with their morning coffee every day…

on Jun 3rd, 2008What’s an Optimal Dictionary?

Mark Changizi is something of a sensation after his recent SciAm appearances, so I picked up one of his papers - the one about economical hierarchy organization in dictionaries.

This is a really interesting bit of work - one of those “why didn’t anyone consider this before?” kinds of things. Of course, people have thought of it before, in terms of efficiencies in semantic hierarchies, but Changizi is the first I’m aware of to consider a dictionary as a language optimization problem.

I remember when I first fell in love with Computational Linguistics it was because of research exactly like this. I actually came to IU to study Cognitive Science, something like Psycholinguistics, actually. I sat in on a CL class when I really should’ve been taking Andy Clark’s Philosophical Foundations seminar (the two classes met at the same time) - his last at IU, as it turns out. But I’m glad I made the rash decision I did - because I got exposed to Zipf’s Law and Shannon’s Information Theories, and WOW! Something about the meta-scientific nature of it all really caught me. It was like applied philosophy. We weren’t exactly doing “down and dirty” empirical research, but neither were we playing with building blocks spawned by Chomsky’s imagination. The idea that there were mathematical limits on language production seems obvious in retrospect, but at the time it was a kind of revelation.

Changizi takes me back to some of that.

The question is this: what would an “optimal dictionary” look like? We can speculate that it would have two characteristics.

First, it would involve an ideal tradeoff between expressive power and compression. That is, it would need to be as compact (in terms of the number of items needed to define all its entries - think of it as “total size”) as possible without giving up on complete coverage of the data (i.e. all the words attested in the language). This desideratum has to mostly be studied in terms of hierarchical levels - for two reasons. The first of these is a priori - it’s just in the nature of a “good dictionary definition” that it defines its entry in terms of less-specific, more abstract terms. The second of these is realistic: since there’s no point in setting an upper bound on how many concepts an ideal language needs (it needs as many as people find useful, obviously), and no way to measure how well the concepts in use in a language are covered by the words (no one, that I know of, has a model for “conceptual redundancy” between items that can be tested), the best we can do is hypothesize that the optimal vocabulary will arrange its hierarchy of words in such a way that dictionary coverage is “compact” in the sense of “shortest token-length definitions without sacrificing coverage” mentioned earlier.

Second, it would involve a “strict definitional hierarchy.” That is, there would be a set of “atomic” words which are not defined in terms of any other words (something like the “semantic primes” of Natural Semantic Metalanguage theory), and each level in the hierarchy would draw only from words in the immediately preceding level.

With these assumptions, Changizi is able to lay out some empirically testable conditions for “economy” in dictionaries like the OED. First, it should have an “optimum” number of levels in its hierarchy.

To aid in understanding this concept, consider a binary alphabet - 0 and 1. Obviously there are four possible two-letter words we can make with this: 00, 01, 10, and 11. By the same reasoning, there are 16 four-letter words we can make (I won’t spell them all out - exercise for the reader and all that). But what if we have an intermediate level in which all four of the two-letter words are represented by letters? I.e. a = 00, b = 01, c = 10, d = 11. Well, we can still get our target output of 16 possible words with these letters, but the words themselves only require two characters of storage space rather than four (because they are “ab” instead of “0001,” etc.). So to store 16 “concepts” in a “dictionary” that only allows definitions in terms of strings of the original two “semantic primes” (0 and 1), we need 64 (16 strings of four letters each) characters. But to store the same number of concepts with an intermediate layer of letters that act as standins for combinations of the primes, we can store the same 16 “concepts” with only 44 characters (8 characters to spell out the two-letter definitions of each letter on the intermediate level, 16 two-letter definitions for the output level - that’s 8 + 4 (the letter labels) + 32 = 44). So intermediate levels “optimize” the size of a dictionary by reducing storage space.

The relevant correlate in a natural language dictionary is meant to be “hypernyms,” that is, words like “vehicle” that cover for “car” and “buggy” and “rocket ship,” etc.

So, crunch some math and you find that to capture a vocabulary of 150,000 or so words (i.e. the size of an average pocket dictionary), the “optimal” number of levels is 7. Anything between 5 and 10 levels would be within 10% of optimum, actually. But any number of levels more or less than 7 is less efficient at reducing dictionary size while maintaining coverage.

A second prediction involves the “growth factor” by level. Returning to our example, we saved ourselves 20 characters by adding an intermediate level (down to 44 characters from 64). This savings can be captured in terms of a “level-level combinatorial growth exponent.” In the original example - where we had an alphabet of 2 and an output of 16 and only two levels (the alphabet and the output layer), this exponent is obviously 4 - because 2 times 2 times 2 times 2 is 16. Another way of saying it is that to get from the input layer to the ouput layer, we need four-character definitions, or we need the size of the dictionary to grow by a power of 4.

When we put in the middle layer, however, it’s not so dramatic. Now we need a factor of only 2 to define the middle layer (2 times 2 to get us the 4 words of the middle layer), and a factor of 2, in turn, to define the output layer from the middle layer (4 times 4 is sixteen = we need two letters each of a four-letter alphabet to define 16 words).

So by adding the middle layer, we drop our growth exponent from 4 to 2. Changizi hypothesizes that we can then make a second prediction based on this. Remember, our original prediction was that there would be 7 plus-or-minus two (hmmmm… where have I heard that before?) levels in the “optimal” dictionary’s hierarchy. If that’s the case, and if every level contains a uniform definitional length, then we can guess that an “optimal” combinatorial growth exponent should be 1.3. (I’ll leave the number-crunching to the reader - either that or look it up in the paper. Or trust me that it comes out right - I checked it.)

The third prediction is asserted rather than justified: namely, that thing about strict hierarchy I mentioned earlier. Each level should only use words in the level immediately before it in its definitions.

Alright, so these are all very cool concepts and have given me a lot of brain food over the past day. But …

… here comes the goofiness.

Leaving aside the issue of whether Changizi actually finds this kind of structure in the OED or not (he claims to - but I have grave doubts about his methods, and graver doubts about the veracity of his conclusions), a lot of these assumptions simply don’t make sense for natural language. The most obvious being - why should a natural language restrict itself to the strict hierarchy? Continuing with the earlier example - let’s say we have a concept that, rather than being 0001, i.e. three parts ‘0′ and one part ‘1′ (whatever that means in semantics!) - or is “ab” in terms of our intermediate level - is just “001?” Put differently, what if we have a concept that is “a1?” This is something of a problem for the funness of the model, not only because it doesn’t let us crunch our simple exponential growth numbers anymore (for variable-length definitions at each level, we have to do more complicated math), but also because it introduces ambiguities. “001″ can, after all, be represented as “0b” or “a1″ equally well. So there are obvious model-theoretic reasons why we would want to prevent this - but I can’t think of any compelling real-world reasons to make these assumptions. Indeed, I can think of compelling reasons to make the opposite assumptions. If we’re playing with an alphabet of semantic atoms (which Changizi gives good reasons to assume should have 10-60 members for English), it seems like we would want as many combinations of these atoms as the system will allow, for maximal expressive power. Indeed, the whole point of Zipf’s Laws are to demonstrate that variable length in phonemic specification is something of an optimization. Zipf doesn’t predict that we’ll have a set of pronounceable words of all the same length in human languages! Quite the contrary - the fact that some of these words are shorter than others is the result of a tradeoff between effort and specificity. Frequent words should be maximally abstract and maximally short. Infrequent words should be more specific and longer. I see no reason why this shouldn’t be just as true of a semantic-conceptual space as it is of a phonemic-lexical space. We would expect that some concepts in use in natural language should be “more specific” than others, and that these concepts should involve longer definitions than others. Now - in theory Changizi has this in the form of the intermediate levels. But I see no reason why the intermediate levels should be prohibited from deviating from the ideal exponential growth factor of 1.3 at each level. That prediction amounts, really, to predicting systematic gaps in the lexicon at predictable levels of conceptual specificity. I need more convincing before I believe that such a thing is even an “optimization characteristic” of language, let alone that it actually obtains in English!

Now, I suppose the argument here is that dictionaries should organize themselves in this way, not that the conceptual space necessarily should. But I don’t see how we can get away from the idea of a dictionary as a proxy for the conceptual space. Doing so would be like saying that the dictionary sacrifices accuracy for the purpose of making itself optimally small. But there is no reason whatever to believe that such a process would happen in the real world. The purpose of a dictionary is to accurately record all words in use. The economic considerations of paper saving seem trivial in cost compared with the economic fallout from delivering a product that doesn’t live up to its stated purpose. It would be like worrying first about saving on metal and only a distant second on speed in designing a racecar. True, in the general case we have reason to believe that skimping on metal will make the car lightweight and presumably faster, but the ultimate purpose is to build something that goes fast, and if there are cases where spending a bit more on a bit more metal will accomplish that goal, then damnit that’s what you do!

In short, I find this paper a valuable first step, with massive bonus points for “interesting concept,” but I’m still skeptical. It seems to me that this is a much harder problem in reality than the model here can capture. It also seems that a lot of the assumptions need further thinking. We might be crossing domains - imposing constraints from one domain in an improper way on another. It’s going to take more sophisticated math to solve this problem properly, I think, and a more thorough explication of the assumptions before I’m completely convinced they’re on the right track.

However, as a thought experiment this paper is inspiring, and I do believe it lays the groundwork properly. This is an interesting question, and the answer given here is almost certainly on the right track, if maybe not complete.

on Jun 3rd, 2008Linguistic Beauty Contest

Well this was inevitable, I suppose. Estonia is throwing a “language beauty contest,” evidently spurred by the urban myth that it once came in second to Italian in an Italian language beauty contest.

I’m gonna go out on a limb and say it gets first prize this time.

Honestly, what is the point of a language beauty contest? If ever there was a subject to which de gustibus non est disputandum applies… Every statement about how “beautiful” or “intelligent” a language is says more about the cultural prejudices of the speaker than it does about the target language.

Which, I suppose, is an interesting subject in its own right. Alright, fine, let’s have language beauty contests. But let’s do it right. Meaning, balance for male and female speakers and listeners, have naive listeners from all cultural backgrounds, devise some kind of a category rating system so that we can data crunch the results, and for Jebus sake NO GORAM “LANGUAGE EXPERT” JUDGES!!! Actually, come to think of it, this might just fall into that ellusive category of useful things Sociolinguists could do (but generally don’t).

on Jun 3rd, 2008Enron Dataset

Ben Fry has a cool graphic illustrating some word frequency data from the Enron Email Dataset. The graphic itself isn’t very useful (Fry admits this in the caption), but it’s certainly interesting as an attempt to make you feel “inside the information,” Matrix-like.

on Jun 3rd, 2008Forvo

Via Omniglot I’ve learned of a new internets resource called Forvo. It’s a pronounciation repository.

Forvo is the place where you’ll find words pronounced in their original languages. Ever wondered how a word is pronounced? Ask for that word or name, and another user will pronounce it for you. You can also help others recording your pronunciations in your own language.

Useful kit, right? But here’s where it gets weird. Esperanto is one of the languages. I mean, don’t get me wrong - I count myself a mild supporter of the Esperanto movement, and I’m certainly interested in constructed languages as a researcher. But surely there’s a problem including it on a site that’s meant to give native pronunciations? Ok, fine, there are probably as many as 2,000 native Esperanto speakers. But they’re scattered all over the world, forming no cohesive linguistic community, and in all cases they acquired the language from nonnative speakers. There’s simply no meaningful sense in which one can “sound like” a native Esperanto speaker, is there?

Of course, I’m not sure how useful a repository of words pronounced in isolation is ever going to be, really. Useful, certainly, but limited.

on Jun 3rd, 2008Welsh Corpus

There is now a Welsh and a Scotts Gaelic corpus available for download from Language Engineering Resources for the Indigenous Minority Languages of the British Isles and Ireland Project of Lancaster University. Kewl.

on Jun 3rd, 2008Cross-linguistic Cursing

LanguageHat has a post called Chechens don’t lightly curse that’s about just what it says. Chechens, apparently, take cursing very seriously - to the point that telling someone to “fuck their mother,” a common expression in Russian, is a killing (or at least a fighting) matter in Chechen. Or, so says Anatol Lieven in a book explaining Russia’s loss in the Chechen Wars.

The ‘granddads’ forced the younger soldiers to buy useless things from them, hand over all their payøand 20 marks a month was all we got. One young soldier in my squad had had to give most of his pay for a broken clock. I took it to the ‘granddad’, asked him, ‘Why did you sell him this?’ He cursed me. Now we Chechens don’t lightly curse each other — for us, this is a serious business. I broke the clock over his head. I got another three days in the cooler for that…

The “granddad” is, of course, a more senior member of the Russian military - in which hazing is apparently quite fierce. So one presumes this exchange took place in Russian. And indeed, in the footnotes:

Incidentally, it is not quite true that Chechens do not use the Russian expression, ‘xxxx your mother!’ when speaking to each other; but they only do so when speaking in Russian

Which does sort of expose this as typical ethnic posturing. Consider: most Chechens are, one assumes, completely competent in Russian, probably almost to the level at which they speak Chechen (Ethnologue reports that “most speakers are quite fluent in Russian”). If so, the Chechens who speak Russian near-natively would have linguistic intuitions about cursing in Russian. That is, they wouldn’t simply understand the literal translation of the phrase “fuck your mother,” they’d also know how it felt - meaning approximately how acceptable it is in Russian society. Now, surely offense at a curseword depends entirely upon social context and speaker intent. So, for example, if in English I tell someone to “fuck off,” whether or not it’s offensive at all has everything to do with my tone, the conversational context, and my relationship to the listener. Say it at a poker table and no harm no foul; saying it to my 84-year-old grandmother under any circumstances might well land her in the hospital.

Now - I’m completely open to the idea that different languages may attach different levels of social cost to taboo words (factoring in the context question, of course). My grandmother, after all, will be less tolerant of cursing in general than members of my own generation simply because her generation attaches a greater social cost in general to such words than mine does. In some important sense, we speak different dialects of English, and not just because she’s from Georgia and I’m from North Carolina. Another example - when I lived in Germany (over 10 years ago), there was a hit song called Ich find’ Dich Scheisse (”I think you’re shit”). Which was sort of funny for me, because of course the FCC would never allow such a thing on pop radio here, and yet there it was, playing in our kitchen almost every day for a couple of months. Obviously, whether as a result of the FCC, or just as a general cultural thing, there’s a lower social cost for cursewords in German than there is in English.

But that observation depends to a great extent on my status as a non-native speaker of German. Were I roughly as fluent in German as I am in English, then it probably wouldn’t have occurred to me that there was anything odd about the #4 song having the word “shit” in the chorus.

Which really does make one wonder how someone with presumably near-native profficiency in Russian takes it upon himself to give a taboo phrase more weight than it’s due? This is especially pertinent considering that the author himself admits, in a footnote, that most Chechens regularly use cursewords when speaking in Russian. So which is it? Is Chechen culture fundamentally opposed to taboo words regardless of the language in which they’re uttered, even if it is known to the listener that the word doesn’t carry as much weight in the native language? Or can they, in fact, distinguish between languages, in which case whence the outrage with the Russian soldier for having told him to go fuck his mother?

What’s going on, most likely, is that since the book is for foreign audiences unlikely to be too familiar with Chechnya, the author is projecting his culture’s self-image onto a group of willing listeners. In Japan and Korea most of my conversations with locals I’d just met involved them trying to convince me that their culture was special and, in some way, clearly superior to the rest of the world. Probably nothing different is going on here.

Nevertheless, my own experiences living in foreign countries lead me to think that there are significant differences in how socially acceptable the inventories of cursewords are across languages. There’s certainly room for some languages to have stricter taboos than others. Which is all raises the very interesting question of why certain cultures seem to need taboo words more than others.

I’ve been considering doing a weekly post on “research sociolinguists could do that would actually be useful,” both as a way of blowing off steam about the dismal state of langauge and gender reserach and also as a way of talking about a field of Linguistics that I find fascinating, even if I prefer to think of it as a subfield of Sociology than of Linguistics. One useful thing a sociolinguist can do (probably has done and I just don’t know about it) would be to do a cross-lingusitic comparison of the force of taboo words. I expect that this could even be done online for some languages - with frequency acting as a kind of proxy for acceptability - though of course with the caveat that it has to be frequency across genres. Which, in CompLing terms, would be like saying that you take all internet pages in a particular language as a corpus and impose, as a requirement for “acceptability,” that the word have a high Julliand’s D score before calling it “acceptable.”

In any case, what I always find interesting about cultural chauvinists is that their arguments tend to rely on the general sameness of people across cultures. If Western and Chechen cultures were really radically different, then we wouldn’t be able to appreciate the “goodness” of putting a high social premium on cursewords. Getting this point across requires the author to rely on the universal existence of cursewords - to varying levels of social acceptability - across all cultures. Meaning that everyone understands, in the required sense, that cursewords are used to insult - even the barbaric Russians - and Chechnya doesn’t turn out to be so special after all.

on May 26th, 2008WALS Online

This is one of the more exciting new sites I’ve seen on the web in quite some time. It’s called WALS Online - for “World Atlas of Linguistic Structures,” and it’s a map database of language-by-characteristic-features. From the homepage:

WALS consists of 141 maps with accompanying texts on diverse features (such as vowel inventory size, noun-genitive order, passive constructions, and “hand”/”arm” polysemy), each of which is the responsibility of a single author (or team of authors). Each map shows between 120 and 1110 languages, each language being represented by a symbol, and different symbols showing different values of the feature. Altogether 2,650 languages are shown on the maps, and more than 58,000 datapoints give information on features in particular languages.

I haven’t played with it much yet, but it looks rather promising. A news blog is available here.

on May 26th, 2008Language Module

Greetings. My name is Joshua Herring, and I’m a PhD student in Computational Linguistics at Indiana University. Not that you’d know it by the blog I’ve been keeping for the past two years. That thing is mostly about politics - which was really, honestly, was never the intention, but there you have it. The internets do play strange tricks on the mind.

Recently (meaning several months ago, but who’s counting?) I ditched blogger and opened up a website of my very own - jwherring.com - the intention being to “reboot.” This blog is for Computational Lingusitics. Well, and technology. Every time I write about politics or law on this blog, God kills a kitten.