on Jun 3rd, 2008What’s an Optimal Dictionary?

Mark Changizi is something of a sensation after his recent SciAm appearances, so I picked up one of his papers - the one about economical hierarchy organization in dictionaries.

This is a really interesting bit of work - one of those “why didn’t anyone consider this before?” kinds of things. Of course, people have thought of it before, in terms of efficiencies in semantic hierarchies, but Changizi is the first I’m aware of to consider a dictionary as a language optimization problem.

I remember when I first fell in love with Computational Linguistics it was because of research exactly like this. I actually came to IU to study Cognitive Science, something like Psycholinguistics, actually. I sat in on a CL class when I really should’ve been taking Andy Clark’s Philosophical Foundations seminar (the two classes met at the same time) - his last at IU, as it turns out. But I’m glad I made the rash decision I did - because I got exposed to Zipf’s Law and Shannon’s Information Theories, and WOW! Something about the meta-scientific nature of it all really caught me. It was like applied philosophy. We weren’t exactly doing “down and dirty” empirical research, but neither were we playing with building blocks spawned by Chomsky’s imagination. The idea that there were mathematical limits on language production seems obvious in retrospect, but at the time it was a kind of revelation.

Changizi takes me back to some of that.

The question is this: what would an “optimal dictionary” look like? We can speculate that it would have two characteristics.

First, it would involve an ideal tradeoff between expressive power and compression. That is, it would need to be as compact (in terms of the number of items needed to define all its entries - think of it as “total size”) as possible without giving up on complete coverage of the data (i.e. all the words attested in the language). This desideratum has to mostly be studied in terms of hierarchical levels - for two reasons. The first of these is a priori - it’s just in the nature of a “good dictionary definition” that it defines its entry in terms of less-specific, more abstract terms. The second of these is realistic: since there’s no point in setting an upper bound on how many concepts an ideal language needs (it needs as many as people find useful, obviously), and no way to measure how well the concepts in use in a language are covered by the words (no one, that I know of, has a model for “conceptual redundancy” between items that can be tested), the best we can do is hypothesize that the optimal vocabulary will arrange its hierarchy of words in such a way that dictionary coverage is “compact” in the sense of “shortest token-length definitions without sacrificing coverage” mentioned earlier.

Second, it would involve a “strict definitional hierarchy.” That is, there would be a set of “atomic” words which are not defined in terms of any other words (something like the “semantic primes” of Natural Semantic Metalanguage theory), and each level in the hierarchy would draw only from words in the immediately preceding level.

With these assumptions, Changizi is able to lay out some empirically testable conditions for “economy” in dictionaries like the OED. First, it should have an “optimum” number of levels in its hierarchy.

To aid in understanding this concept, consider a binary alphabet - 0 and 1. Obviously there are four possible two-letter words we can make with this: 00, 01, 10, and 11. By the same reasoning, there are 16 four-letter words we can make (I won’t spell them all out - exercise for the reader and all that). But what if we have an intermediate level in which all four of the two-letter words are represented by letters? I.e. a = 00, b = 01, c = 10, d = 11. Well, we can still get our target output of 16 possible words with these letters, but the words themselves only require two characters of storage space rather than four (because they are “ab” instead of “0001,” etc.). So to store 16 “concepts” in a “dictionary” that only allows definitions in terms of strings of the original two “semantic primes” (0 and 1), we need 64 (16 strings of four letters each) characters. But to store the same number of concepts with an intermediate layer of letters that act as standins for combinations of the primes, we can store the same 16 “concepts” with only 44 characters (8 characters to spell out the two-letter definitions of each letter on the intermediate level, 16 two-letter definitions for the output level - that’s 8 + 4 (the letter labels) + 32 = 44). So intermediate levels “optimize” the size of a dictionary by reducing storage space.

The relevant correlate in a natural language dictionary is meant to be “hypernyms,” that is, words like “vehicle” that cover for “car” and “buggy” and “rocket ship,” etc.

So, crunch some math and you find that to capture a vocabulary of 150,000 or so words (i.e. the size of an average pocket dictionary), the “optimal” number of levels is 7. Anything between 5 and 10 levels would be within 10% of optimum, actually. But any number of levels more or less than 7 is less efficient at reducing dictionary size while maintaining coverage.

A second prediction involves the “growth factor” by level. Returning to our example, we saved ourselves 20 characters by adding an intermediate level (down to 44 characters from 64). This savings can be captured in terms of a “level-level combinatorial growth exponent.” In the original example - where we had an alphabet of 2 and an output of 16 and only two levels (the alphabet and the output layer), this exponent is obviously 4 - because 2 times 2 times 2 times 2 is 16. Another way of saying it is that to get from the input layer to the ouput layer, we need four-character definitions, or we need the size of the dictionary to grow by a power of 4.

When we put in the middle layer, however, it’s not so dramatic. Now we need a factor of only 2 to define the middle layer (2 times 2 to get us the 4 words of the middle layer), and a factor of 2, in turn, to define the output layer from the middle layer (4 times 4 is sixteen = we need two letters each of a four-letter alphabet to define 16 words).

So by adding the middle layer, we drop our growth exponent from 4 to 2. Changizi hypothesizes that we can then make a second prediction based on this. Remember, our original prediction was that there would be 7 plus-or-minus two (hmmmm… where have I heard that before?) levels in the “optimal” dictionary’s hierarchy. If that’s the case, and if every level contains a uniform definitional length, then we can guess that an “optimal” combinatorial growth exponent should be 1.3. (I’ll leave the number-crunching to the reader - either that or look it up in the paper. Or trust me that it comes out right - I checked it.)

The third prediction is asserted rather than justified: namely, that thing about strict hierarchy I mentioned earlier. Each level should only use words in the level immediately before it in its definitions.

Alright, so these are all very cool concepts and have given me a lot of brain food over the past day. But …

… here comes the goofiness.

Leaving aside the issue of whether Changizi actually finds this kind of structure in the OED or not (he claims to - but I have grave doubts about his methods, and graver doubts about the veracity of his conclusions), a lot of these assumptions simply don’t make sense for natural language. The most obvious being - why should a natural language restrict itself to the strict hierarchy? Continuing with the earlier example - let’s say we have a concept that, rather than being 0001, i.e. three parts ‘0′ and one part ‘1′ (whatever that means in semantics!) - or is “ab” in terms of our intermediate level - is just “001?” Put differently, what if we have a concept that is “a1?” This is something of a problem for the funness of the model, not only because it doesn’t let us crunch our simple exponential growth numbers anymore (for variable-length definitions at each level, we have to do more complicated math), but also because it introduces ambiguities. “001″ can, after all, be represented as “0b” or “a1″ equally well. So there are obvious model-theoretic reasons why we would want to prevent this - but I can’t think of any compelling real-world reasons to make these assumptions. Indeed, I can think of compelling reasons to make the opposite assumptions. If we’re playing with an alphabet of semantic atoms (which Changizi gives good reasons to assume should have 10-60 members for English), it seems like we would want as many combinations of these atoms as the system will allow, for maximal expressive power. Indeed, the whole point of Zipf’s Laws are to demonstrate that variable length in phonemic specification is something of an optimization. Zipf doesn’t predict that we’ll have a set of pronounceable words of all the same length in human languages! Quite the contrary - the fact that some of these words are shorter than others is the result of a tradeoff between effort and specificity. Frequent words should be maximally abstract and maximally short. Infrequent words should be more specific and longer. I see no reason why this shouldn’t be just as true of a semantic-conceptual space as it is of a phonemic-lexical space. We would expect that some concepts in use in natural language should be “more specific” than others, and that these concepts should involve longer definitions than others. Now - in theory Changizi has this in the form of the intermediate levels. But I see no reason why the intermediate levels should be prohibited from deviating from the ideal exponential growth factor of 1.3 at each level. That prediction amounts, really, to predicting systematic gaps in the lexicon at predictable levels of conceptual specificity. I need more convincing before I believe that such a thing is even an “optimization characteristic” of language, let alone that it actually obtains in English!

Now, I suppose the argument here is that dictionaries should organize themselves in this way, not that the conceptual space necessarily should. But I don’t see how we can get away from the idea of a dictionary as a proxy for the conceptual space. Doing so would be like saying that the dictionary sacrifices accuracy for the purpose of making itself optimally small. But there is no reason whatever to believe that such a process would happen in the real world. The purpose of a dictionary is to accurately record all words in use. The economic considerations of paper saving seem trivial in cost compared with the economic fallout from delivering a product that doesn’t live up to its stated purpose. It would be like worrying first about saving on metal and only a distant second on speed in designing a racecar. True, in the general case we have reason to believe that skimping on metal will make the car lightweight and presumably faster, but the ultimate purpose is to build something that goes fast, and if there are cases where spending a bit more on a bit more metal will accomplish that goal, then damnit that’s what you do!

In short, I find this paper a valuable first step, with massive bonus points for “interesting concept,” but I’m still skeptical. It seems to me that this is a much harder problem in reality than the model here can capture. It also seems that a lot of the assumptions need further thinking. We might be crossing domains - imposing constraints from one domain in an improper way on another. It’s going to take more sophisticated math to solve this problem properly, I think, and a more thorough explication of the assumptions before I’m completely convinced they’re on the right track.

However, as a thought experiment this paper is inspiring, and I do believe it lays the groundwork properly. This is an interesting question, and the answer given here is almost certainly on the right track, if maybe not complete.

One Response to “What’s an Optimal Dictionary?”

  1. theory of constraints definitionon 21 Jun 2008 at 12:41 am

    [...] [...]

Trackback URI | Comments RSS

Leave a Reply