One of the terms that's always confused me in linguistics is that of languages being genetically related. Linguistic relatedness doesn't necessarily have anything to do biological relatedness, so why use this term?

As it turns out, the term "genetic" here doesn't mean "having to do with genes", but rather "having a common origin" (think "genesis"). I.e., there's two distinct meanings of the word genetic, one for biology and one for linguistics. Of course, no one's ever bothered to explain this to me... a glaring oversight, especially since the biological meaning is the common one.

Now how about the Chinese translation? In Chinese, to say languages x, y, and z are genetically related, you can use an awkward phrase like this: "x, y, z 等語言在發生學上具有淵源關係". WTF? What does that even mean? I mean, I know it's supposed to mean "x, y, z etc. are genetically related", but really what it translates to is something like "in genetics, x, y, and z have a common-origin relationship", since 發生學, short for 發育生物學 (according to the wikipedia redirect), means genetics or developmental biology—in the biological sense. This is a terrible mistranslation of the English term, imbuing it with a biological significance that it really shouldn't have.

So, I object to the use of the term 發生學 to mean "genetic" in the linguistic sense. It's a good thing I've figured this out... Whereas before I would furrow my eyebrows in confusion whenever I encountered the term in Chinese, I will now shake my head in disapproval instead.

On Penkyamp and other atrocities

I recently stumbled across a romanization system for Cantonese called Pênkyämp (which, amazingly enough, is supposed to be pronounced [pʰɪŋ³³jɐm⁵⁵], or "ping yum", for the non-IPA-readers). Yes, it is even discussed on mainland bulletin boards and promoted in blogs.

Pênkyämp is, hands down, the most confusing Cantonese romanization ever devised. I suppose the distinguishing feature is that it encodes the length distinction, e.g. between [sɐm⁵⁵] 'heart' and [saːm⁵⁵] 'three'. But it does this adding a consonant symbol at the end of the short-vowel syllable. So, 'three' is spelled "sam", which is fine, but 'heart' is spelled "samp", which just looks ridiculous. Similarly, 'square' [fɔːŋ⁵⁵] is "fong", whereas 'wind' [fʊŋ⁵⁵] is "fonk"; and 'eat' [sɪk²²] is "sek", while 'rock' [sɛk²²] is "seg".

This scheme offends me because it's wildly non-iconic—i.e., things that are longer should look longer. But for "sam" vs. "samp", the vowels, which are crucially different here, look exactly the same, and to make the vowel shorter you add something to the end, rather than modify the vowel symbol itself, which I think would be the desirable thing since the vowel is the most salient part of the syllable. Most systems for Cantonese do this, viz. "saam" vs. "sam".

Now, it is true cross-linguistically that voiceless coda consonants (like -p, -t, -k) make vowel durations shorter, and voiced ones (-b, -d, -g) make vowel durations longer. However, this is only subphonemically true for languages that already have a voicing contrast in that position. I know of no orthographies that use this fact to indicate actual, phonemic vowel length. Furthermore, syllables like [saːm⁵⁵] and [sɐm⁵⁵] tend to have the same overall duration. When the vowel shortens, the coda consonant lengthens to make up for it. So if you're going to tack something on the end, why not write "saam" and "samm"? Isn't that much more intuitive?

My next objection is the tone marks. Apparently you can choose between numbers and diacritics, and since the numbers are standard (1 through 6 stand for high 55, rising 25, mid 33, low 21/11, low rising 23, and low-mid 22), I have no problems with them. The terrible design choice is the diacritic tone marks. Going from tones 1 through 6, we have ä, ã, â, a, á, à.

First of all: an umlaut for the high tone? An umlaut?! Umlauts, whenever they are used, are used to change vowel quality, i.e., the vowel itself, and not length or pitch or stress or whatever. Umlauts are not appropriate for marking tone. Ever. (As an example, look at pinyin "u" vs. "ü".) And besides, what's wrong with the macron? Wouldn't "ā" do just as well, if not better?

Next, the second and fifth tones. The second tone is by far the more common of the two, and so should get the less weird tone mark. If you're going to use an acute accent for rising tone, then "á" should mark second tone, not the fifth tone. I suppose the use of the tilde for rising tone may be inspired by its use for the glottalized rising tone in Vietnamese; however, in most orthographies, the tilde is used for nasalization. The tilde is also reminiscent of the IPA falling-rising tone mark, a complex symbol which looks like this: [a᷈]. But this association also seems inappropriate for the straightforwardly rising tone in Cantonese. For the fifth tone, a haček would seem more appropriate (cf. the third tone in Mandarin).

The third tone, a level tone marked with a circumflex, is completely puzzling to me. Why mark a level tone with a hat? It makes no sense. In Vietnamese, circumflexes are used to distinguish vowels; in IPA, they're used for falling tones. There's just no motivation for this usage here.

Finally, the low (fourth) tone is left unmarked in Pênkyämp. This decision also seems counter-intuitive to me. If any tone should be unmarked, it should be the first tone. This is the tone that most (stressed syllables of) loanwords and many onomatopoeic words have. Yale romanization doesn't take this strategy, choosing instead to mark the more extreme (high or low) tones, and leaving the mid tones unmarked, which I suppose is also a reasonable strategy. But an unmarked low tone for Cantonese? Again, there is no motivation for this, and no easy way to remember this.

The choice of vowel notation for [y] and [œ], namely "eu" and "eo", are also sub-optimal. Cantonese romanizations have been using "ue" for [y] for years. Jyutping uses "yu", which is also OK. "eu" looks like it should be [ew]; alternatively, "eu" is the common (and Yale) romanization for [œ] (cf. my own name 祥 "Cheung"). This is a poor design choice. Similarly, why use "eo" for [œ] when "oe" looks more like IPA and "eu" is the romanization you see on the street? These choices are especially illogical considering that the [-ɛ] rhyme is spelled "-e", and considering the existence of the rhyme [-ɛːw]. Well, OK, that should be spelled "-eu", right? No! That's been taken by [-y] already, so instead Pênkyämp makes an awkward work-around and spells it "-eau". This problem could have been avoided by choosing more sensible vowel spellings in the first place.

But back to vowel length. This system makes a choice. It chooses to represent the vowel length distinction in Cantonese as primary, and kind of ignores vowel quality. Most other romanizations go the other way, distinguishing vowel quality but not representing length. But the fact of the matter is that you get both. There is a length distinction, and the short vowels all happen to be higher and more central than their long vowel counterparts. So who's right? Is it vowel length, or vowel quality? The answer is that it's both; the system is redundant. Why don't we just let our romanization system be redundant as well? Take, for example, the case of Taiwanese romanization, where the the [-wa] rhyme is spelled "-oa", and the [-wi] final is spelled "-ui". Why not use "u" or "w" as the medial for both? Well, because they're different rhymes, and you might as well make them visually distinct. It might be a surprising design choice, but it's not a bad one.

Pênkyämp basically makes everything it can make obscure, obscure. The vowels are spelled funny. The tones are marked funny. Short/long vowels are distinguished, but not in any normal way: no doubled letters, no colons or IPA length marks, no macrons. No, to figure out if a vowel is long or short (which, remember, essentially changes what vowel it is), you have to glance over one or two letters and see if there's a -p, -t, -k, -y, or -w there, then modify the vowel in your head to match. (One could argue that you're supposed to read the entire rhyme as a unit, but the questions remains: how to make these units, which are composed of alphabetic symbols, most easily learned/parsed?) Moreover, making this short/long distinction serves no purpose. It just makes it more confusing.

I actually tried reading a sample text written in Pênkyämp, and it was pure torture. When every symbol is used in a nonconventional way, which Pênkyämp does, it becomes a monumental task to just to parse one syllable. Does Pênkyämp offer any ideas or insights of value to the larger issue of Cantonese romanization? I'm afraid the answer to that is an emphatic "no".

Everybody's Cantonese

I've been going through an old book entitled Everybody's Cantonese (1949, by Chan Yeung Kwong), and although the vocabulary is pretty basic, I did find some old pronunciations and interesting characters. For example, 咁 is transcribed as gom3, with a back rounded vowel (nowadays usually pronounced gɐm3); and 粒 is transcribed as nɐp5 (which I've always heard as lɐp5). These appear to the old pronunciations which have gone out of fashion.

Interesting characters include 氈 dzin1 'blanket', 笪 daːt3 'classifier for places', 樖 pɔ1 'classifier for trees', and 擸𢶍 laːp6saːp3 'trash' (now usually written 垃圾). I've always wondered about the word for trash, which in mainland Mandarin is pronounced la1ji1, but in Taiwan is pronounced le4se4. Why the difference? Are one or both of the variants related to the Cantonese word, and how?

The newspapers are making a big deal about how the mainland translation skips out on "communism" and "dissent", which got me looking for the full, uncut translation from Hong Kong-based broadcaster Phoenix Satellite Television, which is mentioned—but, rather inconveniently, not linked to—in the English-language media. So I extracted the text and have posted it below for general (and translators') interest's sake.

各位同胞: My fellow citizens:
今天我站在這裡,為眼前的重責大任感到謙卑,對各位的信任心懷感激,對先賢的犧牲銘記在心。我要謝謝布希總統為這個國家的服務,也感謝他在政權轉移期間的寬厚和配合。 I stand here today humbled by the task before us, grateful for the trust you have bestowed, mindful of the sacrifices borne by our ancestors. I thank President Bush for his service to our nation, as well as the generosity and cooperation he has shown throughout this transition.

It's always interesting when you learn a character for a word you know but didn't know had a character. Especially when it's super-complex. Take, for example, Cantonese zok6 'chisel':

Looks like 丵 zok6/zhuó 'dense grass' (?) is the phonetic, then there's a mortar (臼), and some beating (殳) with a metal object (金).


Here are some characters that look alike.

戊 wù/mou6 '5th of the 10 Heavenly Stems'. Used as phonetic in 茂 mào/mau6 'luxuriant; profuse'

戌 xū/seot1 '11th of the 12 Earthly Branches'.

戎 róng/jung4 'weapons; military affairs'. Used as phonetic in 絨 'down, velvet'. Interestingly, 'thief' 賊 zéi/caak6 looks like it has this as a component, but the original character had phonetic 則 zé/zak and the weapon radical 戈 gē/gwo1. According to Wenlin, there's an obsolete form 𧵪.

戍 shù/syu3 'defend; garrison'.

"Better Cantonese"

Recently I've been listening to these one-minute segments on how to make your Cantonese more correct, called 粵講粵o岩一分鐘.

It's basically a big prescriptivist-fest, telling native Cantonese speakers to correct their "lazy sounds" and not merge /n-/ and /l-/, not drop initial /ng-/, etc. There's two "hosts": the guy is 何文匯, whose name is on various Cantonese dictionaries, and the girl is 黃念欣, who I don't know.

For some reason, the guy is really good at making you feel stupid. "People are too lazy to look up the dictionary," he says. Or, "if people would just think logically, they wouldn't pronounce things all wrong." The girl is much more encouraging, but says things like, "if you say things wrong, you'll sound really childish."

Despite the tone, there are some interesting etymological/philological notes in some of the segments, and I did learn some obscure characters which I've seen before but never knew how to read.

Although they can get rather pedantic, in some ways it's kind of reassuring that people are being prescriptivist about Cantonese, because that means people actually care about preserving the language... and that's a luxury not all languages enjoy.

how to read 躍?

Brushing up on my Chinese, I came across this character: 躍. “I wonder how you pronounce it,” I said to myself, and tried to look it up under pinyin yào, since it shares the phonetic with 耀 ‘shine’. No dice.

As it turns out, 躍 ‘leap’ is Mand. yuè / Canto. joek6 (or joek3). But in Middle Chinese, it was homophonous with 藥 yào / joek6 ‘medicine’ (they’re listed right next to each other in the Guangyun). They’re still homophonous in Cantonese, but in Mandarin they’ve taken on separate readings. A similar thing happened with 角 ‘corner’ (Canto. gok3), which is sometimes read jiǎo and sometimes jué.

To add to the confusion, here are some more characters with the same phonetic but read differently:

  • 耀 yào / jiu6 ‘shine’
  • 戳 chuō / coek3 ‘jab, poke’
  • 擢 zhuó / zok6 ‘pull out’

while reading up on Tibetan, i came across a disturbing entry in the dictionary: 暫 (the book's in Chinese). Apparently, the pronuncation in China is zàn, but in Taiwan it's zhàn. Not too big of a deal, but it's one more thing to remember when looking things up in the dictionary, which is annoying.

I've moved my blog to my new domain, My friend Adrian is generously giving me some of his server space to host my web presence, and... well, this is huge! Now I've got more database and blogging and content management capabilities than I can shake a stick at!

So here's the plug: if you need a place for your web site, go to for all your web hosting needs. Reliable hosting, great prices, and friendly support. And you can't beat the 30-day money-back guarantee.

So, why blyt, you ask? Well, if you look up 筆 'pen' in ancient Chinese texts, you'll find the phrase 不律為筆. Roughly translated, it says "No rules is pen." This makes no sense unless you read it not for meaning, but for pronunciation: "The character 筆 is pronounced like 不 + 律", which linguists nowadays guess might have sounded like b-liwət. Of course, that's rather inconvenient for a domain name, and naturally was taken, so here we are. Welcome to, where I shall put digital pen to digital paper.

