Unicode and "character encodings"

Tagged:

The only thing that computers understand, fundamentally, is the difference between a one and a zero. Numbers. That's all. To express letters so that computers can manipulate them, each letter has to be assigned a unique number.

Self-evidently, the total of the world's languages includes a lot more letters than just the English alphabet, so when computers migrated out of Anglophone countries, letter-representation systems had to expand and adjust. The eventual result was Unicode.

The 1960s and 1970s witnessed many competing attempts to assign numbers to letters. Increasing internationalization only made the problem worse and more urgent. Eventually (simplifying extremely!) the "solution" settled upon was a plethora of "character encodings" using the same set of numbers to represent different sets of letters; any document sent had to signal somehow what encoding it was using, so that the computer could display the proper letters.

If you've ever gotten a spam email that looked like gobbledygook in your email application, you've been the victim of a character encoding problem. Either the email didn't correctly signal which encoding it was using, or your computer doesn't know how to handle that particular encoding. Or, of course, both.

The real nail in the coffin of these character-encoding systems is a multilingual document or database. Such as—a research library's catalogue.

Finally in the late 1980s, the International Standards Organization decided enough was enough. They determined that the way forward was to expand the set of numbers used to represent letters hugely, and assign each letter its own unique number, so that there was no possibility of confusion.

Of course it didn't turn out to be that simple. Exactly what a "letter" is became a vexed question. Also there are several shortcuts (each, confusingly, called an "encoding;" examples include UTF-8 and UTF-16) that computers can use to represent Unicode characters. Still, Unicode is and will likely remain the gold standard for computer representation of the world's languages.

What does this mean for you? Well, if you have been using MARC's anglicizations of foreign alphabets in your catalogue, now may be the time to look into converting your records to Unicode. This has been done mechanically on very large and linguistically-diverse catalogues with impressively accurate results.

If you have internationalized content on your web pages, you should check what character encoding it is in; if that encoding is not Unicode, you should absolutely consider conversion. Be aware, also, that some popular web-design applications (FrontPage is an egregious offender) do not use Unicode—they use Microsoft-proprietary and heavily discredited encodings instead. Fix these, or ask someone else to! Future web-page maintainers will thank you.

Jukka Korpela has an occasionally techie but still readable tutorial on character encodings. Recommended.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Could you give us pointer to more information about the succesful mechanical conversion to unicode on large catalogs you speak of? The way you put it, it almost sounds like they mechanically converted from romanization/anglicization to unicode vernacular... but that doesn't seem likely? Anyhow, pointer to more information would be great!

What I know about this comes from talking to the cataloguers at the University of Wisconsin at Madison. The numbers I heard (allowing for hazy memory) were a few hundred records having to be kicked back for manual attention (typos in the original records, usually) out of the entire catalogue, which contains a substantial East Asian collection, among of course materials from all over the world.

The conversion pioneers, I believe, were at Yale. I don't know if they've published anything about their experience, but I'll check it out and get back to you.

Karen Coyle has an excellent, excellent article on this topic: "Unicode: The Universal Character Set Part 2: Unicode in Library Systems." Journal of Academic Librarianship 32:1 (Jan 2006) p101-3. Full-text is in Elsevier ScienceDirect for those who have access to that.

The article cites this page from the Library of Congress that (among other things) lays out the relationships between MARC encoding and Unicode in both human- and computer-readable forms.

They do not, alas, provide actual translation software, but given those tables, the translation shouldn't be too terribly horrendous a programming job (assuming your ILS has some way for you to batch-manipulate records).

Just wanted to add a pointer to another great intro to Unicode, written by Ardie Bausenbach (LC), published in DigiNews at http://www.rlg.org/en/page.php?Page_ID=17068#article2. Ardie originally started writing about character encoding issues for the RLG Descriptive Metadata Guidelines (http://www.rlg.org/en/page.php?Page_ID=214), but what she produced turned out to be too good and in-depth to be buried as an appendix to the guidelines, so we placed it in DigiNews. Enjoy!

Thank you for this!

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.