Whiz Kid Technomagic i18n Tools



Copyright © 1999 G. Adam Stanislav.
All rights reserved.
  Can you imagine the silence if everyone said only what he knows!

    — Karel Čapek

What is i18n

I18n is short for internationalization (i, plus 18 other characters, plus n).

The creators of the original personal computer used a 7-bit character code, commonly known as ASCII, which is capable of encoding only the very basic characters of Roman alphabet. That makes it useless for just about any language other than English and Latin.

To solve this problem, the International Standards Organization developed a family of standards called ISO-8859. Each of these standards either combines the accented characters of several languages into an 8-bit code, or allows the use of a different alphabet (such as Cyrillic, Greek, and others).

Each of the standards starts with the same encoding as ASCII. It then defines additional characters.

For example, ISO-8859-2 encodes the character sets of various Central-European languages into an 8-bit set.

Alas, while this solves the problem of using a personal computer in any one language, it is not possible to encode all characters used by the various languages of the world into eight bits.

Is this a problem?

That, of course, depends on your needs.

If, for example, you want to design a web page in a language other than English, you can probably use one of the ISO-8859 encodings. In that case, it is not a problem.

But, suppose you were asked to help a Buddhist scholar design a web site to publish the results of a study of various Buddhist texts. You will need to display the text of the study itself in English (or whatever language it is written in). You will also need to display some of the original texts. Suppose they are in Sanskrit, Tibetan, and Chinese. Each of these languages uses a different character set. Add to it the fact that Chinese alone cannot fit into a pure 8-bit representation.

What options do you have?

  • Multiple pages

    You could design a separate page for the study itself, another one for the Sanskrit sources, yet another for the Tibetan sources. You would need a different solution for the Chinese sources.

    Even disregarding the problem with Chinese, this solution has the problem that you have to flip between different pages if you want to refer to the sources while reading the text of the study.

  • Frames

    You could design separate pages as above, then display each in a separate frame.

    That means making the display of each page much smaller, probably forcing the reader to scroll through the various sections using scroll bars.

    This option will not work at all if the user visits the site with a browser that does not support frames.

  • Images

    You could use plain text for the study, graphical images for the sources. This is the first option so far that would work with Chinese.

    The main problem with this solution is that images are considerably larger than plain text. The user of your site would have to wait for the images to download.

    Add to it that browsers like Lynx (very popular on Unix systems) work in text mode. They do not display images.

None of these options offers a truly satisfactory solution.

Multi-byte encoding

The obvious solution to i18n problems is the use of multi-byte encoding, i.e., using more than one 8-bit byte (or octet in Internet parlance).

Several multi-byte encodings have been developed over the years. Some of them are suitable only for Chinese or Japanese. But one system that has emerged as clearly superior is Unicode.

The Unicode standard uses 16-bit mapping. In other words, it assigns a 16-bit integer to the various characters of all alphabets currently in existence. Additionally, it maps some other pictographs, such as mathematical symbols, dingbats, and others.

It also reserves part of the map for private use. Anyone can use the private section to map any glyphs or pictographs they want.

A popular effort exists to extend the private section into a four-byte encoding using 31 bits (referred to as UCS-4). This allows enough space for the use of fictional alphabets (e.g., Clingon) and extinct alphabets (e.g., Egyptian hieroglyphics), while being backwards compatible with the Unicode standard.

Note that the Unicode standard only maps glyphs to a 16-bit integer. It does not specify how the 16-bit value is to be represented on the computer. The main reason for this is that there are two principal ways of doing it in the world of computers. One places the least (or less) significant byte (LSB) before the most (or more) significant byte (MSB). The other does the exact opposite: It places the LSB after the MSB.

This is quite irrelevant as long as a computer is not connected to a network: It simply uses its own native representation of 16-bit integers. But the moment the computer needs to exchange Unicode data with another computer without knowing its native format (which happens all the time on the Internet), some kind of encoding protocol is required.

RFC 2279 defines UTF-8, an encoding of any 31-bit value to a unique combination of one to six octets (8-bit bytes). This is the preferred Unicode encoding protocol on the WWW. It is also fully compatible with UCS-4.

RFC 2277 states: “Protocols MUST be able to use the UTF-8 charset, which consists of the ISO 10646 coded character set combined with the UTF-8 character encoding scheme, as defined in [10646] Annex R (published in Amendment 2), for all text.”

Because of that, recent versions of all popular web browsers support UTF-8 encoding. Depending on the underlying operating system and fonts used, they will either display the right glyph as mapped by the Unicode standard, or they will show the closest glyph they can find.

Take, for example, the letter S with a caron. If UTF-8 encoded, it will either show up as an S with a caron, or just a plain S. Without UTF-8 all bets are off (for example, if using ISO-8859-2 encoding, it will show up either as an S with a caron or as a copyright symbol, or as something else dependent on the font used).

To see how your browser handles this particular example, here it is:

Š

If you want to test your browser thoroughly, try this link.

Tools

Tools to convert text files into the UTF-8 format are necessary. Some have been developed, others still need to be made.

Note for FreeBSD users only: As of 1999-04-20, all of these tools are available from a single port. It is in /usr/ports/converters/i18ntools/. Two of the tools are also available as separate ports (see below).

  • libutf-8

    To facilitate the development of such tools, I have created libutf-8 a library of C routines for the conversion of Unicode to UTF-8 and back. The library can also be used to convert the 31-bit UCS-4 mappings to UTF-8 and back.

    A Unix package is available by ftp.

    Or, you can download it by http by right-clicking on libutf-8-1.0.tar.gz and saving it on your system.

    If you use FreeBSD, you can install it painlessly from the ports collection. Just type:

    % cd /usr/ports/converters/libutf-8
    % make install
    

    This will only work, however, if you installed or upgraded your ports distribution on or after 1999-04-17.

    And, of course, if you write software that uses libutf-8 and release it as a FreeBSD port, all you need to make sure libutf-8 is installed is add the following in your port makefile:

    LIB_DEPENDS=   utf-8:${PORTSDIR}/converters/libutf-8
    

    Please read the man pages to learn how to use it in your programs:

    For the use under Windows, download utf-8.zip. It contains the same source code (but with carriage returns inserted), a different makefile, plus utf-8.dll and utf-8.lib. Since Windows does not understand man pages, they are not included, but you can just click on the links above to read them on line.

  • utrans

    This program can convert plain text written with any character encoding into UTF-8. It comes with 159 binary charmaps, plus a utility called mbm to create more.

    But utrans can also process plain text charmaps of two kinds:

    • The text charmaps available from Roman Czybora’s ISO 8859 Alphabet Soup;
    • The text charmaps that come with most Linux, and all FreeBSD, distributions (and available by ftp if you do not have them).

    That means that, if needed, you can easily create your own charmaps.

    Please note that utrans uses the utf-8 library, so you need to download and install libutf-8 first.

    utrans for Unix is available from my ftp site, or just right-click on utrans-1.0.tar.gz and save it right now.

    Windows version is also available as utrans.zip in source code only (that means you need a C compiler to create the executable).

  • uhtrans

    When you need to convert the output of different character encodings within the same file, uhtrans is your next step. It converts UTF-8 encoded text into 7-bit ASCII. Any UTF-8 sequence that represents a Unicode value greater than 127 decimal, is turned into an HTML-style representation of the type Ӓ (decimal notation).

    Note that uhtrans expects its input in the UTF-8 encoding; otherwise it fails. Typically, you will call it like this:

    tuc input | utrans | uhtrans -o output
    

    Because the output of uhtrans is plain ASCII, you can now edit the file using some completely different encoding, then convert it again.

    For example, suppose you want to write some of the text in ISO-8859-1, and some in ISO-8859-2, and convert the result to UTF-8. You could edit all the ISO-8859-1 text, leaving space for the rest of the text as needed. Then you would call:

    utrans -p ISO-8859-1 -i input | uhtrans -o interim
    

    Now you can switch your console and keyboard to the ISO-8859-2 mode, edit the interim file, save it, and run:

    utrans -p ISO-8859-2 -i interim | uhtrans -o output
    

    You can download uhtrans from my ftp site, or from here by right-clicking on uhtrans-1.0.tar.gz, and saving the file.

    You also need libutf-8, and should have utrans.

  • hutrans

    Finally, you will probably want to convert your file back to pure UTF-8, especially if it is not an HTML file. You can do this with hutrans, which does the exact opposite of uhtrans.

    In other words, hutrans copies a text file while replacing all occurences of Ӓ with the appropriate UTF-8 sequence. It also understands a hexadecimal notation in the form of ꯍ to allow you to copy encoding directly from various maps.

    You can download hutrans from my ftp site, or by right-clicking on hutrans-1.0.tar.gz. As the rest of the programs so far, utrans needs libutf-8 for its operation.

  • ptrans

    Sometimes you need to convert a file from the UTF-8 format back to plain text using certain character mapping.

    In that case, you need ptrans. It uses the same character maps as utrans above. In fact, ptrans functions as the exact opposite of utrans.

    As usual, you can get it by ftp or by right-clicking on ptrans-1.0.tar.gz. And you need libutf-8.

  • tuc

    Strictly speaking, tuc has nothing to do with i18n. But it may be useful as a secondary aid when processing text. It is an older program of mine (i.e., written before I started this i18n project).

    All tuc does is convert text files created by other operating systems to Unix format, or optionally to DOS format. So, if you need to convert text files between DOS and Unix, get tuc-1.10.tar.gz from here.

    Or, if you use FreeBSD, just type:

    % cd /usr/ports/textproc/tuc
    % make install
    

These tools allow you to create and edit UTF-8 files containing the entire span of the Unicode standard. They can also be called transparently by other programs, such as various X11 editors, or CGI scripts.

I used all these tools to convert this web page into UTF-8.

Putting it all together

The tools included here were designed to work together. They allow you to convert files from any encoding to any encoding, and to and from UTF-8.

Suppose you have an HTML file created under Windows 95 using its non-standard encoding. Simply pipe everything in the proper order, like this (placing it all on one line, which I have split up here for legibility):

tuc -i input.html | utrans -p CP1252 |
	uhtrans | hutrans |
	ptrans -p ISO-8859-1 -o output.html



Whiz Kid Technomagic Home Page ] [ ftp ] [ ISO-8859-2 ] [ FreeBSD ] [ Unicode Charts ] [ Slovak Alphabet ]

Just how international is your browser?



Παντά ρει

affinity

Война и Мир