Web Localization in Perl

Autrijus Tang
OurInternet, Inc.
July 2002

Abstract

The practice of internationalization (i18n) enables applications to support multiple languages, date/currency formats and local customs (collectively known as locales); localization (L10n) then deals with the actual implementation of fitting the software into the needs of users in a certain locale. Today, Web applications are one of the key areas that's being massively localized, due to its nature of text-based interface representation formats.

In the Free Software world, many of the most flexible and widely-used technologies are built upon the Perl language, which has long been the language of choice for web application developers. This article presents the author's hands-on experience on localizing several Perl-based applications into Chinese, the detailed usage and pitfalls of common frameworks, as well as best practice tips for managing a localization project.

Introduction

``There are a number of languages spoken by human beings in this world.''
-- Harald Tveit Alvestrand, in RFC 1766, ``Tags for the Identification of Languages''

Why should someone localize their websites or web applications?

Let us imagine this very question being debated on the Web, with convincing arguments and explanations raised by various parties, in different languages. As a participant in this discussion, you may hear following points being made:

Figure 1: Reasons for Localization (before localization)
  • Иностранная валюта, формат даты, язык и обычаи могут казаться нам пугающими
  • Menschen sind produktiver wenn sie in ihrer gewohnten Umgebung arbeiten
  • Tas veicina daudz labâku sapraðanu un mijiedarbîbu starp daâdâm kultrâm
  • Un progretto con molti collaboratori internazionali si evolverá piú in fretta e meglio
  • 地区化的过程, 有助於软件的模块化与可移植性

But, alas, it is not likely that all parties could understand all these points. This fact had naturally lead to the effect of language barrier -- our field of meaningful discussion are often restricted to a few locale groups: people who speak the same language and share the same culture with us.

However, that is truly sad since the arguments we missed are often valid ones, and usually offer new insights into our present condition. Therefore, it will be truly beneficial if the arguments, interfaces and other texts are translated for us:

Figure 2: Reasons for Localization (localized to English)
  • It is a distraction to have to deal with interfaces that use foreign languages, date formats, currencies and customs
  • People are more productive when they operate in their native environments
  • It fosters understanding and communication between cultures
  • Projects with more international contributors will evolve faster and better
  • Localization tends to improve the software's modularity and portability

As these arguments have pointed out, it is often not possible nor desirable to just speak X, be it Latin, Interlingua, Esperanto, Lojban or, well, English. At such times, localization (L10n) is needed.

For proprietary applications, L10n was typically done as a prerequisite of competing in a foreign market. That implies if the localization cost exceeds estimated profit in a given locale, the company would not localize its application at all, and it would be difficult (and maybe illegal) for users to do it themselves without the source code. If the vendor did not design its software with good i18n framework in mind -- well, then we're just out of luck.

Fortunately, the case is much simpler and rewarding with open-source applications. As with proprietary ones, the first few versions are often designed with only one locale in mind; but the difference is anybody is allowed to internationalize it at any time. As Sean M. Burke put it:

The way that we can do better in open source is by writing our software with the goal that localization should be easy both for the programmers and maybe even for eager users. (After all, practically the definition of "open source" is that it lets anyone be a programmer, if they are interested enough and skilled enough.)

This article describes detailed techniques to make L10n easy for all parties involved. I will focus on web-based applications written in the Perl language, but the principle should also apply elsewhere.

Localizing Static Websites

``It Has To Work.''
-- First Networking Truth, RFC 1925

Web pages come in two different flavors: static ones provides the same content during many visits, until it is updated; dynamic pages may offer different information depends on various factors. They are commonly referred as web documents and web applications.

However, being static does not mean that all visitors must see the same representation -- different people may prefer different languages, styles or medium (e.g. via auditory outputs instead of visual ones). Part of the Web's strength is its ability to let the client negotiate with the server, and determine the most preferred representation.

For a concrete example, let us consider the author's hypothetical homepage http://www.autrijus.org/index.html, written in Chinese:

Listing 1. A simple Chinese page
<html><head><title>唐宗漢 - 家</title></head>
<body>施工中, 請見諒</body></html>

One day, I decided to translate it for my English-speaking friends:

Listing 2. Translated page in English
<html><head><title>Autrijus.Home</title></head>
<body>Sorry, this page is under construction.</body></html>

At this point, many websites would decide to offer a language selection page to let the visitor to pick their favorite language. An example is shown in Figure 3:

Figure 3: A typical language selection page
Please choose your language:
Čeština Deutsch English Español
Français Hrvatski Italiano 日本語
한국어 Nederlands Polski Русский язык
Slovensky Slovensci Svenska 中文 (GB)
中文 (Big5)

For both non-technical users and automated programs, that page is confusing, redundant, and highly irritating. Besides demanding an extra search-and-click for each visits, it poses considerable amount of difficulty on web agent programmers, as they now have to parse the page and follow the correct link, which is a highly error-prone thing to do.

MultiViews: The Easiest L10n Framework

Of course, it is much better if everybody could see their preferred language automatically. Thankfully, the Content Negotiation feature in HTTP 1.1 addressed this problem quite neatly.

Under this scheme, browsers will always send an Accept-Language header, which specifies one or more preferred language codes; for example, zh-tw, en-us, en would mean "Traditional Chinese, American English or English, in this order".

The web server, upon receiving this information, is responsible to present the request content in the most preferred language. Different web servers may implement this process differently; under Apache (the most popular web server), a technique called MultiViews is widely used.

Using MultiViews, I will save the English version as index.html.en (note the extra file extension), then put this line into Apache's configuration file httpd.conf or .htaccess:

	Options +MultiViews

After that, Apache will examine all requests to http://www.autrijus.org/index.html, and see if the client prefers 'en' in its Accept-Language header. Hence, people who prefer English would see the English page; otherwise, the original index.html is displayed.

This technique allows gradual introduction of new localized versions of the same documents, so my international friends can contribute more languages over time -- index.html.fr for French, index.html.he for Hebrew, and so on.

Since a large share of online populace speak only their native language and English, most of the contributed versions would be translated from English, not Chinese. But because both versions represent the same contents, that is not a problem.

... or is it? What if I go back to update the original, Chinese page?

The Difficulty on Keeping up Translations

As I modify the original page, the first thing I'd notice is that it's impossible to get my French and Hebrew friends to translate from Chinese -- clearly, using English as the base version would be necessary. The same reasoning also applies to most Free Software projects, even if the principal developers do not speak English natively.

Moreover, even if it is merely a change to the background color (e.g. <body bgcolor=gold>), I still need to modify all translated pages, in order to keep the layout consistent.

Now, if both the layout and contents are changed, things quickly become very complicated. Since the old HTML tags are gone, my translator friends must work from scratch every time! Unless all of them are HTML wizards, errors and conflicts will surely arise. If there are 20 regularly updated pages in my personal site, then pretty soon I will run out of translators -- or even out of friends.

As you can see, we need a way to separate data and code (i.e. text and tags), and automate the process of generating localized pages.

Separate Data and Code with CGI.pm

Actually, the previous sentence pretty much summarized up the modern internationalization(i18n) process: To prepare a web application for localization, one must find a way to separate as much data from code as possible.

As the long-established Web development language of choice, Perl offers a bewildering array of modules and toolkits for website construction. The most popular one is probably CGI.pm, which has been merged into core perl release since 1997. Let us see a code snippet that uses it to automatically generate translated pages:

Listing 3. Localization with MultiViews and CGI.pm
use CGI ':standard'; # our templating system
foreach my $language (qw(zh_tw en de fr)) {
    open OUT, ">index.html.$language" or die $!;
    print OUT start_html({ title => _("Autrijus.Home") }),
	      _("Sorry, this page is under construction."),
	      end_html;        
    sub _ { some_function($language, @_) } # XXX: put L10n framework here
}

Unlike the HTML pages, this program enforces data/code separation via CGI.pm's HTML-related routines. Tags (e.g. <html>) now become functions calls (start_html()), and texts are turned into Perl strings. Therefore, when the localized version is written out to the corresponding static page (index.html.zh_tw, index.html.en, etc.), the HTML layout will always be identical for each of the four languages listed.

The sub _ function is responsible for localizing any text into the current $language, by passing the language and text strings to a hypothetical some_function(); the latter is known as our localization framework, and we will see three such frameworks in the following section.

After writing the snippet, it is a simple matter to grep for all strings inside _(...), extract them into a lexicon, and ask translators to fill it out. Note that here lexicon means a set of things that we know how to say in another language -- sometimes single words like ("Cancel"), but usually whole phrases ("Do you want to overwrite?" or "5 files found."). Strings in a lexicon are like entries in travelers' pocket phrasebooks, sometimes with blanks to fill in, as demonstrated in Figure 4:

Figure 4: An English => Haitian lexicon
English Haitian
This costs ___ dollars. Bagay la kute ___ dola yo.

Ideally, the translator should focus solely on this lexicon, instead of peeking at HTML files or the source code. But there is the rub: different localization frameworks use different lexicon formats, so one has to choose the framework that suits the project best.

Localization Frameworks

``It is more complicated than you think.''
-- Eighth Networking Truth, RFC 1925

To implement the some_function() in figure 4, one needs a library to manipulate lexicon files, look up the corresponding strings in it, and maybe incrementally extract new strings to the lexicon. These are collectively known as a localization framework.

From my observation, frameworks mostly differ in their idea about how lexicons should be structured. Here, I will discuss the Perl interface for three such frameworks, starting from the venerable Msgcat.

Msgcat -- Lexicons are Arrays

As one of the earliest L10n frameworks and part of XPG3/XPG4 standards, Msgcat enjoys ubiquity in all Un*x platforms. It represents the first-generation paradigm of lexicons: treat entries as numbered strings in an array (a.k.a. message catalog). This approach is straightforward to implement, needs little memory, and is very fast to look up. The resource files used in Windows programming and other platforms uses basically the same idea.

For each page or source file, Msgcat requires us to make a lexicon file for each language, as shown below:

Listing 4. A Msgcat lexicon
$set 7 # $Id: nls/de/index.pl.m
1 Autrijus'.Haus
2 Wir bitten um Entschudigung. Diese Seite ist im Aufbau.

The above file contains the German translation for each text strings within index.html, which is represented by an unique set number, 7. Once we finished building the lexicons for all pages, the gencat utility is then used to generate the binary lexicon:

	% gencat nls/de.cat nls/de/*.m 

It is best to imagine the internals of the binary lexicon as a two-dimensional array, as shown in figure 5:

Figure 5: The content of nls/de.cat
set_id
msg_id
1 2 3 4 5 6 7 8 9
1 .................. Autrijus'.Haus ......
2 ............... Wir bitten um Entschudigung... ......
3 .....................

To read from the lexicon file, we use the Perl module Locale::Msgcat, available from CPAN (the Comprehensive Perl Archive Network), and implement the earlier sub _() function like this:

Listing 5. Sample usage of Locale::Msgcat
use Locale::Msgcat;
my $cat = Locale::Msgcat->new;
$cat->catopen("nls/$language.cat", 1); # it's like a 2D array
sub _ { $cat->catgets(7, @_) } # 7 is the set_id for index.html
print _(1, "Autrijus.House");  # 1 is the msg_id for this text

Note that only the msg_id matters here; the string "Autrijus.House" is only used as an optional fall-back when the lookup failed, as well as to improve the program's readability.

Because set_id and msg_id must both be unique and immutable, future revision may only delete entries, and never reassign the number to represent other strings. This characteristic makes revisions very costly, as observed by Drepper et al in the GNU gettext manuals:

Every time he comes to a translatable string he has to define a number (or a symbolic constant) which has also be defined in the message catalog file. He also has to take care for duplicate entries, duplicate message IDs etc. If he wants to have the same quality in the message catalog as the GNU gettext program provides he also has to put the descriptive comments for the strings and the location in all source code files in the message catalog. This is nearly a Mission: Impossible.

Therefore, one should consider using Msgcat only if the lexicon is very stable.

Another shortcoming that had plagued Msgcat-using programs is the plurality problem. Consider this code snippet:

Listing 6. Incorrect plural form handling
printf(_(8, "%d files were deleted."), $files);

This is obviously incorrect when $files == 1, and "%d file(s) were deleted" is grammatically invalid as well. Hence, programmers are often forced to use two entries:

Listing 7. English-specific plural form handling
printf(($files == 1) ? _(8, "%d file was deleted.")
		     : _(9, "%d files were deleted."), $files);

But even that is still bogus, because it is English-specific -- French uses singular with ($files == 0), and Slavic languages has three or four plural forms! Trying to retrofit those languages to the Msgcat infrastructure is often a futile exercise.

Gettext -- Lexicons are Hashes

Due to the various problems of Msgcat, the GNU Project has developed its own implementation of the Uniforum Gettext interface in 1995, written by Ulrich Drepper. It had since become the de facto L10n framework for C-based free software projects, and has been widely adopted by C++, Tcl and Python programmers.

Instead of requiring one lexicon for each source file, Gettext maintains a single lexicon (called a PO file) for each language of the entire project. For example, the German lexicon de.po for the homepage above would look like this:

Listing 8. A Gettext lexicon
#: index.pl:4
msgid "Autrijus.Home"
msgstr "Autrijus'.Haus"

#: index.pl:5
msgid "Sorry, this site is under construction."
msgstr "Wir bitten um Entschudigung. Diese Seite ist im Aufbau."

The #: lines are automatically generated from the source file by the program xgettext, which can extract strings inside invocations to gettext(), and sort them out into a lexicon.

Now, we may run msgfmt to compile the binary lexicon locale/de/LC_MESSAGES/web.mo from po/de.po:

	% msgfmt locale/de/LC_MESSAGES/web.mo po/de.po

We can then access the binary lexicon using Locale::gettext from CPAN, as shown below:

Listing 9. Sample usage of Locale::gettext
use POSIX;
use Locale::gettext;
POSIX::setlocale(LC_MESSAGES, $language); # Set target language
textdomain("web"); # Usually the same as the application's name
sub _ { gettext(@_) } # it's just a shorthand for gettext()
print _("Sorry, this site is under construction.");

Recent versions (glibc 2.2+) of gettext had also introduced the ngettext("%d file", "%d files", $files) syntax. Unfortunately, Locale::gettext does not support that interface yet.

Also, gettext lexicons support multi-line strings, as well as reordering via printf and sprintf:

Listing 10. A multi-line entry with numbered arguments
msgid ""
"This is a multiline string"
"with %1$s and %2$s as arguments"
msgstr ""
"これは多線ひも変数として"
"%2$s%1$s のである"

Finally, GNU gettext comes with a very complete tool chain (msgattrib, msgcmp, msgconv, msgexec, msgfmt, msgcat, msgcomm...), which greatly simplified the process of merging, updating and managing lexicon files.

Locale::Maketext -- Lexicons are Dispatch Tables!

First written in 1998 by Sean M. Burke, the Locale::Maketext module was revamped in May 2001 and included in Perl 5.8 core.

Unlike the function-based interface of Msgcat and Gettext, its basic design is object-oriented, with Locale::Maketext as an abstract base class, from which a project class is derived. The project class (with a name like MyApp::L10N) is in turn the base class for all the language classes in the project (with names like MyApp::L10N::it, MyApp::L10N::fr, etc.).

A language class is really a perl module containing a %Lexicon hash as class data, which contains strings in the native language (usually English) as keys, and localized strings as values. The language class may also contain some methods that are of use in interpreting phrases in the lexicon, or otherwise dealing with text in that language.

Here is an example:

Listing 11. A Locale::Maketext lexicon and its usage
package MyApp::L10N;
use base 'Locale::Maketext';

package MyApp::L10N::de;
use base 'MyApp::L10N';
our %Lexicon = (
    "[quant,_1,camel was,camels were] released." =>
    "[quant,_1,Kamel wurde,Kamele wurden] freigegeben.",
);

package main;
my $lh = MyApp::L10N->get_handle('de');
print $lh->maketext("[quant,_1,camel was,camels were] released.", 5);

Under its square bracket notation, translators can make use of various linguistic-related functions inside their translated strings. The example above highlights includes built-in plurals and quantifiers support; for languages with other kinds of plural-form characteristics, it is a simple matter of implementing a corresponding quant() function. Ordinates and time formats are easy to add, too.

Each language class may also implement an ->encoding method to describe the encoding of its lexicons, which may be linked with Encode for transcoding purposes. Language families are also inheritable and subclassable: missing entries in fr_ca.pm (Canadian French) would fallbacks to fr.pm (Generic French).

The handy built-in method ->get_handle() with no arguments magically detects HTTP, POSIX and Win32 locale settings in CGI, mod_perl or command line; it spares the programmer to parse those settings manually.

However, Locale::Maketext is not without problems. The most serious issue is its lacking of a toolchain like GNU Gettext's, due to the extreme flexibility of lexicon classes. For the same reason, there are also fewer support in text editors (e.g. the PO Mode in Emacs).

Finally, since different projects may use different styles to write the language class, the translator must know some basic Perl syntax -- or somebody has to type in for them.

Locale::Maketext::Lexicon -- The Best of Both Worlds

Irritated by the irregularity of Locale::Maketext lexicons, I implemented a home-brew lexicon format for my company's internal use in May 2002, and asked the perl-i18n mailing list for ideas and feedbacks. Jesse Vincent suggested: "Why not simply standardize on Gettext's PO File format?", so I implemented it to accept lexicons in various formats, handled by different lexicon backend modules. Thus, Locale::Maketext::Lexicon was born.

The design goal was to combine the flexibility of Locale::Maketext lexicon's expression with standard formats supported by utilities designed for Gettext and Msgcat. It also supports the Tie interface, which comes in handy for accessing lexicons stored in relational databases or DBM files.

The following program demonstrates a typical application using Locale::Maketext::Lexicon and the extended PO File syntax supported by the Gettext backend:

Listing 12. A sample application using Locale::Maketext::Lexicon
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
use CGI ':standard';
use base 'Locale::Maketext';      # inherits get_handle()

# Various lexicon formats and sources
use Locale::Maketext::Lexicon {
    en => ['Auto'],              fr    => ['Tie' => 'DB_File', 'fr.db'],
    de => ['Gettext' => \*DATA], zh_tw => ['Gettext' => 'zh_tw.mo'],
};

# Ordinate functions for each subclasses of 'main'
use Lingua::EN::Numbers::Ordinate; use Lingua::FR::Numbers::Ordinate;
sub en::ord { ordinate($_[1]) } sub fr::ord { ordinate_fr($_[1]) }
sub de::ord { "$_[1]." }        sub zh_tw::ord { "第 $_[1] 個" }

my $lh = __PACKAGE__->get_handle; # magically gets the current locale
sub _ { $lh->maketext(@_) }       # may also convert encodings if needed

print header, start_html,         # [*,...] is a shorthand for [quant,...]
	_("You are my [ord,_1] guest in [*,_2,day].", $hits, $days), end_html;

__DATA__
# The German lexicon, in extended PO File format
msgid "You are my %ord(%1) guest in %*(%2,day)."
msgstr "Innerhalb %*(%2,Tages,Tagen), sie sind mein %ord(%1) Gast."

Line 2 tells the current package main to inherit from Locale::Maketext, so it could acquire the get_handle method. Line 5-8 builds four language classes using a variety of lexicon formats and sources:

Line 11-13 implements the ord method for each language subclasses of the package main, which converts its argument to ordinate numbers (1th, 2nd, 3rd...) in that language. Two CPAN modules are used to handle English and French, while German and Chinese only needs straightforward string interpolation.

Line 15 gets a language handle object for the current package. Because we did not specify the language argument, it automatically guesses the current locale by probing the HTTP_ACCEPT_LANGUAGE environment variable, POSIX setlocale() settings, or via Win32::Locale on Windows. Line 16 sets up a simple wrapper funciton that passes all arguments to the handle's maketext method.

Finally, line 18-19 prints a message containing one string to be localized. The first argument $hits will be passed to the ord method, and the second argument $days will call the built-in quant method -- the [*...] notation is a shorthand for the previously discussed [quant,...].

Line 22-24 is a sample lexicon, in extended PO file format. In addition to ordered arguments via %1 and %2, it also supports %function(args...) in entries, which will be transformed to [function,args...]. Any %1, %2... sequences inside the args will have their percent signs (%) replaced by underscores (_).

Case Studies

``One size never fits all.''
-- Tenth Networking Truth, RFC 1925

Armed with the understanding of localization frameworks, let us see how it fits into real-world applications and technologies.

For web applications, the place to implement a L10n framework is almost inevitably its representation system, also known as templating system, because that layer determines the extent of an application's data/code separation. For example, the Template Toolkit encourages a clean 3-tier data/code/template model, while the equally popular Mason framework lets you easily mix perl code in a template. In this section, we will survey L10n strategies for those two different frameworks, and the general principle should also apply to AxKit, HTML::Embperl, and other templating systems.

Request Tracker (Mason)

The Request Tracker is the first application that uses Locale::Maketext::Lexicon as its L10n framework. The base language class is RT::I18N, with subclasses reading *.po files stored in the same directory.

Additionally, its ->maketext method was overridden to uses Encode (or in pre-5.8 versions of perl, my Encode::compat) to return UTF-8 data on-the-fly. For example, Chinese translator may submit lexicons encoded in Big5, but the system will always handle them natively as Unicode strings.

In the application's Perl code, all objects use the $self->loc method, inherited from RT::Base:

Listing 13. RT's L10n implementation
sub RT::Base::loc
    { $self->CurrentUser->loc(@_) }
sub RT::CurrentUser::loc
    { $self->LanguageHandle->maketext(@_) }
sub RT::CurrentUser::LanguageHandle
    { $self->{'LangHandle'} ||= RT::I18N->get_handle(@_) }

As you can see, the current user's language settings is used, so different users can use the application in different languages simultaneously. For Mason templates, two styles were devised:

Listing 14. Two ways to mark strings in Mason templates
% $m->print(loc("Another line of text", $args...));
<&|/l, $args...&>Single line of text</&>

The first style, used in embedded perl chunks and <%PERL> sections, is made possible by exporting a global loc() function to the Mason interpreter; it automatically calls the current user's ->loc method described above.

The second style uses the filter component feature in HTML::Mason, which takes the enclosed Single line of text, passes it to the /l component (possibly with arguments), and displays the returned string. Here is the implementation of that component:

Listing 15. Implementation of the html/l filter component
% my $hand = $session{'CurrentUser'}->LanguageHandle;
% $m->print($hand->maketext($m->content, @_));

With these constructs, it is a simple matter of extracting messages out of existing templates, comment them, and send to the translators. The initial extraction for 700+ entries took one week; the whole i18n/L10n process took less than two months.

Slash (Template Toolkit)

Slash -- Slashdot Like Automated Storytelling Homepage -- is the code that runs Slashdot. More than that, however, Slash is an architecture for putting together web sites, built upon Andy Wardley's Template Toolkit module.

Due to the clean design of TT2, Slash features a careful separation of code and text, unlike RT/Mason. This largely eliminated the need to localize inside Perl source code.

Previous to this article's writing, various whole-template localizations based on the theme system had been attempted, including Chinese, Japanese, and Hebrew versions. However, merging with a new version was very difficult (not to mention plugins), and translations tend to lag behind a lot.

Now, let us consider a better approach: An auto-extraction layer above the template provider, based on HTML::Parser and Template::Parser. Its function would be like this:

Listing 16. Input and output of the TT2 extraction layer
Input
<B>from the [% story.dept %] dept.</B>
Output
<B>[%|loc( story.dept )%]from the [_1] dept.[%END%]</B>

The acute reader will point out that this layer suffer from the same linguistic problems as Msgcat does -- what if we want to make ordinates from [% story.dept %], or expand the dept. to department / departments? The same problem has occurred in RT's web interface, where it had to localize messages returned by external modules, which may already contain interpolated variables, e.g. "Successfully deleted 7 ticket(s) in 'c:\temp'.".

My solution to this problem is to introduce a fuzzy match layer with the module Locale::Maketext::Fuzzy, which matches the interpolated string against the list of candidate entries in the current lexicon, to find one that can possibly yield the string (e.g. "Successfully deleted [*,_1,ticket] in '[_2]'."). If two or more candidates are found, -- after all, "Successfully [_1]." also matches the same string -- tie-breaking heuristics are used to determine the most likely candidate.

Combined with xgettext.pl, developers can supply compendium lexicons along with each plugin/theme, and the Slash system would employ a multi-layer lookup mechanism: Try plugin-specific entries first; then the theme's; then fallback to the global lexicon.

Summary

``...perfection has been reached not when there is nothing left to add, but when there is nothing left to take away.''
-- Twelfth Networking Truth, RFC 1925

From the two case studies above, it is quite easy to see an emergent pattern of how such efforts are carried. This section presents a 9-step guide in localizing existing web applications, as well as tips of how to implement them with minimal hassles.

The Localization Process

We can summarize the localization process as several steps, each depending on previous ones:
  1. Assess the website's templating system
  2. Choose a localization framework and hook it up
  3. Write a program to locate text strings in templates, and put filters around them
  4. Extract a test lexicon; fix obvious problems manually
  5. Locate text strings in the source code by hand; replace them with _(...) calls
  6. Extract another test lexicon and machine-translate it
  7. Try the localized version out; fix any remaining problems
  8. Extract the beta lexicon; mail it to your translator teams for review Fix problems reported by translators; extract the official lexicon and mail it out!
  9. Periodically notify translators of new lexicon entries before each release
Following these steps, one could manage a L10n project fairly easily, and keep the translations up-to-date and minimize errors.

Localization Tips

Finally, here are some tips for localizing Web applications, and other softwares in general:

Listing 17. Fragmented vs. complete sentences
_("Found ") . $files . _(" file(s).");   # Fragmented sentence - wrong!
sprintf(_("Found %s file(s)."), $files); # Complete (with sprintf)
_("Found [*,_1,file].", $files);         # Complete (Locale::Maketext)

Listing 18. Comments in lexicons
#: lib/RT/Transaction_Overlay.pm:579
#. ($field, $self->OldValue, $self->NewValue)
# Note that 'changed to' here means 'has been modified to...'.
msgid "%1 %2 changed to %3"
msgstr "%1 %2 cambiado a %3"

Using the xgettext.pl utility provided in the Locale::Maketext::Lexicon package, the source file, line number (marked by #:) and variables (marked by #.) can be deduced automatically and incrementally. It would also be very helpful to clarify the meaning of short or ambiguous with normal comments (marked by #), as shown in listing 18 above.

Conclusion

For countries with language dissimilar to English, localization efforts is often the prerequisite for people to participate in other Free Software projects. In Taiwan, L10n projects like the CLE (Chinese Linux Environment), Debian-Chinese and FreeBSD-Chinese were (and still are) the principal place where community contributions are made. However, such efforts are also historically time-consuming, error-prone jobs, partly because of English-specific frameworks and rigid coding practices used by existing applications. The entry barrier for translators was unnecessarily high.

On the other hand, ever-increasing internationalization of the Web makes it increasingly likely that the interface to Web-based dynamic content service will be localized to two or more languages. For example, Sean M. Burke led enthusiastic users to localize the popular Apache::MP3 module, which powers home-grown Internet jukeboxes everywhere, to dozens of languages in 2002. The module's author, Lincoln D. Stein, did not involve with the project at all -- all he needed to do was integrating the i18n patches and lexicons into the next release.

The Free Software projects are not abstractions filled with code, but rather depends on people caring enough to share code, as well as sharing useful feedback in order to improve each other's code. Hence, it is my sincere hope that techniques presented in this article will encourage programmers and eager users to actively internationalize existing applications, instead of passively translating for the relatively few applications with established i18n frameworks.

Acknowledgments

Thanks to Jesse Vincent for suggesting Locale::Maketext::Lexicon to be written, and for allowing me to work with him on RT's L10n model. Thanks also to Sean M. Burke for coming up with Locale::Maketext, and encouraging me to experiment with alternative Lexicon syntaxes.

Thanks also go to my brilliant colleagues in OurInternet, Inc. for the hard work they did on localizing web applications: Hsin-Chan Chien, Chia-Liang Kao, Whiteg Weng and Jedi Lin. Also thanks to my fellow translators of the Llama book (Learning Perl), who showed me the power of distributed translation teamworks.

I would also like to thank to Nick Ing-Simmons, Dan Kogai and Jarkko Hietaniemi for teaching me how to use the Encode module, Bruno Haible for his kind permission for me to use his excellent work on GNU libiconv, and Tatsuhiko Miyagawa for proofreading early versions of my Locale::Maketext::Lexicon module. Thanks!

Finally, if you decide to follow the steps in this article and participate in software internationalization and localization, then you have my utmost gratitude; let's make the Web a truly World Wide place.

Bibliography

Alvestrand, Harald Tveit. 1995. RFC 1766: Tags for the Identification of Languages., ftp://ftp.isi.edu/in-notes/rfc1766.txt

Callon, Ross, editor. 1996. RFC 1925: The Twelve Networking Truths., ftp://ftp.isi.edu/in-notes/rfc1925.txt

Drepper, Ulrich, Peter Miller, and François Pinard. 1995-2001. GNU gettext. Available in ftp://prep.ai.mit.edu/pub/gnu/, with extensive documents in the distribution package.

Burke and Lachler. 1999. Localization and Perl: gettext breaks, Maketext fixes, first published in The Perl Journal, issue 13.

Burke, Sean M. 2002. Localizing Open-Source Software, first published in the The Perl Journal, Fall 2002 edition.

W3C internationalization activity statement, 2001, http://www.w3.org/International/Activity.html

Mozilla i18n & L10n guidelines, 1999, http://www.mozilla.org/docs/refList/i18n/