Web-Application Internationalization
Web-Application Internationalization
Web-Application Internationalization
Puneet Sachdev
December 2007
Summary: The key to Web-application internationalization is to know both your application and the future audience. This article
aims to answer many of the critical questions that are required before an application is internationalized. (9 printed pages)
Contents
Introduction
Scenario: A. Datum Corporation
Database-Layer Issues
Business-Layer Issues
Presentation-Layer Issues
Client-Layer Issues
Process
Conclusion
Critical-Thinking Questions
Sources
Glossary
Introduction
When well-known, multinational companies want to sell in China, they make sure that their users understand their content. At
the time of this writing, one such company of which I know has 26 localized versions of its online-shopping portal. They and
other global corporations have realized that, although you can buy in any language, you must use the customer's language to
sell [Morgan et al., 2001]. Challenges in transforming a Web site or a product, so that it can cater to a global audience, can be
immense.
This article attempts to highlight the main issues of which a practicing architect would have to be aware upon attempting
internationalization of a Web application with both static and dynamic content. Every platform on which an application is built
provides a detailed, platform-specific internationalization guide, which should be carefully read and understood.
The article does not attempt to provide solutions for the various issues that it highlights in every layer of the application.
However, it does recommend an approach through the choices that are made in our fictitious scenario.
Your company, A. Datum Corporation, is celebrating its fourth consecutive year as the leading Web-based mall-management
ASP in the U.S. Your CEO, Scott Bishop, is addressing shareholders with current-year numbers, when someone asks a question
that he had anticipated: "With the U.S. market captured and saturated, how do we plan to maintain explosive growth in revenue
and profits?"
Scott explains, "By selling to the Chinese, Indians, and Russians. Seventy percent of the shopping malls to be built in the next 10
years will be in these countries. We want to be there when they build them." This is when it all starts. This is when A. Datum
embarks on its journey to become a global corporation.
You are the architect of the special task force that is created by Scott and is responsible for achieving this important milestone of
releasing the product in China in six months. You know that the effort will affect every part of the existing application and will not
https://msdn.microsoft.com/en-us/library/cc168605(d=printer).aspx 1/9
1/19/2018 Web-Application Internationalization
be limited only to the presentation layer, as many in the company have been suggesting. From experience, you also know that
globalizing an application has two parts, both of which are orthogonal to the application layers (client, presentation, business,
and integration): internationalization (i18n) and localization (L10n).
· Reduce risk, as well as make it easier to isolate and resolve issues quickly and effectively.
· Enable regression-testing to happen easily, because development of the application is happening in parallel.
Part of the team will work on automating certain parts of the effort by using commercial off-the-shelf (COTS) or homegrown
tools. The following are general areas in which you think that this might help:
· Scanning code, and identifying use of locale-specific functions, routines, and methods
Following your analysis, you define a phased process that is punctuated by regression-testing efforts (see the section titled
"Process").
Database-Layer Issues
You realize that, although the immediate goal is to release in China, that market is not going to be the only one in which you will
eventually deploy.
Character Set
https://msdn.microsoft.com/en-us/library/cc168605(d=printer).aspx 2/9
1/19/2018 Web-Application Internationalization
· The database must support the character set of the data that is coming in. The choice of character set must consider future
language requirements. The ISO 8859 character set supports English and most of the western European languages, but it does
not support Chinese. Consequently, you choose Unicode, which provides unrestricted multilingual support.
· The application should be refactored appropriately, so that localization in various languages does not result in code change and
recompilation. After this has been done, the time-to-market in new geographies will be reduced tremendously.
· The external interfaces of the application should be able to handle data in every character set.
Data Migration
You determine that, because existing data is encoded in ASCII, its migration into a database with Unicode support is not an issue.
However, there are tools that are available from database vendors for exporting and importing data to switch its character set.
Character Widths
Every character in English is encoded within a single byte. Hence, a database field column of width CHAR (10) implies 10
characters in English or in any language that is encoded in ISO 8859. A character in Russian or Chinese, however, might span
three bytes. Consequently, you give the go-ahead to increase the size of all character fields in the database to at least three times
their current size, to accommodate Asian and other multibyte languages.
Business Logic
The database contains business logic inside of procedures and functions. Modifications to column sizes can result in this code
being refactored. However, specific database vendors might provide certain features that can limit the amount of modifications
that are needed to accommodate these changes (for example NLS_LENGTH_SEMANTICS in Oracle 9i [Oracle, 2005]).
Business-Layer Issues
This is the layer in which most of your application code lies. It broadly encompasses the application
server/middleware/MOM/ESB/processing engines.
Character Set
Because your application is J2EE-based, it is inherently Unicode-compliant. Most of the application development platforms that
are available today support Unicode.
Locale Negotiation
The application currently assumes an en_US locale (U.S. English, as indicated in Java properties); every user in the application
defaults to this. In a global scenario, locale negotiation would determine the user's locale. There are various ways in which this
can be done for a Web application:
· Store the locale as a user preference, and service all requests to the user based on that preference.
Business Logic
Multilingual data has created new requirements/validations in your code. These requirements relate to locale-specific validations,
currency conversions, and the exporting and importing of data. For instance, a CSV file import/export will have problems for a
French locale, due to the decimal separator being a period and not a comma (that is, 4.5 becomes 4, 5). You realize that fixes to
these problems would have to be determined on a case-by-case basis, and that these are more of design issues.
https://msdn.microsoft.com/en-us/library/cc168605(d=printer).aspx 3/9
1/19/2018 Web-Application Internationalization
Presentation-Layer Issues
The most important aspect of any Web application is its presentation; this is what the users interact with, and it has the maximum
impact on their perception of the application. Your application presents two types of content to the user.
Static Content
This includes help files, pages about terms and conditions, images, and so on. The best approach for these is to maintain
separate copies of them per language, and have the application pick the appropriate version depending upon the user's locale.
This content best resides inside of a content-management system (CMS), and is best served through HTTP servers like Apache or
Microsoft Internet Information Services (IIS). The main task here is to decide on a directory structure that keeps static content for
each locale separate and easily maintainable.
Dynamic Content
Supporting multiple locales for dynamic content means that the internationalization architecture must:
· Automatically render entities, such as numeric and monetary values, according to locale.
· Allow groups of templates to be treated as a unit, to support different page designs for different locales.
You list the following areas as those in which most of the work will lie, in the presentation layer.
Textual Content
The internationalized application must treat text and images (images with text) as dynamically generated data. Existing textual
content has to be extracted into resource bundles. Your team has developed certain in-house tools that can scan the existing
code base and perform the extraction. You suggest enhancing the tool to replace the extracted occurrence with the result of a
call to the resource bundle by using a generated key. These resource bundles will be the targets of localization efforts in the
various user languages.
Screen Layout
Each written language has different characters, and they take up a different amount of real estate on the screen. Hence, it is
possible, after translating to Russian, that "Purchase Order Number" will not fit in the current 100-pixel width that is defined for
its label. You need a way, then, to externalize (or parameterize) the screen layout per locale. You make an informed decision to
use HTML DIV-based layouts in the Web pages. This allows you to control the layout completely by using CSS. The idea is to
have a separate style sheet for every language that is supported. This has enabled UI designers to work on screen layouts by
using only style sheets and not complex dynamic JSP pages.
· Currency formats—For an amount of 10000.00, display "$10,000.00" for the en_US locale and "€10 000,00" for the fr_FR locale.
· Date formats—Locales vary in date and calendar-format displays. "DD/MM/YYYY," "MM/DD/YYYY," and "MMM DD, YYYY" are
some of the common formats that are used. The names of the days and months also need localization. Figure 2 shows the
current date lookup in your application—internationalized, and then localized in Russian:
https://msdn.microsoft.com/en-us/library/cc168605(d=printer).aspx 4/9
1/19/2018 Web-Application Internationalization
· Address/Phone number—Addresses also vary from country to country. There are differences in the list of states and ZIP code
formats (for example, "XXXXX-XXXX" for a ZIP code in the United States, or "XXXXXXX" in India).
· Validation—Differing formats for numbers, dates, addresses, ZIP codes, and phone numbers lead to the related problem of
locale-specific validation. You come up with two ways to solve this problem:
· Locale-sensitive validations that can be accomplished completely in JavaScript can be implemented on a per-locale basis, in
separate JavaScript files, which will be dynamically included in the pages depending on locale.
· Certain validations might require some server-side support. The J2SE platform provides extensive locale support for currency,
date, and numbers. Your company has contracted with a third-party Web service for validation of international addresses. You
decide to implement such complex validations by using AJAX.
· Text truncation—The length of a phrase with the same meaning might vary in different languages. Because a Web page has
finite space, you decide on implementing a truncation scheme, in which noncritical data is truncated. The user has an option of
drilling down to see the full content.
· HTTP encoding—The server can set a CHARSET parameter in the HTTP header to specify the character encoding of the
response. Because the application is going to support multiple languages, your recommendation is to use Unicode. Hence, you
instruct the team to put the following line in every JSP in the application:
<code>
Client-Layer Issues
The client for the application is a browser. Your application supports Microsoft Internet Explorer version 5.5 and later. The
following issues are those with which your team has to deal, in this layer:
Fonts
https://msdn.microsoft.com/en-us/library/cc168605(d=printer).aspx 5/9
1/19/2018 Web-Application Internationalization
For a browser to show data in a particular character set, it needs a font, which will map the code points of that character set to
appropriate visual representations. Certain fonts are part of the basic installation of the operating system. For instance, a
Japanese version of Windows will have fonts installed for the Japanese language. However, if the user wants to view Japanese
data on a machine that has the Windows-1252 code page installed, a compatible font is required. There are general-purpose
fonts available, too, which support all of the languages. As an example, Arial Unicode MS is one such font and is part of the
Microsoft Office distribution. You application will specify the appropriate font as part of style sheets, and the expectation would
be that the appropriate font is installed on the user's machine.
JavaScript
An important aspect to consider on the client side is JavaScript. Your application uses a lot of JavaScript for performing validation
and showing alert messages to the user. These messages will be pre-read from the resource bundles in the browser and shown
to the user by using JavaScript alerts; therefore, they must be localized.
Third-Party Interfaces
Every third-party interface to the application has to be looked upon in the light of data in multiple languages passing through it.
· XML interaction—Interactions with external Web services using XML encoded in UTF-8 is capable of handing multilingual
content.
· Flat file—The application allows certain flat-file downloads and uploads. Your team ensures correct encoding of the files that
are downloaded and uploaded, so that no information is lost. Every export/import that involves CSV files must be looked at, as to
whether such a method is still viable. Some situations might warrant a switch to a different format, such as XML.
· Third-party tools—Each third-party tool/API that is used in the application must support multibyte character sets and Unicode
compliance; for example, a PDF driver is used to generate reports as a PDF document.
Process
The process diagram that is shown in Figure 3 depicts the phased process for this effort:
https://msdn.microsoft.com/en-us/library/cc168605(d=printer).aspx 6/9
1/19/2018 Web-Application Internationalization
Figure 3. A process diagram for internationalization (Click on the picture for a larger image)
Conclusion
In this article, we have answered many of the critical questions that are required before an application is internationalized. The
answers to these questions are critical for the effort to both succeed and realize the expected return on investment (ROI).
· How many languages an application can possibly support in the future will decide on the character set that is to be used. It is
recommended to use Unicode, when in doubt.
· It is also critical to carefully list the elements of the application that will be affected by multilingual data. This exercise is
important to scope the effort.
https://msdn.microsoft.com/en-us/library/cc168605(d=printer).aspx 7/9
1/19/2018 Web-Application Internationalization
· Internationalization will affect every part of the application, including third-party interfaces. This also implies a significant testing
effort. It always helps if the application has an existing automated test suite, which can be used for regression-testing.
· Internationalization can lead to a lot of work in many source files. An example is extracting static text out of JSP pages and into
resource bundles. It is always better to look for tools (COTS or homegrown) that can ease some of this pain.
Critical-Thinking Questions
· What is the current state of my application, from an i18N perspective?
· Can my persistence handle data in multiple character sets? What should be the character set of the database?
· What areas of the presentation layer are affected by the user's locale?
Sources
· [Morgan et al., 2001] Morgan, Terri, Carol Luttrell, and Yuzeng Liu. "Designing Multilingual Web Sites: Applied Authoring
Techniques." 2001. ACM Digital Library.
· [Oracle, 2005] Hardman, Ron. "Globalization: Going Global." Oracle Magazine, 2005.
Glossary
ASCII (American Standard Code for Information Interchange)—The most common character set that is used to represent
American English. Code points in 7-bit ASCII (called US-ASCII) range from 0 to 127. ASCII contains uppercase and lowercase
Roman alphabets, European numerals, punctuation, a set of control codes (nongraphical code points from decimal 0 to 31), and
a few miscellaneous symbols. Many early Internet protocols were based on 7-bit ASCII, which greatly complicated Web-
application support of languages other than American English.
Character set—A set of graphical, textual symbols, each of which is mapped to an integer (for example, ASCII and ISO 8859).
Collation—The process of ordering text by using language or specific rules, instead of by using binary comparison.
Encoding—A way of mapping the code points of a character set to units of specific width, and defining byte serialization and
ordering rules. Unicode has UTF-8 and UTF-16 encodings.
Internationalization (i 18n)—The process of designing an application to make it adaptable to different languages and regions,
without requiring engineering changes.
ISO 8859—A character-set series that was created to overcome some of the limitations of ASCII. Each ISO 8859 character set
may have up to 256 characters. ISO 8859–1 ("Latin–1") comprises the ASCII character set, plus characters with diacritics (accents,
diereses, cedillas, circumflexes, and the like), and additional symbols. The ISO 8859 series defines 13 character sets (ISO 8859–1
through ISO 8859–10, and ISO 8859–13 through ISO 8859–15) that can represent text in dozens of languages.
Locale—A set of political, cultural, and region-specific elements that are represented in an application. As per ISO standards,
locale is a combination of a language + country + variant (for example, en_US and en_GB).
Localization (L10n)—The process of adapting software for a specific region or language, by adding locale-specific components
and translating text.
https://msdn.microsoft.com/en-us/library/cc168605(d=printer).aspx 8/9
1/19/2018 Web-Application Internationalization
Unicode—Also known as ISO 10646, defines a character set with 21-bit code points. Unicode can represent all of the character
sets in the world. The Java Programming language internally represents all character and string objects in 16-bit Unicode. Hence,
programs that are written in Java can process data in multiple languages.
Windows-1252—Character encoding of the Latin alphabet in Microsoft Windows. Windows-1252 is a superset of ISO 8859.
This article was published in Skyscrapr, an online resource provided by Microsoft. To learn more about architecture and the
architectural perspective, please visit skyscrapr.net.
© 2018 Microsoft
https://msdn.microsoft.com/en-us/library/cc168605(d=printer).aspx 9/9