Internationalization i18n Localization l10n presentation

About This Presentation

Transcript and Presenter's Notes

Title: Internationalization i18n Localization l10n

1
Internationalization (i18n)Localization (l10n)
2
Objectives

The need to do it.
The difficulties involved.
How to do it with JSPs and Java.
Database issues.

3
Think Globally

92 of the world speaks no or little English
20 main Asian languages
According to ethnologue.com
6809 living languages
Living a child is learning it as their only
language

4
Primary Languages Spoken
Rank Language Script Speakers (millions) 1 Mandar
in Chinese 885 2 Hindi Devanagari 375 3 Spanish L
atin 358 4 English Latin 347 5 Arabic Arabic 2
11 6 Bengali Bengali 210 7 Portugues Latin 178 8
Russian Cyrillic 165 9 Japanese Japanese 125 10 Ge
rman Latin 100
5
English usage is shrinking
http//global-reach.biz/globstats/evol.htm Englis
h Internet users 2000 58 2005 lt 35 Non
English online traffic 2000 40 2005 70
6
Why i18n is hard

Lots of different character sets
A character set maps characters to numbers
ASCII created in 1968 used 7 bits
That gives you 128 possible characters
26 lower case
26 upper case
10 numbers
10 symbols above the numbers
20 commonly used punctuation marks
Leaves you space for 36 more characters

7
Why i18n is hard

Only Latin, English, Hawaiian, and Swahili fit
into 7 bits
The 8th bit was used for i18n but, no standards
were set.
Even with Western European Latin, there are many,
many mappings
ISO-8859-1 (based on old DEC VT220 terms)
IBM 850
cp1252 (microsoft's contribution)
http//czyborra.com/charsets/ - is a HUGE list of
character sets

8
Characters Need More Space

Eight bits gives you room for 256 characters
Still not enough.
Unicode 3.2 has OVER 95,000 characters!!!
!!! THAT'S A LOT OF CHARACTERS !!!
Since 8 bits can only hold 256 characters, that
led to... A LOT OF STANDARDS!!!!!
There are so many standards, nobody can keep them
straight
IANA maintains the list of standards names
http//www.iana.org/assignments/character-sets

9
Unicode The Solution (sort of)

A consortium started in 1991 to come up with a
standard character set
Characters are stored in more than one byte.
Two bytes give you 216 characters, which covers
the characters of every currently used language.
Java stores characters in Unicode, which makes it
very good for internationalization.

10
Problems with Unicode

Unicode is a character set. It simply attaches
numbers to characters doesn't dictate how
they're translated into bits.
The simplest character encoding for Unicode is
called UCS2 just use two bytes for each
character A is 0x00 0x41
If you're not careful, you'll end up sending
characters that look like \0 to UNIX which
messes it up.
UTF-8 is an encoding of Unicode which works
across all known platforms.

11
Where i18n happens

In a database driven website, you have to worry
about i18n in 3 places
Your database has to store some kind of unicode
The web browser the person uses has to know
unicode
Your application has to be i18n
Has to read unicode from the database
Has to know how to write to the browser

12
I18n in databases

Unicode support in databases is fairly new.
Recent versions of all major databases support
UTF-8 in some way.
Some, however, require you to use special data
formats.
Oracle, for example, will let you either declare
your entire database as Unicode encoded, or you
can add unicode to non-unicode encoded database
tables using the NCHAR and NVARCHAR2 data types.

13
Database Details
Database UTF-8 Other UNICODE? Special
treatment? Oracle Yes Yes Depends Sybase
Yes No No Postgres Yes No No MS
SQL Server No Yes Yes MySQL Yes Sort
of No
14
How Browsers Deal with Different Encodings
When a browser sends a request for a web page, it
tells the web server what kind of encoding it
understands GET / HTTP/1.1 HTTP_ACCEPT_LANGUAGE
en-us, hr HTTP_ACCEPT_CHARSETISO-8859-1,
UTF-8 This says that the browser prefers
documents to be sent to it int the iso-8859-1
(Latin) encoding, but will also take
UTF-8. Sadly, Internet explorer doesn't send
ACCEPT_CHARSET. As of MSIE 6, it understands
UTF-8.
15
How Browsers Deal with Different Encodings
When a server gets a request, it retrieves pages,
runs programs, then respondes. Unless otherwise
specified, it uses same encoding of request. If
no charset was sent with request, it uses a
default. Here's a typical HTTPD response 200
OK HTTP/1.1 Content-Type contenttext/html
charsetISO-8859-1 lthtmlgtltheadgtlttitlegthi!lt/title
gtlt/headgt ...
16
Setting the Character Set of a JSP Reponse
lt_at_ page contentTypetext/htmlcharsetUTF-8
gt sets the character set to UTF-8. The
default is ISO-8859-1 Some HTML pages have this
sort of thing ltMETA http-eqiv Content-type
Contenttext/htmlcharsetUTF-8gt It gets
overridden by the contentType set by the page
directive. lt?xml version1.0 encodingUTF-8gt
also gets overridden.
17
Internationalization and Localization in your
Application
In Java, different languages are handled by
different resource files. For each language
(locale) you have, a separate resource file
contains the text. Greetings.properties
Greetings_fr.properties greetings
Hello greetings Bonjour farewell
Goodbye farewell Au revoir inquiry How are
you? inquiry Comment allez-vous? These files
go in your classes directory (under
WEB-INF) Your code will determine the locale and
use the appropriate file for the text. Each file
is called a resource bundle. Files which contain
the same messages share a common basename.
18
Defining the Locale
A locale defines a country and language In JSTL,
the locale can be set in two ways 1. By hand
ltfmtsetLocale valuefr_CA, fr_FR /gt This says
you'd prefere Canadian French, but will accept
French French is there's no resource for Canadian
French. Also good, if you have a bean called
myBean with a property locale ltfmtsetLocale
value"myBean.locale" /gt 2. You can let
JSTL set the locale based on the browser's
HTTP_ACCEPT_LANGUAGE Don't forget to include the
fmt tags lt_at_ taglib prefix"fmt"
uri"http//java.sun.com/jstl/fmt" gt
19
Using the Locale and Resources
Once the locale is set (either automatically or
by hand) use the ltfmtbundlegt and ltfmtmessagegt
tags to get the internationalized
message ltfmtbundle basename"greetings"gt
Hello ltfmtmessage key"hello" /gt Goodbye
ltfmtmessage key"goodbye" /gt lt/fmtbundlegt
The bundle tags will internationalize dates
too ltfmtformatDate value"myBean.date"
dateStyle"long" /gt If the locale is en_US, the
date will be month/day/year If it's en_GB, the
date will be day/month/year
20
Setting a Default Locale for Your Application
In the web.xml file of your application ltcontext
-paramgt ltparam-namegt javax.servlet.jsp.jstl.fmt.
fallbackLocale lt/param-namegt ltparam-valuegt
en lt/param-valuegt lt/context-paramgt
21
Resources
All the language codes http//ftp.ics.uci.edu/pub
/ietf/http/related/iso639.txt All the country
codes http//www.davros.org/misc/iso3166.html Th
e Unicode site http//www.unicode.org/

Write a Comment

User Comments (0)

About PowerShow.com

Internationalization i18n Localization l10n PowerPoint PPT Presentation