Internationalization i18n Localization l10n - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Internationalization i18n Localization l10n

Description:

1 Mandarin Chinese 885. 2 Hindi Devanagari 375. 3 Spanish Latin 358. 4 ... 10 symbols above the numbers. 20 commonly used ... in databases is fairly new. ... – PowerPoint PPT presentation

Number of Views:367
Avg rating:3.0/5.0
Slides: 22
Provided by: csMi4
Category:

less

Transcript and Presenter's Notes

Title: Internationalization i18n Localization l10n


1
Internationalization (i18n)Localization (l10n)
2
Objectives
  • The need to do it.
  • The difficulties involved.
  • How to do it with JSPs and Java.
  • Database issues.

3
Think Globally
  • 92 of the world speaks no or little English
  • 20 main Asian languages
  • According to ethnologue.com
  • 6809 living languages
  • Living a child is learning it as their only
    language

4
Primary Languages Spoken
Rank Language Script Speakers (millions) 1 Mandar
in Chinese 885 2 Hindi Devanagari 375 3 Spanish L
atin 358 4 English Latin 347 5 Arabic Arabic 2
11 6 Bengali Bengali 210 7 Portugues Latin 178 8
Russian Cyrillic 165 9 Japanese Japanese 125 10 Ge
rman Latin 100
5
English usage is shrinking
http//global-reach.biz/globstats/evol.htm Englis
h Internet users 2000 58 2005 lt 35 Non
English online traffic 2000 40 2005 70
6
Why i18n is hard
  • Lots of different character sets
  • A character set maps characters to numbers
  • ASCII created in 1968 used 7 bits
  • That gives you 128 possible characters
  • 26 lower case
  • 26 upper case
  • 10 numbers
  • 10 symbols above the numbers
  • 20 commonly used punctuation marks
  • Leaves you space for 36 more characters

7
Why i18n is hard
  • Only Latin, English, Hawaiian, and Swahili fit
    into 7 bits
  • The 8th bit was used for i18n but, no standards
    were set.
  • Even with Western European Latin, there are many,
    many mappings
  • ISO-8859-1 (based on old DEC VT220 terms)
  • IBM 850
  • cp1252 (microsoft's contribution)
  • http//czyborra.com/charsets/ - is a HUGE list of
    character sets

8
Characters Need More Space
  • Eight bits gives you room for 256 characters
  • Still not enough.
  • Unicode 3.2 has OVER 95,000 characters!!!
  • !!! THAT'S A LOT OF CHARACTERS !!!
  • Since 8 bits can only hold 256 characters, that
    led to... A LOT OF STANDARDS!!!!!
  • There are so many standards, nobody can keep them
    straight
  • IANA maintains the list of standards names
  • http//www.iana.org/assignments/character-sets

9
Unicode The Solution (sort of)
  • A consortium started in 1991 to come up with a
    standard character set
  • Characters are stored in more than one byte.
  • Two bytes give you 216 characters, which covers
    the characters of every currently used language.
  • Java stores characters in Unicode, which makes it
    very good for internationalization.

10
Problems with Unicode
  • Unicode is a character set. It simply attaches
    numbers to characters doesn't dictate how
    they're translated into bits.
  • The simplest character encoding for Unicode is
    called UCS2 just use two bytes for each
    character A is 0x00 0x41
  • If you're not careful, you'll end up sending
    characters that look like \0 to UNIX which
    messes it up.
  • UTF-8 is an encoding of Unicode which works
    across all known platforms.

11
Where i18n happens
  • In a database driven website, you have to worry
    about i18n in 3 places
  • Your database has to store some kind of unicode
  • The web browser the person uses has to know
    unicode
  • Your application has to be i18n
  • Has to read unicode from the database
  • Has to know how to write to the browser

12
I18n in databases
  • Unicode support in databases is fairly new.
  • Recent versions of all major databases support
    UTF-8 in some way.
  • Some, however, require you to use special data
    formats.
  • Oracle, for example, will let you either declare
    your entire database as Unicode encoded, or you
    can add unicode to non-unicode encoded database
    tables using the NCHAR and NVARCHAR2 data types.

13
Database Details
Database UTF-8 Other UNICODE? Special
treatment? Oracle Yes Yes Depends Sybase
Yes No No Postgres Yes No No MS
SQL Server No Yes Yes MySQL Yes Sort
of No
14
How Browsers Deal with Different Encodings
When a browser sends a request for a web page, it
tells the web server what kind of encoding it
understands GET / HTTP/1.1 HTTP_ACCEPT_LANGUAGE
en-us, hr HTTP_ACCEPT_CHARSETISO-8859-1,
UTF-8 This says that the browser prefers
documents to be sent to it int the iso-8859-1
(Latin) encoding, but will also take
UTF-8. Sadly, Internet explorer doesn't send
ACCEPT_CHARSET. As of MSIE 6, it understands
UTF-8.
15
How Browsers Deal with Different Encodings
When a server gets a request, it retrieves pages,
runs programs, then respondes. Unless otherwise
specified, it uses same encoding of request. If
no charset was sent with request, it uses a
default. Here's a typical HTTPD response 200
OK HTTP/1.1 Content-Type contenttext/html
charsetISO-8859-1 lthtmlgtltheadgtlttitlegthi!lt/title
gtlt/headgt ...
16
Setting the Character Set of a JSP Reponse
lt_at_ page contentTypetext/htmlcharsetUTF-8
gt sets the character set to UTF-8. The
default is ISO-8859-1 Some HTML pages have this
sort of thing ltMETA http-eqiv Content-type
Contenttext/htmlcharsetUTF-8gt It gets
overridden by the contentType set by the page
directive. lt?xml version1.0 encodingUTF-8gt
also gets overridden.
17
Internationalization and Localization in your
Application
In Java, different languages are handled by
different resource files. For each language
(locale) you have, a separate resource file
contains the text. Greetings.properties
Greetings_fr.properties greetings
Hello greetings Bonjour farewell
Goodbye farewell Au revoir inquiry How are
you? inquiry Comment allez-vous? These files
go in your classes directory (under
WEB-INF) Your code will determine the locale and
use the appropriate file for the text. Each file
is called a resource bundle. Files which contain
the same messages share a common basename.
18
Defining the Locale
A locale defines a country and language In JSTL,
the locale can be set in two ways 1. By hand
ltfmtsetLocale valuefr_CA, fr_FR /gt This says
you'd prefere Canadian French, but will accept
French French is there's no resource for Canadian
French. Also good, if you have a bean called
myBean with a property locale ltfmtsetLocale
value"myBean.locale" /gt 2. You can let
JSTL set the locale based on the browser's
HTTP_ACCEPT_LANGUAGE Don't forget to include the
fmt tags lt_at_ taglib prefix"fmt"
uri"http//java.sun.com/jstl/fmt" gt
19
Using the Locale and Resources
Once the locale is set (either automatically or
by hand) use the ltfmtbundlegt and ltfmtmessagegt
tags to get the internationalized
message ltfmtbundle basename"greetings"gt
Hello ltfmtmessage key"hello" /gt Goodbye
ltfmtmessage key"goodbye" /gt lt/fmtbundlegt
The bundle tags will internationalize dates
too ltfmtformatDate value"myBean.date"
dateStyle"long" /gt If the locale is en_US, the
date will be month/day/year If it's en_GB, the
date will be day/month/year
20
Setting a Default Locale for Your Application
In the web.xml file of your application ltcontext
-paramgt ltparam-namegt javax.servlet.jsp.jstl.fmt.
fallbackLocale lt/param-namegt ltparam-valuegt
en lt/param-valuegt lt/context-paramgt
21
Resources
All the language codes http//ftp.ics.uci.edu/pub
/ietf/http/related/iso639.txt All the country
codes http//www.davros.org/misc/iso3166.html Th
e Unicode site http//www.unicode.org/
Write a Comment
User Comments (0)
About PowerShow.com