Unicode - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Unicode

Description:

Title: jB5 Author: Arun Tanksali Last modified by: SUNIL KUMAR Created Date: 2/10/2006 4:07:02 AM Document presentation format: On-screen Show Company – PowerPoint PPT presentation

Number of Views:113
Avg rating:3.0/5.0
Slides: 26
Provided by: Arun59
Category:
Tags: imps | unicode

less

Transcript and Presenter's Notes

Title: Unicode


1
Unicode W3CJataayu Software
  • C. Kumar
  • January 2007

2
Agenda
  • About Jataayu
  • Unicode Encoding
  • W3C Specification for multi-lingual authoring
  • Multilingual WEB Address
  • Indian WEB Sites an Overview
  • W3C Activity

3
About Jataayu
  • Jataayu formed with a clear focus of delivering
    solutions for wireless data services
  • Over 60 of the data traffic in Indian Mobile
    Networks for WAP, Mobile WEB and MMS handled by
    Jataayu Products
  • Mobile Device Solution Division focusing on
    wireless data applications like WAP, MMS, SyncML,
    IMPS, Email, Web Browsing, Download
  • Active participants in OMA, W3C and MWI
  • Over 350 people strong with offices in UK,
    Singapore, Korea, Taiwan and the US
    headquartered in India with major development
    center in Bangalore

4
Localization - Internationalization
  • Localization (l10n)
  • Adaptation of the content to meet the language,
    cultural and other requirements of a specific
    target market
  • Internationalization (i18n)
  • Design Development of the content that enables
    easy localization for target audiences that vary
    in culture, region or language.
  • Mission of W3C i18n Activity is to ensure the
    W3Cs formats and protocols are usable worldwide
    in all languages and in all writing systems.

5
Need for Unicode
  • Early character sets based on 7-bit, gave 27 (ie.
    128) possible characters
  • Adding the 8th bit gave a total of 256 possible
    characters. Still not enough for all the European
    languages.
  • Code page mechanism helped a little by changing
    the upper cells (0xA0 to 0xFF), but was very
    complex.
  • Addressing the needs of the other languages
    requires thousands of ideographic characters at a
    time.

6
Unicode Encoding
  • Unicode, universal character set contains all the
    characters needed for writing the majority of
    living languages in use on computers.
  • Allows for simple display and storage of
    multilingual content
  • An encoding refers to the way that characters are
    mapped from the character set to actual Unicode
    value.
  • Different encoding yield different byte sequences.

7
Unicode Encoding
  • UTF-8 (Unicode Transformation Format)
  • Variable length 8-bit character encoding for
    Unicode
  • Able to represent any universal character in the
    Unicode Standard
  • Uses one to four bytes to encode a Unicode symbol
  • Only one byte is needed to encode the US-ASCII
    characters

8
Unicode Encoding
  • UTF-16 (16-bit Unicode Transformation Format)
  • Variable length 16-bit character encoding for
    Unicode
  • Uses two or four byte sequence to encode a
    Unicode symbol
  • Two byte is required to encode the US-ASCII
    character
  • UCS-2 (2-byte Universal Character Set)
  • Fixed length encoding that always encodes
    characters into a single 16-bit value
  • It can encode characters in the range 0x0000 to
    0xFFFF

9
Unicode Encoding
  • UCS-4 / UTF-32 (32-bit Unicode Transformation
    Format)
  • Fixed length 32-bit character encoding for
    Unicode
  • Every character it uses 4 bytes and it is very
    space inefficient
  • Little used in practice with UTF-8 and UTF-16
    being the normal ways of encoding Unicode Text
  • http//www.unicode.org/

10
Unicode Encoding
  • Devanagari (0x0900 0x097F)
  • Bengali (0x0980 0x09FF)
  • Tamil (0x0B80 0x0BFF)
  • Kannada (0x0C80 0x0CFF)

Code Point U0041 U05D0 U597D U233B4
UTF-8 41 D7 90 E5 A5 BD F0 A3 8E B4
UTF-16 00 41 05 D0 59 7D D8 4C DF B4
UTF-32 00 00 00 41 00 00 05 D0 00 00 59 7D 00 02 33 B4
11
Unicode Encoding
  • Alternate way to represent the character is by
    using escape value. (x05D0)
  • Not all documents have to be encoded as Unicode
  • But documents can only contain characters defined
    by Unicode Standard
  • Any encoding can be used as long as it is
    properly declared and it is the subset of Unicode
  • Unicode encoding also allows many more languages
    to be mixed on a single page

12
Other Encoding formats
  • Shift_JIS (SJIS), character encoding for the
    Japanese Language
  • Single byte character encoding for the
    lower-ASCII characters (0x00 0x7F)
  • Double-byte character encoding for the
    upper-ASCII bytes
  • GB2312, character encoding for simplified Chinese
    characters

13
W3C Specification - Encoding
  • W3C specification for multi-lingual authoring
  • Encoding of the document needs to be mentioned,
    so that the application that consumes can
    interpret it.
  • Meta Tag
  • ltmeta http-equivContent-type
    contenttext/htmlcharsetUTF-8 /gt
  • XML
  • lt?xml version1.0 encodingUTF-8?gt
  • Content-type header returned from the WEB server
    should also contain the character encoding of the
    document
  • Content-Type text/html Charsetutf-8

14
W3C Specification - Language
  • Author needs to specify the language of the
    document (web page content)
  • Browser can choose the appropriate font selection
    using the Lang attribute
  • Search Engine can group or filter results based
    on the users linguistic preferences (using meta)
  • Translation tools use to recognize the section of
    text in a particular language

15
W3C Specification - Language
  • HTTP Content Language Header
  • Content-Language hi
  • Language Attribute on html tag
  • lthtml langhigt
  • lthtml xmllanghigt
  • Content Language in meta tag
  • ltmeta http-equivContent-Language contenthi
    /gt
  • Language attribute on embedded content
  • ltdiv langen xmllangengt Some English
    Content lt/divgt

16
What value to use for lang?
  • IANA (Internet Assigned Numbers Authority)
  • Provides a unique value for each language
  • It is available in the Subtag value in the new
    IANA Language
  • http//www.iana.org/assignments/language-subtag-re
    gistry
  • Hindi hi, Kannada kn, Tamil ta

17
Bi-directional text
  • Additional information is required in addition to
    the language attribute to provide support for
    non-Latin scripts (like Arabic, Hebrew, Urdu)
  • In HTML, dir attribute is used to specify the
    direction of the text
  • The title says ltspan dirrtlgt ? ? ? ? ? ? ? ?
    ? ? ? ? ? , W3Clt/spangt in Hebrew.

18
Multilingual WEB Address
  • A Web address is used to point a resource on the
    WEB
  • Web address are typically expressed using URIs
    (Uniform Resource Identifiers)
  • Restricts to a small number of characters (upper
    lower case letters of the English alphabet,
    numerals and few symbols).
  • Users expectations and use of the Internet have
    changed this restrictions.
  • There is a growing need to use any language
    characters in WEB Addresses.

19
Multilingual WEB Address
  • A Web address in your own language and alphabet
    is easier to create, memorize, interpret and
    relate it. (Ex http//???.com)
  • Punycode is a way of representing Unicode code
    points using only ASCII characters. (Ex
    http//xn--21bm4l.com)

20
Indian Content an Overview
  • Most Indian Websites are not using Unicode
  • Content are generated within the ASCII range and
    provide the proprietary fonts which maps the
    ASCII character set to Indian Languages.
  • Visually it will be fine, but no other entities
    will be able to interpret it
  • For each site, the user may need to download the
    proprietary fonts, which is not user friendly
  • Search Engine will not be able to interpret the
    content which is intended by author as it does
    not follow the standard encoding.

21
Indian Content an Overview
22
Unicode W3C Importance
  • WEB is also moving towards the mobile
  • W3C Mobile Web Initiative (MWI) defines the best
    practices for Mobile Browsing
  • Cannot install the required fonts during
    run-time as used to do in desktop
  • If Unicode character are used the required font
    may be available within the device

23
Firefox
  • Firefox (http//www.getfirefox.com)
  • Provides extensive support for Unicode and
    related fonts
  • Provides the Add-ons to type in Indian Languages
    in web pages in Linux. (Such tools are already
    available for Windows XP Users through the
    language packs)
  • https//addons.mozilla.org/firefox/5484/author/

24
W3C i18n activity
  • Core Working group
  • Enable universal access to the World Wide Web by
    providing adequate support to other W3C Working
    Groups
  • GEO (Guidelines, Education Outreach)
  • Internationalization aspects of W3C technology
    better understood and more widely and
    consistently used
  • ITS (Internationalization Tag Set)
  • Develop a set of elements and attributes that can
    be used with new DTDs/Schemas to support the
    internationalization and localization of documents

25
Thanks
  • kumarc_at_jataayusoft.com
Write a Comment
User Comments (0)
About PowerShow.com