Unicode and entities - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

Unicode and entities

Description:

... user, the recipient will holler that your stuff is not very readable. ... text, the Macintosh reader whines that the text is filled with funny characters. ... – PowerPoint PPT presentation

Number of Views:75
Avg rating:3.0/5.0
Slides: 20
Provided by: jills6
Category:

less

Transcript and Presenter's Notes

Title: Unicode and entities


1
Unicode and entities
  • Jill R. SommerInstitute for Applied Linguistics
  • Kent State University

2
Unicode and entities
  • Initially, the application of HTML on the World
    Wide Web was seriously restricted by its reliance
    on the ISO-8859-1 coded character set, which is
    appropriate only for Western European languages.
    Despite this restriction, HTML has been widely
    used with other languages, using other coded
    character sets or character encodings, through
    various ad hoc extensions to the language.

3
Once upon a time, there was ASCII
  • If you type some text in a word processor and
    save it, your computer has to convert your
    keystrokes to numbers. A big problem is what
    numbers will the computer use?Several decades
    ago the American Standard Code for Information
    Interchange or ASCII emerged as a standard.
    Originally only 7 bits or 128 numbers were used
    to encode A-Z, a-z, 0-9, punctuation marks, some
    control characters and other signs. The computer
    industry felt this were not enough number
    positions to encode all necessary characters, so
    an extra bit was used. Unfortunately there were
    many completely different 8-bits ASCII and other
    encoding schemes developed and the confusion
    remains until today.

4
Diacritics and such
  • Why should you worry about encodings? Most people
    write simple texts for which the 7-bits ASCII
    encoding gives excellent results. But many people
    would like to write texts in their native tongue,
    including characters with diacritics (u-umlaut,
    e-acute, n-tilde, etc.), or they would like to
    use special symbols, even as simple as currency
    symbols (yen, florin, peseta, etc.).

5
Diacritics and such
  • Using any special symbol beyond A-Z, a-Z, 0-1 or
    some punctuation marks gives horrible problems.
    If you write an email on a Macintosh computer and
    send it to a Windows user, the recipient will
    holler that your stuff is not very readable. If a
    Windows user layouts a beautiful webpage with
    diacritics in the text, the Macintosh reader
    whines that the text is filled with funny
    characters. Matters grow worse if you try to use
    really special characters like the Symbol font in
    mathematical equations or Russian characters.

6
Unicode to the rescue
  • Unicode was invented to put an end to all
    encoding difficulties. However, some people think
    Unicode is still not complete and other people
    are not interested in using Unicode by the rules.
  • Matters would be very simple if every font maker
    would provide a big font file with all 65535
    possible Unicode 1.0 characters - and even more
    in Unicode 3, which would suffice to display all
    possible characters and symbols in all languages
    of the world.

7
Unicode to the rescue
  • How to encode Unicode characters in a web
    page?Look up the decimal, not the hexadecimal
    value of the character you would like to encode.
    Put it in the html page as follows8364if
    you would like to encode the Euro sign. Do not
    forget the semicolon - an often made
    mistake.Between the and tags in
    the web page a charset meta instruction has to be
    placedcontent"text/html charsetutf-8"Use only 1
    charset meta per page and put it in the head
    section.

8
Unicode to the rescue
  • Of course every piece of software would have to
    be able to use these Unicode fonts. In that case
    every Macintosh, Windows, Linux, whatever
    computer user would find that the Greek delta is
    at the same place in his font, so he could
    communicate to every other user without any
    shadow of a doubt.Why the computer industry is
    not able or not willing to apply this scheme is
    not clear. Reality is that we are still stuck
    with many incompatible situations.

9
Unicode and Entities
  • HTML is an application of ISO Standard 88791986,
    Information Processing Text and Office Systems --
    Standard Generalized Markup Language (SGML)
    ISO-8879. The HTML Document Type Definition
    (DTD) is a formal definition of the HTML syntax
    in terms of SGML. This specification amends the
    DTD of HTML 2.0 in order to make it applicable to
    documents encompassing a character repertoire
    much larger than that of ISO-8859-1, while still
    remaining SGML conformant.

10
Unicode and Entities
  • SGML views the characters as a single set (called
    a "character repertoire"), and a "code set" that
    assigns an integer number (known as "character
    number") to each character in the repertoire.
  • HTML, as an application of SGML, does not
    directly address the question of the external
    character encoding. This is deferred to
    mechanisms external to HTML, such as MIME as used
    by the HTTP protocol or by electronic mail.

11
Unicode and Entities
  • For the HTTP protocol RFC2068, the external
    character encoding is indicated by the "charset"
    parameter of the "Content-Type" field of the
    header of an HTTP response. For example, to
    indicate that the transmitted document is encoded
    in the "JUNET" encoding of Japanese RFC1468,
    the header will contain the following line
    Content-Type text/html charsetISO-2022-JP

12
Unicode and entities
  • The term "charset" in MIME is used to designate a
    character encoding, rather than merely a coded
    character set as the term may suggest. A
    character encoding is a mapping (possibly
    many-to-one) of sequences of octets to sequences
    of characters taken from one or more character
    repertoires.

13
Unicode and entities
  • The HTTP protocol also defines a mechanism for
    the client to specify the character encodings it
    can accept.
  • Similarly, if HTML documents are transferred by
    electronic mail, the external character encoding
    is defined by the "charset" parameter of the
    "Content-Type" MIME header field RFC2045, and
    defaults to US- ASCII in its absence.

14
Displaying Non-Western European languages in HTML
  • Languages that use characters that are not
    included in the ISO-8859-1 Latin-1 character set
    cannot be dealt with properly using plain HTML
    2.0 (which was described in RFC-1866). However,
    there are several methods of handling such
    languages that are "standards-compliant", in that
    they follow the "Internationalized" HTML 2.0
    specification in RFC-2070 and the HTML 4.0
    specification.

15
Displaying Non-Western European Languages in HTML
  • Suppose that one wished to present material
    written in an Eastern European language that
    needs characters from the ISO-8859-2 Latin 2
    character set. Here are the two main approaches

16
Displaying Non-Western European Languages in HTML
  • Send the following line among the HTTP headers of
    the document sent by the HTTP server (but not as
    part of the document itself) Content-Type
    text/html charsetISO-8859-2You can then use
    the raw 8-bit Latin 2 characters in the range
    160-255 in your document (but because ISO-8859-2
    is technically not the SGML "Document Character
    Set", the apparently corresponding numeric
    entities "160" - "255" will not be properly
    interpreted as Latin 2 characters, but rather
    with their normal Latin 1 meanings, as shown in
    the chart at the top of this file).

17
Displaying Non-Western European Languages in HTML
  • A more "multilingual" approach is to use the
    Unicode character set, in which non-ISO-8859-1
    characters occur in positions higher than 255.
    The most radical method of doing this would be to
    use high Unicode characters directly, as part of
    a non-single-byte transfer encoding (such as
    UTF-8 or raw double-byte), accompanied by the
    appropriate HTTP headers. But it should also be
    possible to use numeric entities greater than
    255, even within a simple ASCII or ISO-8859-1
    file, in order to refer to Unicode characters
    that are not included in ISO-8859-1 Latin-1.

18
Numeric ampersand entities
  • Entities begin with and end with
  • Numeric ampersand entities affect your browser.
  • To use these characters in your own HTML files,
    put the appropriate number into __ (e.g.
    "163" for the British pound (currency) sign),
    or, for the 8-bit alphabetic characters, use the
    alternative standard HTML 2.0 entity.

19
Numeric ampersand entities
  • "amp", "lt", and "gt", which should be used
    to escape the characters in an HTML file,
    and "quot" to escape a double-quote character
    in an attribute value.
  • nbsp is a non-breaking space and can be quite
    useful.
Write a Comment
User Comments (0)
About PowerShow.com