Unicode and entities - PowerPoint PPT Presentation

1 / 19

About This Presentation

Title:

Unicode and entities

Description:

... user, the recipient will holler that your stuff is not very readable. ... text, the Macintosh reader whines that the text is filled with funny characters. ... – PowerPoint PPT presentation

Number of Views:75

Avg rating:3.0/5.0

Slides: 20

Provided by: jills6

Category:

more less

Transcript and Presenter's Notes

Title: Unicode and entities

1
Unicode and entities

Jill R. SommerInstitute for Applied Linguistics
Kent State University

2
Unicode and entities

Initially, the application of HTML on the World
Wide Web was seriously restricted by its reliance
on the ISO-8859-1 coded character set, which is
appropriate only for Western European languages.
Despite this restriction, HTML has been widely
used with other languages, using other coded
character sets or character encodings, through
various ad hoc extensions to the language.

3
Once upon a time, there was ASCII

If you type some text in a word processor and
save it, your computer has to convert your
keystrokes to numbers. A big problem is what
numbers will the computer use?Several decades
ago the American Standard Code for Information
Interchange or ASCII emerged as a standard.
Originally only 7 bits or 128 numbers were used
to encode A-Z, a-z, 0-9, punctuation marks, some
control characters and other signs. The computer
industry felt this were not enough number
positions to encode all necessary characters, so
an extra bit was used. Unfortunately there were
many completely different 8-bits ASCII and other
encoding schemes developed and the confusion
remains until today.

4
Diacritics and such

Why should you worry about encodings? Most people
write simple texts for which the 7-bits ASCII
encoding gives excellent results. But many people
would like to write texts in their native tongue,
including characters with diacritics (u-umlaut,
e-acute, n-tilde, etc.), or they would like to
use special symbols, even as simple as currency
symbols (yen, florin, peseta, etc.).

5
Diacritics and such

Using any special symbol beyond A-Z, a-Z, 0-1 or
some punctuation marks gives horrible problems.
If you write an email on a Macintosh computer and
send it to a Windows user, the recipient will
holler that your stuff is not very readable. If a
Windows user layouts a beautiful webpage with
diacritics in the text, the Macintosh reader
whines that the text is filled with funny
characters. Matters grow worse if you try to use
really special characters like the Symbol font in
mathematical equations or Russian characters.

6
Unicode to the rescue

Unicode was invented to put an end to all
encoding difficulties. However, some people think
Unicode is still not complete and other people
are not interested in using Unicode by the rules.
Matters would be very simple if every font maker
would provide a big font file with all 65535
possible Unicode 1.0 characters - and even more
in Unicode 3, which would suffice to display all
possible characters and symbols in all languages
of the world.

7
Unicode to the rescue

How to encode Unicode characters in a web
page?Look up the decimal, not the hexadecimal
value of the character you would like to encode.
Put it in the html page as follows8364if
you would like to encode the Euro sign. Do not
forget the semicolon - an often made
mistake.Between the and tags in
the web page a charset meta instruction has to be
placedcontent"text/html charsetutf-8"Use only 1
charset meta per page and put it in the head
section.

8
Unicode to the rescue

Of course every piece of software would have to
be able to use these Unicode fonts. In that case
every Macintosh, Windows, Linux, whatever
computer user would find that the Greek delta is
at the same place in his font, so he could
communicate to every other user without any
shadow of a doubt.Why the computer industry is
not able or not willing to apply this scheme is
not clear. Reality is that we are still stuck
with many incompatible situations.

9
Unicode and Entities

HTML is an application of ISO Standard 88791986,
Information Processing Text and Office Systems --
Standard Generalized Markup Language (SGML)
ISO-8879. The HTML Document Type Definition
(DTD) is a formal definition of the HTML syntax
in terms of SGML. This specification amends the
DTD of HTML 2.0 in order to make it applicable to
documents encompassing a character repertoire
much larger than that of ISO-8859-1, while still
remaining SGML conformant.

10
Unicode and Entities

SGML views the characters as a single set (called
a "character repertoire"), and a "code set" that
assigns an integer number (known as "character
number") to each character in the repertoire.
HTML, as an application of SGML, does not
directly address the question of the external
character encoding. This is deferred to
mechanisms external to HTML, such as MIME as used
by the HTTP protocol or by electronic mail.

11
Unicode and Entities

For the HTTP protocol RFC2068, the external
character encoding is indicated by the "charset"
parameter of the "Content-Type" field of the
header of an HTTP response. For example, to
indicate that the transmitted document is encoded
in the "JUNET" encoding of Japanese RFC1468,
the header will contain the following line
Content-Type text/html charsetISO-2022-JP

12
Unicode and entities

The term "charset" in MIME is used to designate a
character encoding, rather than merely a coded
character set as the term may suggest. A
character encoding is a mapping (possibly
many-to-one) of sequences of octets to sequences
of characters taken from one or more character
repertoires.

13
Unicode and entities

The HTTP protocol also defines a mechanism for
the client to specify the character encodings it
can accept.
Similarly, if HTML documents are transferred by
electronic mail, the external character encoding
is defined by the "charset" parameter of the
"Content-Type" MIME header field RFC2045, and
defaults to US- ASCII in its absence.

14
Displaying Non-Western European languages in HTML

Languages that use characters that are not
included in the ISO-8859-1 Latin-1 character set
cannot be dealt with properly using plain HTML
2.0 (which was described in RFC-1866). However,
there are several methods of handling such
languages that are "standards-compliant", in that
they follow the "Internationalized" HTML 2.0
specification in RFC-2070 and the HTML 4.0
specification.

15
Displaying Non-Western European Languages in HTML

Suppose that one wished to present material
written in an Eastern European language that
needs characters from the ISO-8859-2 Latin 2
character set. Here are the two main approaches

16
Displaying Non-Western European Languages in HTML

Send the following line among the HTTP headers of
the document sent by the HTTP server (but not as
part of the document itself) Content-Type
text/html charsetISO-8859-2You can then use
the raw 8-bit Latin 2 characters in the range
160-255 in your document (but because ISO-8859-2
is technically not the SGML "Document Character
Set", the apparently corresponding numeric
entities "160" - "255" will not be properly
interpreted as Latin 2 characters, but rather
with their normal Latin 1 meanings, as shown in
the chart at the top of this file).

17
Displaying Non-Western European Languages in HTML

A more "multilingual" approach is to use the
Unicode character set, in which non-ISO-8859-1
characters occur in positions higher than 255.
The most radical method of doing this would be to
use high Unicode characters directly, as part of
a non-single-byte transfer encoding (such as
UTF-8 or raw double-byte), accompanied by the
appropriate HTTP headers. But it should also be
possible to use numeric entities greater than
255, even within a simple ASCII or ISO-8859-1
file, in order to refer to Unicode characters
that are not included in ISO-8859-1 Latin-1.

18
Numeric ampersand entities

Entities begin with and end with
Numeric ampersand entities affect your browser.
To use these characters in your own HTML files,
put the appropriate number into __ (e.g.
"163" for the British pound (currency) sign),
or, for the 8-bit alphabetic characters, use the
alternative standard HTML 2.0 entity.

19
Numeric ampersand entities

"amp", "lt", and "gt", which should be used
to escape the characters in an HTML file,
and "quot" to escape a double-quote character
in an attribute value.
nbsp is a non-breaking space and can be quite
useful.

Write a Comment

User Comments (0)