Software Internationalization - PowerPoint PPT Presentation

About This Presentation
Title:

Software Internationalization

Description:

Client-Side (browser) How to make the best use of the browsers when dealing with ... Must be compatible with most browsers. We restrict ourselves to HTML ... – PowerPoint PPT presentation

Number of Views:20
Avg rating:3.0/5.0
Slides: 17
Provided by: milen9
Learn more at: http://www.unicode.org
Category:

less

Transcript and Presenter's Notes

Title: Software Internationalization


1
15th International Unicode Conference August/Septe
mber, 1999
Keys to Building a Multilingual Search Engine
Thierry Sourbier
2
Search Engine Overview
  • Client-Side (browser)
  • How to make the best use of the browsers when
    dealing with multiple languages
  • Server-Side
  • How to provide efficient multilingual information
    retrieval

Create index
HTTP
Submit query
Process query
Display results
3
Overview of the Server-side
  • Index creation steps
  • Normalization
  • gives the pages a standard format
  • Segmentation
  • breaks the pages in units that will be stored in
    the index
  • Index building
  • Query processing steps
  • Normalization
  • makes sure that the query has the same format as
    the indexed pages
  • Segmentation
  • breaks the query in units that will be looked up
    in the index
  • Index search

Typically only Normalization and Segmentation are
language dependent. The goal is to reduce these
dependencies as much as possible.
4
Multilingual Normalization
  • Normalizing the character encoding
  • One size fits all Unicode
  • Removing the unnecessary
  • HTML tags, extra white spaces, etc.
  • Character normalization
  • Mapping together characters that have the same
    meaning
  • Locale dependent

5
Multilingual Segmentation
  • Linguistic features cant be used
  • Too complex and/or costly to implement
  • Relying on N-Gram
  • N-Gram a sequence of N contiguous characters
  • N-Gram may overlap
  • example with N4
  • unicode conference gt unic,nico,icod,code
    ,de c,e co, con,...

6
N-Grams Advantages
  • Advantages
  • Simple to implement
  • Increased tolerance for typos
  • Free morphology
  • Language independent

7
N-Grams Disadvantages
  • Disadvantages
  • Index is bigger
  • Minimum query length is N characters
  • shorter query will yield to no results
  • May introduce noise
  • sometime the system may be too tolerant (e.g. a
    query to standing may send back pages
    containing understand)
  • Not as good as linguistic based IR system.
  • no explicit word normalization possible

8
What value should N have?
  • N is language dependent
  • Typically we use a value between 1 and 6
  • High N-gram size improves quality, but reduces
    tolerance and increases the minimal query size
  • Some languages may require more than one N-Gram
    size
  • Japanese example

9
Client-side
  • Must be compatible with most browsers
  • We restrict ourselves to HTML
  • We use the standard encodings for each language
    for our pages
  • many people still use browsers that are not
    Unicode friendly
  • this makes content editing easier

10
Using a FORM
  • The parameters of the query are passed via the
    URL to a CGI script
  • e.g http//www.my_site.com/my_script?query22San
    Jose22
  • What is the charset of the data sent back from
    the client?

11
URL Encoding Issues
  • Different browsers have different behaviors
  • Example a Japanese query
  • Could be submitted to the server as
  • ...search.pl?Query93FA967B8CEA
  • Or by another browser as
  • ...search.pl?Query2623260853B2623264123B26
    23354863B

12
FORM and CGI
  • The server tells the client which encoding to use
    at the HTTP level
  • ltHTMLgt
  • ltHEADgt
  • ltMETA HTTP-EQUIV"content-type"
    CONTENT"text/html charset..."gt
  • lt/HEADgt
  • .
  • lt/HTMLgt

13
FORM and CGI
  • The client returns the information to the script
    using the Private FORM/CGI Protocol
  • A hidden form field adds a parameter to the
    query which identifies the locale
  • ltformgt
  • ...
  • ltinput typehidden nameLocale valuejagt
  • ...
  • lt/formgt

14
Displaying the Results
  • Simple if only one code set per page is required
  • For multilingual content
  • use UTF-8
  • use multiples frames
  • Unexpected browser behavior

15
Conclusion
  • Solutions exist to provide a robust multilingual
    search engine
  • Code set issues on the client side can be a
    limitation but it will soon disappear as more and
    more people will be using UTF-8 friendly browsers

16
QA
Thierry Sourbier Software Developer tsourbier_at_rese
arch.intl.com
Write a Comment
User Comments (0)
About PowerShow.com