S. T. Nandasara Lecturer - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

S. T. Nandasara Lecturer

Description:

... Asian Languages on the web. To describe the state of multilingualism in Asian ... To create a network of qualified Asian partners to specify and support the ... – PowerPoint PPT presentation

Number of Views:159
Avg rating:3.0/5.0
Slides: 30
Provided by: shakranget
Learn more at: http://portal.unesco.org
Category:

less

Transcript and Presenter's Notes

Title: S. T. Nandasara Lecturer


1
Asian Languages on the Web
S. T. NandasaraLecturer USCS, University of
Colombo, Sri Lanka Ashu Marasinghe Associate
ProfessorLOP, Nagaoka University of Technology,
Japan Yoshiki Mikami Professor, LeaderLOP,
Nagaoka University of Technology, Japan
2
Asian Languages on the Web
  • Introduction of Asian Languages
  • Survey Objectives and Methodology
  • Asian Language Presence on the Web
  • Multilingualism in the Asian Web
  • Script and Encoding Issues
  • Asian Language Resource Network (ALRN) Project

3
Survey Objectives
  • Give an overview for Asian Languages on the web
  • To describe the state of multilingualism in Asian
    country domains
  • Defined at various levels, from a personal or
    document level to a societal level
  • Multiple language presence in each country domain
  • Give an overview of cross-border languages
  • To shed light on script and encoding issues of
    Asian languages
  • What extent is UCS/Unicode employed for Asian
    languages?
  • What scripts are actually used to represent a
    specific language?
  • What extent are locally developed encodings used?
  • Define a future agenda, which can guide us in
    realizing the vision of creating an
    observation-collection instrument for Asian
    languages.

4
Survey Methodology
  • Used a web crawler (Ubi crawler)
  • It traces links within pages and recursively
    crawls to gather those newly discovered pages
  • The collection of downloaded web pages passed to
    the language identification engine
  • The language properties of the pages were
    identified

5
Web Pages Collected
  • Focused on web pages in 42 country domains in
    Asia.
  • The crawl was begun from a seed file containing
    13,286 URLs
  • The list of ccTLDs contains ae, af, az, bd, bh,
    bn, bt, cy, id, il, in, iq, ir, jo, kg, kh, kw,
    kz, la, lb, lk, mm, mn, mv, my, np, om, ph, pk,
    ps, qa, sa, sg, sy, th, tj, tm, tp, tr, uz, vn
    and ye.
  • The Asia crawl started from 5th July 2006 at
    1100hrs and ended on 19th July 2006 at 1903hrs
  • Downloaded 107,141,679 web pages in total,
    652,710,237,381 bytes in size

6
Downloaded Pages by ccTLD Top 10
7
Downloaded Pages by ccTLD Least 10
8
Language Identification Process
  • The language identification engine LIM (Language
    Identification Module) used
  • LIM consists of two components
  • Training component
  • Training data is translations of the Universal
    Declaration of Human Rights (UDHR) provided by
    the United Nations Office of Higher Commissioner
    for Human Rights
  • The second component is identification component
  • LIM can simultaneously detect the triplet of
    language, script and encoding scheme

9
Discovered 55 Asian languages
  • Chinese, Japanese and Korean are excluded from
    the analysis
  • Hebrew, Thai, Turkish, Vietnamese, Arabic,
    Tatar, Farsi, Javanese, Indonesian, Malay,
    Sundanese, Hindi, Dari, Uzbek, Mongolian, Kazakh,
    Madurese, Uighur, Kashmiri Pushtu, Balochi,
    Turkmen, Minangkabau, Bikol, Kyrgyz, Balinese,
    Punjabi, Sindhi, Achehnese, Sinhala, Kapampangan,
    Iloko, Bengali Assamese, Filipino, Waray,
    Bugisnese, Burmese, Kurdish, Tajiki, Azeri,
    Tamil, Hiligaynon, Dhivehi, Bhojpuri, Tibetan,
    Cebuano, Telugu, Saraiki, Lao, Gujarati, Pashto,
    Kannada, Urdu, Khmer, Hani

10
No of web pages per 1000 population
11
Number of pages by language Top 10
12
Number of pages by language Least 10
13
Multilingualism by Country Domain
  • The most recent version of Ethnologue lists close
    to seven thousand languages around the world.
  • More than 2600 of them are spoken in the Asian
    region.
  • Large scale linguistic diversity is observable in
    Asia. Among the 2600, only around 51 languages
    are recognized by Asian governments as official
    or national language(s)
  • Richest diversity of languages in the region,
    i.e. Indonesia
  • Interesting to note that there is a significantly
    larger number of pages in Javanese compared to
    either Indonesian or Malay
  • The major language found in Indonesia, Malaysia,
    Brunei, Singapore, Southern Thailand and
    Phillipines can be categorized into a single root
    Malay language spoken in different dialects.
  • Javanese has a dominating web presence in
    Indonesia.
  • The lesser Sundanese, Madurese, Achehnese and
    Buginese languages are found to be of great
    importance to Indonesias local language
    diversity on the Internet

14
Cross-Border Languages
  • Another aspect of the multilingualism in the
    region is the overwhelming presence of
    cross-border languages on the web
  • Defined two categories of languages
  • First category is local languages, which are
    officially recognized language(s) and home
    speakers languages of the state
  • The second category is cross-border languages,
    such as English, French, Russian and Arabic,
    which are used as a language of communication
    among the peoples of different nations

15
Cross-Border Language Presence
West Asia
16
Central Asia
17
South East Asia
18
South Asia
19
Script Diversity of Asia
20
Same Script Shared by Various Languages
Devanagari Script used by
More than 480 million speakers Hindi More than
10 million speakers Marathi Nepali More than 1
million speakers Awadhi Bhojpuri Braj-Dhasha Chaha
ttsigarhi Konkani Kachchi Marwani Maithali Magahi
Scholars language Sanskrit
Less than 1 million speakers
Garhwali Mundari Newari Begheli Bhatneri Bathi Bat
eri Bhili Gondi Jaipuri Harauti Ho Kachchhi Kanauj
i Khadiya Khorthi
Kului Kumaoni Khadiya Khortha Kului Kumaoni Kurku
Kurukh Kurmali Palpa Panchpargania Santali Nagpuri
Kankan Limbu Sherpa
21
UDHR Document by Major Script Grouping
Representation of the UDHR Document by Major
Script Grouping
1 Cumulated speaker population based on
Ethnologue, Language of the World, 15th ed.
(2005)
22
UTF-8 Encoding in Selected Languages
23
Asian Language Resources Agenda
ALRN Mission
24
ALRN Action Plan
  • The project will be focusing on South, South
    East, Central West Asian Languages
  • Act as an umbrella with Asian Language Resources
    (ALR)
  • To accommodate Secure and Sustainable UTF base
    encoding
  • Take advantage of existing Organization such as
    Language Observatory Project (LOP,TCL)
  • Corpus collection from the web using LOs
    crawler/language identifier
  • Language resources originated from Japan and with
    their paralleled language corpus available in
    other languages (UDHR, Oshin, One Straw
    Revolution, etc)
  • Multilingual Terminology Dictionary
  • Information Standards of language corpus building
  • Liaison with international organization such as
    UNESCO, UDHR, etc.
  • Information resource shearing web site
    (www.language-resource.net)

Asian Academy of Languages ?

25
Thank you Danke schön Merci Gracias Obrigado Grazi
e Danke Spaciba ??????st?
26
Language Presence in Asian Countries
(The exact number of languages may never be
determined exactly)
27
Language Diversity
(Half of the worlds languages are spoken in only
eight countries)
28
Asian Language Recognition
29
Asian Language Resources Network - Agenda
Write a Comment
User Comments (0)
About PowerShow.com