Title: S. T. Nandasara Lecturer
1Asian Languages on the Web
S. T. NandasaraLecturer USCS, University of
Colombo, Sri Lanka Ashu Marasinghe Associate
ProfessorLOP, Nagaoka University of Technology,
Japan Yoshiki Mikami Professor, LeaderLOP,
Nagaoka University of Technology, Japan
2Asian Languages on the Web
- Introduction of Asian Languages
- Survey Objectives and Methodology
- Asian Language Presence on the Web
- Multilingualism in the Asian Web
- Script and Encoding Issues
- Asian Language Resource Network (ALRN) Project
3Survey Objectives
- Give an overview for Asian Languages on the web
- To describe the state of multilingualism in Asian
country domains - Defined at various levels, from a personal or
document level to a societal level - Multiple language presence in each country domain
- Give an overview of cross-border languages
- To shed light on script and encoding issues of
Asian languages - What extent is UCS/Unicode employed for Asian
languages? - What scripts are actually used to represent a
specific language? - What extent are locally developed encodings used?
- Define a future agenda, which can guide us in
realizing the vision of creating an
observation-collection instrument for Asian
languages.
4Survey Methodology
- Used a web crawler (Ubi crawler)
- It traces links within pages and recursively
crawls to gather those newly discovered pages - The collection of downloaded web pages passed to
the language identification engine - The language properties of the pages were
identified
5Web Pages Collected
- Focused on web pages in 42 country domains in
Asia. - The crawl was begun from a seed file containing
13,286 URLs - The list of ccTLDs contains ae, af, az, bd, bh,
bn, bt, cy, id, il, in, iq, ir, jo, kg, kh, kw,
kz, la, lb, lk, mm, mn, mv, my, np, om, ph, pk,
ps, qa, sa, sg, sy, th, tj, tm, tp, tr, uz, vn
and ye. - The Asia crawl started from 5th July 2006 at
1100hrs and ended on 19th July 2006 at 1903hrs - Downloaded 107,141,679 web pages in total,
652,710,237,381 bytes in size
6Downloaded Pages by ccTLD Top 10
7Downloaded Pages by ccTLD Least 10
8Language Identification Process
- The language identification engine LIM (Language
Identification Module) used - LIM consists of two components
- Training component
- Training data is translations of the Universal
Declaration of Human Rights (UDHR) provided by
the United Nations Office of Higher Commissioner
for Human Rights - The second component is identification component
- LIM can simultaneously detect the triplet of
language, script and encoding scheme
9Discovered 55 Asian languages
- Chinese, Japanese and Korean are excluded from
the analysis - Hebrew, Thai, Turkish, Vietnamese, Arabic,
Tatar, Farsi, Javanese, Indonesian, Malay,
Sundanese, Hindi, Dari, Uzbek, Mongolian, Kazakh,
Madurese, Uighur, Kashmiri Pushtu, Balochi,
Turkmen, Minangkabau, Bikol, Kyrgyz, Balinese,
Punjabi, Sindhi, Achehnese, Sinhala, Kapampangan,
Iloko, Bengali Assamese, Filipino, Waray,
Bugisnese, Burmese, Kurdish, Tajiki, Azeri,
Tamil, Hiligaynon, Dhivehi, Bhojpuri, Tibetan,
Cebuano, Telugu, Saraiki, Lao, Gujarati, Pashto,
Kannada, Urdu, Khmer, Hani
10No of web pages per 1000 population
11Number of pages by language Top 10
12Number of pages by language Least 10
13Multilingualism by Country Domain
- The most recent version of Ethnologue lists close
to seven thousand languages around the world. - More than 2600 of them are spoken in the Asian
region. - Large scale linguistic diversity is observable in
Asia. Among the 2600, only around 51 languages
are recognized by Asian governments as official
or national language(s) - Richest diversity of languages in the region,
i.e. Indonesia - Interesting to note that there is a significantly
larger number of pages in Javanese compared to
either Indonesian or Malay - The major language found in Indonesia, Malaysia,
Brunei, Singapore, Southern Thailand and
Phillipines can be categorized into a single root
Malay language spoken in different dialects. - Javanese has a dominating web presence in
Indonesia. - The lesser Sundanese, Madurese, Achehnese and
Buginese languages are found to be of great
importance to Indonesias local language
diversity on the Internet
14Cross-Border Languages
- Another aspect of the multilingualism in the
region is the overwhelming presence of
cross-border languages on the web - Defined two categories of languages
- First category is local languages, which are
officially recognized language(s) and home
speakers languages of the state - The second category is cross-border languages,
such as English, French, Russian and Arabic,
which are used as a language of communication
among the peoples of different nations
15Cross-Border Language Presence
West Asia
16Central Asia
17South East Asia
18South Asia
19Script Diversity of Asia
20Same Script Shared by Various Languages
Devanagari Script used by
More than 480 million speakers Hindi More than
10 million speakers Marathi Nepali More than 1
million speakers Awadhi Bhojpuri Braj-Dhasha Chaha
ttsigarhi Konkani Kachchi Marwani Maithali Magahi
Scholars language Sanskrit
Less than 1 million speakers
Garhwali Mundari Newari Begheli Bhatneri Bathi Bat
eri Bhili Gondi Jaipuri Harauti Ho Kachchhi Kanauj
i Khadiya Khorthi
Kului Kumaoni Khadiya Khortha Kului Kumaoni Kurku
Kurukh Kurmali Palpa Panchpargania Santali Nagpuri
Kankan Limbu Sherpa
21UDHR Document by Major Script Grouping
Representation of the UDHR Document by Major
Script Grouping
1 Cumulated speaker population based on
Ethnologue, Language of the World, 15th ed.
(2005)
22UTF-8 Encoding in Selected Languages
23Asian Language Resources Agenda
ALRN Mission
24ALRN Action Plan
- The project will be focusing on South, South
East, Central West Asian Languages - Act as an umbrella with Asian Language Resources
(ALR) - To accommodate Secure and Sustainable UTF base
encoding - Take advantage of existing Organization such as
Language Observatory Project (LOP,TCL) - Corpus collection from the web using LOs
crawler/language identifier - Language resources originated from Japan and with
their paralleled language corpus available in
other languages (UDHR, Oshin, One Straw
Revolution, etc) - Multilingual Terminology Dictionary
- Information Standards of language corpus building
- Liaison with international organization such as
UNESCO, UDHR, etc. - Information resource shearing web site
(www.language-resource.net)
Asian Academy of Languages ?
25Thank you Danke schön Merci Gracias Obrigado Grazi
e Danke Spaciba ??????st?
26Language Presence in Asian Countries
(The exact number of languages may never be
determined exactly)
27Language Diversity
(Half of the worlds languages are spoken in only
eight countries)
28Asian Language Recognition
29Asian Language Resources Network - Agenda