Title: The Development of E2T and T2E Active Reading via Web
1The Development of E2T and T2E Active Reading
via Web
- Asanee Kawtrakul and Teams
- Kasetsart University, Bangkok, Thailand
- ak_at_vivaldi.cpe.ku.ac.th
- Fifth Agricultural Ontology Service (AOS)
Workshop - 29 April 2004, Beijing, China
2Outline
- Motivation
- Objectives
- System Overview
- Methodologies
- Example
- Conclusion and Future work
3Acknowledgement
- KURDI
- Kasetsart University Research and Development
Institute
4Collaboration
- Library Institute of Kasetsart University
- Providing thesaurus and Agricultural Corpus
-
5Motivation
- Valued data scattering throughout the
organization in multi-language - Good Information collected by many individuals in
unstructured format - Digested information gives quicker
decision-making
6Proposed project
- Summarization
- From unstructured to structured format
- Only the gist of information
- Translation
- From English to Thai (E2T)
- Thai to English (T2E)
7Objectives
- To develop a system for summarizing and
translating the agricultural information from
English to Thai using statistical and frame-based
approach (E2T) - To support the development of information
discovery and web-based information exchange in
the agricultural domain(T2E)
8E2T
9Summarization (Input)
- Let us focus on Canadas agricultural products.
In 1998, there were 1,216 registered commercial
egg producers in Canada. Ontario produced 39.8
of all eggs in Canada, Quebec was second with
16.6. The western provinces have a combined egg
production of 35.6 and the eastern provinces
have a combined production of 8.0.
With a courtesy of Agriculture and Agri-Food
Canada, http//www.agr.ca/cb
10Summarization (Cube)
11Other Output
12Some related works
- Frame
- Knowledge representation in form of slot and
filler - Consisting of attributes and their values
Attributes
Values
13Methodologies
- Integration of NLP techniques and data cube
structure - Gist of information extracted and summarized by
frames and then translated into the target
language - Data cube structure supporting efficient data
access management and powerful decision making - Focusing on the case
- Agricultural summary articles which have merely
similar structure
14Why needs NLP techniques?
- NP Analysis
- To extract the name entity for activating a frame
- To enhance the performance of indexing
- Word sense Disambiguation
- Pound
- The basic monetary unit of the United Kingdom
- Unit of mass and weight
15System Overview
GraphicalUser Interface
16Gathering Module
17Indexing and Clustering Module
18Summarization Module
SentenceStructures
19Summarization (Input)
- Let us focus on Canadas agricultural products.
In 1998, there were 1,216 registered commercial
egg producers in Canada. Ontario produced 39.8
of all eggs in Canada, Quebec was second with
16.6. The western provinces have a combined egg
production of 35.6 and the eastern provinces
have a combined production of 8.0.
With a courtesy of Agriculture and Agri-Food
Canada, http//www.agr.ca/cb
20Summarization (Filtering)
Let us focus on Canadas agriculturalproducts.
In 1998, there were 1,216 registeredcommercial
egg producers in Canada.
Ontario produced 39.8 of all eggsin Canada.
Quebec was second with 16.6
The western provinces have a combinedegg
production of 35.6.
The eastern provinces have a combinedproduction
of 8.0.
21Summarization (Templates)
22Summarization (Frames)
23Summarization (Cube)
24Translation Module
VisualizationTool
25Translation Result
26Web-based User Interface
- To make inquiries about the history of
agricultural products price, including their
chronological, statistical data
27Output
28Current State E2T the system
- Parser Shallow parsing
- English to Thai
- Summarization and Translation Frame-based
- Text to relational database
29Parser
Big dog loves small cat.
????? ???? ??? ??? ???? /sulnakh yail rakh määwm
lekh/
30T2E
31Input and Output
- Input characteristics (SL)
- Web pages must be of html file only
- Web pages displayed in Thai
- Output characteristics (TL)
- The system will display output in English by
popping up the new window
32Why Translate only Table?
- From the survey, the agricultural web pages could
be divided into 3 types - Full text
- Tables with contexts
- Tables only (approx. 50)
33Table Characteristics (cnt.)
Unit
Heading (Outside Table)
Pure Texts
Numeric
34Table Characteristics (cnt.)
Unit outside table
Unit Inside table
35Input Format Example
Department of Internal Trade (DIT)
Office of Agriculture Economics (OAE)
36Tables only
Picture
Bullet
Agricultural Economics News
37System overview
38Input Webpage
HTML File
Web Robot
Internet
39Table Analysis
HTML File
Tag with position anchor
Text with position anchor
40Position Anchor (Table Analysis)
- Using letter to stand for the datas position in
each cell of table - T stands for table
- R stands for row
- C stands for column
41Keyword Definition Example(Table Analysis)
The result will be T1R1C1 ???? T1R1C2
1999 T1R1C3 2000 T1R2C1 ????????? T1R2C2
24,245 T1R2C3 28,356 T2R1C1 ??????? T2R1C2
1999 T2R2C1 ????????? T2R2C2 2,172,000
42Chunk-level Translation
Translated File
Text with Keyword
43Phrase Chunker (cnt.)(Chunk level Translation)
rules 1 np ? n vp vp ? aux? v n
???? ?????? ??????
1
2
3
44Phrase Chunker (Chunk level Translation)
45Chunk level Translation (cnt.)
- Handle with Name Entity!
- NE cannot be word-by-word translated
- e.g. ????????????????????????????
- Chunker ? AGRICULTURAL PLANT AND MATERIAL CONTROL
DIVISION - NE Extraction ? AGRICULTURAL REGULATORY DIVISION
46Table Characteristics (Unit Conversion)
Unit outside table
1
2
Unit Inside table
47Unit conversion (cnt.)
48Sentence Generation
rules 1 np ? n vp vp ? aux? v n
???? ?????? ??????
1
2
3
49Sentence Generation (cnt.)
NP ????vp ?????? ??????
Transfer rules Thai English np ? n
vp np ? adjp n vp ? v n adjp ? adj np
NP np ?????? ??????????
NP np goods importing????
NP np goods importing price
50Result
Active Reading
51Available Web sites
- Department of Internal Trade
- http//www.dit.go.th/
- Office of the Rubber Replanting Aid Fund
- http//www.thailandrubber.thaigov.net/menu5.php
- http//www.talaadthai.com/pricebase/default.asp
- http//www.rubberthai.com/price/price_index.htm
- http//www.thaifruitnews.com/
52Multilinguality Extension
53(No Transcript)
54Structure of ML-Dictionary (New version)
- Main language English (Vocabulary and POS.)
- Separate table for each language.
- Vocabularies that have the same meaning are
linking together by ID attribute. - Supported 10 languages
- Bahasa Indonesian, Chinese, English, French,
Italian, Japanese, Korean, Tagalog, Thai and
Vietnamese. - UTF-8 Character encoding.
55User Interface example.
- Adding new vocabulary user interface
56User Interface example. (cont)
- Query vocabulary user interface
57Current result based on FAO stat
- English 23,207 vocabularies.
- French 1,482 vocabularies.
- Thai 23,097 vocabularies.
- Vietnamese 175 vocabularies.
- Japanese 108 vocabularies.
- Bahasa Indonesian 13 vocabularies.
- Chinese, Italian, Korean and Tagalog 0
vocabulary.
58Future work
- Web-based Multilingual Active Reading System for
Information Exchange - Language Configuration
- Active Reading assistant
- Table Translator with more multilingual dictionary
59Thank you