Work at TACOLA Lab - PowerPoint PPT Presentation

About This Presentation

Title:

Work at TACOLA Lab

Description:

Work at TACOLA Lab Team Members T.V.Geetha Ranjani Parthasarathi Madhan Karky E.UmaMaheswari J.Balaji Subalalitha Elanchezhiyan.K, Karthika, Thenmalar ... – PowerPoint PPT presentation

Number of Views:306

Avg rating:3.0/5.0

Slides: 42

Provided by: HOD98

Learn more at: https://www.infitt.org

Category:

more less

Transcript and Presenter's Notes

Title: Work at TACOLA Lab

1
Work at TACOLA Lab

Team Members
T.V.Geetha Ranjani Parthasarathi Madhan Karky
E.UmaMaheswari J.Balaji Subalalitha
Elanchezhiyan.K, Karthika, Thenmalar,
Radhakrishnan, Kandasamy, Padmavathi, Aruna,
Vijayavani

2
Tamil Language Processing

Tamil Language Processing
Morphological analyser
Normal Words, Compound Words, Colloquial Words
Parser
Simple, Complex and Compound Sentences
Semantic analysis based on UNL
Language Technology
Blog Mining
Ontology Based Information Extraction
Personalized Search
Parallelization for NLP Processing
Emotion detection form text
Carnatic Music Processing
Raga Modelling
Singer, Genre Identification
Music Emotion Recognition

Tamil Language Oriented Tools
Dictionary
Text Compaction
UNL Based Work
UNL for semantic representation
Nested UNL
Concept based Search
Bi-lingual Search
Event Processing
Discourse Analysis
Summarization
Question answering
Thirukural Search
Lyric Oriented Processing
Lyric Mining
Lyrics for Tunes
Pleasantness

3
Papers for TIC 2011

Tamil Language Oriented Tools
Agaraadhi A Novel Online Dictionary Framework
An Efficient Tamil Text Compaction System.
(Surukkupai)
Kuralagam, A Concept Relation Based Search
Framework for Thirukural.
Popularity Based Scoring Model for Tamil Word
Games
Tamil Language Processing
Template based Multilingual Summary Generation.
On Emotion detection from Tamil Text.
Tamil Summary Generation for Cricket Match.
Lyric Oriented Processing
Lyric Mining Word, Rhyme Concept
Co-occurrence Analysis.
Special Indices for LaaLaLaa Lyric Analysis
Generation Framework.

4
AGARAADHIA NOVEL ONLINE DICTIONARY FRAMEWORK

Elanchezhiyan.K
Karthikeyan.S
T.V.Geetha
Ranjani Parthasarathi
Madhan Karky

5
OBJECTIVES

Agaraadhi, a dictionary framework for indexing
and retrieving Tamil words, their meaning,
analysis and related information.
Framework to incorporate various unique features
- designed to provide additional information to
the user regarding the word that they query
about.

6
INTRODUCTION

Agaraadhi dictionary has more than 3 lac words in
various domains such as
General,
Literature,
Medical,
Engineering,
Computer Science,
Birds Name and More
The Agaraadhi is a Tamil English bilingual
dictionary.

7
INTRODUCTION CONT

The Agaraadhi is a Tamil English bilingual
dictionary with 20 features. such as
morphological analysis,
morphological generation,
word usage statistics,
word pleasantness analysis,
spell checking,
similar word finder,
word usage in literature,
picture dictionary,
number to text conversion,
phonetic transliteration,
live usage analysis from micro blogs and more

8
AGARAADHI FRAMEWORK CONT
9
AGARAADHI FEATURES

Morphological Analyser
gives the morphological features of the query
word such as root word, parts of speech, gender,
tense and count.
If the Query word is padithaan, Morphological
Analyser gives as padi as root, word represents
male gender and query word is past tense and so
on.
Morphological GeneratorTamil morphological
generator tackles different syntactic categories
such as nouns, verbs, post positions, adjectives,
adverbs.
The generator is used to generate possible
morphological variations of the query word.
Spell Checker
used to check the spelling of Tamil words and to
provide alternative suggestions for the wrongly
spelt words.
If root word not in dictionary - generates all
the possible suggestions with minimum variations
from the given word

10
AGARAADHI FEATURES

Word Suggestions
gives the list of equivalent or related words for
the given query word.
Word Pleasantness
score generator provides how easy it is to
pronounce the word.
Word Popularity Score
shows the word usage in the web based on
frequency distribution of the word across the
popular blogs, news articles, social nets etc.
Word Usage Statistics
shows the usage of the word in the social network
over the past one week.
Word Usage in Literature
finds the usage of words in popular literature
such as Thirukural, Bharathiyar Padalgal, Avvai
songs and also Lyrics of Tamil Movie songs.

11
AGARAADHI FEATURES

Word of the Day
A rare word is randomly chosen and is displayed
in the opening page to facilitate users to learn
a new word every day.
Number to Text Converter
converts a number to Tamil word equivalent as
well as in English text. For example in Tamil we
represent oru Arpputham (????????) for 100
million, Kumbam (???????) for 10 billion and
finally up to Anniyan (????????) for one zilli
Picture Dictionary
Pictures, photos or line drawings to depict
popular words have been included in the
dictionary to enable efficient learning for
children using this tool.

12
RESULTS

Query word pookkal (???????)
http//www.agaraadhi.com/dict/OD.jsp?wE0AEAAE
0AF82E0AE95E0AF8DE0AE95E0AEB3E0AF
8DlntaSubmit.x8Submit.y7
Query word mazhai (???)
http//www.agaraadhi.com/dict/OD.jsp?wE0AEAEE
0AEB4E0AF88lntaSubmit.x21Submit.y4
Query word fruit
http//www.agaraadhi.com/dict/OD.jsp?wfruitlnen

13
FUTURE WORK

Providing APIs for programmers and developing
mobile apps for Agaraadhi framework will open a
good platform for many researchers and developers
working in Tamil Computing area.

14
REFERENCE

Anandan, R. Parthasarathi, and Geetha,
Morphological Analyser for Tamil. ICON 2002,
2002.
Anandan, R. Parthasarathi, and Geetha,
Morphological Generator for Tamil. Tamil Inayam,
Malaysia, 2001.
J. Jai Hari Raju, P. IndhuReka, Dr. Madhan Karky,
Statistical Analysis and visualization of Tamil
Usage in Live Text Streams, Tamil Internet
Conference, Coimbatore, 2010.

15
An Efficient Tamil Text Compaction System

N.M.Revathi
G.P.Shanthi
Elanchezhiyan.K
T V Geetha
Ranjani Parthasarathi
Madhan Karky

16
OBJECTIVES

Why Compacting?
limited message length in blog sites and tiny
user interface of mobile phones.
saves online storage space and hence reduction in
cost.
The paper proposes
a text compaction system for Tamil, first of its
kind in Tamil.
Idea of compaction
Getting the shortest word has no specific rule it
is mainly aimed at understanding.
can be obtained by omitting letters, replacing
prefix and suffix through suitable symbols and
numbers.

17
FRAMEWORK ARCHITECTURE

18
FRAMEWORK CONT..

Input Processing
The morphological analyzer removes the suffix (if
present) added to the word and delivers the root
word (RW).

19
FRAMEWORK CONT..

Identification of the category Extraction of
compact word
Three categories of words common Tamil words,
abbreviations/acronyms, numbers.
abbreviations /acronyms by comparing it with the
keys of the hashmap.
With the help of the hash key and a mapping
algorithm, the compact word is retrieved.
Otherwise belongs to either the common tamil word
or numbers
If numbers - Numerical analyser for text to
number conversion.
Output Processing
Tamil tool Morphological Generator to add the
suitable suffix to cater to the rules of the
language.

20
RESULT AND ANALYSIS

Tested with over 10,000 words.
The final result is reduced to 40 of the
original text.

21
REFERENCES

Anandan, R. Parthasarathi, and Geetha,
Morphological Analyser for Tamil. ICON 2002,
2002.
Fung, L. M. (2005). SMS short form identification
and codec. Unpublished masters thesis, National
University of Singapore, Singapore .
Acrophile (LSLarkey, P Ogilvie, MA Price, B
Tamilio, 2000) a system that automatically
searches acronym expansion pairs.
Short Message Service (SMS) Texting Symbols A
Functional Analysis of 10,000 Cellular Phone Text
Messages by Robert E. Beasley,Franklin College.

22
Kuralagam - Concept Relation based Search Engine
for Thirukkural

Elanchezhiyan.K
T.V.Geetha
Ranjani Parthasarathi
Madhan Karky

23
Objectives

Kuralagam is a conceptual search framework for
Thirukkural based on UNL Framework.
Searching with keywords in kurals and
intepretations
Concept based search based on CoReX conceptual
indexing based on UNL
Bilingual search English and Tamil
Showing Relationships between the concepts.

24
Kuralagam Framework
25
Offline Processing

Web Crawler
A Thirukkural statistics crawler
crawls the news and blog documents - to find the
usage of each individual Thirukkural.
The usage recorded for measuring the popularity
score for each Thirukkural
Enconversion Based on UNL
Indexed based on CoReX Framework

26
UNL Enconversion

UNL is an intermediate language
processes knowledge across languagebarriers.
captures semantics by converting natural language
terms present in the document to concepts.
concepts are connected to the other concepts
through UNL relations - 46 UNL relations
plf(Place From), plt(Place To), tmf(Time from),
tmt(Time to) etc
Process of converting a natural language text to
UNL graph is known as Enconversion
reverse process is known as Deconversion.

27
An Example speaks more...

ExJohn was playing in the garden

28
Indexer

The Kuralagam Indexer is designed based on CoReX
Techniques.
The Indexer stores and manages the UNL graphs in
two different indices.
Concept only index (C index), and
Concept-Relation-Concept index (CRC index)

29
Online Processing

Query Translation and Expansion
converts the user query to UNL graph.
uses CRC (Concept Relation Concept) CoReX indices
to fetch similarity thesaurus and co-occurrence
list to populate the Multi list Data Structure.
Search and Ranking
fetches the Thirukkural number and its details.
Thirukkurals for a given query are fetched using
the two types of concept relation indices namely
CRC and C.
The query concept is expanded using related CRC
indices pointing to the query concept.
helps in retrieving many Thirukkurals
conceptually related to the query not possible
with key word Thirukkural search engines.
The ranking is based on
priority to the indices in the order CRCgtC
usage score
frequency occurrence of the query concept

30
Tab Layout
31
Performance Evaluation

The accuracy of the Thirukkural search engine was
measured using the average precision and mean
average precision.
The comparisons between concept based search and
keyword based search were measured using Average
Precision methodology

32
Average Precision
33
Reference

1. Subalalitha, T V Geetha, Ranjani Parthasarathy
and Madhan Karky Vairamuthu. CoReX A Concept
Based Semantic Indexing Technique. In SWM-08.
2008. India.
2. Foundation, U., the Universal Networking
Language (UNL) Specifications Version 3 3ed.
December 2004 UNL Computer Society, 2004.
8(5).Center UNDL Foundation
3. Anandan, R. Parthasarathi, and Geetha,
Morphological Analyser for Tamil. ICON 2002,
2002.
4. T.Dhanabalan, K.Saravanan, and T.V.Geetha.
2002. Tamil to UNL Enconverter, ICUKL, Goa,
India.
5. Andrew, T. and S. Falk. User performance
versus precision measures for simple search
tasks. In 29th Annual international ACM SIGIR
Conference on Research and Development in
information Retrieval 2006. Seattle, Washington,
USA.

34
Template Based MultiLingual Summary Generation

Subalalitha C.N
E.Umamaheswari
T V Geetha
Ranjani Parthasarathi
Madhan Karky

35
Aim

To generate a multi lingual summary using based
on Universal Networking Language (UNL) Framework

36
The Architechture
37
Multi Lingual Summary Generation using UNL

Template based Information Extraction
Seven tourism specific templates have been
designed and used
Templates filled using semantic information
inherent in UNL input graphs
Template information is language independent and
can be used with any desired language.

38
Example Templates for Tourism Domain
Template Semantics inherited from UNL
God iofgtgod, iofgtgoddess, iclgtgod
Food iclgtfood, iclgtfruit
Flaura and Fauna iclgtanimal, iclgtreptile, iclgtmammal, iclgt plant
Boarding facility iclgtfacility
Transport facility iclgttransport
Place iclgtplace, iofgtplace, iofgtcity, iofgtcountry
Distance icl gtunit , icl gtnumber
39
SummaryGeneration

The template information is converted to target
language using respective UNL-target language
dictionaries.
UNL-target language dictionaries contains root
words.
Natural language term from the root word is
obtained using target language information like
case suffixes and language technology tools like
morphological generator
(???????????????????)
When these converted template information is
fitted into target language specific dynamic
sentence patterns, a summary is generated.

40
Performance Evaluation

Tested with 33,000 Tamil and English text
documents enconverted to UNL graphs.
The performance of the methodology proposed has
been evaluated using human judgement.
The accuracy of the summary generated has
achieved 90 .

Further Enhancements
Query specific summary
Comparing the performance with human generated
summaries.

41
References

1 Elanchezhiyan K, T V Geetha, Ranjani
Parthasarathi Madhan Karky, CoRe Concept
Based Query Expansion, Tamil Internet Conference,
Coimbatore, 2010.
2 Alkesh Patel , Tanveer Siddiqui , U. S.
Tiwary , A language independent approach to
multilingual text summarization, Conference
RIAO2007, Pittsburgh PA, U.S.A. May 30-June
1,2007
3David Kirk Evans, Identifying Similarity in
Text Multi-Lingual Analysis for Summarization ,
Doctor of Philosophy thesis, Graduate School of
Arts and Sciences , Columbia University, 2005
4 Radev, Allison, Blair-Goldensohn et al
(2004), MEAD a platform for multidocument
multilingual text summarization
5 The Universal Networking Language (UNL)
Specifications Version 3 Edition 3, UNL Center
UNDL Foundation December 2004.
Jagadeesh J, Prasad Pingali, Vasudeva Varma,
Sentence Extraction Based Single Document
Summarization Workshop on Document
Summarization, March, 2005, IIIT Allahabad.
7 Naresh Kumar Nagwani, Dr. Shrish Verma , A
Frequent Term and Semantic Similarity based
Single Document Text Summarization Algorithm
International Journal of Computer Applications
(0975 8887) Volume 17 No.2, March 2011 .
8Prof. R. Nedunchelian, Centroid Based
Summarization of Multiple Documents Implemented
using Timestamps First International Conference
on Emerging Trends in Engineering and Technology,
IEEE 2008