Application of Automatic Language and Subject Identification for Universal Digital Library using Spa

About This Presentation

Title:

Application of Automatic Language and Subject Identification for Universal Digital Library using Spa

Description:

Application of Automatic Language and ... Lavanya Prahallad, Suryakanth V Gangashetty, Kishore Prahallad and. Raj Reddy ... Ex: The Life of Mahatma Gandhi ... – PowerPoint PPT presentation

Number of Views:35

Avg rating:3.0/5.0

Slides: 15

Provided by: svl

Category:

more less

Transcript and Presenter's Notes

Title: Application of Automatic Language and Subject Identification for Universal Digital Library using Spa

1
Application of Automatic Language and Subject
Identification for Universal Digital Library
using Sparse Data

Lavanya Prahallad, Suryakanth V Gangashetty,
Kishore Prahallad and
Raj Reddy
IIIT Hyderabad, India
School of Computer Science
Carnegie Mellon University (CMU)
Pittsburgh, PA 15213, USA
Email lavanyap_at_andrew.cmu.edu, svg_at_cs.cmu.edu,
skishore_at_cs.cmu.edu , rr_at_cmu.edu

2
Need for Language / Subject Identification

Language and Subject are essential parts of
meta-data of a book
Meta-data helps to search and locate a book
Books in UDL often have been tagged with
Incorrect language and subject information
Data entry operators did not know the language /
subject, Ex Arabic book being scanned in China /
India etc.
Poor decision making from the title
Title Learning Hindi
Language English / Hindi
Subject Literature / Education
Automatic methods are required to identify the
language / subjects from title (sparse data) of
the books
Not enough data to build/train statistical
models, hence knowledge base systems are more
appropriate

3
Function Words for Language Identification

Content words
Meaning is best described in a dictionary
Belong in open sets so that new ones can freely
be added to the language
Example nouns, verbs, adjectives, or adverbs,
new words may be formed readily
Function words
Words with little inherent meaning but with
important roles in the grammar of a language
Closed class words
Have little lexical meaning or have ambiguous
meaning
Serve to express grammatical relationships with
other words within a sentence. Example
prepositions, pronouns, auxiliary verbs,
conjunctions, grammatical articles or particles
It is very uncommon to have new function words
being added to the language in the course of time

4
Approach for Language Identification

Training Manually collect 25-50 pages for each
language from Wikipedia
Select most frequently occurring function words
in each language
Create dictionary of function words for each
language
Remove words which are occurring in more than one
language

5
Approach for Language Identification
6
Studies on Language Identification

Number of books 1535764, Number of books belongs
to 5 languages 365518

8
Issues in Subject Identification

List of subjects
From Library of Congress, confined to a set of 48
subjects
Database of keywords related to each subject
Not readily available
Obtain 25-50 pages for each subject from
Wikipedia
Stems (root) of keywords might be a better
choice for search, morphological analyzer is
required
Use first 8 characters as approximate stems
Identification from sparse data (i.e. title)
Statistical model needs more data for training
Knowledge based systems might be a better choice
Keywords not sufficient for subject
identification
Ex The Life of Mahatma Gandhi
Hence patterns are used Life of Biography,
civil war history

9
Approach for Subject Identification

Consider only the first 8 characters as stems for
subject names and the corresponding keywords of
each subject.
Similarly consider only the first 8 characters as
stems of each word in the title.
Match the title stems with the patterns of each
subject (sample list of standard patterns for
subjects is given in Table. If no subject is
found proceed to step 4.
Match the title stems with the stems of the
subject names. If no subject is found proceed to
step 5.
Match the title stems with the stems of the
keywords of each subject. If no subject found
proceed to step 6.
Match the title stems with the stems of subject
"General". If no subject found, then consider the
title as undetermined.

10

11

12

13
Summary and Conclusions

Proposed an automatic language and subject
identification of the books
Useful for the meta-data in universal digital
library
Still it is necessary to achieve better
performance in subject identification
We continue to work on automatic subject and
language identification task for other languages
as well

14
Thank You

Write a Comment

User Comments (0)