Application of Automatic Language and Subject Identification for Universal Digital Library using Spa - PowerPoint PPT Presentation

1 / 14
About This Presentation
Title:

Application of Automatic Language and Subject Identification for Universal Digital Library using Spa

Description:

Application of Automatic Language and ... Lavanya Prahallad, Suryakanth V Gangashetty, Kishore Prahallad and. Raj Reddy ... Ex: The Life of Mahatma Gandhi ... – PowerPoint PPT presentation

Number of Views:35
Avg rating:3.0/5.0
Slides: 15
Provided by: svl
Category:

less

Transcript and Presenter's Notes

Title: Application of Automatic Language and Subject Identification for Universal Digital Library using Spa


1
Application of Automatic Language and Subject
Identification for Universal Digital Library
using Sparse Data
  • Lavanya Prahallad, Suryakanth V Gangashetty,
    Kishore Prahallad and
  • Raj Reddy
  • IIIT Hyderabad, India
  • School of Computer Science
  • Carnegie Mellon University (CMU)
  • Pittsburgh, PA 15213, USA
  • Email lavanyap_at_andrew.cmu.edu, svg_at_cs.cmu.edu,
    skishore_at_cs.cmu.edu , rr_at_cmu.edu

2
Need for Language / Subject Identification
  • Language and Subject are essential parts of
    meta-data of a book
  • Meta-data helps to search and locate a book
  • Books in UDL often have been tagged with
    Incorrect language and subject information
  • Data entry operators did not know the language /
    subject, Ex Arabic book being scanned in China /
    India etc.
  • Poor decision making from the title
  • Title Learning Hindi
  • Language English / Hindi
  • Subject Literature / Education
  • Automatic methods are required to identify the
    language / subjects from title (sparse data) of
    the books
  • Not enough data to build/train statistical
    models, hence knowledge base systems are more
    appropriate

3
Function Words for Language Identification
  • Content words
  • Meaning is best described in a dictionary
  • Belong in open sets so that new ones can freely
    be added to the language
  • Example nouns, verbs, adjectives, or adverbs,
    new words may be formed readily
  • Function words
  • Words with little inherent meaning but with
    important roles in the grammar of a language
  • Closed class words
  • Have little lexical meaning or have ambiguous
    meaning
  • Serve to express grammatical relationships with
    other words within a sentence. Example
    prepositions, pronouns, auxiliary verbs,
    conjunctions, grammatical articles or particles
  • It is very uncommon to have new function words
    being added to the language in the course of time

4
Approach for Language Identification
  • Training Manually collect 25-50 pages for each
    language from Wikipedia
  • Select most frequently occurring function words
    in each language
  • Create dictionary of function words for each
    language
  • Remove words which are occurring in more than one
    language

5
Approach for Language Identification
6
Studies on Language Identification

7
  • Number of books 1535764, Number of books belongs
    to 5 languages 365518

8
Issues in Subject Identification
  • List of subjects
  • From Library of Congress, confined to a set of 48
    subjects
  • Database of keywords related to each subject
  • Not readily available
  • Obtain 25-50 pages for each subject from
    Wikipedia
  • Stems (root) of keywords might be a better
    choice for search, morphological analyzer is
    required
  • Use first 8 characters as approximate stems
  • Identification from sparse data (i.e. title)
  • Statistical model needs more data for training
  • Knowledge based systems might be a better choice
  • Keywords not sufficient for subject
    identification
  • Ex The Life of Mahatma Gandhi
  • Hence patterns are used Life of Biography,
    civil war history

9
Approach for Subject Identification
  • Consider only the first 8 characters as stems for
    subject names and the corresponding keywords of
    each subject.
  • Similarly consider only the first 8 characters as
    stems of each word in the title.
  • Match the title stems with the patterns of each
    subject (sample list of standard patterns for
    subjects is given in Table. If no subject is
    found proceed to step 4.
  • Match the title stems with the stems of the
    subject names. If no subject is found proceed to
    step 5.
  • Match the title stems with the stems of the
    keywords of each subject. If no subject found
    proceed to step 6.
  • Match the title stems with the stems of subject
    "General". If no subject found, then consider the
    title as undetermined.

10

11

12

13
Summary and Conclusions
  • Proposed an automatic language and subject
    identification of the books
  • Useful for the meta-data in universal digital
    library
  • Still it is necessary to achieve better
    performance in subject identification
  • We continue to work on automatic subject and
    language identification task for other languages
    as well

14
Thank You
Write a Comment
User Comments (0)
About PowerShow.com