Slovak morphological analyser - PowerPoint PPT Presentation

1 / 13
About This Presentation
Title:

Slovak morphological analyser

Description:

Conversion: Create word list. Consists of lines: lemma # word # tag ... Dictionary contains nouns, adjectives, pronouns and verbs. Special case: numbers. Usable, ... – PowerPoint PPT presentation

Number of Views:26
Avg rating:3.0/5.0
Slides: 14
Provided by: HR18
Category:

less

Transcript and Presenter's Notes

Title: Slovak morphological analyser


1
Slovak morphological analyser
  • Marek Grác
  • FI MUNI, Brno
  • Czech republic

2
Content
  • Morphological analyse and analyser
  • Logical structure of data
  • Conversion process
  • State of the art future works

3
Morphological analyse
  • Lemmatization
  • children ? child, wrote ? write
  • Assume grammar categories
  • children ? noun, plural
  • Methods based on
  • Dictionary
  • Rules written by linguistics
  • Machine learning

4
Morphological analyser ajka
  • Laboratory of NLP, FI MUNI, Brno
  • Dictionary-based analyser
  • Free license (GNU/GPL)
  • Support for several platforms
  • Linux, Windows
  • Libraries for several programming languages
  • C, Perl, SWI Prolog

5
Morphological analyser ajka
  • Grammar categories are fully configurable
  • Using FSA and structure trie (very fast)
  • Process only single words
  • New York ? New, York
  • Language independent, but

6
Ajka - segmentation
  • Word is divided into four parts
  • Prefix negative, superlative
  • Stem base wom-an, wom-en
  • Intersegment wr-i-te, wr-o-te
  • Suffix child-0, child-ren

7
Problems in conversion process
  • Source data collected by linguistics
  • Human-readable (documents)
  • Basic knowledge of language
  • Different definitions of patterns
  • Different treatment of exception
  • Different grammatical categories

8
Comparison of data structures
  • Exception is specified in word declaration
  • Two-parts segmentation
  • Stem base Intersegment
  • Suffix
  • Support for multiple word expression
  • New York
  • Exception is specified in pattern
  • Four-parts segmentation

9
Conversion process
  • Transform to XML
  • Create word list lemma word tag
  • Create declination patterns
  • Transform to format required by ajka

10
Conversion Create word list
  • Consists of lines lemma word tag
  • Heuristic for some naïve declination processes
  • child child k1nS ( noun, singular )
  • child children k1nP ( noun, plural )

11
Conversion Create patterns
  • Fully-automatic creation of patterns
  • Algorithm
  • For each lemma L do
  • S intersegment and suffix of every word
  • If S is not equal to known pattern then add S to
    patterns and L is the name of pattern

12
State of the art
  • Dictionary contains nouns, adjectives, pronouns
    and verbs
  • Special case numbers
  • Usable, but

13
Future
  • Data cleaning
  • Grammar for numbers
  • Localization of other morphological tools
  • Integration with application for common users
Write a Comment
User Comments (0)
About PowerShow.com