Compression of Small Text Files using Syllables Jan Lnsk, Michal emlicka Charles University in Pragu - PowerPoint PPT Presentation

1 / 1
About This Presentation
Title:

Compression of Small Text Files using Syllables Jan Lnsk, Michal emlicka Charles University in Pragu

Description:

Compression of Small Text Files using Syllables. Jan L nsk , Michal emlicka ... zizelevak_at_gmail.com, michal.zemlicka_at_mff.cuni.cz. Syllables. Sequences of sounds ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 2
Provided by: janl58
Category:

less

Transcript and Presenter's Notes

Title: Compression of Small Text Files using Syllables Jan Lnsk, Michal emlicka Charles University in Pragu


1
Compression of Small Text Files using
SyllablesJan Lánský, Michal emlicka Charles
University in Prague, Faculty of Mathematics and
Physics Czech Republic zizelevak_at_gmail.com,
michal.zemlicka_at_mff.cuni.cz
  • Syllables
  • Sequences of sounds
  • In some languages they can also be simply
    recognized in text
  • For our purposes a syllable is a group of vowels
    optionally surrounded by consonants.
  • Decomposing words into syllables
  • We often need to know origin of word for correct
    decomposing
  • For compression purposes it is sufficient to use
    an approximation of correct decomposition of
    words into syllables, that decomposes text to
    enough frequent parts and allows to reconstruct
    of the original text from these parts.
  • Examples of decomposition
  • English trans for ma tion
  • Czech nej ne ob hos po dá ro va
    tel nej í mi
  • Compression Methods
  • HuffSyllable (HS) slightly improved HuffWord,
    using syllables (5 adaptive Huffman trees)
    instead of words it distinguishes 5 types of
    syllables, word-based methods use only 2 types of
    compression units words and non-words In each
    step of the algorithm the expected type of
    actually processed syllable is calculated.
  • LZWL LZW applied on syllables instead characters
  • Syllable-based compression is suitable for
    languages
  • where exists a mapping between letters and sounds
    accurate enough
  • where words have many grammatical forms like
    Czech, German, or Russian
  • Results
  • Our experimental results show that Syllable-based
    compression
  • is less resource consuming than word-based
    compression and more resource consuming than
    character-based compression.
  • outperforms (as expected) character-based
    compression.
  • is for English always outperformed by word-based
    compression using the same algorithm.
  • is for Czech at least comparable with (typically
    it outperforms) word-based compression using the
    same algorithm.

Classification of Symbols
Classification of Words / Syllables
Symbols
Words
Letters
Non-Letters
Letter
Non-letter
Capital
Small
Spec. characters
Digits
Capital
Small
Mixed
hello
HELLO
Hello
Special
Numeric
Vowels
Consonants
1982
?
Write a Comment
User Comments (0)
About PowerShow.com