Coping with Babel - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

Coping with Babel

Description:

Never break a linguistically complete text unit over more than one non-inline element: ... Fuzzy matching statistical. Advantages: cost reduction, consistency ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 44
Provided by: gca
Category:

less

Transcript and Presenter's Notes

Title: Coping with Babel


1
Coping with Babel
  • How to Localize XML

2
Designing for Localization
  • L10N - adapting material for target markets
  • Document design can seriously impact the costs of
    translation and localization.
  • Other language rules can differ significantly
    from English.
  • There are clear dos and donts.
  • Overriding principle is good XML practice.

3
Entity references
  • Do not use entity references for word
    substitution
  • ltparagtUse a tool to release the catch.lt/paragt
  • Cause problems for inflected languages
  • Cause problems for parsing/translation tools
  • Use boiler plate text instead

4
Translatable attributes
  • Avoid using translatable attributes
  • ltparagtUse a lttool id"a1098" name"claw hammer"gt
    to release the CPU retention catch.lt/paragt
  • Cause problems for inflected languages
  • Cause extra burden for translators
  • More to go wrong

5
CDATA sections
  • Avoid using CDATA sections that may contain
    translatable text
  • lttmplgtlt!CDATAltpgtPlease refer to the ltemgtindex
    pagelt/emgt page for further informationlt/pgtgtlt/tmp
    lgt
  • Lose syntactical control
  • Segmentation problems
  • How are translation tools to cope?

6
Processing instructions
  • Avoid Processing Instructions in translatable
    text
  • ltparagtUse a lt?tool name"claw hammer"?gt to
    release the CPU retention catch.lt/paragt
  • Syntactically week
  • Confuse translation memory operations

7
Infinite Naming Schemes
  • Avoid the use of infinite naming schemes
  • ltresources xmllang"en"gt
  • lterr001gtCannot open file 1.lt/err001gt
  • lthint001gtHint does file 1 exist.lt/hint001gt
  • lterr002gtIncorrect value.lt/err002gt
  • lthint002gtHint Must be between 1 and
    2.lt/hint002gt
  • lterr003gtConnection timeout.lt/err999gt
  • ...
  • lt/resourcesgt
  • No clear element definitions

8
Typographical elements
  • Avoid the use of "typographical" elements
  • ltparagtltbgtDo not uselt/bgt ltbr/gt type
    elements.lt/paragt
  • Bad XML practice.
  • Causes problems for translators.
  • Target language text may be in the opposite order.

9
Do not break sentences
  • Never break a linguistically complete text unit
    over more than one non-inline element
  • ltparagt
  • ltlinegtThis text should not belt/linegt
  • ltlinegtbroken this way the translated text may
    well be in a different order.lt/linegt
  • lt/paragt

10
XML Translation Standards
  • LISA - Localization Industry Standards
    Association http//www.lisa.org
  • OASIS - Organization for the Advancement of
    Structured Information Standards
    http//www.oasis-open.org
  • W3C - World Wide Web Consortium
    http//www.w3c.org
  • OLIF Consortium http//www.olif.net

11
LISA Standards
  • TMX - Translation Memory Exchange format
    http//www.lisa.org/tmx
  • TBX - Termbase Exchange format
    http//www.lisa.org/tbx
  • SRX - Segmentation Rules Exchange format
    http//www.lisa.org/srx
  • GMX - GILT Metrics Exchange format
    http//www.lisa.org/gmx

12
OASIS L10N Standards
  • XLIFF - XML Localization Interchange File
    Format http//www.oasis-open.org/committees/tc_ho
    me.php?wg_abbrevxliff
  • TransWS - Translation Web Services
    http//www.oasis-open.org/committees/tc_home.php?w
    g_abbrevtrans-ws

13
W3C and OLIF
  • W3C to start on Localization Directives standard.
  • OLIF - Open Lexicon Interchange Format
    http//www.olif.net

14
xmltm
  • XML Text Memory
  • A radical new approach to translating XML
    documents

15
Computational Linguistic Methodologies
  • Machine Translation
  • Translation Memory
  • Hybrid Linguistic Inferencing Engines
  • Terminology

16
Translation memory
  • Advent in early 1980s
  • Intermediate format
  • Alignment
  • Storage
  • Leveraged memory
  • Fuzzy matching statistical
  • Advantages cost reduction, consistency
  • Drawbacks proofreading, managing memories
  • No significant advances in technology

17
XML namespace
  • Major new feature of XML compared to SGML
  • Allows the mapping of different ontological
    entities onto the same representation
  • Allows different ways to look at the same data
  • Namespaces can be made transparent

18
xmltm namespace
  • Text Memory namespace
  • Can be mapped onto any XML document
  • Vertical view of document in terms of text
    segments
  • Can be totally transparent

19
xmltm namespace
Example of the use of namespace in an XML
document
ltdocument xmlnstm"urnxml-Intl-tm" gt lttmtmgt
ltsectiongt ltparagt lttmtegt
lttmtugt Namespace is very flexible.
lt/tmtugt lttmtugt It is very
easy to use. lt/tmtugt lt/tmtegt
lt/paragt
20
xmltm namespace
original document view
tm namespace view
doc
tm
title
te
text
tu
text
section
section
te
sentence
sentence
tu
tu
para
text
te
sentence
sentence
tu
tu
para
text
te
sentence
sentence
tu
tu
para
text
te
sentence
sentence
tu
tu
para
text
te
sentence
sentence
tu
tu
para
text
para
text
te
sentence
sentence
tu
tu
21
xmltm namespace
original document view
text
tm namespace view
sentence
sentence
tu
te
tu
22
xmltm namespace
original document view
text
ltparagt
Namespace is very simple. It is easy to use.
lt/paragt
tm namespace view
sentence
sentence
tu
te
tu
ltparagt
lttmte ide1gt
lttmtu idu1.1gt
lt/tmtugt
Namespace is very simple.
lttmtu idu1.2gt
lt/tmtugt
It is easy to use.
lt/tmtegt
lt/paragt
23
xmltm Text Memory
  • Author memory
  • Maintain memory of source text
  • Authoring statistics
  • Authoring tool input
  • Translation memory
  • Automatic alignment
  • Maintain perfect link of source and target text
  • Reduce translation costs

24
xmltm DOM differencing
Source Document
Updated Source Document
tu id1
tu id1
tu id2
deleted
tu id3
tu id3
tu id4
tu id4
origid5
tu id7
tu id5
modified
tu id6
tu id6
tu id8
new
25
xmltm Author Memory
  • Namespace aware differencing
  • Identify changes from the previous version
  • Unique text unit identifiers are maintained
  • Modification history
  • Text units can be loaded into a database
  • Authoring environment integration

26
xmltm Translation Memory
  • The tm namespace can be used to create XLIFF
    files
  • Automatic alignment of source and target
    languages
  • Allows for more focused translation matching
  • Perfect matching
  • Leveraged matching from document - identical text
  • Leveraged matching from database
  • Modified text unit matching
  • Linguistically enhanced fuzzy matching
  • Non translatable text unit identification

27
xmltm translation
Translated Document
XLIFF Document
Source Document
trans-unit id1
tu id1
tu id1
tu id2
trans-unit id2
tu id2
tu id3
tu id3
trans-unit id3
tu id4
trans-unit id4
tu id4
trans-unit id5
tu id5
tu id5
tu id6
trans-unit id6
tu id6
28
xmltm translated document
translated document view
translated tm namespace view
doc
tm
title
te
tekst
tu
tekst
section
section
te
zdanie
zdanie
tu
tu
para
tekst
te
zdanie
zdanie
tu
tu
para
tekst
te
zdanie
zdanie
tu
tu
para
tekst
te
zdanie
zdanie
tu
tu
para
tekst
te
zdanie
zdanie
tu
tu
para
tekst
para
tekst
te
zdanie
zdanie
tu
tu
29
xmltm perfect alignment
Source Document
Translated Document
Perfect alignment
tu id1
tu id1
tu id2
tu id2
tu id3
tu id3
tu id4
tu id4
tu id5
tu id5
tu id6
tu id6
30
xmltm perfect matching
Matched Target Document
Perfect Matching
Updated Source Document
tu id1
tu id1
tu id2
deleted
tu id3
tu id3
tu id4
tu id4
requires translation
modified
tu id7
tu id7
tu id6
tu id6
requires translation
tu id8
new
tu id8
31
xmltm contextual memory
Source Document
Translated Document
Perfect alignment
tu id1
tu id1
tu id2
tu id2
tu id3
tu id3
tu id4
tu id4
tu id5
tu id5
tu id6
tu id6
32
xmltm leveraged DB memory
Translated Document
Perfect alignment
Source Document
tu id1
tu id1
tu id2
tu id2
tu id3
tu id3
tu id4
tu id4
tu id5
tu id5
tu id6
tu id6
DB
33
xmltm in-document leveraged matching
Perfect Matching
Updated Source Document
Matched Target Document
tu id1
tu id1
tu id2
deleted
tu id3
tu id3
tu id4
tu id4
requires translation
modified
tu id7
tu id7
tu id6
tu id6
requires proofing
leveraged match
tu id8
newsame id3
tu id8
34
xmltm in-document fuzzy matching
Perfect Matching
Updated Source Document
Matched Target Document
tu id1
tu id1
tu id2
deleted
tu id3
tu id3
tu id4
tu id4
requires translation
tu id7
tu id7
modorigid5
fuzzy match
tu id6
tu id6
requires proofing
leveraged match
tu id8
Newsame
tu id8
35
xmltm db leveraged matching
Perfect Matching
Updated Source Document
Matched Target Document
tu id1
tu id1
tu id2
deleted
tu id3
tu id3
tu id4
tu id4
requires translation
tu id7
tu id7
modorigid5
fuzzy match
tu id6
tu id6
requires proofing
doc leveraged match
tu id8
newsame
tu id8
requires proofing
tu id9
tu id9
DB leveraged match
DB
36
xmltm non translatable text
Perfect Matching
Updated Source Document
Matched Target Document
tu id1
tu id1
requires no translation
tu id2
tu id2
non translatable
non trans
tu id3
tu id3
tu id4
tu id4
requires translation
tu id7
tu id7
fuzzy match
tu id6
tu id6
requires proofing
doc leveraged match
tu id8
newsame
tu id8
requires proofing
tu id9
tu id9
DB leveraged match
DB
37
Traditional Translation Scenario
Publishing
Translation
Extracted text
source text
tm process
Prepared text
Translated text
Translate
QA
38
xmltm Translation Scenario
Publishing
leveraged matching
xml source text
Extracted text
Prepared text
tm process
Automatic Process
web interface
Translator
Translate
Web
QA
xml target text
Automatic Process
39
xmltm matching
  • Perfect Matching driven by Author Memory
  • Leveraged Matching
  • 100 same text
  • In document Leveraged Matching
  • Database Leveraged Matching
  • Fuzzy Matching
  • Modified Matching
  • Linguistically aware Fuzzy Matching
  • Non translatable element identification
  • Alphanumeric
  • Numeric
  • Measurements

40
xmltm benefits
  • Enterprise level scalability
  • Totally integrated within the XML framework
  • Source text is automatically extracted and
    matched
  • Word counts are controlled by the customer
  • Text can be presented for translation via the web
  • Online composition
  • The most up to date translation is held by the
    customer
  • Data is merged automatically at end of
    translation cycle
  • All memory operations are totally automated
  • Can be used transparently for relay translations
  • Much cheaper to implement and run
  • More accurate better matching

41
xmltm summary
  • Can be used to build consistent authoring systems
  • Can be used to produce automatic authoring
    statistics
  • Translation Memory generation and alignment is
    totally automatic
  • Memory is held within the documents themselves
  • Extraction and merging for translation are
    automatic
  • The system provides much more efficient matching
    mechanisms
  • Structure of the XML document is protected during
    translation

42
xmltm
  • Fully specified XML based standard
  • http//www.xml-intl.com/docs/specification/
  • xml-tm.html
  • Maintained by xml-intl.com
  • http//www.xml-intl.com/dtd/tm.dtd
  • http//www.xml-intl.com/dtd/tm.xsd
  • Detailed article on www.xml.com
  • Offered for consideration as a Lisa standard

43
  • Any questions?
Write a Comment
User Comments (0)
About PowerShow.com