Title: Coping with Babel
1Coping with Babel
2Designing for Localization
- L10N - adapting material for target markets
- Document design can seriously impact the costs of
translation and localization. - Other language rules can differ significantly
from English. - There are clear dos and donts.
- Overriding principle is good XML practice.
3Entity references
- Do not use entity references for word
substitution - ltparagtUse a tool to release the catch.lt/paragt
- Cause problems for inflected languages
- Cause problems for parsing/translation tools
- Use boiler plate text instead
4Translatable attributes
- Avoid using translatable attributes
- ltparagtUse a lttool id"a1098" name"claw hammer"gt
to release the CPU retention catch.lt/paragt - Cause problems for inflected languages
- Cause extra burden for translators
- More to go wrong
5CDATA sections
- Avoid using CDATA sections that may contain
translatable text - lttmplgtlt!CDATAltpgtPlease refer to the ltemgtindex
pagelt/emgt page for further informationlt/pgtgtlt/tmp
lgt - Lose syntactical control
- Segmentation problems
- How are translation tools to cope?
6Processing instructions
- Avoid Processing Instructions in translatable
text - ltparagtUse a lt?tool name"claw hammer"?gt to
release the CPU retention catch.lt/paragt - Syntactically week
- Confuse translation memory operations
7Infinite Naming Schemes
- Avoid the use of infinite naming schemes
- ltresources xmllang"en"gt
- lterr001gtCannot open file 1.lt/err001gt
- lthint001gtHint does file 1 exist.lt/hint001gt
- lterr002gtIncorrect value.lt/err002gt
- lthint002gtHint Must be between 1 and
2.lt/hint002gt - lterr003gtConnection timeout.lt/err999gt
- ...
- lt/resourcesgt
- No clear element definitions
8Typographical elements
- Avoid the use of "typographical" elements
- ltparagtltbgtDo not uselt/bgt ltbr/gt type
elements.lt/paragt - Bad XML practice.
- Causes problems for translators.
- Target language text may be in the opposite order.
9Do not break sentences
- Never break a linguistically complete text unit
over more than one non-inline element - ltparagt
- ltlinegtThis text should not belt/linegt
- ltlinegtbroken this way the translated text may
well be in a different order.lt/linegt - lt/paragt
10XML Translation Standards
- LISA - Localization Industry Standards
Association http//www.lisa.org - OASIS - Organization for the Advancement of
Structured Information Standards
http//www.oasis-open.org - W3C - World Wide Web Consortium
http//www.w3c.org - OLIF Consortium http//www.olif.net
11LISA Standards
- TMX - Translation Memory Exchange format
http//www.lisa.org/tmx - TBX - Termbase Exchange format
http//www.lisa.org/tbx - SRX - Segmentation Rules Exchange format
http//www.lisa.org/srx - GMX - GILT Metrics Exchange format
http//www.lisa.org/gmx
12OASIS L10N Standards
- XLIFF - XML Localization Interchange File
Format http//www.oasis-open.org/committees/tc_ho
me.php?wg_abbrevxliff - TransWS - Translation Web Services
http//www.oasis-open.org/committees/tc_home.php?w
g_abbrevtrans-ws
13W3C and OLIF
- W3C to start on Localization Directives standard.
- OLIF - Open Lexicon Interchange Format
http//www.olif.net
14xmltm
- XML Text Memory
- A radical new approach to translating XML
documents
15Computational Linguistic Methodologies
- Machine Translation
- Translation Memory
- Hybrid Linguistic Inferencing Engines
- Terminology
16Translation memory
- Advent in early 1980s
- Intermediate format
- Alignment
- Storage
- Leveraged memory
- Fuzzy matching statistical
- Advantages cost reduction, consistency
- Drawbacks proofreading, managing memories
- No significant advances in technology
17XML namespace
- Major new feature of XML compared to SGML
- Allows the mapping of different ontological
entities onto the same representation - Allows different ways to look at the same data
- Namespaces can be made transparent
18 xmltm namespace
- Text Memory namespace
- Can be mapped onto any XML document
- Vertical view of document in terms of text
segments - Can be totally transparent
19xmltm namespace
Example of the use of namespace in an XML
document
ltdocument xmlnstm"urnxml-Intl-tm" gt lttmtmgt
ltsectiongt ltparagt lttmtegt
lttmtugt Namespace is very flexible.
lt/tmtugt lttmtugt It is very
easy to use. lt/tmtugt lt/tmtegt
lt/paragt
20xmltm namespace
original document view
tm namespace view
doc
tm
title
te
text
tu
text
section
section
te
sentence
sentence
tu
tu
para
text
te
sentence
sentence
tu
tu
para
text
te
sentence
sentence
tu
tu
para
text
te
sentence
sentence
tu
tu
para
text
te
sentence
sentence
tu
tu
para
text
para
text
te
sentence
sentence
tu
tu
21xmltm namespace
original document view
text
tm namespace view
sentence
sentence
tu
te
tu
22xmltm namespace
original document view
text
ltparagt
Namespace is very simple. It is easy to use.
lt/paragt
tm namespace view
sentence
sentence
tu
te
tu
ltparagt
lttmte ide1gt
lttmtu idu1.1gt
lt/tmtugt
Namespace is very simple.
lttmtu idu1.2gt
lt/tmtugt
It is easy to use.
lt/tmtegt
lt/paragt
23 xmltm Text Memory
- Author memory
- Maintain memory of source text
- Authoring statistics
- Authoring tool input
- Translation memory
- Automatic alignment
- Maintain perfect link of source and target text
- Reduce translation costs
24 xmltm DOM differencing
Source Document
Updated Source Document
tu id1
tu id1
tu id2
deleted
tu id3
tu id3
tu id4
tu id4
origid5
tu id7
tu id5
modified
tu id6
tu id6
tu id8
new
25 xmltm Author Memory
- Namespace aware differencing
- Identify changes from the previous version
- Unique text unit identifiers are maintained
- Modification history
- Text units can be loaded into a database
- Authoring environment integration
26 xmltm Translation Memory
- The tm namespace can be used to create XLIFF
files - Automatic alignment of source and target
languages - Allows for more focused translation matching
- Perfect matching
- Leveraged matching from document - identical text
- Leveraged matching from database
- Modified text unit matching
- Linguistically enhanced fuzzy matching
- Non translatable text unit identification
27 xmltm translation
Translated Document
XLIFF Document
Source Document
trans-unit id1
tu id1
tu id1
tu id2
trans-unit id2
tu id2
tu id3
tu id3
trans-unit id3
tu id4
trans-unit id4
tu id4
trans-unit id5
tu id5
tu id5
tu id6
trans-unit id6
tu id6
28 xmltm translated document
translated document view
translated tm namespace view
doc
tm
title
te
tekst
tu
tekst
section
section
te
zdanie
zdanie
tu
tu
para
tekst
te
zdanie
zdanie
tu
tu
para
tekst
te
zdanie
zdanie
tu
tu
para
tekst
te
zdanie
zdanie
tu
tu
para
tekst
te
zdanie
zdanie
tu
tu
para
tekst
para
tekst
te
zdanie
zdanie
tu
tu
29 xmltm perfect alignment
Source Document
Translated Document
Perfect alignment
tu id1
tu id1
tu id2
tu id2
tu id3
tu id3
tu id4
tu id4
tu id5
tu id5
tu id6
tu id6
30 xmltm perfect matching
Matched Target Document
Perfect Matching
Updated Source Document
tu id1
tu id1
tu id2
deleted
tu id3
tu id3
tu id4
tu id4
requires translation
modified
tu id7
tu id7
tu id6
tu id6
requires translation
tu id8
new
tu id8
31 xmltm contextual memory
Source Document
Translated Document
Perfect alignment
tu id1
tu id1
tu id2
tu id2
tu id3
tu id3
tu id4
tu id4
tu id5
tu id5
tu id6
tu id6
32 xmltm leveraged DB memory
Translated Document
Perfect alignment
Source Document
tu id1
tu id1
tu id2
tu id2
tu id3
tu id3
tu id4
tu id4
tu id5
tu id5
tu id6
tu id6
DB
33 xmltm in-document leveraged matching
Perfect Matching
Updated Source Document
Matched Target Document
tu id1
tu id1
tu id2
deleted
tu id3
tu id3
tu id4
tu id4
requires translation
modified
tu id7
tu id7
tu id6
tu id6
requires proofing
leveraged match
tu id8
newsame id3
tu id8
34 xmltm in-document fuzzy matching
Perfect Matching
Updated Source Document
Matched Target Document
tu id1
tu id1
tu id2
deleted
tu id3
tu id3
tu id4
tu id4
requires translation
tu id7
tu id7
modorigid5
fuzzy match
tu id6
tu id6
requires proofing
leveraged match
tu id8
Newsame
tu id8
35 xmltm db leveraged matching
Perfect Matching
Updated Source Document
Matched Target Document
tu id1
tu id1
tu id2
deleted
tu id3
tu id3
tu id4
tu id4
requires translation
tu id7
tu id7
modorigid5
fuzzy match
tu id6
tu id6
requires proofing
doc leveraged match
tu id8
newsame
tu id8
requires proofing
tu id9
tu id9
DB leveraged match
DB
36 xmltm non translatable text
Perfect Matching
Updated Source Document
Matched Target Document
tu id1
tu id1
requires no translation
tu id2
tu id2
non translatable
non trans
tu id3
tu id3
tu id4
tu id4
requires translation
tu id7
tu id7
fuzzy match
tu id6
tu id6
requires proofing
doc leveraged match
tu id8
newsame
tu id8
requires proofing
tu id9
tu id9
DB leveraged match
DB
37Traditional Translation Scenario
Publishing
Translation
Extracted text
source text
tm process
Prepared text
Translated text
Translate
QA
38xmltm Translation Scenario
Publishing
leveraged matching
xml source text
Extracted text
Prepared text
tm process
Automatic Process
web interface
Translator
Translate
Web
QA
xml target text
Automatic Process
39xmltm matching
- Perfect Matching driven by Author Memory
- Leveraged Matching
- 100 same text
- In document Leveraged Matching
- Database Leveraged Matching
- Fuzzy Matching
- Modified Matching
- Linguistically aware Fuzzy Matching
- Non translatable element identification
- Alphanumeric
- Numeric
- Measurements
40xmltm benefits
- Enterprise level scalability
- Totally integrated within the XML framework
- Source text is automatically extracted and
matched - Word counts are controlled by the customer
- Text can be presented for translation via the web
- Online composition
- The most up to date translation is held by the
customer - Data is merged automatically at end of
translation cycle - All memory operations are totally automated
- Can be used transparently for relay translations
- Much cheaper to implement and run
- More accurate better matching
41xmltm summary
- Can be used to build consistent authoring systems
- Can be used to produce automatic authoring
statistics - Translation Memory generation and alignment is
totally automatic - Memory is held within the documents themselves
- Extraction and merging for translation are
automatic - The system provides much more efficient matching
mechanisms - Structure of the XML document is protected during
translation
42xmltm
- Fully specified XML based standard
- http//www.xml-intl.com/docs/specification/
- xml-tm.html
- Maintained by xml-intl.com
- http//www.xml-intl.com/dtd/tm.dtd
- http//www.xml-intl.com/dtd/tm.xsd
- Detailed article on www.xml.com
- Offered for consideration as a Lisa standard
43