Title: Building the Federal Multilingual Infrastructure in Unicode Foreign Language Dictionary Tools
1Building the Federal Multilingual Infrastructure
in UnicodeForeign Language Dictionary Tools
2Project Goals
- Unite federal foreign language analysts in
communities of interest by language to increase
the speed and accuracy of multilingual work - Outgrowth of NSA legacy individual foreign
language dictionary tools - Share Next Generation tool suite across the
federal government in 90 languages
3Foreign Language Work 1970s
- Manual tools
- Hardcopy dictionaries (2-10 per person)
- 3x5 card files for specialized vocabulary
- Pen and paper only
- Work environment
- Career analysts revered as subject matter experts
rule the work place. - College graduates hired right out of school, some
with military experience, enter the job.
4Foreign Language Challenge IThe classic sparse
data problem
- Never enough vocabulary
- Never enough grammar training
- Never enough cultural knowledge
5Foreign Language Challenge IIWhy its a sparse
data problem.
- Communication is usually spontaneous between 2 or
more people who share a great deal of special
knowledge in common - Ultimate goals often not explicit
- Ambiguity reigns for outsiders
- No simple rules for filling in the blanks
6An example ?? ? ?? ??? ?? ? ?? ?? ?? ?
- All glossed (4 min/chr 17chrs) meaning
obscureFemale people go hit knock bamboo
curtains secret doctor come untie decide her ask
issue. - All phrases verified (longest string match9)
clearerA woman goes and knocks on the bamboo
curtains secret doctor to come resolve her
problem.but still uncertain - Check for neologismgo to FBIS recent
translations, look to clarify meaning of new term
knock bamboo curtain. - Knock on the bamboo curtain for a secret doctor
seek out an illegal quack - A woman (must) go seek out an illegal quack to
resolve her problem.
7People say, Whats the big deal with just an
on-line dictionary?
- I never/seldom use a dictionary!
- Native speaker syndrome
- Vast majority of people must use a dictionary in
a second/third language - Hardcopy dictionaries are better.
- Cant do wild-card searches by hand
- Not engineered for 10 sec. avg. response
- Humans tire machines do not.
81991First Generation Dictionary DB Tool
- 200,000 entries from 3x5 cards collected over 20
years - Wild card searchable
- Cross referenced 4 ways in accordance with user
requirements - Displayed in native script
- Can cut and paste queries/responses
9Reactions to 1st Generation Tool
- Younger analysts used it liked it made great
suggestions to improve it - Senior analysts usually would not use it
1019952nd Generation Dictionary DB Tool
- Responses faster on queries with leading wild
card - GUI customized per user input
- Candidate entry system established
- Usership doubled !
- Senior analysts start to use it
1119983rd Generation Dictionary DB Tool
- Database re-encoded in UTF8
- Simultaneous simplified and traditional Chinese
display enabled - Average 1,000-3,000 candidate entries approved
annually 98-02 - Usership again doubled !
12Today WordscapeThe Next Generation Dictionary
Tool
- Retains all Chinese capabilities
- Expands to all language collections
- Neologism newswire research tools
- Over 90 languages represented in one Unicode DB
unified under one XML schema and one suite of
tools - Under LASER ACTD funding, extending all across
the federal government!
13Technology and Standards
- New technology being used
- Benefits of scale from use of UTF8, XML
- Standards adoptedleading change
- Participating in ISO standards group Technical
Committee 37 on terminology and language
resources (developing standardized formats for
foreign language lexical resources and data
exchange)
14When do Unicode standards fail? When Unicode
standards are not standard!
- 3rd World languages less commonly taught in the
United States - Hindi (many different script rendering
implementations) - Mongolian (no standardized spelling, many
newswire web sites employ non-standard fonts)
15Language Knowledge Services Team/Resources
- John L. George Program Manager (301) 688-9133
- Over 20 computer scientists/techs
- Currently deploying Beta version
- Learning from testing with earlier version
instantiations at FBI and NSA - on JWICS now, SIPRnet/NIPRnet next
16Contact Information
- John J. Kovarik
- Senior Language Technology Authority
- NSA Representative to LASER ACTD
-
- National Security Agency
- 9800 Savage Road
- Suite 6486 S2
- Phone (301) 688-7198
- Kovarik_at_afterlife.ncsc.mil