The Basic Language Resources Kit BLARK

About This Presentation

Title:

The Basic Language Resources Kit BLARK

Description:

Define the minimal set of language resources that is necessary to do any ... For formant synthesis: sama as above, with hand-labelled formant. Hamburg, 22-11-2004 ... – PowerPoint PPT presentation

Number of Views:47

Avg rating:3.0/5.0

Slides: 44

Provided by: stevenk9

Category:

more less

Transcript and Presenter's Notes

Title: The Basic Language Resources Kit BLARK

1
The Basic Language Resources Kit (BLARK)

Steven Krauwer
Utrecht Institute of Linguistics UiL OTS / ELSNET

2
Overview

The BLARK Enterprise
How to arrive at it
The Dutch Language Union approach
Refining the concept
Defining a BLARK
Main beneficiaries
References
Concluding remarks

3
The BLARK Enterprise

Define the minimal set of language resources that
is necessary to do any precompetitive RD and
professional education at all for a language (the
Basic Language Resource Kit or BLARK)
Determine for each language which components are
already available
Make a priority plan to complete the BLARK for
each language
Ensure funding to get the work done

4
What are the componentsof a BLARK

Lexicons (monolingual, multilingual, )
Corpora (language, speech annotated,
unannotated mono- and multilingual mono- and
multimodal )
Tools (annotation, exploration, )
Modules (lemmatizers, parsers, speech
recognizers, tts, transcribers, translation, )

5
What makes the BLARK Enterprise special?

The idea is to make a common generic BLARK
definition, in principle applicable to all
languages
The common definition will be based on the
experience with different languages, and will
prevent reinvention of wheels
The common definition will ensure
interoperability and interconnectivity
(especially for multilingual or cross-lingual
applications)

6
Other benefits

Experience from other languages will help making
cost estimations
Adoption of a BLARK common to all languages may
help in persuading funders to support the
creation of the BLARK
Adoption of a common BLARK may facilitate porting
of knowledge and expertise between languages

7
Words of caution

A BLARK definition will evolve over time, as new
applications, application environment and
technologies come up
A BLARK definition should be seen as a template
rather than a dictate, as different languages may
have different specific requirements
BLARK completion priorities may differ from
language to language (on e.g. economic, social or
political grounds)

8
How to define a BLARK and assign priorities

Methodology proposed by the Dutch Language Union
DLU (Binnenpoorte et al, LREC 2002)
Identify a number of typical applications
Determine for each of them which technologies
(modules) are needed to make them (-, , , )
Identify for each module which resources they
require (-, , , )
Assign the highest priority to the resources that
support most applications

9
Proposed DLU priorities for NLP

treebank
robust parsers
tokenisation and named entity recognition
semantic annotations for the treebank
translation equivalents
evaluation benchmarks

10
Proposed DLU priorities for speech

automatic speech recognition
application-specific speech corpora
multi-media speech corpora
tools for transcription of speech data
speech synthesis
benchmarks for evaluation

11
Next steps by DLU

Make a survey of what exists and to what extent
it is available (0-9 availability score)
Assign priorities (not just resources but also an
infrastructure for maintenance and distribution)
Secure funding from Dutch and Flemish government
for a national programme
Issue calls for proposals for collaborative
resources projects (1st call closed Nov 2 2004)

12
Refining the concept

Items not really covered by the DLU teams
definition vs specification
availability
quality
quantity
standards
support
Addressed in the NEMLAR project

13
Definition / specification

Not enough to say a written language corpus,
what about
size (types, tokens)
encoding
annotation
text types
representativity
domains
i.e. we need full specs

14
Availability

DLU 0-9 scale, very impressionistic
Our proposal 3 dimensions
accessibility
cost
modifiability
to each we assign a penalty score (0 is best)

15
Accessibility

3 classes, with associated penalties
(3) existing, but only company-internal
(2) existing and freely usable for precompetitive
research
(1) existing and freely usable for all RD

16
Cost

4 cost categories
(4) price over 10 keuro
(3) price between 1 and 10 keuro
(2) price between 100 and 1000 euro
(1) less than 100 euro

17
Modifiability

3 categories
(3) black box you get them as they are, but you
cannot change or even inspect its internals
(2) glass box you cant change them but you can
see what is inside)
(1) open resources freely manipulable

18
Comments on availability

we can now express availability in a 3 digit
score (accessibility, cost, modifiability) which
should be rather easy to assign objectively
the lowest scores are the best
if the accessibility score is 3, the other scores
dont mean very much

19
Quality

We distinguish two types of quality absolute
(I.e. an inherent property of the resource) and
relative (I.e. in relation to how you want to use
it)
Absolute standard-compliance and soundness
Relative task-relevance and environment-relevance

20
Standard-compliance

criterion to what extent is the resource based
on a common standard (formal or de facto)
possible values (penalty based)
(3) no standard
(2) standard, but not fully compliant
(1) standard and fully compliant

21
Soundness

criterion to what extent is the resource based
on well-defined specifications
values
(3) no specifications provided
(2) specs provided, but not fully compliant
(1) specs provided, fully compliant

22
Task-relevance

criterion (relative) to what extent is the
resources suited for a specific task X
values (3 binary values)
contains all information needed for X (yes/no)
has the proper size for X(yes/no)
based on a relevant selection of items for X
(yes/no)

23
Environment-relevance

criterion to what extent is the resource
interoperable with its environment (other
resources)
values (3 binary valuas)
information matches (yes/no)
size matches (yes/no)
selection matches (yes/no)

24
Comments on quality

We can now express absolute quality objectively
in terms of a pair of scores (standard-compliance,
soundness) this score can be assigned by the
provider
and relative quality (for our own purposes) in
terms of two triples of yes/no answers
(task-relevance, environment-relevance) this
score can only be assigned by the user
other attributes may be added as long as they can
be objectively assigned

25
Quantity

The DLU team did not try to formulate any
quantitative requirements
We have tried to do this in the context of the
NEMLAR project, see below for our tentative
figures
Statistical approaches can swallow any amount of
resources, and minimal figures are very hard to
find
Our figure finding exercise has been very much
example driven

26
Standards

Very few existing formal standards around,
although some exist (cf Romary Ide at LREC2004
workshop, Monachini et al, 2003)
Evolving de facto standards include
Bottom-up work by committees (TEI)
Top-down actions
Projects aiming at standards (e.g. EAGLES, ISLE)
Example setting RD projects (e.g. Wordnet,
Speechdat, Multext)
Our position any standard is better than no
standard at all

27
Defining a BLARK

Work carried out in the context of the NEMLAR
project (www.nemlar.org), aimed at Arabic
resources
Work described here based on project deliverables
(see site), summarized in article by Maegaard,
Krauwer, Choukri, Damsgaard presented at NEMLAR
conference in Cairo (Sep 2004)

28
Approach adopted

Same strategy as Dutch Language Union
(applications gt modules gt resources)
But with different results because of differences
in social/economic situation and in language
structure
Results follow, in terms of global definitions
and tentative size indications (no specs provided
at this stage, but project is still ongoing)
Feedback is welcome!!!!!!!!

29
Written resources (1)

Lexicon
For all components 40 000 stems with POS
morphology
For sentence boundary detection list of
conjunctions and other sentence starters/stoppers
For named entity recognition 50 000 human proper
names
For semantic analysis same 40 000, with
subcategorization, shallow lexical semantic info
possibly a WordNet

30
Written resources (2)

Bi-/Multilingual lexicon
Same size as monolingual
Thesauri, ontologies, wordnets
Thesaurus subtree with ca 200-300 nodes for each
domain
Ontologies and wordnets ideally same size as
lexicon

31
Written resources (3)

Corpora
For term extraction 100 million words
unannoteted
For small applications 0.5 million words
annotated
For statistical POS tagger 1-3 million (ann)
Sentence boundary 0.5-1.5 million (ann)
Named entity (stat based) 1.5 million (ann)
Term extraction 100 million (ann)
Co-reference resolution 1 million (ann)
WSD 2-3 million (ann)

32
Written resources (4)

Multilingual corpora
For alignment 0.5 million (tagged)
Multimodal corpora
For OCR (printed) ??
For OCR (hand-written) ??

33
Spoken resources (1)

Acoustic data
For dictation 50-100 speakers, 20 min each,
fully transcribed, plus 10 speakers for testing
For telephony 500 speakers uttering 50 different
sentences (speechdat, orientel based)
For embedded speech recognition data similar to
Speecon
For broadcast news transcription 50-100 hours
well-annotated, plus 1000 hours of
non-transcribed data should come with 300
million words of non-annotated written text

34
Spoken resources (2)

Acoustic data (contd)
For conversational speech data similar to
CallHome/CallFriends from LDC
For speaker recognition 500 speakers for
training, 3 minutes each, transcribed, plus 100
speakers for testing
For language/dialect identification data similar
to CallFriend, or from Broadcast News (esp for
variants of Arabic)
For speech synthesis male and female speakers,
15 hours, using a read text, phonetically
balanced
For formant synthesis sama as above, with
hand-labelled formant

35
Spoken resources (3)

Multimodal corpora
For lips movement reading similar to M2VTS, with
some 50 faces
Written corpora for speech technologies
General 300 million words unannotated,
preferably broadcast news or other press and
media sources
For phonetic lexicon and language models 1-5
million words, annotated
For Arabic vowelized and non-vowelized corpus

36
What next? (1)

Check definition and quantification for
completeness and consistency and correct
Try to provide specs for every single item
Try to differentiate between general and Arabic
in definitions and specs

37
What next? (2)

For each language
Take the BLARK definition and specs
Adapt to local conditions
Make a survey of what exists and what has to be
made
Find the funds and build the BLARK for your
language

38
Prescriptive / descriptive

Prescriptive
the BLARK definition tells you which ingredients
you need
the specification tells you what they should look
like
Descriptive
a BLARK instantiation comes with a description of
its components

39
Main beneficiaries (1)

academic and industrial researchers material to
try out ideas and conduct pilot studies
industrial developers only for generic
activities, since specific applications require
more user and domain orientation
educators material for experimental work by
students in labs

40
Main beneficiaries (2)

probably not the main languages in Europe (EN,
FR, GE) as they are pretty well covered anyway
mostly the languages that are not supported by a
strong market (because of small size or poor
economy)

41
References

Binnenpoorte et al at LREC 2002 (see also
www.elsnet.org/dox/lrec2002-binnenpoorte.pdf
ELRA Newsletter vol 3, n 2, 1998 (see also
www.elsnet.org/blark.html)
NEMLAR see www.nemlar.org for
Arabic BLARK Report
NEMLAR presentation at Cairo conference
Romary Ide at LREC 2004 (see also
www.elsnet.org/lrec2004-roadmap/Romary-Ide.ppt)

42
Concluding remarks

The BLARK aims at providing a common definition
of the notion minimal set of resources
It should help language communities to come
closer to the idea of creating an equal playing
field, in spite of market forces
It should facilitate porting of expertise
It is necessarily dynamic, as technologies evolve
rapidly

The Basic Language Resources Kit BLARK - PowerPoint PPT Presentation

The Basic Language Resources Kit BLARK

Define the minimal set of language resources that is necessary to do any ... For formant synthesis: sama as above, with hand-labelled formant. Hamburg, 22-11-2004 ... – PowerPoint PPT presentation