Title: CSA Branding 101 October 2004 Presented By Giving Tree Group
1Automated Indexing Implementation Managing
Expectations by Craig Emerson, CSA
- The CSA landscape
- 6 offices all use Cuadras STAR production
software - 59 AI Databases in Social Sciences, Natural
Science, Technology, Arts Humanities - 700,000 abstracts indexed each year
- 100 editorial staff (editors and editorial
assistants) - 18 thesauri
2- Focus on CSAs Head office in Bethesda
- 300k records/yr 6 thesauri natural science
- journals, conferences, reports, books
- MAI program must assign specific vocabulary
rather than general concepts that may or may not
match our existing thesauri. - Rule-based systems
- Text-analysis (Lexical analysis)
- Marching orders Cost reduction rather than
increased productivity
3Objective Use Machine-Assisted Indexing to
increase Production Efficiency in Bethesda Office
- What does Production Efficiency really mean?
- if it just reflects cost savings, wed outsource
to India - Focus must include
- increasing the quality and utility of existing
products (e.g. consistency of indexing, thesaurus
maintenance, etc.) - identification and development of new AI products
4- But pragmatism dictates a short-term goal of
more-for-less - Setting my Opportunity Costs
- target freelancer costs US200k yr-1
- MAI packages are 40,000-250,000 (often per
server) - annual maintenance costs between 5-20,000
- IT support and maintenance ½ head for 6
thesauri and 3 major authority files - whatever we invested, we wanted to make it back
within 2-3 years - reduce freelancer costs by about 35,000 each year
5- Savings are not uniform over 3 years
- You may actually need to spend more in the first
year (time taken to rule-build) - Rather than 17.5 cost reduction per year, you
may be faced with 35 cost reduction in year 2
635 isnt daunting given MAI propaganda. Buy
now! Productivity will increase x-fold!
- Such claims are attainable but problematic to
validate - Time savings database (thesaurus)-specific
- Rule building eats some of the time savings (1rst
year) - Even with high accuracy, indexing still requires
checking - Give editors more time and theyll spend it
polishing abstracts
7- Set Productivity Goals Regardless of Theoretical
Constraints - largest hurdle is staff management, not MAI
technology - initially, staff are worried database quality
will suffer, and their jobs may disappear - editors realize extra time allows them to do
additional indexing, or other editorial jobs
(mission creep) - Whether you want time savings to translate into
bigger and better databases, or a cost reduction,
you must set your unambiguous expectations in
stone by the end of the year, your indexing
quota will double, and by the end of the
following year, triple.
8thesaurus structure rule-builders ability IT
support document type quality of source
text howler acceptance factor
Productivityest f
- Are thesaurus terms close to natural language?
If so, 70-80 accuracy within a year.
9- Rule-builders Ability
- Management Constraints
- 1 editor per rule base (thesaurus, authority
file) - 20-50 of editors time for rule-building
- Editors Limits
- Several databases have a single editor the de
facto rule builder. If a logic maven, 70-80
accuracy within a year.
10- IT Support
- Fulltime IT person for 1-2 months (bases)
- ½ time person for maintenance through 12 months
- ¼ time person beyond 12 months
- Software constraints
- likely to push software beyond limits
- unique requirements probably not provided by
software - identification of software limitations unknown to
provider, ß version or not
11Importance of document type
We therefore conclude that narrow (less than
50 kilometres wide) compositional streaks, as
well as the larger-scale bilateral zonation, are
vertically continuous over tens to hundreds of
kilometres within the plume.
Melvin Anthony stormed to third with a posing
routine that is setting new standards and may
force his contemporaries to re-think their
smooch-strut-and-jiggle approach.
12Importance of text choice for MAI
- Article Title (Non-English)
- Abstract (Non-English)
- Author keywords
- Source Title
- Conference Title
- Special notes
Note Fulltext isnt great for rule-based
systems. It is better for concept-oriented MAI
where thesaurus matching isnt as important
13- The Howler Factor
- Web translation tools are grist for the humour
mill - Do you need to eyeball every index term?
- Important for long-term estimates (1 year away)
- If yes, then you budget for at least 30-60
seconds per record - If not, then the critical statistic is the
percentage of records youre willing to be
released not vetted - Consider that howlers will be out-of-context
thesaurus terms
14(No Transcript)
15- The remaining issue is validation of results
- Youll achieve your efficiency goal editors will
double the output, but at what quality cost? - Design some simple tests
- Editors should re-index material theyve already
indexed, again in a blind study - Randomly choose 50 records indexed manually, and
50 indexed with the help of MAI. Rank the 100
records with respect to indexing in a single
blind study. Any pattern? - As above, but with MAI-only
16- So what happened with the Bethesda
implementation? - started the process 10 months ago
- MAI accuracy 25-75 (x63)
- increase in indexing rate 20-300 (x50)
- no change in quality in records with MAI Manual
- Significantly lower quality in MAI-only indexing
- not as a result of howlers
- caused by addition of more general terminology
17- Unanticipated problems
- software much more limited than anticipated
- Craig The requirements for java changed with
the new software, but they didn't tell us about
it. How can we go about installing v.1.4.2_06 of
the java JRE in Apollo? Would it break anything
else that relies on Java? Francis - internal IT support much less than anticipated
- Unanticipated benefits
- thesaurus maintenance has improved quality of
indexing has increased (editors focus on
difficult terminology) - editors have become invigorated they live for
arguments and MAI rule-building provides a feast - new projects have developed Reindexing
backfiles custom index for cluster databases
ability to index document types not previously
considered because of volume
18- In Summary
- Be clear on goal increased production or cost
savings - Lock-in IT support
- Youll be able to integrate immediately
- Assume youll need 1 year to realize a
significant benefit and 2 years to optimize the
benefit - Conduct a pilot study to estimate productivity
gains otherwise assume a 2-3x increase in
indexing rate (higher if thesaurus matching isnt
an issue) - Determine efficiencies ongoing and track editors
time as never before