Text Analytics Workshop Development - PowerPoint PPT Presentation

Loading...

PPT – Text Analytics Workshop Development PowerPoint presentation | free to download - id: 54a2f2-MGFlN



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Text Analytics Workshop Development

Description:

Title: Taxonomy Development Workshop Author: Tom Reamy Last modified by: Tom Reamy Created Date: 5/31/2002 6:24:58 PM Document presentation format – PowerPoint PPT presentation

Number of Views:110
Avg rating:3.0/5.0
Slides: 32
Provided by: TomR97
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Text Analytics Workshop Development


1
Text Analytics Workshop Development
  • Tom Reamy Chief Knowledge Architect
  • KAPS Group
  • Knowledge Architecture Professional Services
  • http//www.kapsgroup.com

2
Agenda
  • Development - Foundation
  • Case Study 1 Internet News
  • Case Study 2 Tale of two taxonomies
  • Case Study 3 Software Evaluation and Beyond
  • BBN Motivations
  • Amgen 2 clustering, auto-taxonomy
  • GAO Taxonomy from terms to rules
  • Exercises

3
Text Analytics Platform 4 Basic Contexts
  • Ideas Content Structure
  • Language and Mind of your organization
  • Applications - exchange meaning, not data
  • People Company Structure
  • Communities, Users
  • Central team - establish standards, facilitate
  • Activities Business processes and procedures
  • Technology
  • CMS, Search, portals, taxonomy tools
  • Applications BI, CI, Text Mining

4
Text Analytics Development Foundation
  • Articulated Information Management Strategy (K
    Map)
  • Content and Structures and Metadata
  • Search, ECM, applications - and how used in
    Enterprise
  • Community information needs and Text Analytics
    Team
  • POC establishes the preliminary foundation
  • Need to expand and deepen
  • Content full range, basis for rules-training
  • Additional SMEs content selection, refinement
  • Taxonomy starting point for categorization /
    suitable?
  • Databases starting point for entity catalogs

5
Knowledge Architecture Audit Knowledge Map
Project Foundation Contextual Interviews Information Interviews App/Content Catalog User Survey Strategy Document
Meetings, work groups Overview High Level Process Community Info behaviors of Business processes Technology and content All 4 dimensions Meetings, work groups
General Outline Broad Context Deep Details Deep Details Complete Picture New Foundation
6
Taxonomy Development Process Progressive
Refinement
Taxonomy Model Information Interviews Content Analysis Refine Map Community Governance Plan
Buy/Find work groups Overview Info behaviors, Card Sorts Bottom Up Prototypes Interviews Evaluate Refine Interviews Develop, Refine
General Outline Preliminary Taxonomy Taxonomy 1.0 Taxonomy 1.0-1.9 Tax 2.0 Taxonomy
7
Text Analytics Development Categorization Process
  • Starter Taxonomy
  • If no taxonomy, develop initial high level (see
    Chart)
  • Analysis of taxonomy suitable for
    categorization
  • Structure not too flat, not too large
  • Orthogonal categories
  • Content Selection
  • Map of all anticipated content
  • Selection of training sets if possible
  • Automated selection of training sets taxonomy
    nodes as first categorization rules apply and
    get content

8
Text Analytics Development Categorization Process
  • First Round of Categorization Rules
  • Term building from content basic set of terms
    that appear often / important to content
  • Add terms to rule, apply to broader set of
    content
  • Repeat for more terms get recall-precision
    scores
  • Repeat, refine, repeat, refine, repeat
  • Get SME feedback formal process scoring
  • Get SME feedback human judgments
  • Text against more, new content
  • Repeat until done 90?

9
Text Analytics Development Entity Extraction
Process
  • Facet Design from KA Audit, K Map
  • Find and Convert catalogs
  • Organization internal resources
  • People corporate yellow pages, HR
  • Include variants
  • Scripts to convert catalogs programming
    resource
  • Build initial rules follow categorization
    process
  • Differences scale, score
  • Recall find all entities
  • Precision correct assignment to entity class
  • Issue disambiguation Ford company, person,
    car

10
Case Study - Background
  • Inxight Smart Discovery
  • Multiple Taxonomies
  • Healthcare first target
  • Travel, Media, Education, Business, Consumer
    Goods,
  • Content 800 Internet news sources
  • 5,000 stories a day
  • Application Newsletters
  • Editors using categorized results
  • Easier than full automation

11
Case Study - Approach
  • Initial High Level Taxonomy
  • Auto generation very strange not usable
  • Editors High Level sections of newsletters
  • Editors Taxonomy Pros - Broad categories
    refine
  • Develop Categorization Rules
  • Multiple Test collections
  • Good stories, bad stories close misses - terms
  • Recall and Precision Cycles
  • Refine and test taxonomists many rounds
  • Review editors 2-3 rounds
  • Repeat about 4 weeks

12
(No Transcript)
13
(No Transcript)
14
(No Transcript)
15
(No Transcript)
16
(No Transcript)
17
(No Transcript)
18
(No Transcript)
19
Case Study - Issues
  • Taxonomy Structure
  • Aggregate nodes vs. independent nodes
  • Children Nodes subset rare
  • Depth of taxonomy and complexity of rules
  • Trade-off need to update and usefulness of
    categories
  • Multiple avenues - Facets source New York
    Times can put into rules or make it a facet to
    filter results
  • When to use filter or terms experimental
  • Recall more important than precision editors
    role

20
Case Study Lessons Learned
  • Combination of SME and Taxonomy pros
  • Combination of Features Entity extraction,
    terms, Boolean, filters, facts
  • Training sets and find similar are weakest
  • Somewhat useful during development for terms
  • No best answer taxonomy structure, format of
    rules
  • Need custom development
  • Plan for ongoing refinement
  • This stuff actually works!

21
Enterprise Environment Case Studies
  • A Tale of Two Taxonomies
  • It was the best of times, it was the worst of
    times
  • Basic Approach
  • Initial meetings project planning
  • High level K map content, people, technology
  • Contextual and Information Interviews
  • Content Analysis
  • Draft Taxonomy validation interviews, refine
  • Integration and Governance Plans

22
Enterprise Environment Case One Taxonomy, 7
facets
  • Taxonomy of Subjects / Disciplines
  • Science gt Marine Science gt Marine microbiology gt
    Marine toxins
  • Facets
  • Organization gt Division gt Group
  • Clients gt Federal gt EPA
  • Instruments gt Environmental Testing gt Ocean
    Analysis gt Vehicle
  • Facilities gt Division gt Location gt Building X
  • Methods gt Social gt Population Study
  • Materials gt Compounds gt Chemicals
  • Content Type Knowledge Asset gt Proposals

23
Enterprise Environment Case One Taxonomy, 7
facets
  • Project Owner KM department included RM,
    business process
  • Involvement of library - critical
  • Realistic budget, flexible project plan
  • Successful interviews build on context
  • Overall information strategy where taxonomy
    fits
  • Good Draft taxonomy and extended refinement
  • Software, process, team train library staff
  • Good selection and number of facets
  • Final plans and hand off to client

24
Enterprise Environment Case Two Taxonomy, 4
facets
  • Taxonomy of Subjects / Disciplines
  • Geology gt Petrology
  • Facets
  • Organization gt Division gt Group
  • Process gt Drill a Well gt File Test Plan
  • Assets gt Platforms gt Platform A
  • Content Type gt Communication gt Presentations

25
Enterprise Environment Case Two Taxonomy, 4
facets
  • Environment Issues
  • Value of taxonomy understood, but not the
    complexity and scope
  • Under budget, under staffed
  • Location not KM tied to RM and software
  • Solution looking for the right problem
  • Importance of an internal library staff
  • Difficulty of merging internal expertise and
    taxonomy

26
Enterprise Environment Case Two Taxonomy, 4
facets
  • Project Issues
  • Project mind set not infrastructure
  • Wrong kind of project management
  • Special needs of a taxonomy project
  • Importance of integration with team, company
  • Project plan more important than results
  • Rushing to meet deadlines doesnt work with
    semantics as well as software

27
Enterprise Environment Case Two Taxonomy, 4
facets
  • Research Issues
  • Not enough research and wrong people
  • Interference of non-taxonomy communication
  • Misunderstanding of research wanted tinker toy
    connections
  • Interview 1 implies conclusion A
  • Design Issues
  • Not enough facets
  • Wrong set of facets business not information
  • Ill-defined facets too complex internal
    structure

28
Taxonomy Development Conclusion Risk Factors
  • Political-Cultural-Semantic Environment
  • Not simple resistance - more subtle
  • re-interpretation of specific conclusions and
    sequence of conclusions / Relative importance of
    specific recommendations
  • Understanding project scope
  • Access to content and people
  • Enthusiastic access
  • Importance of a unified project team
  • Working communication as well as weekly meetings

29
Text Analytics Development Case Study 3 POC
Government Agency
  • Demo of SAS Teragram / Enterprise Content
    Categorization

30
Conclusion
  • Enterprise Context strategic, self knowledge
  • Importance of a good foundation
  • Importance of Taxonomy Structure mapped to use
  • POC a head start on development
  • Importance of Text Analytics Vision / Strategy
  • Infrastructure resource, not a project
  • Balance of expertise and local knowledge
  • Importance of Usability for refinement cycles
  • Difference of taxonomy and categorization
  • Concepts vs. text in documents

31
Questions?
  • Tom Reamy tomr_at_kapsgroup.com
  • KAPS Group
  • Knowledge Architecture Professional Services
  • http//www.kapsgroup.com
About PowerShow.com