Dr. C. Lee Giles - PowerPoint PPT Presentation

About This Presentation
Title:

Dr. C. Lee Giles

Description:

IST 511 Information Management: Information and Technology Information extraction, data mining, metadata Dr. C. Lee Giles David Reese Professor, College of ... – PowerPoint PPT presentation

Number of Views:209
Avg rating:3.0/5.0
Slides: 89
Provided by: wya98
Category:
Tags: control | fuzzy | giles | lee | logic | traffic

less

Transcript and Presenter's Notes

Title: Dr. C. Lee Giles


1
IST 511 Information Management Information and
Technology Information extraction, data mining,
metadata
  • Dr. C. Lee Giles
  • David Reese Professor, College of Information
    Sciences and Technology
  • The Pennsylvania State University, University
    Park, PA, USA
  • giles_at_ist.psu.edu
  • http//clgiles.ist.psu.edu

Special thanks to E. Agichtein, K. Borne, S.
Sarawagi, C. Lagoze,
2
Last time
  • What are probabilities
  • What is information theory
  • What is probabilistic reasoning
  • Definitions
  • Why important
  • How used decision making
  • Decision trees
  • Impact on information science

3
Today
  • What is information extraction
  • What is data mining
  • Text mining as subfield
  • What is metadata
  • Impact on information science

4
Tomorrow
  • Topics used in IST
  • Digital libraries,
  • Scientometrics, bibliometrics
  • Digital humanities

5
Theories in Information Sciences
  • Enumerate some of these theories in this course.
  • Issues
  • Unified theory?
  • Domain of applicability
  • Conflicts
  • Theories here are
  • Very algorithmic
  • Some quantitative
  • Some qualitative
  • Quality of theories
  • Occams razor
  • Subsumption of other theories (all can use
    machine learning)
  • Text mining special case of data mining
  • Natural language processing uses data mining
    methods
  • Theories
  • Natural language processing

6
Science Paradigms
  • Thousand years ago science was empirical
  • describing natural phenomena
  • Last few hundred years theoretical branch
  • using models, generalizations
  • Last few decades a computational branch
  • simulating complex phenomena
  • Today data science (eScience)
  • unify theory, experiment, and simulation
  • Data captured by instrumentsor generated by
    simulator
  • Processed by software
  • Information/Knowledge stored in computer
  • Scientist analyzes database / filesusing data
    management and statistics

7
Information extraction, data mining and natural
language processing
  • Natural language processing is the processing and
    understanding of human language by machines
  • Information Extraction can be considered a
    subclass
  • Also known as knowledge extraction
  • Data mining is the process of discovering new
    patterns from large data sets
  • Text mining is the data mining of text
  • Text analytics generally refers to the tools used
  • Information extraction is the process of
    extracting and labeling relevant data from large
    data sets, usually text
  • Large means manually unreasonable

8
The Value of Unstructured Text Data
  • Unstructured text data is the primary form of
    human-generated information
  • Business and government reports, blogs, web
    pages, news, scientific literature, online
    reviews,
  • Need to extract information and give it structure
    to effectively manage, search, mine, store and
    utilize this data
  • Information Extraction maturing, and active
    research area
  • Software and companies exist
  • Intersection of Computational Linguistics,
    Machine Learning, Data mining, Databases, and
    Information Retrieval
  • Active crawling for text data

9
Example Answering Queries Over Text
For years, Microsoft Corporation CEO Bill Gates
was against open source. But today he appears to
have changed his mind. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Select Name From PEOPLE Where Organization
Microsoft
PEOPLE
Name Title Organization Bill Gates
CEO Microsoft Bill Veghte VP
Microsoft Richard Stallman Founder Free
Soft..
Bill Gates Bill Veghte
(from William Cohens IE tutorial, 2003)
10
Information extraction from text or pdfs
For years, Microsoft Corporation CEO Bill Gates
was against open source. But today he appears to
have changed his mind. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Select Name From PEOPLE Where Organization
Microsoft
PEOPLE
Name Title Organization Bill Gates
CEO Microsoft Bill Veghte VP
Microsoft Richard Stallman Founder Free
Soft..
XML or database
For extraction of OAI metadata from academic
documents, see CiteSeerX citeseerx.ist.psu.edu
(William Cohens IE tutorial, 2003)
11
Information Extraction Tasks
  • Extracting entities and relations this talk
  • Entities named (e.g., Person) and generic (e.g.,
    disease name)
  • Relations entities related in a predefined way
    (e.g., Location of a Disease outbreak, or a CEO
    of a Company)
  • Events can be composed from multiple relation
    tuples
  • Common extraction subtasks
  • Preprocess sentence chunking, syntactic parsing,
    morphological analysis
  • Create rules or extraction patterns hand-coded,
    machine learning, and hybrid
  • Apply extraction patterns or rules to extract new
    information
  • Postprocess and integrate information
  • Co-reference resolution, deduplication,
    disambiguation

12
Entities
  • Wikipedia An entity is something that has a
    distinct, separate existence, although it need
    not be a material existence.
  • Features
  • Permanent vs transient
  • Unique vs common
  • Animate vs inanimate
  • Small vs large
  • Mobile vs sessile
  • Place vs thing
  • Abstract vs real
  • Bio labels
  • Digital mention or reference

13
Example Extracting Entities from Text
  • Useful for data warehousing, data cleaning, web
    data integration

Address
4089 Whispering Pines Nobel Drive San Diego CA
92122
1
Ronald Fagin, Combining Fuzzy Information from
Multiple Systems, Proc. of ACM SIGMOD, 2002
Citation
Segment(si) Sequence Label(si)
S1 Ronald Fagin Author
S2 Combining Fuzzy Information from Multiple Systems Title
S3 Proc. of ACM SIGMOD Conference
S4 2002 Year
14
Entity Disambiguation
  • Task of clustering and linking similar entities
    in a document or between documents.
  • Labels sometime complex are given to these
    entities
  • Sometimes includes task of extracting or finding
    those entities (information extraction, focused
    crawling, etc)

15
Hand-Coded Methods
  • Easy to construct in some cases
  • e.g., to recognize prices, phone numbers, zip
    codes, conference names, etc.
  • Intuitive to debug and maintain
  • Especially if written in a high-level language
  • Can incorporate domain knowledge
  • Scalability issues
  • Labor-intensive to create
  • Highly domain-specific
  • Often corpus-specific
  • Rule-matches can be expensive

IBM Avatar
16
Entity Disambiguation by some other name?
  • record linkage
  • merge/purge processing or list washing
  • data matching
  • object identity problem
  • named entity resolution
  • duplicate detection
  • record matching
  • instance identification
  • deduplication
  • coreference resolution
  • reference reconciliation
  • database hardening
  • Closely related to Natural Language Processing

17
Entity Disambiguation Applications
  • Speech understanding
  • Question/answering
  • Health records
  • Criminal activities
  • Finance records
  • Semantic web applications
  • Scientific discovery and search
  • Semantic search
  • Others?

18
Entity Tagging
  • Identifying mentions of entities (e.g., person
    names, locations, companies) in text
  • MUC (1997) Person, Location, Organization,
    Date/Time/Currency
  • ACE (2005) more than 100 more specific types
  • Hand-coded vs. Machine Learning approaches
  • Best approach depends on entity type and domain
  • Closed class (e.g., geographical locations,
    disease names, gene protein names) hand coded
    dictionaries
  • Syntactic (e.g., phone numbers, zip codes)
    regular expressions
  • Semantic (e.g., person and company names)
    mixture of context, syntactic features,
    dictionaries, heuristics, etc.
  • Almost solved for common/typical entity types

19
Machine Learning Methods
  • Can work well when lots of training data and easy
    to construct
  • Can capture complex patterns that are hard to
    encode with hand-crafted rules
  • e.g., determine whether a review is positive or
    negative
  • extract long complex gene names
  • Non-local dependencies

20
Representation Models Cohen and McCallum, 2003
Classify Pre-segmentedCandidates
Lexicons
Sliding Window
Abraham Lincoln was born in Kentucky.
Abraham Lincoln was born in Kentucky.
Abraham Lincoln was born in Kentucky.
member?
Classifier
Classifier
Alabama Alaska Wisconsin Wyoming
which class?
which class?
Try alternatewindow sizes
Context Free Grammars
Finite State Machines
Boundary Models
Abraham Lincoln was born in Kentucky.
Abraham Lincoln was born in Kentucky.
Abraham Lincoln was born in Kentucky.
BEGIN
Most likely state sequence?
NNP
V
P
NP
V
NNP
Most likely parse?
Classifier
PP
which class?
VP
NP
VP
BEGIN
END
BEGIN
END
S
and beyond
Any of these models can be used to capture words,
formatting or both.
21
(Person) Name Disambiguation
  • Person Name disambiguation
  • A person can be referred to in different ways
    with different attributes in multiple records,
    the goal of name disambiguation is to resolve
    such ambiguities, linking and merging all the
    records of the same entity together
  • Large of mentions and entities
  • Consider three types of person name ambiguities
  • Aliases - one person with multiple aliases, name
    variations, or name changed
  • e.g. CL Giles Lee Giles, Superman Clark Kent
  • Common Names - more than one person shares a
    common name,
  • e.g. Jian Huang 118 papers in DBLP
  • Typography Errors - resulting from human input or
    automatic extraction
  • Goal disambiguate, cluster and link names in a
    large digital library or
  • bibliographic resource such as Medline

22
Popular Machine Learning Methods
For details Feldman, 2006 and Cohen, 2004
  • Naive Bayes
  • SRV Freitag 1998, Inductive Logic Programming
  • Rapier Califf and Mooney 1997
  • Hidden Markov Models Leek 1997
  • Maximum Entropy Markov Models McCallum et al.
    2000
  • Conditional Random Fields Lafferty et al. 2001
  • Scalability
  • Can be labor intensive to construct training data
  • At run time, complex features can be expensive to
    construct or process (batch algorithms can help
    Chandel et al. 2006 )

23
Data mining?
  • Process of semi-automatically analyzing large
    data sets and databases to find patterns that
    are
  • valid hold on new data with some certainity
  • novel non-obvious to the system
  • useful should be possible to act on the item
  • understandable humans should be able to
    interpret the pattern

24
Evolution of Data Mininglthttp//www.thearling.com
/text/dmwhite/dmwhite.htmgt
25
Data Mining is Ready for Prime Time
  • Data mining is ready for general application
    because it engages three technologies that are
    now sufficiently mature
  • Massive data collection delivery
  • Powerful multiprocessor computers
  • Sophisticated data mining algorithms

26
Organizational Reasons to use Data Mining
  • Most organizations already collect and refine
    massive quantities of data.
  • Their most important information is in their data
    warehouses.
  • Data mining moves beyond the analysis of past
    events to predicting future trends and
    behaviors that may be missed because they lie
    outside the experts expectations.
  • Data mining tools can answer complex business
    questions that traditionally were too
    time-consuming to resolve.
  • Data mining tools can explore the intricate
    interdependencies within databases in order to
    discover hidden patterns and relationships.
  • Data mining allows decision-makers to make
    proactive, knowledge-driven decisions.

27
(No Transcript)
28
A Key Concept for Data Mining
  • Data Mining delivers actionable data
  • data that support decision-making
  • data that lead to knowledge and understanding
  • data with a purpose
  • i.e., Data do not exist for their own sake.
  • The Data Warehouse is a corporate asset (whether
    in business, marketing, banking, science,
    telecommunications, entertainment, computer
    security, or security).

29
Data Mining - the up side
  • Data mining is everywhere
  • Huge scientific databases (NASA, Human Genome,)
  • Corporate databases (OLAP)
  • Credit card usage histories (Capital One)
  • Loan applications (Credit Scoring)
  • Customer purchase records (CRM)
  • Web traffic analysis (Doubleclick)
  • Network security intrusion detection (Silent
    Runner)
  • The hunt for terrorists
  • The NBA!

30
Data Mining - the down side
  • Data mining is a pejorative in the business
    database community (data dredging)
  • They prefer to call it Knowledge Discovery, or
    Business Intelligence, or CRM (Customer
    Relationship Management), or Marketing, or OLAP
    (On-Line Analytical Processing)
  • Legal issues in many countries
  • The Data Mining Moratorium Act of 2003
  • debated within the U.S.Congress
  • privacy concerns
  • directly primarily against the DARPA TIA Program
    (Total Information Awareness)

31
The Information Age is Here!
  • "Data doubles about every year, but useful
    information seems to be decreasing."
  • Margaret Dunham, "Data Mining Techniques
    Algorithms", 2002
  • "There is a growing gap between the generation of
    data and our understanding of it."
  • Witten Frank, "Data Mining Practical Machine
    Learning Tools", 1999
  • "The trouble with facts is that there are so many
    of them"
  • Samuel McChord Crothers, "The Gentle Reader",
    1973
  • "Get your facts first, and then you can distort
    them as much as you please."
  • Mark Twain

32
Characteristics of The Information Age
  • Data Avalanche
  • the flood of Terabytes of data is already
    happening, whether we like it or not
  • our present techniques of handling these data do
    not scale well with data volume
  • Distributed Digital Archives
  • will be the main access to data
  • will need to handle hundreds to thousands of
    queries per day
  • Systematic Data Exploration and Data Mining
  • will have a central role
  • statistical analysis of typical events
  • automated search for rare events

33
The Data Flood is Everywhere
  • Huge quantities of data are being generated in
    all business, government, and research domains
  • Banking, retail, marketing, telecommunications,
    other business transactions ...
  • Scientific data genomics, astronomy, biology,
    etc.
  • Web, text, and e-commerce

34
Data Growth Rate
Exabytes
10-fold Growth in 5 Years!
DVD RFID Digital TV MP3 players Digital
cameras Camera phones, VoIP Medical imaging,
Laptops, Data center applications,
Games Satellite images, GPS, ATMs,
Scanners Sensors, Digital radio, DLP theaters,
Telematics Peer-to-peer, Email, Instant
messaging, Videoconferencing, CAD/CAM, Toys,
Industrial machines, Security systems, Appliances
Source IDC, 2008
35
What is Data Mining?
  • Data mining is defined as an information
    extraction activity whose goal is to discover
    hidden facts contained in (large) databases."
  • Data mining is used to find patterns and
    relationships in data. (EDA Exploratory Data
    Analysis)
  • Patterns can be analyzed via 2 types of models
  • Descriptive Describe patterns and create
    meaningful subgroups or clusters.
  • Predictive Forecast explicit values, based
    upon patterns in known results.
  • How does this become useful (not just bits of
    data)? ...
  • through KNOWLEDGE DISCOVERY
  • Data ? Information ? Knowledge ?
    Understanding / Wisdom!

36
Historical Note Many Names of Data Mining
  • Data Fishing, Data Dredging 1960-
  • used by Statisticians (as a bad name)
  • Data Mining 1990-
  • used by DB business communities
  • in 2003 bad image because of DARPA TIA
  • Knowledge Discovery in Databases (1989-)
  • used by AI Machine Learning communities
  • also Data Archaeology, Information Harvesting,
    Information Discovery, Knowledge Extraction, ...

Currently Data Mining and Knowledge Discovery
are seemed to be used interchangeably.
37
Relationship with other fields
  • Overlaps with machine learning, statistics,
    artificial intelligence, databases, visualization
    but more stress on
  • scalability of number of features and instances
  • stress on algorithms and architectures whereas
    foundations of methods and formulations provided
    by statistics and machine learning.
  • automation for handling large, heterogeneous data

38
Some basic operations
  • Predictive
  • Regression
  • Classification
  • Collaborative Filtering
  • Descriptive
  • Clustering / similarity matching
  • Association rules and variants
  • Deviation detection

39
Data Mining Examples
  • Classic Textbook Example of Data Mining
    (Legend?) Data mining of grocery store logs
    indicated that men who buy diapers also tend to
    buy beer at the same time.
  • Blockbuster Entertainment mines its video rental
    history database to recommend rentals to
    individual customers.
  • A financial institution discovered that credit
    applicants who used pencil on the form were much
    more likely to default on their debts than those
    who filled out the application using ink.
  • Credit card companies recommend products to
    cardholders based on analysis of their monthly
    expenditures.
  • Airline purchase transaction logs revealed that
    9-11 hijackers bought one-way airline tickets
    with the same credit card.
  • Astronomers examined objects with extreme colors
    in a huge database to discover the most distant
    Quasars ever seen.

40
(No Transcript)
41
Data Mining ApplicationMarketing
  • Sales Analysis
  • associations between product sales
  • beer and diapers
  • strawberry pop tarts and beer (and hurricanes)
  • Customer Profiling
  • data mining can tell you what types of customers
    buy what products
  • Identifying Customer Requirements
  • identify the best products for different
    customers
  • use prediction to find what factors will attract
    new customers

42
Data Mining ApplicationFraud Detection
  • Auto Insurance Fraud
  • Association Rule Mining can detect a group of
    people who stage accidents to collect on
    insurance
  • Money Laundering
  • Since 1993, the US Treasury's Financial Crimes
    Enforcement Network agency has used a data-mining
    application to detect suspicious money
    transactions
  • Banking Loan Fraud
  • Security Pacific/Bank of America uses data mining
    to help with commercial lending decisions and to
    prevent fraud

43
The Necessity of Data Mining
  • Enormous interest in these data collections.
  • The environment to exploit these data does not
    exist!
  • 1 Terabyte at 100 Mbits/sec takes 1 day to
    transfer.
  • Hundreds to thousands of queries per day.
  • Data will reside at multiple locations, in many
    different formats.
  • Existing analysis tools do not scale to Terabyte
    data collections.
  • The need is acute! A solution will not just
    happen.

44
What is Knowledge Discovery?
  • Knowledge discovery refers to finding out new
    knowledge about an application domain using data
    on the domain usually stored in a database.
  • Application domains scientific, customer
    purchase records, computer network logs, web
    traffic logs, financial transactions, census
    data, basketball play-by-play histories, ...
  • Why are Data Mining Knowledge Discovery such
    hot topics? --- because of the enormous interest
    in these huge databases and their potential for
    new discoveries.
  • In large databases, Data Mining and Knowledge
    Discovery come in two flavors
  • Event-based mining
  • Relationship-based mining

45
Event-Based Mining
  • (Event-based mining is based upon events or
    trends in data.)
  • Four distinct orthogonal categorizations
  • Known events / known models - use existing models
    (descriptive models) to locate known phenomena of
    interest either spatially or temporally within a
    large database.
  • Known events / unknown models - use clustering
    properties of data to discover new relationships
    and patterns among known phenomena.
  • Unknown events / known models - use known
    associations and relationships (predictive
    models) among parameters that describe a
    phenomenon to predict the presence of previously
    unseen examples of the same phenomenon within a
    large complex database.
  • Unknown events / unknown models - use thresholds
    or trends to identify transient or otherwise
    unique ("one-of-a-kind") events and therefore to
    discover new phenomena. ? Serendipity!

46
Relationship-Based Data Mining (Based upon
associations relationships among data items)
  • Spatial associations -- identify events or
    objects at the same physical spatial location, or
    at related locations (e.g., urban versus rural
    data).
  • Temporal associations -- identify events or
    transactions occurring during the same or related
    periods of time (e.g., periodically, or N days
    after event X).
  • Coincidence associations -- use clustering
    techniques to identify events that are co-located
    (that coincide) within a multi-dimensional
    parameter space.

47
User Requirements for a Data Mining System(What
features must a DM system have for users?)
  • Cross-Identification - refers to the classical
    problem of associating the objects listed in one
    database to the objects listed in another.
  • Cross-Correlation - refers to the search for
    correlations, tendencies, and trends between
    parameters in multi-dimensional data, usually
    across databases.
  • Nearest-Neighbor Identification - refers to the
    general application of clustering algorithms in
    multi-dimensional parameter space, usually within
    a single database.
  • Systematic Data Exploration - refers to the
    application of the broad range of event-based and
    relationship-based queries to one or more
    databases in the hope of making a serendipitous
    discovery of new events/objects or a new class of
    events/objects.

48
Representative Data Mining Architecturelthttp//ww
w.thearling.com/text/dmwhite/dmwhite.htmgt
49
Data leads to Knowledge leads to Understanding
  • EXAMPLE
  • Data 00100100111010100111100 (stored in
    database)
  • Information ages and heights of children
    (metadata)
  • Knowledge the older children tend to be taller
  • Understanding childrens bones grow as they get
    older

Data ? Information ? Knowledge ? Understanding /
Wisdom!
50
Astronomy Example
Data
(a) Imaging data (ones zeroes)
(b) Spectral data (ones zeroes)
  • Information (catalogs / databases)
  • Measure brightness of galaxies from image (e.g.,
    14.2 or 21.7)
  • Measure redshift of galaxies from spectrum (e.g.,
    0.0167 or 0.346)

Knowledge Hubble Diagram ? Redshift-Brightness
Correlation ? Redshift Distance
Understanding the Universe is expanding!!
51
Goal of Data Mining
  • The end goal of data mining is not the data
    themselves, but the new knowledge and
    understanding that are revealed in the process
    Business Intelligence (BI).
  • (Remember what we said about the business
    communitys opinion of D.M.)
  • This is why the research field is usually
    referred to as KDD Knowledge Discovery in
    Databases.

52
The Data Mining Process
The most important and time-consuming step is
Cleaning the Data.
53
Data Mining Methods and Some Examples
Clustering Classification Associations Neural
Nets Decision Trees Pattern Recognition Correlatio
n/Trend Analysis Principal Component
Analysis Regression Analysis Outlier/Glitch
Identification Visualization Autonomous
Agents Self-Organizing Maps (SOM) Link (Affinity)
Analysis
Find all groups and classes of objects
represented in the data
Classify new data items using the known classes
groups
Find associations and patterns among different
data items
Organize information in the database based on
relationships among key data descriptors
Identify linkages between data items based on
features shared in common
54
Some Data Mining Techniques Graphically
Represented
Self-Organizing Map (SOM)
Clustering
Neural Network
Outlier (Anomaly) Dectection
Link Analysis
Decision Tree
55
Remember what it is
Data Mining is an information extraction
activity whose goal is to discover hidden facts
contained in large databases.
56
Data Mining Technique Clustering
In this case, three different groups
(classes) of items were found among all of the
items in the data set.
57
Data Mining Technique Decision Tree
Classification
  • Question
  • Should I play tennis today?

Similar to game 20 questions
Same technique used by bank loan officers to
identify good potential customers versus poor
customers.
(I must really love tennis!)
58
Data Mining TechniqueAssociation Rule
Mining(Market Basket Analysis)
transaction id
customer id
products bought
sales records
  • Trend (Rule) Products p5, p8 often bought
    together
  • Trend (Rule) Customer 12 likes product p9

59
Data Mining Algorithm The SOM
Figure The SOM (Self-Organizing Map) is one
technique for organizing information in a
database based upon links between concepts. It
can be used to find hidden relationships and
patterns in more complex data collections,
usually based on links between keywords or
metadata.
60
Data Mining Application Outlier Detection
Figure The clustering of data clouds (dc)
within a multidimensional parameter space
(p). Such a mapping can be used to search for
and identify clusters, voids, outliers,
one-of-kinds, relationships, and associations
among arbitrary parameters in a database (or
among various parameters in geographically
distributed databases).
61
Link Analysis for Terrorist SNAFind all
connections and relationships among known
terrorists.
62
Data Mining TechnologyParallel Mining
Figure Parallel Data Mining The application of
parallel computing resources and parallel data
access (e.g., RAID) enables concurrent
drill-downs into large data collections
63
Data Mining Methods Explained
  • Clustering Group data items according to tight
    relationships.
  • Classification Assign data items to
    predetermined groups.
  • Associations Associate data with similar
    relationships. The beer-diaper example is an
    example of associative mining.
  • Artificial Neural Networks (ANN) Non-linear
    predictive models that learn through training and
    resemble biological neural networks in structure.
  • Decision Trees Hierarchical sets of decisions,
    based upon rules, for rapid classification of a
    data collection.
  • Sequential Patterns Identify or predict behavior
    patterns or trends.
  • Genetic Algorithms Rapid optimization techniques
    that are based on the concepts of natural
    evolution.
  • Nearest Neighbor Method Classify a data item
    according to its nearest neighbors (records that
    are most similar).
  • Rule induction The extraction of useful if-then
    rules from data based on statistical
    significance.
  • Data visualization The illustration and visual
    interpretation of complex relationships in
    multidimensional data using graphics tools.
  • Self-Organizing Map (SOM) Graphically organizes
    (in a 2-dimensional map) the information stored
    within a database based upon similarities and
    links between concepts. It can be used to find
    hidden relationships and patterns in more complex
    data collections.

64
Data Mining Techniques techniques are based on
Algorithms techniques are used in Applications
65
KDnuggets
66
Tools used
67
Industries datamining is used
68
http//www.kdnuggets.com/polls/2004/data_mining_ap
plications_industries.htm
Poll of Users Where do you currently apply data
mining? (August 2004) Industries/fields where
you currently apply data mining? 216 votes
total
Banking (29) ... 13 Scientific data
(20) ... 9 Direct Marketing/Fundraising
(19) . 9 Fraud Detection (19)
9 Bioinformatics/Biotech (18) . 8 Insurance
(15) ... 7 Medical/Pharma (15)
7 Telecommunications (12) 6 eCommerce/Web
(12) . 6 Investment/Stocks (9) ..
4 Manufacturing (9) . 4 Retail (9)
4 Security (8)
4 Travel (2) ... 1 Entertainment/News
(1) 0.5 Other (19) ... 9
69
Data Mining Summary
  • What? -- Data Mining is defined as "an
    information extraction activity whose goal is to
    discover hidden facts contained in (large)
    databases."
  • Why? -- To explore systematically and to make
    discoveries in huge databases.
  • How? -- Apply one of many techniques to find
    patterns, relationships, groupings, classes,
    trends, anomalies, rare events, unusual
    connections, and causal connections among items
    in a database.
  • Example -- The standard textbook example of data
    mining is the legendary trend found in grocery
    store logs that men who buy diapers also tend
    to buy beer at the same time.
  • Outcome -- Actionable information make
    decisions based upon information discovered.
  • What is needed -- SIFTWARE software that aids
    in isolating interesting useful information by
    sifting through large databases.
  • Real world application -- Data ? Information ?
    Knowledge ? Understanding / Wisdom!

70
The importance of metadata and their rules
  • So we have all this mined or extracted data what
    is it?
  • Label some of it and call it metadata
  • You know what it is
  • Make it available to others (if you can)
  • Tim Berners-Lee
  • inventor of the world wide web
  • Founder of the W3C
  • Presentation at Ted

71
Metadata is data about data
Metadata (and Markup languages)
Metadata often is written in XML
72
Metadata is semi-structured data conforming to
commonlyagreed upon models, providing
operational interoperabilityin a heterogeneous
environment
73
What is metadata?Some simple definitions
  • Structured data about data.
  • Dublin Core Metadata Initiative FAQ, 2005
  • http//dublincore.org/resources/faq/
  • Machine-understandable information about Web
    resources or other things.
  • Tim Berners-Lee, W3C, 1997
  • http//www.w3.org/DesignIssues/Metadata

74
"Web resources or other things"
  • Metadata might be "about" anything!
  • HTML documents
  • digital images
  • databases
  • books
  • museum objects
  • archival records
  • metadata records
  • Web sites
  • collections
  • services
  • physical places
  • people
  • organizations
  • works
  • formats
  • concepts
  • events

75
What is metadata?Towards a "functional" view
  • Data associated with objects which relieves their
    potential users of having to have full advance
    knowledge of their existence or characteristics.
  • Lorcan Dempsey Rachel Heery, "Metadata a
    current view of practice and issues", 1998
  • http//www.ukoln.ac.uk/metadata/publications/jdmet
    adata/
  • Structured data about resources that can be used
    to help support a wide range of operations.
  • Michael Day, "Metadata in a Nutshell", 2001
  • http//www.ukoln.ac.uk/metadata/publications/nutsh
    ell/

76
What might metadata "say"?
What is this called? What is this about? Who made
this? When was this made? Where do I get (a copy
of) this? When does this expire? What format does
this use? Who is this intended for? What does
this cost? Can I copy this? Can I modify
this? What are the component parts of this? What
else refers to this? What did "users" think of
this? (etc!)
77
What operations/functions?
  • resource disclosure discovery
  • resource retrieval, use
  • resource management, including preservation
  • verification of authenticity
  • intellectual property rights management
  • commerce
  • content-rating
  • authentication and authorization
  • personalization and localization of services
  • (etc!)

78
What operations/functions?
  • Different functions different metadata
  • Metadata (and metadata standards) sometimes
    classified according to function
  • Descriptive primarily for discovery, retrieval
  • Administrative primarily for management
  • Structural relationships between component parts
    of resources
  • Contextual relationships between resources
  • No one size fits all solution!

79
Metadata importance
  • data about data is about as good as the
    definition gets...
  • As a data resource grows, metadata becomes more
    important
  • Lack of metadata has different consequences
  • documentation metadata can be regenerated
    automatically, or by hand
  • datasets, pictures once lost, can be impossible
    to regenerate

80
Types of Metadata
See http//www.loc.gov/standards/metadata.html
  • Descriptive
  • Discovery / description of objects
  • Title, author, abstract, etc.
  • Structural
  • Storage presentation of objects
  • 1 pdf file, 1 ppt file, 1 LaTeX file, etc.
  • Administrative
  • Managing and preservation of objects
  • Access control lists, terms and conditions,
    format descriptions, meta-metadata

81
Which View is Correct?
figure 1 from http//www.dlib.org/dlib/january01/
lagoze/01lagoze.html
82
Approaches to Metadata
  • from Ng, Park and Burnett, 1997 (also JASIS,
    50(13)) http//www.scils.rutgers.edu/sypark/asis.
    html
  • library science bibliographic control
  • organizing the physical containers of
    information, by means of bibliographical
    description, subject analysis, and classification
    notation construction, so that the container can
    be efficiently described, identified, located and
    retrieved
  • computer and information science data management
  • not only to store, access and utilize data
    effectively, but also to provide data security,
    data sharing, and data integrity

83
Metadata and Cataloging
  • In library science, metadata issues are closely
    tied with cataloging issues
  • purpose of a catalog (Cutter, 1904)
  • enable a person to find a book
  • show what the library has
  • assist in the choice of a work
  • Does computer science has a cataloging analog
    coupled with metadata?

84
DL Metadata Issues
  • Who provides metadata?
  • author? publisher? professional cataloger?
    extracted from content?
  • Is metadata integrated with data?
  • related question is metadata a first class
    object?
  • Formats!
  • which ones?
  • extensible?
  • paradox the more powerful the format, the less
    likely it will be used...

85
Metadata Formats and Implementation
  • Use markup languages
  • Interoperable
  • Extensible
  • Robust
  • Permits advance search features
  • When online, the beginning of a semantic web!

86
Interesting Formats
  • Library science
  • Machine Readable Catalogue (MARC) huge,
    extensive, all purpose, one size fits all format
  • pro does everything
  • con kids, dont try this at home!
  • Computer science
  • application-specific formats refer, BibTeX,
    RFC-1807, etc.
  • Dublin Core - common ground?

87
What we covered
  • Methods and tools for making sense of data
  • Assists reasoning, decision making
  • Data manipulation methods
  • Large data
  • How metadata helps

88
Basic assumptions of Web Information
Retrieval(Search engines)
  • Corpus constantly changing created by amateurs
    and professionals
  • Goal Retrieve summaries of relevant information
    quickly with links to the original site
  • High precision! Recall not important
  • Crawling important
  • Searcher amateurs no professional training and
    less or no concern about quality queries

89
(No Transcript)
90
Importance of Data
  • Data is not only important to science but also
    to the humanities.
  • The sexy job in the next ten years will be ...
    to take data -- to be able to understand it, to
    process it, to extract value from it, to
    visualize it, to communicate it. -- Hal Varian
    (Economist, Berkeley Google)
  • Elite American university students do not think
    big enough. That is exactly the complaint from
    some of the largest technology companies and the
    federal government. At the heart of this
    criticism is data. -- New York Times
  • Statistical agencies face increased demand for
    data products, and the questions asked by our
    society are becoming increasingly complex and
    hard to measure. Meeting these challenges
    requires innovation in cognitive research, and
    economic and statistical modeling. -- Roderick
    Little (Statistician, US Census and U Michigan)

91
Never too much Data
  • Companies that manage their data well are 5 to
    6 more productive. NYTimes

92
Building DBPedia
93
Words of wisdom
  • "We have confused information (of which there is
    too much) with ideas (of which there are too
    few)."
  • Paul Theroux
  • "The great Information Age is really an explosion
    of non-information it is an explosion of data
    ... it is imperative to distinguish between the
    two information is that which leads to
    understanding."
  • R.S. Wurman in his book Information Anxiety2

94
Propositions
  • Data is valuable

95
Questions
  • Role in information science of
  • Information (knowledge) extraction
  • Data mining
  • Metadata
  • What next?
Write a Comment
User Comments (0)
About PowerShow.com