Title: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn.
1These are general notional tutorial slides on
data mining theory and practice from which
content may be freely drawn.
- Monte F. Hancock, Jr.
- Chief Scientist
- Celestech, Inc.
2Data Mining is the detection, characterization,
and exploitation of actionable patterns in data.
3Data Mining (DM)
- Data Mining (DM) is the principled detection,
characterization, and exploitation of actionable
patterns in data. - It is performed by applying modern mathematical
techniques to collected data in accordance with
the scientific method. - DM uses a combination of empirical and
theoretical principles to Connect Structure to
Meaning by - Selecting and conditioning relevant data
- Identifying, characterizing, and classifying
latent patterns - Presenting useful representations and
interpretations to users - DM attempts to answer these questions
- What patterns are in the information?
- What are the characteristics of these patterns?
- Can meaning be ascribed to these patterns
and/or their changes? - Can these patterns be presented to users in a way
that will facilitate their assessment,
understanding, and exploitation? - Can a machine learn these patterns and their
relevant interpretations?
4DM for Decision Support
- Decision Support is all about
- enabling users to group information in familiar
ways - controlling complexity by layering results (e.g.,
drill-down) - supporting users changing priorities
- allowing intuition to be triggered (Ive seen
this before!) - preserving and automating perishable
institutional knowledge - providing objective, repeatable metrics (e.g.,
confidence factors) - fusing simplifying results
- automating alerts on important results (Its
happening again!) - detecting emerging behaviors before they
consummate (Look!) - delivering value (timely-relevant-accurate
results) - helping users make the best choices.
5DM Provides Intelligent Analytic Functions
- Automating pattern detection to characterize
complex, distributed signatures that are worth
human attention and recognize those that are
not. - Associating events that go together but are
difficult for humans to correlate. - Characterizing interesting processes not just
facts or simple events - Detecting actionable anomalies and explaining
what makes them different AND interesting. - Describing contexts from multiple perspectives
with numbers, text and graphics
6DM Answers Questions Users are Asking
- Fusion Level 1 Who/What is Where/When in my
space? - Organize and present facts in domain context
- Fusion Level 2 What does it mean?
- Has this been seen before? What will happen
next? - Fusion Level 3 Do I care?
- Enterprise relevance? What action should be
taken? - Fusion Level 4 What can I do better next time?
- Adaptation by pattern updates and retraining
- How certain am I?
- Quantitative assessment of evidentiary pedigree
7Useful Data Applications
- Accurate identification and classification add
value to raw data by tagging and annotation
(e.g., fraud detection) - Anomaly / normalcy and fusion characterize,
quantify, and assess normalcy of patterns and
trends (e.g., network intrusion detection) - Emerging patterns and evidence evaluation -
capturing institutional knowledge of how events
arise and alerting when they emerge - Behavior association - detection of actions that
are distributed in time space but
synchronized by a common objective
connecting the dots - Signature detection and association detection
characterization of multivariate signals,
symbols, and emissions (e.g., voice recognition) - Concept tagging - reasoning about abstract
relationships to tag and annotate media of all
types (e.g., automated web bots) - Software agents assisting analysts
small-footprint fire-and-forget apps that
facilitate search, collaboration, etc.
8 Some Good Data Mining Analytic
Applications
- Help the user focus via unobtrusive automation
- Off-load burdensome labor (perform intelligent
searches, smart winnowing) - Post smart triggers/tripwires to data stream
(e.g., anomaly detection) - Help with mission triage (Sort my in-basket!)
- Automate aspects of classification and detection
- Determine which sets of data hold the most
information for a task - Support construction of ad hoc on-the-fly
classifiers - Provide automated constructs for merging decision
engines (multi-level fusion) - Detect and characterize domain drift (the
rules of the game are changing) - Provide functionality to make best estimate of
missing data - Extract/characterize/employ knowledge
- Rule induction from data, develop signatures
from data - Implement reasoning for decision support
- High-dimensional visualization
- Embed decision explanation capability into
analytic applications - Capture/automate/institutionalize best practice
- Make proven analytic processes available to all
- Capture rare, perishable human knowledge and put
it everywhere - Generate signature-ready prose reports
9Things that make hard problems VERY hard
- Events of interest occur relatively infrequently
in very large datasets (population imbalance) - Information is distributed in a complex way
across many features (the feature selection
problem) - Collection is hard to task, data are difficult to
prepare for analysis, and are never perfect
(noise in the data, data gaps, coverage gaps) - Target patterns are ambiguous/unknown squelch
settings are brittle (e.g., hard to balance
detection vs. false-alarm rates) - Target patterns change/morph over time and across
operational modes (domain drift, processing
methods becomes stale)
10Some Key Principles of Information Driven Data
Mining
- Right People, Methods, Tools (in that order)
- Make no prior assumptions about the problem
(agnostic) - Begin with general techniques that let the data
determine the direction of the analysis (Funnel
Method) - Dont jump to conclusions perform process audits
as needed - Dont be a one widget wonder integrate
multiple paradigms so the strengths of one
compensate for the weaknesses of another - Break the problem into the right pieces (Divide
and Conquer) - Work the data, not the tools, but automate when
possible - Be systematic, consistent, thorough dont lose
the forest for the trees. - Document the work so that it is reproducible
- Collaborate to avoid surprises team members,
experts, customer - Focus on the Goal maximum value to the user
within cost and schedule
11Select Appropriate Machine Reasoners
- 1.) Classifiers
- Classifiers ingest a list of attributes, and
determine into which of finitely many categories
the entity exhibiting these attributes falls.
Automatic object recognition and next-event
prediction are examples of this type of
reasoning. -
- 2.) Estimators
- Estimators ingest a list of attributes, and
assign some numeric value to the entity
exhibiting these attributes. The estimation of a
probability or a "risk score" are examples of
this type of reasoning. - 3.) Semantic Mappers
- Semantic mappers ingest text (structured,
unstructured, or both), and generate a data
structure that gives the "meaning" of the text.
Automatic gisting of documents is an example of
this type of reasoning Semantic mapping
generally requires some kind of domain model. - 4.) Planners
- Planners ingest a scenario description, and
formulate an efficient sequence of feasible
actions that will move the domain to the
specified goal state. -
- 5.) Associators
- Associators sample the entire corpus of
domain data, and identify relationships among
entities. Automatic clustering of data to
identify coherent subpopulations is a simple
example. A more sophisticated example is the
forensic analysis of phone, flight, and financial
records to infer the structure of terrorist
networks.
12Overcoming Processing Challenges through
Intelligent Automation of Data Conditioning,
Feature Selection, and Source Conformation
- Data Quality
- Cleanliness, Consistency
- Comprehensiveness
- Completeness
- Correctness
- Information Quality
- Representative (ground truth)
- Timeliness
- Salience
- Independence
- Attributes of Enterprise Problems
- New trends
- New behavior/event schemes
- Non-stationarity
- Population imbalance
- Inability to act on findings
13Embedded Knowledge
- Principled, domain-savvy synthesis of
circumstantial evidence - Copes well with ambiguous, incomplete, or
incorrect input - Enables justification of results in terms domain
experts use - Facilitates good pedagogical helps
- Solves the problem like the man does, and so is
comprehensible to most domain experts. - Degrades linearly in combinatorial domains
- Can grow in power with experience
- Preserves perishable expertise
- Allows efficient incremental upgrade/adjustment/re
purposing
14Features
- A feature is the value assumed by some attribute
of an entity in the domain - (e.g., size, quality, age, color, etc.)
- Features can be numbers, symbols, or complex data
objects - Features are usually reduced to some simple form
before modeling is performed. -
- gtgtgtfeatures are usually single numeric values or
contiguous strings.ltltlt
15Feature Space
- Once the features have been designated, a feature
space can be defined for a domain by placing the
features into an ordered array in a systematic
way. - Each instance of an entity having the given
features is then represented by a single point in
n-dimensional Euclidean space its feature
vector. - This Euclidean space, or feature space for the
domain, has dimension equal to the number of
features. - Feature spaces can be one-dimensional,
infinite-dimensional, or anywhere in between.
16How do classifiers work?
17(No Transcript)
18Machines
- Data mining paradigms are characterized by
- A concept of operation (CONOP component
structure, I/O, training alg., operation) - An architecture (component type, , arrangement,
semantics) - A set of parameters (weights/coefficients/vigilanc
e parameters) - gtgtgtit is assumed here that parameters are
real numbers.ltltlt - A machine is an instantiation of a data mining
paradigm. - Examples of parameter sets for various paradigms
- Neural Networks interconnect weights
- Belief Networks conditional probability tables
- Kernel-Based-classifiers (SVM, RBF) regression
coefficients - Metric classifiers (K-means) cluster centroids
19A Spiral Methodology for theData Mining Process
20The DM Discovery Phase Descriptive Modeling
- OLAP
- Visualization
- Unsupervised learning
- Link Analysis/Collaborative Filtering
- Rule Induction
21The DM Exploitation Phase Predictive Modeling
- Paradigm selection
- Test design
- Formulation of meta-schemes
- Model construction
- Model evaluation
- Model deployment
- Model maintenance
22A de facto standard DM Methodology
- CRISP-DM (cross-industry standard process for
data mining) - 1.) Business Understanding
- 2.) Data Understanding
- 3.) Data Preparation
- 4.) Modeling
- 5.) Evaluation
- 6.) Deployment
23Data Mining Paradigms What does your solution
look like?
- Conventional Decision Models -statistical
inference, logistic regression, score cards - Heuristic Models -human expert, knowledge-based
expert systems, - fuzzy logic, decision trees, belief nets
- Regression Models -neural networks (all sorts),
radial basis functions, - adaptive logic networks, decision trees, SVM
24Real-World DM Business Challenges
- Complex and conflicting goals
- Defining success
- Getting buy in
- Enterprise data is distributed
- Limited automation
- Unrealistic expectations
25Real-World DM Technical Challenges
- big data consume space and time
- efficiency vs. comprehensibility
- combinatorial explosion
- diluted information
- difficult to develop intuition
- algorithm roulette
26Data Mining Problems What does your domain look
like?
- How well is the problem understood?
- How "big" is the problem?
- What kind of data do we have?
- What question are we answering?
- How deeply buried in the data is the answer?
- How must the answer be presented to the user?
271. Business Understanding
- How well is the problem understood?
28How well is the problem understood?
- Domain intuition low/medium/high
- Experts available?
- Good documentation?
- DM teams prior experience?
- Prior art?
- What is the enterprise definition of success?
- What is the target environment?
- How skillful are the users?
- Where are the pitchforks?
292. Data Understanding3. Preparing the Data
- How "big" is the problem?
- What kind of data do we have?
30DM Aspects of Data Preparation
- Data Selection
- Data Cleansing
- Data Representation
- Feature Extraction and Transformation
- Feature Enhancement
- Data Division
- Configuration Management
31How "big" is the problem?
- Number of exemplars (rows)
- Number of features (columns)
- Number of classes (ground truth)
- Cost/schedule/talent (dollars, days, dudes)
- Tools (own/make/buy, familiarity, scope)
32What kind of data do we have?
- Feature type nominal/numeric/complex
- Feature mix homo/heterogeneous by type
- Feature tempo
- Fresh/stale
- Periodic/sporadic
- Synchronous/asynchronous
- Feature data quality
- Low/high SNR
- Few/many gaps
- Easy/hard to access
- Objective/subjective
- Feature information quality
- Salience, correlation, localization, conditioning
- Comprehensive? Representative?
33How much data do I need?
- Many heuristics
- Montes 6MN rule, other similar
- Support vectors
- Segmentation requirements
- Comprehensive
- Representative
- Consider population imbalance
34Feature Saliency Tests
- Correlation/Independence
- Visualization to determine saliency
- Autoclustering to test for homogeneity
- KL-Principal Component Analysis
- Statistical Normalization (e.g., ZSCORE)
- Outliers, Gaps
35Making Feature Sets for Data Mining
- Converting Nominal Data to Numeric Numeric
Coding - Converting Numeric data to Nominal Symbolic
Coding - Creating Ground-Truth
36Information can be Irretrievably Distributed
(e.g., the parity-N problem)
- 0010100110 1
- The best feature set is not necessarily the set
of best features.
37An example of a Feature Metric
- Salience geometric mean of class precisions
- an objective measure of the ability of a feature
to distinguish classes - takes class proportion into account
- specific to a particular classifier and problem
- does not measure independence
38Nominal to Numeric Coding...one step at a time!
Original Data
Step 1
Step 2
39Numeric to Nominal Quantization
40Clusters Usually Mean Something
41How many objects are shown here? One, seen from
various perspectives!This illustrates the danger
of using ONE METHOD/TOOL/VISUALIZATION!
42Autoclustering
- Automatically find spatial patterns in complex
data - find patterns in data
- measure the complexity of the data
43Differential Analysis
- Discover the Difference Drivers Between Groups
- Which combination of features accounts for the
observed differences between groups? - Focus research
44Sensitivity Analysis
- Measure the Influence of Individual Features on
Outcomes - Rank order features by salience and independence
- Estimate problem difficulty
45Rule Induction
- Automatically find semantic patterns in complex
data - discover rules directly from data
- organize raw data into actionable knowledge
46A Rule Induction Example(using data splits)
47Rule Induction Example (Data Splits)
48(No Transcript)
49(No Transcript)
50(No Transcript)
51(No Transcript)
52(No Transcript)
534. Modeling
- What question are we answering?
- How deeply buried in the data is the answer?
- How must the answer be presented to the user?
54(No Transcript)
55What question are we answering?
- Ground truth type
- Nominal
- Numeric
- Complex (e.g., interval estimate, plan, concept)
- Ground truth data quality
- Low/high SNR
- Few/many gaps
- Easy/hard to access
- Objective/subjective
- Ground truth predictability
- Correlation with features
- Population balance
- Class collisions
56How deeply buried in the data is the answer?
- Solvable by a 1 layer Multi-Layer Perceptron
(easy) - Linearly separable any two classes can be
separated by a hyperplane - Solvable by a 2 layer Multi-Layer Perceptron
(moderate) - Convex hulls of classes overlap, but classes do
not - Solvable by a 3 layer Multi-Layer Perceptron
(hard) - Classes overlap but do not collide
- intractable
- Data contain class collisions
57How must the answer be presented to the user?
- Forensics
- GUI, confidence factors, intervals, justification
- Integration
- Web-based, Web-enable, dll/sl, fully integrated
- Accuracy
- correct, confusion matrix, lift chart
- Performance
- Throughput, ease of use, accuracy, reliability
58(No Transcript)
59Text Book Neural Network
60(No Transcript)
61(No Transcript)
62(No Transcript)
63(No Transcript)
64(No Transcript)
65(No Transcript)
66(No Transcript)
67Knowledge Acquisition
- What the Expert says
- KE ...and, primates. What evidence makes you
CERTAIN an animal is a primate? - KE Yeah, well, like...If its a land animal
thatll eat anything...but it bears live young
and walks upright,... - KE Any obvious physical characteristics?
- EX Uh...yes...and no feathers, of course, or
wings, or any of that... Well, then...then, its
gotta be a primate...yeah. - KE So, ANY animal which is a land-dwelling,
omnivorous, skin-covered, unwinged featherless
biped which bears live young is NECESSARILY a
primate? - EX Yep.
- KE Could such an animal, be, say, a fish?
- EX No...it couldnt be anything but a primate.
68What the KE hears
- IF
- (f1,f2,f3,f4,f5) (land, omni, feathers,
wingless biped, born alive) - THEN
- PRIMATE and (not fish, not domestic, not bug, not
germ, not bird)
69Evaluation
- How must the answer be presented to the user?
70Model Evaluation
- Accuracy
- Classification accuracy, geometric accuracy
- precision/recall
- RMS
- Lift curve
- Confusion matrices
- ROI
- Speed, space, utility, other
71Classification Errors
- Type I - Accepting an item as a member of a class
when it is actually false a false positive. - Type II - Rejecting an item as a member of a
class when it actually is (true) a false
negative.
72Model Maintenance
- Retraining, stationarity
- Generalization (e.g. heteroscedasticity)
- Changing the feature set (add/subtract)
- Conventional maintenance issues
73What do we give the user besides an application?
- Documentation
- Support
- Model retraining
- New model generation
74Using a Paradigm Taxonomy to Select a DM Algorithm
- Place paradigms into a taxonomy by specifying
their attributes. This taxonomy can be used for
algorithm selection. - First, an example taxonomy.
75KBES (knowledge-based Expert System)
required intuition high vector
count supported high feature count
supported medium class count
supported medium cost to develop
high schedule to develop high
talent to develop medium, high
tools to develop can be expensive to buy/make
feature types supported
nominal/numeric/complex feature mix
supported homogeneous, heterogeneous
feature data quality needed need not fill
"gaps" ground truth types supported
nominal, complex relative
representational power low relative
performance fast, intuitive, robust
relative weaknesses ad hoc relatively simple
class boundaries relative strengths
intuitive easy to provide conclusion
justification
76MLP (Multi-Layer Perceptron)
required intuition low vector count
supported high feature count
supported medium class count
supported medium cost to develop
low schedule to develop medium
talent to develop medium
tools to develop easy to obtain inexpensively
feature types supported numeric
feature mix supported homogeneous
feature data quality needed must fill "gaps"
ground truth types supported
nominal, numeric relative
representational power high
relative performance moderately fast
relative weaknesses inscrutable uncontrolled
regression relative strengths easy
to build
77RBF (Radial Basis Function) required
intuition low vector count
supported high feature count
supported medium class count
supported high cost to develop
low schedule to develop medium
talent to develop medium
tools to develop easy to obtain inexpensively
feature types supported numeric
feature mix supported homogeneous
feature data quality needed need not fill
"gaps" ground truth types supported
nominal, numeric relative
representational power high
relative performance moderately fast
relative weaknesses inscrutable models tend
to be large relative strengths
uncontrolled regression can be mitigated
78SVM (Support Vector Machines)
required intuition low vector count
supported high feature count
supported high class count
supported two cost to develop
medium schedule to develop medium
talent to develop medium
tools to develop easy to obtain inexpensively
feature types supported numeric
feature mix supported homogeneous
feature data quality needed must fill "gaps"
ground truth types supported
nominal, numeric relative
representational power high
relative performance moderately fast
relative weaknesses inscrutable can be hard
to train relative strengths minimal
need to enhance features
79Decision Trees (e.g., CART, BBNs)
required intuition low vector count
supported high feature count
supported medium class count
supported high cost to develop
low schedule to develop medium
talent to develop medium
tools to develop easy to obtain inexpensively
feature types supported nominal,
numeric feature mix supported
homogeneous, heterogeneous feature
data quality needed need not fill "gaps"
ground truth types supported nominal,
numeric relative representational
power high relative performance
moderately fast relative weaknesses
many "low support" nodes or rules
relative strengths can provide insight into the
domain
80The taxonomy can be used to match available
paradigms with the characteristics of the data
mining problem to be addressed
81IF the ground truth is discrete
there aren't too many classes the class
boundaries are simple the number of
features is medium the data are
heterogeneous no comprehensive,
representative data set with GT the
population is unbalanced by class the
domain is well-understood by available experts
conclusion justification is neededTHEN
KBES
82ELSE IF the ground truth is
numeric there is a medium number of
classes the class boundaries are complex
the number of features is medium the
data are numeric comprehensive,
representative data set tagged with GT the
population is relatively balanced by class
the domain is not well-understood by available
experts conclusion justification is not
neededTHEN MLP
83ELSE IF the ground truth is numeric or
nominal there is a large number of
classes the class boundaries are very
complex the number of features is medium
the data are numeric representative
data set tagged with GT the population is
unbalanced by class the domain is not
well-understood by available experts
conclusion justification is not neededTHEN
RBF
84ELSE IF the ground truth is numeric or
nominal the number of classes is two
the class boundaries very complex the
number of features is very large the data
are numeric comprehensive, representative
data set tagged with GT the population is
unbalanced by class the domain is not
well-understood by available experts
conclusion justification is not neededTHEN
SVM
85ELSE IF the ground truth is numeric or
nominal there is a medium number of
classes the class boundaries are very
complex the number of features is medium
the data are numeric, nominal, or complex
representative data set tagged with GT
the population is unbalanced by class the
domain is not well-understood by available
experts conclusion justification is
neededTHEN Decision Tree (CART, BBN,
etc.) END IF
86Common Reasons Data Mining Projects Fail
87Mistakes can occur in each major element of data
mining practice!
- 1. Specification of Enterprise Objectives
- Defining success
- 2. Creation of the DM Environment
- Understanding and Preparing the Data
- 3. Data Mining Management
- 4a,b. Descriptive Modeling and Predictive
Modeling - Detecting and Characterizing Patterns
- Building Models
- Model Evaluation
- Model Deployment
- Model Maintenance
881. Specification of Enterprise Objectives
- Define success
- Knowledge acquisition interviews (who, what, how)
- Objective measures of performance (enterprise
specific) - Assessment of enterprise process and data
environment - Specification of data mining objectives
89Specification Mistakes
- DM projects require careful management of user
expectations. Choosing the wrong person as
customer interface can guarantee user
disappointment. - (GIGOO Garbage in, GOLD out!)
- Since the default assessment of RD type
efforts is failure, not defining success
unambiguously will guarantee failure.
902. Creation of the DM Environment
- Data Warehouse/Data Mart /Database
- Meta data and schemas
- Data dependencies
- Access paths and mechanisms
91Environmental Mistakes
- Big data require bigger storage. DM efforts
typically work against multiple copies of the
data try 2 or 3 x. - Unwillingness to invest in tools forces data
miners to consume resources building inferior
versions of what could have been purchased more
cheaply. - Get labs and network connections set up quickly.
92Understanding the Data
- Enterprise data survey
- Data as a process artifact
- Temporal Considerations
- Data Characterization
- Metadata
- Collection paths
- Data Metrics and Quality
- currency, completeness, correctness, correlation
93A List of Common Data Problems
- Conformation (e.g., a dozen ways to say lat/lon)
- Accessibility (distributed, sensitive)
- Ground Truth (missing, incorrect)
- Outliers (detect/process)
- Gaps (imputation scheme)
- Time (coverage, periodicity, trends, Nyquist)
- Consistency (intra/inter record)
- Class collisions (how to adjudicate)
- Class population imbalance (balancing)
- Coding/quantization
94Data Understanding Mistakes
- Assuming that no understanding of the domain is
needed for a successful DM effort - Temporal infeasibility assuming every type of
data you find in the warehouse will actually be
there when your fielded system needs it. - Ignoring the data conformation problem
95Data Preparation Mistakes
- Improper handling of missing data, outliers
- Improper conditioning of data
- Trojan Horsing ground truth into the feature
set - Having no plan for getting operational access to
data
963. Data Mining Management
- Data mining skill mix (who are the DM
practitioners?) - Data mining project planning (RAD vs. waterfall)
- Data mining project management
- Sample DM project cost/schedule
- Dont forget Configuration Management!
97DM Management Mistakes
- Appointing a domain expert as the technical
lead on a DM project virtually guarantees that no
new ground will covered. - Inadequate schedule and/or budget poison the
psychological atmosphere necessary for discovery. - Failure to parallelize work
- Allowing planless tinkering
- Letting technical people snow you
- Failure to conduct process audits
98Configuration Management
- Nomenclature and naming conventions
- Documenting the workflow for reproducibility
- Modeling Process Automation
99Configuration Management Mistakes
- Not having a configuration management plan
(files, directories, nomenclature, audit trail)
virtually guarantees that any success you have
will be unreproduceable. - Allowing each data miner to establish their own
documentation and auditing procedures guarantees
that no one will understand what anyone else has
done. - Failure to automate configuration management
(e.g., putting annotated experiment scripts in a
log) guarantees that your configuration
management plan will not work.
1004a. Descriptive Modeling
- OLAP (on-line analytical processing)
- Visualization
- Unsupervised learning
- Link/Market Basket Analysis
- Collaborative Filtering
- Rule Induction Techniques
- Logistic Regression
1014b. Predictive Modeling
- Paradigms
- Test Design
- Meta-Schemes
- Model Construction
- Model Evaluation
- Model Deployment
- Model Maintenance
102Paradigms
- Know what they are
- Know when to use which
- Know how to instantiate them
- Know how to validate them
- Know how to maintain them
103 Model Construction
- Architecture (monolithic, hybrid)
- Formulation of Objective Function
- Training (e.g., NN)
- Construction (e.g., KBES)
- Meta Schemes
- Bagging
- Boosting
- Post-process model calibration
104Modeling Mistakes
- The Silver Bullet Syndrome relying entirely
on a single tool/method - Expecting your tools to think for you
- Overreliance on visualization
- Using tools that you dont understand
- Not knowing when to quit (maybe this is just
dirt) - Quitting too soon (I havent dug deep enough)
- Picking the wrong modeling paradigm
- Ignoring population imbalance
- Overtraining
- Ignoring feature correlation
1055. Model Evaluation
- Blind Testing
- N-fold Cross-Validation
- Generalization and Overtraining
106Model Evaluation Mistakes
- Not validating the model
- Validating the model on the training data
- Not escrowing a holdback set
107 6. Model Deployment
- ASP (applications service provider)
- API (application program interface)
- Other
- plug-ins
- linked objects
- file interface, etc.
108Model Deployment Mistakes
- Not considering the fielded architecture
- No user training
- Not having any operational performance
requirements (except accuracy)
1097. Model Maintenance
- Retraining
- Poor generalization
- Heteroscedasticity
- Non-stationarity
- Overtraining
- Changing the problem architecture
- Adding/subtracting features
- Modifying ground truth
- Other
110Model Maintenance Mistakes
- Not having a mechanism, method, and criteria for
tracking performance of the fielded model - Not providing a model retraining capability
- No documentation, no support
111Published byDigital Press, 2001 ISBN
1-555558-231-1
112(No Transcript)