Data Mining Solutions (Discussion 2) (Westphal - PowerPoint PPT Presentation

Loading...

PPT – Data Mining Solutions (Discussion 2) (Westphal PowerPoint presentation | free to download - id: 6f5a45-YTBiY



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Data Mining Solutions (Discussion 2) (Westphal

Description:

... neural networks, ... understanding relationships using visual methods Data Mining Tasks ... e.g. unusual airline flights Analyzing Network Structures ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 22
Provided by: Vaira
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Data Mining Solutions (Discussion 2) (Westphal


1
Data Mining Solutions(Discussion 2)(Westphal
Blaxton, 1998)
  • Dr. K. Palaniappan
  • Dept. of Computer Engineering Computer Science,
    UMC

2
Data Mining Tasks (review)
  • (1) Classification or identification -
    automatically label input records
  • (2) Estimation or regression - predict magnitude
    of response or other missing field given input
    records
  • (3) Segmentation or clustering - group the input
    records into meaningful sub-populations
  • (4) Description or visualization - looking for
    gems and diamonds among pebbles

3
Data Mining Tasks/Algorithms (handout)
  • Classification - supervised induction
  • Analyze historical data to create a model that
    can predict future behavior
  • Common tools - neural networks, decision trees,
    if-then-else rules
  • Clustering - partition database into similar
    groups or segments
  • Expert interpretation and modification of
    clusters needed
  • Association - establish relationships about items
    occurring together
  • Sequence Discovery - identification of
    associations over time
  • Visualization - understanding relationships using
    visual methods

4
Data Mining Tasks/Algorithms (handout)
  • Cluster analysis
  • Linkage analysis
  • Time series analysis
  • Categorization analysis
  • Visualization
  • Algorithms/Technologies
  • Neural networks
  • Decision trees
  • Time series (?)
  • Genetic algorithms
  • Hybrid approaches
  • Fuzzy logic
  • Statistics

5
Data Modeling (elaboration)
  • Data abstraction
  • Grouping, binning, categorization, histogramming
    of data useful for summarization
  • 1-D vs 2-D vs higher dimensional data

6
Data Modeling (elaboration)
  • Descriptive data
  • State-based knowledge
  • Set of attributes used to describe discrete
    objects
  • Declarative information - organization
    structures, credit reports, vendor profiles
  • Transactional data
  • Episodic info about time and place of events
  • Links between object classes to represent traits
    or conditions

7
Problem Definition
  • Knowledge representation using hierarchical
    frameworks
  • Objects--gt Relationships--gtNetworks--gt
    Applications--gtSystems
  • Procedural vs declarative knowledge
  • Episodic data tagged with temporal and spatial
    information (sequence, knowing how)
  • Semantic data more commonly analyzed (factual,
    knowing that)
  • Metaknowledge

8
Data Preparation Analysis
  • Define data mining goals
  • Planning questions
  • Ready access to all data sources
  • Data format
  • Integration of data from multiple sources and
    data bases
  • Visual or nonvisual analytical methods
  • Visualization for display
  • Important patterns

9
Sample Datasets
  • http//www.kdnuggets.com/datasets.html
    (Fedstats, Statlog, UC Irvine, KDD)
  • ftp//208.144.240.175/kddcup
  • ftp//www.epsilon.com/kddcup (fund raising
    mailing response dataset)

10
Accessing and Preparing Data
  • Capitalization, concatenation, representation
    format, augmentation, abstraction, unit
    conversion, exclusion
  • Limiting scope - select appropriate dimensions to
    extract
  • Structuring extractions - number of records and
    time

11
Accessing and Preparing Data
  • Extraction using data sampling vs report
    generation (examining the entire dataset)
  • Maintaining consistency and integrity - keep
    track of processing history, data keys, query
    generation code
  • Data sources and preprocessing - databases (SAP,
    Oracle, Peoplesoft, Access, FoxPro, LotusNotes,
    etc), word processors, spreadsheets

12
Accessing and Preparing Data
  • Data integration
  • Multisource
  • Multiformat
  • Multiplatform
  • Multisecurity
  • Multimedia
  • Multiaccess
  • Converting data
  • Long and short data structures

13
Accessing and Preparing Data
  • Data cleanup
  • Up to 80 of time in data mining process
  • Errors - data entry (mistyping, incomplete
    screens), missing data, incompatible formats,
    tampering/improper coding
  • Disambiguation

14
Visual Methods for Analyzing Data
  • Discover overall trends
  • Discover smaller hidden patterns
  • Make unbiased observations/ descriptions about
    data
  • Cognitive limitations
  • Short term memory attentional limitation
    (absorbing multiple pages of tabular or
    text-based output)
  • Long-term memory - reliance on associations not
    being presented

15
Cognitive Strengths
  • Linkage analysis - e.g. telephone calling
    patterns
  • Scheme-based visualization
  • Positioning algorithms - reveal object
    clustering, hierarchical relationships,
    organizational networks, geopositional or
    landscape displays

16
Cognitive Strengths
  • Manipulating display characteristics of objects
    or records
  • Source data -gt Data object -gt Object attributes
    -gt Visualization
  • Attributes - color, shape, size, x-pos, y-pos,
    elevation, intensity, alignment, label, image,
    orientation, link
  • Coding attribute information up to 20 or more
    dimensions can be displayed

17
Analyzing Structural Features
  • Out-of-bounds values - e.g. landscape display or
    scatter diagram of trauma patients
  • Missing data - e.g. clustering of cellular
    communications data
  • Anomalous data - e.g. unusual airline flights

18
Analyzing Network Structures
  • Object - link - network
  • Interconnectivity
  • Articulation points - data objects that connect
    two or more subnetworks, e.g. detecting
    bottlenecks
  • Identification of subnetworks or discrete
    networks
  • Missing connections - entities detached from main
    network

19
Analyzing Network Structures
  • Strong/Weak linkages - strength of relationships
    within the network
  • Fan-Out frequency - degree of connectivity, good
    indicator of unusual behavior
  • Pathway analysis - connectivity of objects across
    a series of linkages
  • Commonality linkages - e.g. fraud detection,
    reducing marketing costs, minimizing
    transportation and delivery costs

20
Analyzing Network Structures
  • Emergent patterns of connectivity
  • Groups, liaisons, attached isolates, detached
    isolates

21
Analyzing Temporal Patterns
  • Trend
  • Cycle
  • Seasonal
  • Irregular
  • Absolute time cycle of events
  • Contiguous time cycle event
  • Visualizing temporal patterns
  • Anacapa presentation methods
About PowerShow.com