Database Technologies for E-Commerce - PowerPoint PPT Presentation

About This Presentation
Title:

Database Technologies for E-Commerce

Description:

Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center Outline Overview of the Pangea B2B marketplace Technology dives: Catalog integration ... – PowerPoint PPT presentation

Number of Views:58
Avg rating:3.0/5.0
Slides: 91
Provided by: rakeshAgr
Category:

less

Transcript and Presenter's Notes

Title: Database Technologies for E-Commerce


1
Database Technologies for E-Commerce
  • Rakesh Agrawal
  • IBM Almaden Research Center

2
Outline
  • Overview of the Pangea B2B marketplace
  • Technology dives
  • Catalog integration
  • Searching specification documents
  • Interactive parametric search
  • Storage retrieval of e-commerce data
  • Research opportunities

3
Pangea Architecture
4
Product Selection
5
Brand and Loyalty Creation
6
Features (1)
  • Eureka
  • Interactive parametric search
  • Searches based on example products
  • Similarity search for product substitution
  • Dynamic Content generation/presentation (on the
    fly from database)
  • Catalog Hierarchy pages with product counts
  • Side-by-Side Product Comparison
  • Product Listings/Details
  • Shopper groups through preference layer

7
Features (2)
  • Server side Searching
  • Functional description search using
    classification
  • Key word Part Number Search
  • Restriction of the search to sub-trees of the
    category hierarchy (pushed down to the database)
  • Real time Price and Availability Engine
  • Ability to crawl, interrogate, extract price
    availability data from various suppliers/distribut
    ors in real time and present them in side-by-side
    comparison format
  • Easily add new distributors/suppliers
  • Completely client side implementation to prevent
    blocking

8
Features (3)
  • Design Worksheet (Generalized shopping cart)
  • List of items an item could be a specific part,
    alternative set of parts, specifications, other
    design worksheets (nesting)
  • Completely integrated with all relevant
    components (search, content, price, etc.)
  • Aggregate constraints (e.g. total price)
  • Multiple projects saved on server
  • Share projects with other (authorized) users
  • Mining for suggestions

9
Features (4)
  • Design data warehouse creation and maintenance
  • Crawling technology for extracting part
    information from websites of various
    manufacturer/distributor/suppliers
  • Data extraction from Manufacturer Spec Sheets
    (pdf files)
  • Classification technology to build, merge, manage
    and populate category hierarchies.

10
Features (5)
  • Soft content creation and maintenance
  • Crawling to acquire articles/news/postings from
    various web news sources, usenet newsgroups, etc.
  • Agents to cluster, classify, extract concepts,
    identify associations.
  • Personalized news/data presentation (based on
    user defined profile, channels, etc.)
  • Complete interleaving and integration with part
    content.
  • Automatic generation of summary pages for
    manufacturers (e.g. related news articles, URLs ,
    product categories).

11
Prototype
  • Data for nearly 2 million parts, in 2000
    categories
  • Focused crawling of more than 175 manufacturers
    for datasheets/ application notes/manuals
    (160,000)
  • 1/4 Terabyte database

12
Catalog Integration
  • R. Agrawal, R. Srikant WWW-10

13
Catalog Integration Problem
  • Integrate products from new catalog into master
    catalog.

14
The Problem (cont.)
  • After integration

15
Desired Solution
  • Automatically integrate products
  • little or no effort on part of user.
  • domain independent.
  • Problem size
  • Million products
  • Thousands of categories

16
Model
  • Product descriptions consist of words
  • Products live in the leaf-level categories

17
Basic Algorithm
  • Build classification model using product
    descriptions in master catalog.
  • Use classification model to predict categories
    for products in the new catalog.

18
National Semiconductor Files
19
National Semiconductor Files with Categories
20
Accuracy on Pangea Data
  • B2B Portal for electronic components
  • 1200 categories, 40K training documents.
  • 500 categories with lt 5 documents.
  • Accuracy
  • 72 for top choice.
  • 99.7 for top 5 choices.

21
Enhanced Algorithm Intuition
  • Use affinity information in the catalog to be
    integrated (new catalog)
  • Products in same category are similar.
  • Bias the classifier to incorporate this
    information.
  • Accuracy boost depends on quality of new catalog
  • Use tuning set to determine amount of bias.

22
Algorithm
  • Extension of the Naive-Bayes classification to
    incorporate affinity information

23
Naive Bayes Classifier
  • Pr(Cid) Pr(Ci)Pr(dCi)/Pr(d) //Bayes Rule
  • Pr(d) same for all categories (ignore)
  • Pr(Ci) docs ? Ci / total docs
  • Pr(dCi) Pw?d Pr(wCi)
  • Words occur independently (unigram model)
  • Pr(wCi) (n(Ci ,w)?) / (n(Ci) ?V)
  • Maximum likelihood estimate smoothed with the
    Lidstones law of succession

24
Enhanced Algorithm
  • Pr(Cid,S) //d existed in category S
  • Pr(Ci,d,S) / Pr(d,S)
  • Pr(Ci,d,S) Pr(d,S) Pr(Cid,S)
  • Pr(Ci)Pr(S,dCi) / Pr(d,S)
  • Pr(Ci)Pr(SCi)Pr(d Ci) / Pr(S,d)
  • Assuming d, S independent given Ci
  • Pr(S)Pr(CiS)Pr(d Ci) / Pr(S,d)
  • Pr(SCi) Pr(Ci) Pr(CiS) Pr(S)
  • Pr(CiS)Pr(dCi) / Pr(dS)
  • Pr(S,d) Pr(S)Pr(dS)
  • Same as NB except Pr(CiS) instead of Pr(Ci)
  • Ignore Pr(dS) as it is same for all classes

25
Computing Pr(CiS)
  • Pr(CiS)
  • Ci?(docs in S predicted to be in Ci)w /
  • ? j?1,n Cj?(docs in S predicted to be in
    Cj)w
  • Ci docs in Ci in the master catalog
  • w determines weight of the new catalog
  • Use a tune set of documents in the new catalog
    for which the correct categorization in the
    master catalog is known
  • Choose one weight for the entire new catalog or
    different weights for different sections

26
Superiority of the Enhanced Algorithm
  • Theorem The highest possible accuracy achievable
    with the enhanced algorithm is no worse than what
    can be achieved with the basic algorithm.
  • Catch The optimum value of the weight for which
    enhanced achieves highest accuracy is data
    dependent.
  • The tune set method attempts to select a good
    value for weight, but there is no guarantee of
    success.

27
Empirical Evaluation
  • Start with a real catalog M
  • Remove n products from M to form the new catalog
    N
  • In the new catalog N
  • Assign fn products to the same category as M
  • Assign the rest to other categories as per some
    distribution (but remember their true category)
  • Accuracy Fraction of products in N assigned to
    their true categories

28
Improvement in Accuracy (Pangea)
29
Improvement in Accuracy (Reuters)
30
Improvement in Accuracy (Google.Outdoors)
31
Tune Set Size (Pangea)
32
Empirical Results
33
Summary
  • Classification accuracy can be improved by
    factoring in the affinity information implicit in
    the data to be categorized.
  • How to apply these ideas to other types of
    classifiers?

34
Searching Specification Documents
  • R. Agrawal, R. Srikant. WWW-2002

35
Specfication Documents
  • Consist of ltattribute name, valuegt pairs,
    embedded in text
  • Examples
  • Data sheets for electronic parts
  • Classifieds

36
Sources of Problems
  • Synonyms for attribute names and units.
  • "lb" and "pounds", but no "lbs" or "pound".
  • Attribute names are often missing.
  • No "Speed", just "MHz Pentium III"
  • No "Memory", just "MB SDRAM"
  • Accurate data extraction is hard, e.g. partial
    datasheet for Cypress CY7C225A PROM

37
An end run!
  • Use a simple regular expression extractor to get
    numbers
  • Do simple data extraction to get hints, e.g.
  • Hint for unit the word following the number.
  • Hint for attribute name k words following the
    number.
  • Use only numbers in the queries
  • Treat any attribute name in the query also as hint

38
Documents and Queries
39
Can it work?
  • Yes!!!!!
  • Provided data is non-reflective
  • Reflectivity can be computed a priori for a given
    data set and provides estimate of expected
    accuracy.

40
Reflectivity
41
Non-reflectivity in real life
  • Non-overlapping attributes
  • Memory 64 - 512 Mb, Disk 10 - 40 Gb
  • Correlations
  • Memory 64 - 512 Mb, Disk 10 - 100 Gb still
    fine.
  • Clusters

42
Basic Idea
  • Consider co-ordinates of a point
  • If there is no data point at the permutations of
    its co-ordinates, this point is non-reflective
  • Only a few data points at the permutations of its
    co-ordinates gt point is largely non-reflective
  • Compute reflectivity as this ratio summed over
    all the points
  • Consider neighborhood of a point in the above
    calculation

43
Reflectivity
  • D set of m-dimensional points
  • n coordinates of point x

    h(n) number of points within distance r of n
  • Reflections(x) permutations of n
    q(n) number of points in D that have
    at least one reflection within distance r of n
  • Reflectivity(m,r) 1 - 1/D SxcD h(n)/q(n)
  • Non-reflectivity 1- Reflectivity
  • See the paper for reflectivity of D over
    k-dimensional subspaces

44
Algorithms
  • How to compute match score (rank) of a document
    for a given query?
  • How to limit the number of documents for which
    the match score is computed?

45
Match Score of a Document
  • Select k numbers from D yielding minimum distance
    between Q and D
  • F(Q,D) ( Sik1 w(qi ,nji)p )1/p
  • Map problem to Bipartite Matching in graph G
  • k source nodes corresponding to query numbers
  • m target nodes corresponding to document numbers
  • An edge from each source to k nearest targets.
  • Assign weight w(qi ,nj)p to the edge (qi,nj).

46
Bipartite Matching
  • The optimum solution to the minimum weight
    bipartite graph matching problem matches each
    number in Q with a distinct number in D such that
    the distance score F(Q,D) is minimized.
  • The minimum score gives the rank of the document
    D for the Query Q.
  • Assuming F to be L1 and w(qi,nj) qi - nj /
    qi e

47
Limiting the Set of Documents
  • Similar to the score aggregation problem Fagin,
    PODS 96
  • Proposed algorithm is an adaptation of the TA
    algorithm in Fagin-Lotem-Naor, PODS 01

48
Limiting the Set of Documents
  • Make k conceptual sorted lists, one for each
    query term use documents index(number)
  • Do a round robin access to the lists. For each
    document found, compute its distance F(D,Q)
  • Let ni number last looked at for query term qi
    Let t (Sik1
    w(qi, ni)p)1/p
  • Halt when t documents found whose distance t
  • t is lower bound on distance of unseen documents

49
Evaluation Metric
  • Benchmark Set of answers when attribute names
    are precisely known in the document and query
  • What fraction of the top 10 "true" answers are
    present in the top 10 answers when attribute
    names are unknown in both document and query?

50
Accuracy Results
51
Reflectivity estimates accuracy
  • Non-reflectivity closely tracked accuracy for all
    nine data sets
  • Non-reflectivity arises due to clustering and
    correlations in real data (Randomized
    non-reflectivity value obtained after permuting
    values in the columns)

52
Incorporating Hints
  • L1 S w(qi,ni) B v(Ai,Hi)
  • v(Ai,Hi) distance between attribute name (or
    unit) for qi and set of hints for ni
  • B relative importance of number match vs.
    name/unit match

53
Balance between Number Match Hint Match
  • Weight to hints should depend on the accuracy of
    hints
  • Use tune set to determine B on per dataset basis

54
Effectiveness of Hints
  • Improvement in accuracy depends on how good are
    hints

55
Effectiveness of Indexing
  • 1 million docs
  • 1 sec for qsize 5
  • .03 sec for qsize1

56
Summary
  • Allows querying using only numbers or numbers
    hints.
  • End run around data extraction.
  • Use simple extractor to generate hints.
  • Can ascertain apriori when the technique will be
    effective

57
Future Work
  • Integration with classic IR (key word search)
  • PROM speed 20 ns power 500 mW
  • Extension to non-numeric values

58
Parametric Search of E-Commerce
Data
  • J. Shafer, R. Agrawal WWW-9

59
Conventional Wisdom
  • User enters search criteria into HTML form
  • Users query is received by web-server and
    submitted to a server-side database (usually as
    SQL)
  • Result set is returned to user as an HTML page
    (or a series of pages)

60
Search Problem in E-Commerce
  • Users often dont know exactly what they are
    looking for
  • Search is a process, not a single operation
  • An individual query is simply a means to an end
  • Too many/few results try different query
  • No universal ranking function

61
Essential Ideas of Eureka
  • Incrementally fetch superset of tuples of
    interest in the client (e.g. all products in the
    product category of interest)
  • User-interface and fast indexing structures for
    interactive exploration of the cached tuples

62
  • Restriction by example
  • Ranking
  • Fuzzy restrictions

63
Architecture Overview
Eureka
ListRenderer
DataPump
DataGroup
HTTP
DataColumn 1
DataColumn N
. . .
client
server
JDBC
Servlet
Database
64
Comments
  • Used Eureka in several situations with very
    positive feedback
  • Used Eureka on datasets with 100K records with no
    visible deterioration in performance
  • Performance is excellent, even in Java

65
Storage and Retrieval ofE-Commerce Data
  • R. Agrawal, A. Somani, Y. Xu VLDB-2001

66
Typical E-Commerce Data Characteristics
An Experimental E-marketplace for Computer
components
  • Constantly evolving schema
  • Sparsely populated data (about 50-100
    attributes/component)
  • Nearly 2 Million components
  • More than 2000 leaf-level categories
  • Large number of Attributes (5000)

67
Conventional horizontal representation (n-ary
relation)
  • DB Catalogs do not support thousands of columns
    (DB2/Oracle limit 1012 columns)
  • Storage overhead of NULL values Nulls increase
    the index size
  • and they sort high in DB2 B tree index
  • Hard to load/update
  • Schema evolution is expensive
  • Querying is straightforward

68
Binary Representation(N 2-ary relations)
  • Decomposition Storage Model Copeland et al
    SIGMOD 85, Khoshafian et al ICDE 87
  • Monet Binary Attribute Tables Boncz et al VLDB
    Journal 99
  • Attribute Approach for storing XML Data Florescu
    et al INRIA Tech Report 99
  • Dense representation
  • Manageability is hard because of large number of
    tables
  • Schema evolution expensive

69
Vertical representation(One 3-ary relation)
  • Objects can have large number of attributes
  • Handles sparseness well
  • Schema evolution is easy
  • Implementation of SchemaSQL LSS 99
  • Edge Approach for storing XML Data FK 99

70
Querying over Vertical Representation
  • Simple query on a Horizontal scheme
  • SELECT MONITOR FROM H WHERE OUTPUTDigitalBec
    omes quite complex
  • SELECT v1.Val
  • FROM vtable v1, vtable v2
  • WHERE v1.Key Monitor
  • AND v2.Key Output
  • AND v2.Val Digital
  • AND v1.Oid v2.Oid

Writing applications becomes much harder. What
can we do ?
71
Solution Query Mapping
  • Translation layer maps relational algebraic
    operations on H to operations on V

72
Key Issues
  • Can we define a translation algebra ?
  • Standard database view mechanism does not work
  • attribute values become column names (need higher
    order views)
  • Can the vertical representation support fast
    querying ?
  • What are the new requirements from the database
    engines?

73
Transformation Algebra
  • Defined an algebra for transforming expressions
    over horizontal views into expressions over the
    vertical representation.
  • Two key operators
  • v2h (?) Operation Convert from vertical to
    horizontal
  • ?k(V) ?Oid(V) ? ?i1,k ?Oid,Val(?KeyAi(V))
  • h2V (?) Operation Convert from horizontal to
    vertical
  • ?k(H) ??i1,k ?Oid,AiAi(?Ai ? ?(V)) ?
  • ?i1,k ?Oid,AiAi(??i1,k
    Ai?(V))
  • Similar to Unfold/Fold LSS 99 ,Gather/Scatter
    STA 98

74
Implementation Strategies
  • VerticalSQL
  • Uses only SQL-92 level capabilities
  • VerticalUDF
  • Exploits User Defined Functions and Table
    Functions
  • Binary
  • 2-ary representation with one relation per
    attribute (using only SQL-92 transforms)
  • SchemaSQL
  • Addresses a more general problem
  • Performed 2-3X worse than Vertical
    representation because of temporary tables

75
Performance Experimental setup
  • 600 MHz dual-processor Intel Pentium machine
  • 512 MB RAM
  • Windows NT 4.0
  • Database IBM DB2 UDB 7.1
  • Two 30GB IDE Drives
  • Buffer Pool Size 50 MB
  • All numbers reported are cold numbers
  • Synthetic data
  • 200X100K r10 ? 200 columns, 100K rows and
    non-null density 10

76
Clustering by Key outperformed clustering by Oid
Join
77
VerticalSQL comparable to Binary and outperforms
Horizontal
Projection of 10 columns
78
VerticalUDF is the best approach
density 10
Projection of 10 columns
79
Wish List from Database Engines
  • VerticalUDF approach showed need for
  • Enhanced table functions
  • First class treatment of table functions
  • Native support for v2h and h2v operations
  • Partial indices

80
Summary
Vertical (w/ Mapping)
Binary (w/ Mapping)
Horizontal


Manageability
-

-
Flexibility
-
Querying



-
Performance


81
Opportunities
82
Semiautomatic Catalog Creation
83
Architecture
84
Improving the Catalog Hierarchy
  • Identify sets of nodes (or single "kitchen sink"
    node) for re-organization.
  • Cluster the set of nodes, and compare the
    resulting hierarchy with the original.
  • Identify nodes that fit better elsewhere in the
    hierarchy.
  • Metrics classification accuracy, tightness of
    cluster.

85
Integration of Real Time Operational Data
86
(No Transcript)
87
Privacy Preserving Data Mining
Alices age
  • Insight Preserve privacy at the individual
    level, while still building accurate data mining
    models at the aggregate level.
  • Add random noise to individual values
    to protect privacy.
  • Can dramatically change
    distribution of values.
  • EM algorithm to estimate original distribution of
    values given randomized values randomization
    function.
  • Estimate only accurate over thousands of values
    gt preserves privacy.
  • Algorithms for building classification models and
    discovering association rules on top of
    privacy-preserved data with only small loss of
    accuracy.

Alices salary
Bobs age
50 40K ...
30 70K ...
30 becomes 65 (3035)
Randomizer
Randomizer
65 20K ...
25 60K ...
Reconstruct distribution of Age
Reconstruct distribution of Salary
Data Mining Algorithms
Data Mining Model
88
Seems to work well!
89
Accuracy vs. Privacy
90
Summary
  • E-Commerce is a rich application domain for
    database technology
  • Research opportunities abound
Write a Comment
User Comments (0)
About PowerShow.com