Title: Using Text Mining to Infer Semantic Attributes for Retail Data Mining
1Using Text Mining to Infer Semantic Attributes
for Retail Data Mining
- Authors Rayid Ghani Andrew E. Fano
- Presenter Vishal Mahajan
- INFS795
2Agenda
- Drawbacks in Current Data Mining Techniques.
- Purpose.
- Assumptions and Constraints.
- Methodology or Approach.
- Extraction of Feature Set.
- Labeling .
- Classification Techniques.
- Naïve Bayes
- EM
- Experimental Results.
- Recommender System.
3Drawbacks in Current Data Mining Techniques
- Semantic Features not automatically considered.
- Transactional Data analyzed without analyzing the
customer. - Trending is partial.
- Retail Items treated as objects with no
associated semantics. - Data Mining Techniques (association rules,
decision trees, neural networks) ignore the
meaning of items and semantics associated with
them.
4Purpose of the Presentation
- Describe a system that extracts semantic
features. - Populate the knowledge base with the semantic
features. - Use of text mining in retailing to extract
semantic features from website of retailers. - How profiles of customers or group of customers
can be build using Text Mining.
5Assumptions Constraints
- Focus on Apparel Retail segment only.
- Results focus on extracting those semantic
features that are deemed important by CRM or
Retail experts. - Data extracted from retailers website.
- Models generated can be extended beyond the
Apparel Retail segment.
6Approach
- Collect Information about products.
- Define set of features to be extracted.
- Label the data with values of the features.
- Train a classifier/extractor to use the labeled
training to extract features from unseen data. - Extract Semantic Features from new products by
using trained classifier. - Populate a knowledge base with the products and
corresponding feature.
7Data Collection Methodology
- Use of web crawler to extract the following from
large retailers website - Names
- URLs
- Description
- Prices
- Categories of all Products Available
- Use of wrappers.
- Extracted Information stored in a database and a
subset chosen.
8Extraction of Feature Set
- Feature selection based on Expert Systems.
- Use of extensive domain knowledge.
- Feature selection based on Retail Apparel section
in mind. - Feature Selected for the project ?
- Age Group
- Functionality
- Price
- Formality
- Degree of Conservativeness
- Degree of Sportiness
- Degree of Trendiness
- Degree of Brand Appeal
9Labeling Training Data
- Database created with data from collected from
retailer website. - Subset of 600 products chosen and labeled.
- Labeling guidelines provided
10Details of Features extracted from each Product
Description
11Verifying Training Data
- Disjoint Dataset as labeling done by different
individuals. - Association rules (between features) used to
obtain consistency in labeled data. - Apriori algorithm
- Apriori Algorithm implemented with single and two
feature antecedents and consequents. - Desired Consistency in Labeling achieved by
applying associating rules
12Apriori Algorithm
- Find the frequent itemsets the sets of items
that have minimum support - A subset of a frequent itemset must also be a
frequent itemset - i.e., if AB is a frequent itemset, both A and
B should be a frequent itemset - Use the frequent itemsets to generate association
rules.
13The Apriori Algorithm Example
L1
C1
Scan D
C2
Database D
C2
L2
Scan D
L3
C3
Scan D
14Training from Labeled Data
- Learning problem treated as a text classification
problem. - Only one text classifier for each semantic
feature. - e.g Price of product will be classified as either
discount or average or luxury. - Age group is classified as Juniors or Teens or
GenX or Mature or All Ages. - Classification was performed using Naïve Bayes
classification.
15Sample Association Rules
16Naïve Bayes
- Simple but effective text classification method.
- Class is selected according to class prior
probabilities. - This Model assumes each word in a document is
generated independently of the other in the
class.
where N(wt,di) count of times word wt occurs in
document di and Pr(cj,di) 0,1)
17Incorporating Unlabeled Data
- Initial sample was for 600 products only.
- Need to take care of unlabeled products to make
any meaningful predictions. - Use of Supervised learning algorithms.
- These algorithms have proved to reduce the
classification error considerably. - Use of Expectation-Maximization (EM) Algorithm as
the supervised technique.
18Expectation-Maximization (EM) Method
- EM is an iterative statistical technique for
maximum likelihood estimation for incomplete
data. - In the retail classification problem, unlabeled
data is considered as incomplete data. - EM ?
- Locally maximizes the likelihood of the
parameter. - Gives estimates for missing values.
19Expectation-Maximization (EM) Method- cont
- EM method is a 2-step process.
- Initial Parameters are set using naïve Bayes from
just the labeled documents. - Subsequent iteration of E- and M-Steps.
- E-Step
- Calculates probabilistically weighed class label
Pr(cjdj), for every unlabeled document. - M-Step
- Estimates new classifier parameter using all
documents (Equation 1). - E and M steps iterated unless classifier
converges
20Experimental Results
21Experimental Results
22Results on new data set
- The subset of data that was used earlier was from
a single retailer. - Another sample of data was collected from variety
of retailers. The results are as follows. - Results are consistently better.
23Recommender System
- Creation of customer profiles (real time) is
feasible by analyzing the text associated with
products and by mapping it to pre-defined
semantic features. - Identity of customer is not known and prior
transaction history is unknown. - Semantic features are inferred by the browsing
pattern of the customer. - Helps in suggesting new products to the customers.
24Recommender System
- Mathematically ?
- P(AijProduct)
- Where Aij is the jth value of ith attribute
- isemantic attributes, jpossible values
- User profile is constructed as follows
- Pr(Ui,jPast N Items) 1/N
i,j
is calculated
25Types of Recommender Systems
- Two Types of Recommender Systems.
- Collaborative Filtering.
- Collect user feedback in terms of ratings.
- Exploit similarities and differences of customers
to recommend items. - Issues
- Sparsity Problem.
- New Items.
- Content Filtering
- Compares the contents
- Issues
- Narrow in scope
- Recommends similar products only
26Conclusions
- The systems learns from the use of supervised and
semi-supervised techniques. - Major assumptions..Products accurately convey the
semantic attributes.?? - Small sample of data used to Infer results.
Practical applications not verified. - System bootstrapped from a small number of
labeled training examples. - Interesting application which could be evolved to
generate trends for retail marketers.