Using Text Mining and Link Analysis for Software Mining - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Using Text Mining and Link Analysis for Software Mining

Description:

Funded by: European Commission 6th Framework. TAO ... Using Text Mining and Link Analysis for ... types and how to unpack the information in any ... – PowerPoint PPT presentation

Number of Views:53
Avg rating:3.0/5.0
Slides: 24
Provided by: mih95
Category:

less

Transcript and Presenter's Notes

Title: Using Text Mining and Link Analysis for Software Mining


1
Using Text Mining and Link Analysis for Software
Mining
  • Miha Grcar, Marko Grobelnik, Dunja Mladenic
  • Jozef Stefan Institute

2
Outline of the Presentation
  • Purpose of this work
  • Software mining
  • Document networks
  • Transforming document networks into feature
    vectors
  • Why feature vectors?
  • Data visualization
  • Clustering, classification
  • What was done so far
  • Future work

3
Purpose of This Work
  • To define a methodology
  • To implement a set of tools
  • for facilitating the construction of ontologies
    and taxonomies out of software
  • Funded by the European Commission under the
    project TAO Transitioning Applications to
    Ontologies http//www.tao-project.eu/

4
Software Mining
Software data sources
Linkanalysis
Textmining
Structured data networks
Unstructured data textual documents
Document network
a set of interlinked documents each link has a
type and a weight
5
Software Data Sources
  • Structured
  • Code samples
  • Web service usage logs
  • Source code
  • Reference manual (function declarations)
  • WDSL
  • Unstructured
  • Web pages
  • Users manual
  • Tutorials, lectures, forums, newsgroups, etc.
  • Reference manual (textual descriptions)
  • Source code comments

6
GATE Software Library
  • Software library for natural language processing
    (NLP)
  • 600 Java classes
  • Language resources data
  • Processing resources algorithms
  • Graphical user interfaces GUI
  • Developed at University of Sheffield
  • Freely available at http//gate.ac.uk/download/

7
A Typical Java Class
Classname
Comment references
/ The format of Documents. Subclasses of
DocumentFormat know about particular MIME
types and how to unpack the information in any
markup or formatting they contain into GATE
annotations. Each MIME type has its own
subclass of DocumentFormat, e.g.
XmlDocumentFormat, RtfDocumentFormat,
MpegDocumentFormat. These classes register
themselves with a static index residing here
when they are constructed. Static
getDocumentFormat methods can then be used to get
the appropriate format class for a particular
document. / public abstract class
DocumentFormat extends AbstractLanguageResource
implements LanguageResource / The MIME
type of this format. / private MimeType
mimeType null / Find a
DocumentFormat implementation that deals with a
particular MIME type, given that type.
_at_param aGateDocument this document will
receive as a feature
the associated Mime Type. The name of the feature
is MimeType and its
value is in the format type/subtype _at_param
mimeType the mime type that is given as input
/ static public DocumentFormat
getDocumentFormat(gate.Document aGateDocument,

MimeType mimeType) //
getDocumentFormat(aGateDocument, MimeType) //
class DocumentFormat
Classcomment
Super-class(base class)
Implementedinterface
Field comment
A field
Field type
Field name
Method comment
A method
Comment reference
Returntype
Methodname
8
Creating a Document Network
DocumentFormat
DocumentFormat.class
9
Creating a Document Network
DocumentFormat.class
LanguageResource
MimeType
2
RtfDocumentFormat
DocumentFormat
AbstractLanguageResource
Document
XmlDocumentFormat
MpegDocumentFormat
10
GATE Comment Reference Network
See next slide
11
GATE Comment Reference Network
12
Transforming Networks into Feature Vectors
11
10
9
8
7
6
5
4
3
2
1
0
0
0.25
2
1
2
6
0.25
0.5
0.25
1
3
3
4
7
4
0
5
6
8
1
7
5
8
9
9
10
11
10
11
13
Transforming Networks into Feature Vectors
11
10
9
8
7
6
5
4
3
2
1
0
0
0.25
2
1
2
6
0.25
0.5
0.25
1
3
3
4
7
4
0
5
6
8
1
7
5
8
9
9
10
11
10
11
14
Transforming Networks into Feature Vectors
  • Adjacency matrix
  • Too sparse
  • Maximum flow
  • Between O(V31/3) and O(V4) for sparse graphs
    too expensive!
  • Shortest paths
  • O(V2logV) for sparse graphs
  • Force-based graph layout
  • Between O(V) and O(V2) for sparse graphs
  • Some other methods
  • Random walk
  • ScentTrails
  • PageRank
  • Belief propagation

2
6
3
4
7
0
8
1
5
9
10
11
15
Combining Feature Vectors
Feature vector
Feature vector
Feature vector
(a) horizontally
Feature vector

Feature vector

Feature vector

?? Feature vector
(b) vertically
Feature vector
Structure feature vector
Feature vector
Structure feature vector
Feature vector
Content feature vector
Content feature vector
Content feature vector
Structure feature vector
  • Stop-words
  • Stemming
  • n-grams
  • Nrm TF-IDF

Combined feature vector
16
Why Feature Vectors?
Softwaredata sources
Processing
Featurevectors
Classification
17
OntoGen
  • A system for data-driven semi-automatic ontology
    construction
  • Developed in the European project SEKT
    (http//sekt-project.org)
  • Freely available at http//ontogen.ijs.si
  • Underlying technologies
  • Supervised and unsupervised learning
  • Active learning
  • Data visualization

18
Set of feature vectors
Clustering
19
(No Transcript)
20
Visualize this cluster
21
(No Transcript)
22
Different semantic space Þ Different
clusters Þ Different taxonomies
Þ Different views of the data
23
Done So Far and Future Work
  • Done so far
  • First implementation in the form of a Web service
    up and running available to the consortium
  • Managing document networks
  • Weight propagation
  • Feature vector computation
  • OntoGen-compatible
  • Future work
  • Visualization tool that helps users set the
    weights
  • Machine learning algorithms
  • Hierarchical clustering
  • Classification
  • Active learning
  • Evaluation (not trivial to do)
  • Golden standard taxonomy vs. hierarchical
    clustering
  • Classification into taxonomy
  • User study

24
Done ?
  • Thank you for your attention
  • Questions?
Write a Comment
User Comments (0)
About PowerShow.com