Developer Identification Methods for Integrated Data from Various Sources Gregorio Robles Jesus M. Gonzalez-Barahona - PowerPoint PPT Presentation

About This Presentation
Title:

Developer Identification Methods for Integrated Data from Various Sources Gregorio Robles Jesus M. Gonzalez-Barahona

Description:

Developer Identification Methods for Integrated Data from Various Sources Gregorio Robles Jesus M. Gonzalez-Barahona Presented by Brian Chan Cisc 864 – PowerPoint PPT presentation

Number of Views:69
Avg rating:3.0/5.0
Slides: 21
Provided by: BrianC186
Category:

less

Transcript and Presenter's Notes

Title: Developer Identification Methods for Integrated Data from Various Sources Gregorio Robles Jesus M. Gonzalez-Barahona


1
Developer Identification Methods for Integrated
Data from Various Sources Gregorio Robles Jesus
M. Gonzalez-Barahona
  • Presented by
  • Brian Chan
  • Cisc 864

2
Table of Contents
  • Background Information
  • Problems Addressed
  • Motivation
  • Data Gathered
  • Conclusion
  • Personal Thoughts
  • Question and Comments

3
Background Information
  • Data mining for project comes from a single
    source of data
  • Results can be applied to Libre Software
  • Look at separately
  • Mailing Lists
  • Bug Repositories

4
Background Information
  • Libre Software shows Pareto law for commits
  • For each major artifact, 20 of developers are
    shown to contribute 80 of the activity in it.

5
Problems Addressed
  • Are the people that commit so much in one
    artifact the same people in the other artifact?
  • People use different identities in each artifact
  • Current mining techniques focus on one artifact
    so cannot tell who is who

6
Motivation
  • To gain insight into the social network and
    structure of libre software projects
  • To find all the identities that correspond to one
    person
  • Focus more on data analysis rather than the
    extraction process

7
Data Gathered
  • Actor has access to Figure 1.0
  • artifacts
  • Alternate rules for
  • each artifact

8
Data Gathered
  • Actor can post on more
  • than one mailing list
  • bylchan_at_ca.ibm.com
  • briancha_at_ca.ibm.com
  • Source Files can appear with many
  • identities Brian Chan
  • Brian
  • bchan
  • Interaction with versioning repository occurs
    through account in server machine
  • Bug tracking systems require email address i.e.
    Bugzilla

9
Data Gathered
  • Primary Figure 2.0
  • Required Information
  • Secondary
  • Not Required
  • for the transaction
  • i.e. name in email

10
Data Gathered (contd)
  • Automated process extracts data into data
    repository
  • Figure 3.0

11
Data Gathered
  • Sources Table
  • Lists where id information was originally
    extracted i.e. file1.C
  • bugreport230
  • Identification Table
  • Identity
  • Id key to Source table

12
Data Gathered
  • Persons
  • Gender, Nationality, Hash
  • Identifications
  • Pseudo identity bchan
  • Match number with another identity
  • Matches
  • Tells which two identities belong to the same
    actor
  • Table 1.0

1 bchan bylchan_at_ca.ibm.com Deduction 80
1 Brian Chan bylchan_at_ca.ibm.com Same Email 90
13
Data Gathered
  • Matching during automated data gathering process
  • Inference
  • Automatic Heuristics
  • Human Verification

14
Data Gathered
  • Rule 1
  • Primary Identities may have part of the real name
    in it
  • Example User ltusername_at_example.com
  • Rule 2
  • Identities can be built from another one
  • nsurname_at_example.com, name.surname_at_example.com
  • name surname_at_example.com
  • Rule 3
  • Some projects or repositories have foresight to
    keep list information that can be used for
    matching

15
Data Gathered
  • Still error in matching algorithms but in
    statistical gathering process, if it is small
    enough then can be ignored.
  • Still use cleaning and verification.

16
Data Gathered
  • Privacy Issues
  • Use Hash value (1st Firewall) to reference
    information. Cannot reference Identifications
    directly
  • Person ID (2nd Firewall) Given in such a way so
    cannot infer real identity without direct access
    to Identifications table
  • Given to unique person so hackers cannot find
    specific id

17
Conclusions
  • Actors in Libre Software may use many different
    identities for development
  • Paper deals with design of how to account for all
    the different people and who is actually doing
    what
  • Discussed how privacy can be dealt with

18
Personal Thoughts
  • Good Points
  • Effective Solution
  • Good examination of all the different identities
    in business
  • Unique interpretation of data mining

19
Personal Thoughts
  • Points for improvement
  • No actual data to view results
  • Reference GNOME but never actually give
    statistical information from it
  • Some interpretation is left to the reader

20
Questions and Comments
Write a Comment
User Comments (0)
About PowerShow.com