Improving Automatic Abbreviation Expansion within Source Code to Aid in Program Search Tools - PowerPoint PPT Presentation

Loading...

PPT – Improving Automatic Abbreviation Expansion within Source Code to Aid in Program Search Tools PowerPoint presentation | free to download - id: 6a8a35-YTM0O



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Improving Automatic Abbreviation Expansion within Source Code to Aid in Program Search Tools

Description:

Improving Automatic Abbreviation Expansion within Source Code to Aid in Program Search Tools Zak Fry Acknowledgments Emily Hill and Haley Boyd Dr. Vijay K. Shanker ... – PowerPoint PPT presentation

Number of Views:6
Avg rating:3.0/5.0
Slides: 37
Provided by: Zak95
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Improving Automatic Abbreviation Expansion within Source Code to Aid in Program Search Tools


1
Improving Automatic Abbreviation Expansion within
Source Code to Aid in Program Search Tools
  • Zak Fry

2
Outline
  • Problem and Motivation
  • Automatically Identifying Abbreviation Expansions
  • A Scoped Approach
  • Analysis and Refinement iScope
  • Evaluations
  • Conclusions

3
Maintenance Tasks
  • 60-90 of software lifecycle
  • Problem id where relevant code is where
    changes need to be made
  • Code to perform a certain task can be very
    scattered
  • Causes difficulty for current maintenance search
    tools

4
Challenges - Coding Practices
  • Identifier names important for code documentation
    and understanding
  • Problem Programmers use of abbreviations in
    code
  • Frequency of occurrence
  • character, integer, string
  • Complex inheritance long class names
  • SecureMessageServiceClientMessageImpl
  • Negates usefulness of identifier names and
    complicates program understanding

5
Abbreviations and Maintenance Tools
  • Problem Search based maintenance tools rely on
    natural language
  • Abbreviations change the natural language
  • Search Term distributed hash
  • dht (DHTPlugin)dht_pi.getPlugin() Thread
    t new AEThread( "DHTTrackerPlugininit" )
    public void runSupport() try if (
    dht.isEnabled()) log.log( "DDB Available" )
  • catch( Throwable e ) log.log( "DDB
    Failed", e ) ...

6
Automatically Identifying Abbreviation Expansions
  • First, how do we identify candidates for
    expansion?
  • Non-dictionary words
  • Abbreviation
  • Short form
  • Expansion
  • Long form

7
Types of Non-Dictionary Words
Abbreviation Category Type Short Form Long Form
Single Word Prefix int integer
Single Word Dropped Letter evt event
Multiple Word Acronym FBI Federal Bureau of Investigation
Multiple Word Combination Multiword recblk receive block
Domain Keywords and Special Cases --- parsetree serialize ---
8
State of the Art
  • Lawrie, Feild, and Binkley
  • Abbreviation Expansion
  • Problem
  • Lack of precision
  • No support for choosing between multiple matches

9
Scoped Approach
  • How to choose between multiple possible long
    forms
  • By manual inspection we found correct long forms
    are more likely to be found in certain locations
  • Also, correctly identifying the long forms for
    certain types of abbreviations is easier than for
    others

10
Order of Types
Abbreviation Type
1 Acronym 2 Prefix 3 Dropped Letter 4 Combo Multiword 5 Most Common
11
Order of Program Context
Context
1 Javadoc 2 Type 3 Method Name 4 Statement 5 Method 6 Method Comments 7 Class Comments
12
General Algorithm
Acronym
Javadoc
Type
Method Name

Prefix
Javadoc
Type
Method Name

13
Multiple matches
  • We assume one best candidate though multiple
    might be present at the same level of scope
  • If multiple matches
  • Examine frequencies
  • Stem long forms and reexamine frequencies
  • Broaden Scope and reexamine frequencies
  • Most frequent expansion

14
Most Frequent Expansion (MFE)
  • If still no ideal candidate is found
  • We mined long forms from 1.5 million LOC of Java
    5 code base
  • Return most frequent long form as last resort

15
Evaluation of Scoped Approach
  • 250 abbreviations from 5 subject programs
  • Gold standard developed by human developer
    inspecting the code manually
  • Implemented LFB according to description
  • Except combination words due to missing
    database

(Accuracy)
16
Analysis and Refinement - iScope
  • Analyzed results and found 3 major sources of
    problems
  • Developed iScope by addressing these 3 major
    problem areas

17
Order of Scoping
  • Problem
  • Scoped approach ordering examine every context
    for an abbreviation type then go to next type
  • Investigating broader contexts for one type
    before even the narrowest context for another
    type is likely to yield incorrect matches
  • Insight Context is more sensitive than type
  • Solution Check each type at each context level,
    then go to next context level (switch order)

18
Single Letter Abbreviations
  • Problem
  • Developers use single letter abbreviations
    differently than multiple letter abbreviations
  • A large subset are actually semantically
    meaningless
  • Single letter very easily matched especially
    because prefix matching is greedy
  • Reader r new BufferedReader()
  • Insight Based on manual inspection, we found
    that meaningful single letter short forms were
    identifiers whose long forms were also their type
    name
  • Solution Limit contextual scope to type only

19
Hyper-Common Abbreviation
  • Problem Some abbreviations used so often in code
    that long form rarely ever co-occurs leading to
    incorrect expansion based on coincidence
  • Solution Mine a small set of extremely common
    abbreviations and use as a preprocessing step

20
Mined list of hyper-common abbreviations
21
Evaluations
  • Is our method accurate enough to be useful?
  • Reevaluation of previous experiment
  • Does abbreviation expansion help maintenance
    tasks?
  • Simple Search
  • Concern Location Task

22
1. Reevaluation of Previous Test
  • Based on our previous experimental methodology
    and metrics, how much improvement was made from
    Scope to iScope?
  • Modified goldset based on new assumptions
    single letter abbreviations

23
1. Reevaluation of Previous Test - Results
  • Compare LFB with Scope and iScope using non
    combinational word (NCW) accuracy values
  • Compare JavaMFE, ProgMFE, Scope, and iScope using
    the total accuracy values

24
2. Simple Search Evaluation
  • When abbreviations are expanded in software, how
    many more search results are returned than
    without expansion?
  • Focus Recall
  • Not missing important results want as many
    potentially relevant results as possible
  • Metric Percent increase in results
  • P.I. Raw returned results with expansion -
    100
  • Raw returned results without expansion

25
2. Simple Search Evaluation (cont)
  • Subjects 215 concerns(Eaddy et al.) annotated by
    3 people each for total of 645 queries
  • Developed independent of the idea of abbreviation
    expansion many queries might not be affected by
    abbreviation expansion at all
  • Match if any word in the query matches any
    word in the method considered a match and
    returned as a result

26
2. Simple Search Evaluation - Results
Approach Total Returned Results Percent Increase
No Expansion 240,752 ---
Scope 284,160 18.03
iScope 282,489 17.34
  • Less increase with iScope single letter
    abbreviation false positive decrease
  • Ideally, this means quality is better
  • experiment 3

27
3. Evaluation with Concern Location
  • Concern location task identification of methods
    that are deemed to be relevant for the given
    search term
  • How much increase in effectiveness can be gained
    from expanding abbreviations in source code when
    performing concern location tasks?

28
3. Evaluation Methodology
  • Tools Latent Semantic Indexing(LSI) and Log
    Entropy-based concern location
  • Goals Attempt to calculate similarity values
    based on location and frequency of potential
    query matches
  • Subjects same as previous experiment

29
3. Methodology (cont)
  • Metric Mean Average Precision (MAP)
  • Precision True positives / Total of
    positives
  • MAP
  • Collect precision values for every new true
    positive, going down the ranked returned results
  • Then take average of all results
  • Attempts to reward highly ranked true positives

30
3. Concern Location Tasks - Results
31
3. Concern Location Tasks - Results
32
Conclusions
  • Abbreviation expansion is proven to be helpful in
    maintenance tools and processes
  • iScope approach improves upon Scope and greatly
    upon state-of-the-art

33
Future Work
  • Further refinement of expansion process to
    achieve highest possible accuracy
  • Full integration into maintenance tool
  • Extension into other programming languages

34
Acknowledgments
  • Emily Hill and Haley Boyd
  • Dr. Vijay K. Shanker and Dr. Lori Pollock

35
Questions?
36
Inherent Inaccuracy
  • Problem Additional errors in code not
    generalizable into solvable problems
  • Insight There will always be inherent error when
    developing automatic systems for non-standard
    input
About PowerShow.com