Developing an Application in Support of Counter Terrorism By William Leannah Qin Wang Akshat Kapoor - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Developing an Application in Support of Counter Terrorism By William Leannah Qin Wang Akshat Kapoor

Description:

The system permits users to find the best trail of evidence through many ... (NP: Senator/nn-tl Max/nn Cleland/nn ) (NP: Lt./nn-tl Col./nn Dawne/nn ... – PowerPoint PPT presentation

Number of Views:71
Avg rating:3.0/5.0
Slides: 24
Provided by: akshat
Learn more at: http://www.mscs.mu.edu
Category:

less

Transcript and Presenter's Notes

Title: Developing an Application in Support of Counter Terrorism By William Leannah Qin Wang Akshat Kapoor


1
Developing an Application in Support of
Counter TerrorismBy William Leannah Qin
WangAkshat Kapoor(Group 1)
2
Starting PointUnintended Information
Revelation (UIR) _at_ SUNY Buffalo
  • The system permits users to find the best trail
    of evidence through many documents that connects
    two or more apparently unrelated concepts.

A ? B ? C Query Definitions The system
tries to find how Article A might be linked to
Article C through Article B
3
Development Tools
  • Python 2.4
  • NLTK 1.4.1
  • Microsoft Access 2003
  • Graphviz

4
Data Sets
  • Development (pieces of 9-11 commission report A,
    B and C)
  • Training set (pieces of 9-11 commission report D,
    E and F)
  • Test set (we will select three different
    articles)

5
System Architecture
Create word index
Documents A, B and C
Tag all text using Brown Corpus
Check stop words
Chunking to identify noun phrases
Article Comparison
Input Text documents with no apparent common link
0.101
Compute strength of link
0. 088
0.013
0. 239
0.124
0.123
0. 01
0. 1065
0.54
Output Concept Chain Graph
0.2324
6
1. Stop Words EliminatorstopWordsCheck.py
  • This program, when run, identifies and eliminates
    pre-specified stop words in any given text
    document (command line argument)

7
stopWordsCheck.py
  • Output
  • pascalgtpython stopWordsCheck.py A.txt
  • Original stream
  • ltltfromgt, ltwikipediagt, lt,gt, ltthegt, ltfreegt,
    ltencyclopediagt, lt.gt, ltjumpgt, lttogt, ltgt,
    ltnavigationgt, lt,gt, ltsearchgt, ltagt, ltsequentialgt,
    ltlookgt, ltatgt, ltunitedgt, ltflightgt, lt175gt,
    ltcrashinggt, ltintogt, ltthegt, ltsouthgt, lttowergt,
    ltofgt, ltthegt, ltworldgt, lttradegt, ltcentergt, ltthegt,
    ltseptembergt, lt11gt, lt,gt, lt2001gt, ltattacksgt,
    ltweregt, ltagt, ltseriesgt, ltofgt, ltsuicidegt,
    ltattacksgt, ltagainstgt, ltthegt, ltunitedgt, ltstatesgt,
    ltconductedgt, ltongt, lttuesdaygt, lt,gt, ltseptembergt,
    lt11gt, lt,gt, lt2001gt, lt.gt, ltaccordinggt, lttogt, ltthegt,
    ltofficialgt, lt9/11gt, ltcommissiongt, ltreportgt, lt,gt,
    ltnineteengt, ltmengt, ltaffiliatedgt, ltwithgt, ltosamagt,
    ltbingt, ltladengt, ltandgt, ltal-qaedagt, lt,gt, ltagt,
    ltloosegt, ltnetworkgt, ltofgt, ltsunnigt, ltislamistgt,
    ltterroristsgt, lt,gt, ltsimultaneouslygt, lthijackedgt,
    ltfourgt, ltugt, lt.gt, ltsgt, lt.gt, ltdomesticgt,
    ltcommercialgt, ltairlinersgt, lt.gt, lttwogt, ltweregt,
    ltcrashedgt, ltintogt, ltthegt, ltworldgt, lttradegt,
    ltcentergt, ltingt, ltmanhattangt, lt,gt, ltnewgt, ltyorkgt,
    ltcitygt, lt\x97gt, ltonegt, ltintogt, lteachgt, ltofgt,
    ltthegt, lttwogt, lttallestgt, lttowersgt, lt,gt, ltaboutgt,
    lt18gt, ltminutesgt, ltapartgt, lt\x97gt, ltshortlygt,
    ltaftergt, ltwhichgt, ltbothgt, lttowersgt, ltcollapsedgt,
    lt.gt, ltthegt, ltthirdgt, ltaircraftgt, ltwasgt,
    ltcrashedgt, ltintogt, ltthegt, ltugt, lt.gt, ltsgt, lt.gt,
    ltdepartmentgt, ltofgt, ltdefensegt, ltheadquartersgt,
    lt,gt, ltthegt, ltpentagongt, lt,gt, ltingt, ltarlingtongt,
    ltcountygt, lt,gt, ltvirginiagt, lt.gt, ltthegt, ltfourthgt,
    ltplanegt, ltwasgt, ltcrashedgt, ltintogt, ltagt, ltruralgt,
    ltfieldgt, ltingt, ltsomersetgt, ltcountygt, lt,gt,
    ltpennsylvaniagt, lt,gt, lt80gt, ltmilesgt, lt(129gt,
    ltkm)gt, lteastgt, ltofgt, ltpittsburghgt, lt,gt,
    ltfollowinggt, ltpassengergt, ltresistancegt, lt.gt,
    ltthegt, ltofficialgt, ltcountgt, ltrecordsgt, lt2,986gt,
    ltdeathsgt, ltingt, ltthegt, ltattacksgt, lt.gtgt

8
stopWordsCheck.py
  • Clean stream
  • '', ltwikipediagt, lt,gt, '', ltfreegt,
    ltencyclopediagt, lt.gt, ltjumpgt, '', ltgt,
    ltnavigationgt, lt,gt, ltsearchgt, '', ltsequentialgt,
    '', '', ltunitedgt, ltflightgt, lt175gt, ltcrashinggt,
    '', '', ltsouthgt, lttowergt, '', '', ltworldgt,
    lttradegt, ltcentergt, '', ltseptembergt, lt11gt, lt,gt,
    lt2001gt, ltattacksgt, '', '', ltseriesgt, '',
    ltsuicidegt, ltattacksgt, '', '', ltunitedgt, ltstatesgt,
    ltconductedgt, '', lttuesdaygt, lt,gt, ltseptembergt,
    lt11gt, lt,gt, lt2001gt, lt.gt, '', '', '', ltofficialgt,
    lt9/11gt, ltcommissiongt, ltreportgt, lt,gt, ltnineteengt,
    ltmengt, ltaffiliatedgt, '', ltosamagt, ltbingt, ltladengt,
    '', ltal-qaedagt, lt,gt, '', ltloosegt, ltnetworkgt, '',
    ltsunnigt, ltislamistgt, ltterroristsgt, lt,gt,
    ltsimultaneouslygt, lthijackedgt, '', '', lt.gt, '',
    lt.gt, ltdomesticgt, ltcommercialgt, ltairlinersgt, lt.gt,
    '', '', ltcrashedgt, '', '', ltworldgt, lttradegt,
    ltcentergt, '', ltmanhattangt, lt,gt, '', ltyorkgt,
    ltcitygt, lt\x97gt, '', '', '', '', '', '',
    lttallestgt, lttowersgt, lt,gt, '', lt18gt, ltminutesgt,
    '', lt\x97gt, ltshortlygt, '', '', '', lttowersgt,
    ltcollapsedgt, lt.gt, '', '', ltaircraftgt, '',
    ltcrashedgt, '', '', '', lt.gt, '', lt.gt,
    ltdepartmentgt, '', ltdefensegt, ltheadquartersgt, lt,gt,
    '', ltpentagongt, lt,gt, '', ltarlingtongt, ltcountygt,
    lt,gt, ltvirginiagt, lt.gt, '', ltfourthgt, ltplanegt, '',
    ltcrashedgt, '', '', ltruralgt, ltfieldgt, '',
    ltsomersetgt, ltcountygt, lt,gt, ltpennsylvaniagt, lt,gt,
    lt80gt, ltmilesgt, lt(129gt, ltkm)gt, lteastgt, '',
    ltpittsburghgt, lt,gt, '', ltpassengergt, ltresistancegt,
    lt.gt, '', ltofficialgt, ltcountgt, ltrecordsgt, lt2,986gt,
    ltdeathsgt, '', '', ltattacksgt, lt.gt

9
2. Tagging
  • run_article_tagger.py
  • Use Brown Corpus to tag three Articles A, B and C

10
(No Transcript)
11
  • 3. Chunking
  • Chunking is performed on all three
    documents, in order to identify noun phrases.
  • For example, we need to identify ltWorld Trade
    Centergt, instead of three separate tokens of
    ltWorldgt, ltTradegt and ltCentergt.
  • Similarly, we must identify ltSeptember 11, 2001gt
    instead of ltSeptembergt lt11,gt lt2001gt.

12
run_article_chunk.py Output (Chunked_B_rule_nn-t
l_nn_nn.txt) (NP ltWar/nn-tlgt ltGames/nngt) (NP
ltField/nn-tlgt ltTraining/nngt ltExercise/nngt) (NP
ltSenator/nn-tlgt ltMax/nngt ltCleland/nngt) (NP
ltLt./nn-tlgt ltCol./nngt ltDawne/nngt) (NP
ltField/nn-tlgt ltTraining/nngt ltExercise/nngt) (NP
ltHouse/nn-tlgt ltcounter-terrorism/nngt
ltofficial/nngt) (NP ltCommissioner/nn-tlgt
ltJamie/nngt ltGorelick/nngt) (NP ltWorld/nn-tlgt
ltTrade/nngt ltCenter/nngt) (NP ltWorld/nn-tlgt
ltTrade/nngt ltCenter/nngt) (NP ltWorld/nn-tlgt
ltTrade/nngt ltCenter/nngt) (NP ltCommissioner/nn-tlgt
ltJamie/nngt ltGorelick/nngt)
13
Chunking We have used a trigram model to perform
chunking. This was essential, in order to
minimize false positives. For example, Richard
B. Meyers (exists in Article B) Richard Stanley
(exists in Article C) ltRichard B. Meyersgt is
obviously different than ltRichard Stanleygt. We
would never have known this using a unigram model
ltRichardgt, and it would have produced a false
positive, by creating a link between the
articles, on finding ltRichardgt in both articles.
14
4. Article comparisonrun_article_compare.py
  • We compare each of three articles individually,
    with each other, to identify common words among
    the articles.
  • A to B
  • B to C
  • A to C

15
  • Example
  • A contains World Trade Center
  • B contains World Trade Center
  • B contains Sec. Donald Rumsfeld
  • C contains Sec. Donald Rumsfeld
  • We have thus identified a common link Sec.
    Donald Rumsfeld that connects Article A with
    Article C

16
run_article_compare.py Output
(Compare_A_to_B_np.txt) Article A Word
Manhattan In Article B False Article B Freq
0.0 Article A Word Virginia In Article B False
Article B Freq 0.0 Article A Word Pennsylvania
In Article B True Article B Freq
0.00037397157816 Article A Word Pennsylvania In
Article B True Article B Freq
0.00037397157816 Article A Word Pennsylvania In
Article B True Article B Freq
0.00037397157816 Article A Word Manhattan In
Article B False Article B Freq 0.0 Article A
Word September In Article B True Article B
Freq 0.00149588631264 Article A Word Washington
In Article B True Article B Freq
0.00112191473448 Article A Word Pennsylvania In
Article B True Article B Freq 0.00037397157816
17
5. Work in Progress
  • Strength of links
  • Concept Chain Graph
  • Baselines
  • Validation

18
5a. Strength of Links
  • On identifying a possible link between two words,
    it might also be useful to determine the strength
    (weight) of the link.
  • We have decided to calculate the strength of each
    link, based on the proximity of words to each
    other

19
Strength of Links
  • We are creating a word index, that will have the
    position of the word in each article.
  • For example,
  • World Trade Center is on position 373
  • Sec. Donald Rumsfeld is on position 2190
  • Then,
  • Strength distance / word count
  • (2190 373) / (number of words)
  • In case of multiple links, we will calculate
    strength by taking their average.

20
5b. Concept Chain Graph
After determining all possible common links among
the articles, we plan to use Graphwiz to create a
Concept Chain Graph, which would look like this-
Source http//www.buffalo.edu/news/fast-execute.c
gi/article-page.html?article72910009
21
5c. Baselines Validation
  •  
  • For establishing a baseline,we will choose
    articles where there are no knownlinks between A
    and C to determine if we can still link them
    through B.

22
Baselines Validation
  • We are using the Swanson research to implement a
    solution that follows Swanson exactly and then
    compare the results with ours.
  • We are also in touch with Dr. Rohini at SUNY and
    are exploring collaboration possibilities. We are
    due to have a conference call on Tuesday. We hope
    to use this collaboration to further validate our
    process.

23
Thank You
Write a Comment
User Comments (0)
About PowerShow.com