Title: Developing an Application in Support of Counter Terrorism By William Leannah Qin Wang Akshat Kapoor
1Developing an Application in Support of
Counter TerrorismBy William Leannah Qin
WangAkshat Kapoor(Group 1)
2Starting PointUnintended Information
Revelation (UIR) _at_ SUNY Buffalo
- The system permits users to find the best trail
of evidence through many documents that connects
two or more apparently unrelated concepts.
A ? B ? C Query Definitions The system
tries to find how Article A might be linked to
Article C through Article B
3Development Tools
- Python 2.4
- NLTK 1.4.1
- Microsoft Access 2003
- Graphviz
4Data Sets
- Development (pieces of 9-11 commission report A,
B and C) - Training set (pieces of 9-11 commission report D,
E and F) - Test set (we will select three different
articles)
5System Architecture
Create word index
Documents A, B and C
Tag all text using Brown Corpus
Check stop words
Chunking to identify noun phrases
Article Comparison
Input Text documents with no apparent common link
0.101
Compute strength of link
0. 088
0.013
0. 239
0.124
0.123
0. 01
0. 1065
0.54
Output Concept Chain Graph
0.2324
61. Stop Words EliminatorstopWordsCheck.py
- This program, when run, identifies and eliminates
pre-specified stop words in any given text
document (command line argument)
7stopWordsCheck.py
- Output
- pascalgtpython stopWordsCheck.py A.txt
- Original stream
- ltltfromgt, ltwikipediagt, lt,gt, ltthegt, ltfreegt,
ltencyclopediagt, lt.gt, ltjumpgt, lttogt, ltgt,
ltnavigationgt, lt,gt, ltsearchgt, ltagt, ltsequentialgt,
ltlookgt, ltatgt, ltunitedgt, ltflightgt, lt175gt,
ltcrashinggt, ltintogt, ltthegt, ltsouthgt, lttowergt,
ltofgt, ltthegt, ltworldgt, lttradegt, ltcentergt, ltthegt,
ltseptembergt, lt11gt, lt,gt, lt2001gt, ltattacksgt,
ltweregt, ltagt, ltseriesgt, ltofgt, ltsuicidegt,
ltattacksgt, ltagainstgt, ltthegt, ltunitedgt, ltstatesgt,
ltconductedgt, ltongt, lttuesdaygt, lt,gt, ltseptembergt,
lt11gt, lt,gt, lt2001gt, lt.gt, ltaccordinggt, lttogt, ltthegt,
ltofficialgt, lt9/11gt, ltcommissiongt, ltreportgt, lt,gt,
ltnineteengt, ltmengt, ltaffiliatedgt, ltwithgt, ltosamagt,
ltbingt, ltladengt, ltandgt, ltal-qaedagt, lt,gt, ltagt,
ltloosegt, ltnetworkgt, ltofgt, ltsunnigt, ltislamistgt,
ltterroristsgt, lt,gt, ltsimultaneouslygt, lthijackedgt,
ltfourgt, ltugt, lt.gt, ltsgt, lt.gt, ltdomesticgt,
ltcommercialgt, ltairlinersgt, lt.gt, lttwogt, ltweregt,
ltcrashedgt, ltintogt, ltthegt, ltworldgt, lttradegt,
ltcentergt, ltingt, ltmanhattangt, lt,gt, ltnewgt, ltyorkgt,
ltcitygt, lt\x97gt, ltonegt, ltintogt, lteachgt, ltofgt,
ltthegt, lttwogt, lttallestgt, lttowersgt, lt,gt, ltaboutgt,
lt18gt, ltminutesgt, ltapartgt, lt\x97gt, ltshortlygt,
ltaftergt, ltwhichgt, ltbothgt, lttowersgt, ltcollapsedgt,
lt.gt, ltthegt, ltthirdgt, ltaircraftgt, ltwasgt,
ltcrashedgt, ltintogt, ltthegt, ltugt, lt.gt, ltsgt, lt.gt,
ltdepartmentgt, ltofgt, ltdefensegt, ltheadquartersgt,
lt,gt, ltthegt, ltpentagongt, lt,gt, ltingt, ltarlingtongt,
ltcountygt, lt,gt, ltvirginiagt, lt.gt, ltthegt, ltfourthgt,
ltplanegt, ltwasgt, ltcrashedgt, ltintogt, ltagt, ltruralgt,
ltfieldgt, ltingt, ltsomersetgt, ltcountygt, lt,gt,
ltpennsylvaniagt, lt,gt, lt80gt, ltmilesgt, lt(129gt,
ltkm)gt, lteastgt, ltofgt, ltpittsburghgt, lt,gt,
ltfollowinggt, ltpassengergt, ltresistancegt, lt.gt,
ltthegt, ltofficialgt, ltcountgt, ltrecordsgt, lt2,986gt,
ltdeathsgt, ltingt, ltthegt, ltattacksgt, lt.gtgt
8stopWordsCheck.py
- Clean stream
- '', ltwikipediagt, lt,gt, '', ltfreegt,
ltencyclopediagt, lt.gt, ltjumpgt, '', ltgt,
ltnavigationgt, lt,gt, ltsearchgt, '', ltsequentialgt,
'', '', ltunitedgt, ltflightgt, lt175gt, ltcrashinggt,
'', '', ltsouthgt, lttowergt, '', '', ltworldgt,
lttradegt, ltcentergt, '', ltseptembergt, lt11gt, lt,gt,
lt2001gt, ltattacksgt, '', '', ltseriesgt, '',
ltsuicidegt, ltattacksgt, '', '', ltunitedgt, ltstatesgt,
ltconductedgt, '', lttuesdaygt, lt,gt, ltseptembergt,
lt11gt, lt,gt, lt2001gt, lt.gt, '', '', '', ltofficialgt,
lt9/11gt, ltcommissiongt, ltreportgt, lt,gt, ltnineteengt,
ltmengt, ltaffiliatedgt, '', ltosamagt, ltbingt, ltladengt,
'', ltal-qaedagt, lt,gt, '', ltloosegt, ltnetworkgt, '',
ltsunnigt, ltislamistgt, ltterroristsgt, lt,gt,
ltsimultaneouslygt, lthijackedgt, '', '', lt.gt, '',
lt.gt, ltdomesticgt, ltcommercialgt, ltairlinersgt, lt.gt,
'', '', ltcrashedgt, '', '', ltworldgt, lttradegt,
ltcentergt, '', ltmanhattangt, lt,gt, '', ltyorkgt,
ltcitygt, lt\x97gt, '', '', '', '', '', '',
lttallestgt, lttowersgt, lt,gt, '', lt18gt, ltminutesgt,
'', lt\x97gt, ltshortlygt, '', '', '', lttowersgt,
ltcollapsedgt, lt.gt, '', '', ltaircraftgt, '',
ltcrashedgt, '', '', '', lt.gt, '', lt.gt,
ltdepartmentgt, '', ltdefensegt, ltheadquartersgt, lt,gt,
'', ltpentagongt, lt,gt, '', ltarlingtongt, ltcountygt,
lt,gt, ltvirginiagt, lt.gt, '', ltfourthgt, ltplanegt, '',
ltcrashedgt, '', '', ltruralgt, ltfieldgt, '',
ltsomersetgt, ltcountygt, lt,gt, ltpennsylvaniagt, lt,gt,
lt80gt, ltmilesgt, lt(129gt, ltkm)gt, lteastgt, '',
ltpittsburghgt, lt,gt, '', ltpassengergt, ltresistancegt,
lt.gt, '', ltofficialgt, ltcountgt, ltrecordsgt, lt2,986gt,
ltdeathsgt, '', '', ltattacksgt, lt.gt
92. Tagging
- run_article_tagger.py
- Use Brown Corpus to tag three Articles A, B and C
10(No Transcript)
11- 3. Chunking
- Chunking is performed on all three
documents, in order to identify noun phrases. - For example, we need to identify ltWorld Trade
Centergt, instead of three separate tokens of
ltWorldgt, ltTradegt and ltCentergt. - Similarly, we must identify ltSeptember 11, 2001gt
instead of ltSeptembergt lt11,gt lt2001gt.
12run_article_chunk.py Output (Chunked_B_rule_nn-t
l_nn_nn.txt) (NP ltWar/nn-tlgt ltGames/nngt) (NP
ltField/nn-tlgt ltTraining/nngt ltExercise/nngt) (NP
ltSenator/nn-tlgt ltMax/nngt ltCleland/nngt) (NP
ltLt./nn-tlgt ltCol./nngt ltDawne/nngt) (NP
ltField/nn-tlgt ltTraining/nngt ltExercise/nngt) (NP
ltHouse/nn-tlgt ltcounter-terrorism/nngt
ltofficial/nngt) (NP ltCommissioner/nn-tlgt
ltJamie/nngt ltGorelick/nngt) (NP ltWorld/nn-tlgt
ltTrade/nngt ltCenter/nngt) (NP ltWorld/nn-tlgt
ltTrade/nngt ltCenter/nngt) (NP ltWorld/nn-tlgt
ltTrade/nngt ltCenter/nngt) (NP ltCommissioner/nn-tlgt
ltJamie/nngt ltGorelick/nngt)
13Chunking We have used a trigram model to perform
chunking. This was essential, in order to
minimize false positives. For example, Richard
B. Meyers (exists in Article B) Richard Stanley
(exists in Article C) ltRichard B. Meyersgt is
obviously different than ltRichard Stanleygt. We
would never have known this using a unigram model
ltRichardgt, and it would have produced a false
positive, by creating a link between the
articles, on finding ltRichardgt in both articles.
144. Article comparisonrun_article_compare.py
- We compare each of three articles individually,
with each other, to identify common words among
the articles. - A to B
- B to C
- A to C
15- Example
- A contains World Trade Center
- B contains World Trade Center
- B contains Sec. Donald Rumsfeld
- C contains Sec. Donald Rumsfeld
- We have thus identified a common link Sec.
Donald Rumsfeld that connects Article A with
Article C
16run_article_compare.py Output
(Compare_A_to_B_np.txt) Article A Word
Manhattan In Article B False Article B Freq
0.0 Article A Word Virginia In Article B False
Article B Freq 0.0 Article A Word Pennsylvania
In Article B True Article B Freq
0.00037397157816 Article A Word Pennsylvania In
Article B True Article B Freq
0.00037397157816 Article A Word Pennsylvania In
Article B True Article B Freq
0.00037397157816 Article A Word Manhattan In
Article B False Article B Freq 0.0 Article A
Word September In Article B True Article B
Freq 0.00149588631264 Article A Word Washington
In Article B True Article B Freq
0.00112191473448 Article A Word Pennsylvania In
Article B True Article B Freq 0.00037397157816
175. Work in Progress
- Strength of links
- Concept Chain Graph
- Baselines
- Validation
185a. Strength of Links
- On identifying a possible link between two words,
it might also be useful to determine the strength
(weight) of the link. - We have decided to calculate the strength of each
link, based on the proximity of words to each
other
19Strength of Links
- We are creating a word index, that will have the
position of the word in each article. - For example,
- World Trade Center is on position 373
- Sec. Donald Rumsfeld is on position 2190
- Then,
- Strength distance / word count
- (2190 373) / (number of words)
- In case of multiple links, we will calculate
strength by taking their average.
205b. Concept Chain Graph
After determining all possible common links among
the articles, we plan to use Graphwiz to create a
Concept Chain Graph, which would look like this-
Source http//www.buffalo.edu/news/fast-execute.c
gi/article-page.html?article72910009
215c. Baselines Validation
-
- For establishing a baseline,we will choose
articles where there are no knownlinks between A
and C to determine if we can still link them
through B.
22Baselines Validation
- We are using the Swanson research to implement a
solution that follows Swanson exactly and then
compare the results with ours. - We are also in touch with Dr. Rohini at SUNY and
are exploring collaboration possibilities. We are
due to have a conference call on Tuesday. We hope
to use this collaboration to further validate our
process.
23Thank You