Grouping Search-Engine Returned Citations for Person Name Queries - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Grouping Search-Engine Returned Citations for Person Name Queries

Description:

Grouping Search-Engine Returned Citations for Person Name Queries. Reema Al-Kamha ... Grouping Algorithm. Input: the final confidence matrix. ... – PowerPoint PPT presentation

Number of Views:52
Avg rating:3.0/5.0
Slides: 28
Provided by: ree2
Category:

less

Transcript and Presenter's Notes

Title: Grouping Search-Engine Returned Citations for Person Name Queries


1
Grouping Search-Engine Returned Citations for
Person Name Queries
  • Reema Al-Kamha

Research Supported by NSF
2
The Problem
  • Search engines return too many citations.
  • Example Kelly Flanagan.
  • Google returns around 685 citations.
  • Many people named Kelly Flanagan
  • It would help to group the citations by person.
  • How do we group them?

3
Kelly Flanagan Query to Google
4

Our Solution
  • A Multi-faceted approach
  • Attributes
  • Links
  • Page Similarity
  • Confidence matrix for each facet
  • Final confidence matrix
  • Grouping algorithm

5
A Multi-faceted Approach
  • Gather evidence from each of several different
    facets
  • Combine the evidence

6
Attributes
  • Phone number, email address, state, city, zip
    code.
  • Regular expression for each attribute.

7
Links
  • People usually post information on only a few
    host servers.
  • Returned citations that have a same host.
  • People often link one page about a person to
    another page
  • about the same person.
  • The URL of one citation has the same host as one
    of the URLs that belongs to the web page
    referenced by the other citation.

8
Links (Cont)
9
Page Similarity
  • adjacent cap-word pairs
  • Cap-Word (Connector Preposition
    (Article)? (Capital-LetterDot))? Cap-Word.

10
Page Similarity
  • The number of shared adjacent cap-word pairs (1,
    2 , 3, 4 or more).
  • Ignore adjacent cap-word pairs that often occur
    on web pages (Home Page and Privacy Policy) by
    constructing a stop-word list.

11
Confidence Matrix Construction
  • For each facet we construct a confidence matrix.

C1 C2 .. Ci .. Cj Cn
C1 1 C12 C1i C1j C1n
C2 1 C2i C2j C2n

Ci 1 Cij Cin

Cj 1 Cjn

Cn 1
0 if no evidence for a facet f
Cij
P(Ci and Cj refer to a same person evidence for
a facet f )
  • Training set to compute the conditional
    probabilities.

12
Confidence Matrix Construction (Cont)
  • We select 9 person names.
  • For each name we collect the first 50 citations.
  • For 50 citations we have 1,225 comparison pairs.
  • The size of our training set is 11,025.

13
Confidence Matrix Construction (Cont)
For attribute facet P(Same Person Yes
Email yes) P(Same Person Yes City
yes and State Yes)
For link facet P(Same Person Yes
Host1 yes and Host1 is non-popular)
For page similarity facet P(Same Person
Yes Share2 yes)
14
Confidence Matrix for Attribute Facet
C1 and C2 have the same zip, city, and state,
which are Provo, UT, and 84604. C1 and C8
, C2 and C8 have the same city and state, which
are Provo and UT. C4 and C7 have the same
city and state, which arePalm Desert and
California.
C1 C2 C3 C4 C5 C6 C7 C8 C9 C10
C1 1 0.99 0 0 0 0 0 0.96 0 0
C2 1 0 0 0 0 0 0.96 0 0
C3 1 0 0 0 0 0 0 0
C4 1 0 0 0.96 0 0 0
C5 1 0 0 0 0 0
C6 1 0 0 0 0
C7 1 0 0 0
C8 1 0 0
C9 1 0
C10 1
15
Confidence Matrix for Link Facet
C1 and C2 have the same host name, and C1 refers
to the host of C2. . C5 and C6 have the same host
name. C3 refers to the host of C5 and C3 refers
to the host of C6
C1 C2 C3 C4 C5 C6 C7 C8 C9 C10
C1 1 0.99 0 0 0 0 0 0 0 0
C2 1 0 0 0.99 0 0 0 0 0
C3 1 0 0 0.99 0 0 0 0
C4 1 0 0 0 0 0 0
C5 1 0.99 0 0 0 0
C6 1 0 0 0 0
C7 1 0 0 0
C8 1 0 0
C9 1 0
C10 1
16
Confidence Matrix for Page Similarity Facet
C1 and C2 share Associate Professor, Brigham
Young, Performance Evaluation, Trace Collection,
Computer Organization, Computer Architecture.
C2 and C3 share Memory Hierarchy, Brent E.
Nelson, System-Assisted Disk, Simulation
Technique, Stochastic Disk, Winter Simulation,
Chordal Spoke, Interconnection Network,
Transaction Processing, Benchmarks Using,
Performance Studies, Incomplete Trace, Heng Zho.
C1 and C8 , C2 and C8 share Brigham Young. C4
and C7 share Palm Desert, Real Estate, Desert
Real .
C1 C2 C3 C4 C5 C6 C7 C8 C9 C10
C1 1 0.95 0 0 0 0 0 0.78 0 0
C2 1 0.95 0 0 0 0 0.78 0 0
C3 1 0 0 0 0 0 0 0
C4 1 0 0 0.92 0 0 0
C5 1 0 0 0 0 0
C6 1 0 0 0 0
C7 1 0 0 0
C8 1 0 0
C9 1 0
C10 1
17
Final Matrix
  • Combine the confidence matrices for the three
    facets using Stanford Certainty Measure.
  • For some observation B,
  • If CF(E1) is the certainty factor associated
    with E1
  • If CF(E2) is the certainty factor
    associated with E2
  • the new certainty factor for B is
  • CF(E1) CF(E2) CF(E1)
    CF(E2).

18
Final Matrix (Cont)
Confidence Matrix for Attributes
Confidence Matrix for Links
Confidence Matrix for Page Similarity
0.96 0 0.78 - 0.96 0 - 0.96 0.78 - 0.78
0 0.96 0 0.78 0.9912
19
Final Confidence Matrix
C1 C2 C3 C4 C5 C6 C7 C8 C9 C10
C1 1 0.95 0 0 0 0 0 0.99 0 0
C2 1 0.95 0 0 0 0 0.99 0 0
C3 1 0 0.99 0.99 0 0 0 0
C4 1 0 0 0.99 0 0 0
C5 1 0 0 0 0 0
C6 1 0 0 0 0
C7 1 0 0 0
C8 1 0 0
C9 1 0
C10 1
20
Grouping Algorithm
  • Input the final confidence matrix.
  • Output groups of search engine returned
    citations, such that each group refers to the
    same person.
  • The idea is
  • Ci , Cj and Cj , Ck then Ci , Cj , Ck
  • The threshold we use for highly
    confident is 0.8.

21
Grouping Algorithm(Cont)
C1 , C2, C2 , C3, C3 , C5, C3 , C6, C4 ,
C7, C1 , C8, C2 , C8
Group1 C1 , C2 , C3 , C5 , C6 , C8, Group 2
C4 , C7, Group 3 C9, Group4 C10
C1 C2 C3 C4 C5 C6 C7 C8 C9 C10
C1 1 0.95 0 0 0 0 0 0.99 0 0
C2 1 0.95 0 0 0 0 0.99 0 0
C3 1 0 0.99 0.99 0 0 0 0
C4 1 0 0 0.99 0 0 0
C5 1 0 0 0 0 0
C6 1 0 0 0 0
C7 1 0 0 0
C8 1 0 0
C9 1 0
C10 1
22
Experimental Results
  • Choose 10 arbitrary different names.
  • For each name we get the first 50 returned
    citations.
  • The size of the test set is 500.
  • Use split and merge measures.
  • Consider 8 returned citations C1, C2, C3, C4, C5,
    C6, C7, C8
  • the correct grouping result
  • Group 1 C1, C2, C4, C6, C7, Group 2 C3,
    C8, Group 3 C5
  • grouping result of our system
  • Group 1 C1, C2, C4, Group 2 C3, C6, C7,
    Group 3 C5, C8
  • The number of splits is 0112.
  • The total number of merges is 2.
  • Normalized the split and merge scores.

23
Experimental Results (Cont)
Official College, Sports Network, Student
Advantage.
24
Cases that Caused Missing Merges--Attributes
Facet
  • No shared attributes.
  • 1030 pairs (out of 1036 pairs) in 41 groups in
    Larry Wild.
  • Only the value of attribute State is shared.
  • 6 pairs in 41 groups in Larry Wild.

25
Techniques that Used to Judge In Case of no
Evidence or Weak Evidence
26
Conclusions
  • Multi-faceted approach is useful, low normalized
    split score (0.004) and a low normalized merge
    score (0.014).
  • No individual facet scored better than using all
    facets together.

27
Contributions
  • Grouped person-name queries by person.
  • Provided an additional tool for search engine
    queries.
Write a Comment
User Comments (0)
About PowerShow.com