GS - PowerPoint PPT Presentation

About This Presentation
Title:

GS

Description:

? ap?d?t???t?t? t?? sta?e?? p?? p??? ap? ta ??ass??? tf-idf s?? ata ?a? t?? OKAPI ???d? ... OKAPI system. Language Modeling Approach. ???t????e to 1998 ap? ... – PowerPoint PPT presentation

Number of Views:81
Avg rating:3.0/5.0
Slides: 185
Provided by: glott
Category:
Tags: okapi

less

Transcript and Presenter's Notes

Title: GS


1
(No Transcript)
2
?????G?S
? e??t?s? ?p??e? µ?a e??a?a stat?st???
µe??d?????a ?a apa?t?se? se p??ß??µata
epe?e??as?a? f?s???? ???ssa? p?? eµfa?????? µ?a
?µ???t?ta ?? p??? t? st???, ? ?p???? e??a? ?
ep????? µeta?? a?ta??????µe??? ??t?t?t??
  • ?a?ade??µata
  • ??ta??????µe?a ????afa st?? a???t?s? p????f???a?.
    (Information Retrieval)
  • ?????e? µ?a? ????? st? p?a?s?? (context) p??
    eµfa???eta?. (Word Sense Disambiguation)
  • ??ta????sµ?? ???e?? ??a t?? d?µ??????a
    Collocations (S??e?fe??µe?e? ???e??)

3
?????G?S
  • Stat?st??? ? ???d?? p?? efa?µ?st??e µe t??
    µe?a??te?? ep?t???a st?? ?pe?e??as?a F?s????
    G??ssa? (Natural Language Processing)
  • ?a?ade??µata
  • Sta s?st?µata ??a a?a??t?s? p????f???a?
    (Information Retrieval IR)
  • ?p?saf???s? t?? ?????a? µ?a? ????? (Word Sense
    Disambiguation WSD)
  • O s??µat?sµ?? s??e?fe??µe??? ???e?? (Collocation)
  • ???? ?a?
  • ?at??????p???s? ?e?µ???? (Text Categorization)
  • ?p??p???s? ?e?µ???? (Text Simplification)

4
?????G?S
  • S??p?? t?? ??at??ß??
  • ?a a?ade??e? t?? efa?µ??? µ?a? e??a?a?
    Stat?st???? µe??d?????a? ??a t??? pa?ap??? t?µe??
    ??e??a?
  • S???e???µ??a, a??pt??? s?st?µ?t?? ??a
  • ??? e??es? s??e?fe??µe??? ???e?? (collocations)
    se ?e?µe?a f?s???? ???ssa?,
  • ??? a?a??t?s? p????f???a? µe ß?s? t? e??t?µa e???
    ???st? (information retrieval), ?a?
  • ??? ap?saf???s? t?? ?????a? µ?a? ????? ap? ta
    s?µf?a??µe?? t?? (word sense disambiguation).

5
?????G?S
  • ? a???t?s? p????f???a? (Information Retrieval)
    e??a? ???d?? t?? ?pe?e??as?a? F?s???? G??ssa? p??
    as???e?ta? µe t?? a??pt??? a??????µ?? ?a?
    µ??t???? ??a t?? a?a??t?s? p????f???a? ap?
    d??f??e? s??????? ?e?µ???? (Internet, document
    depositories).
  • ?e t?? a?a?????s? t?? p?s?t???? µe??d??
    epe?e??as?a? f?s???? ???ssa?, ?? stat?st????
    µ???d?? ????a? ? ????a??? p??s????s? a??pt????
    s?st?µ?t?? ??a a???t?s? p????f???a?.

6
?????G?S
  • Word Sense Disambiguation ? ???d?? p??
    as???e?ta? µe t?? ap?saf???s? t?? ?????a? µ?a?
    ????? µ?sa sta s?µf?a??µe?a t??
  • ?? stat?st???? µ???d?? ?e?????ta? ?? t?
    ap???e?st??? e??a?e?? ??a t?? a??pt??? s?st?µ?t??
    ?p?saf???s?? ???????.
  • ??t??a s?st?µata e??a? p??? ???s?µa ?a? ß??????
    t? µ??a???? µet?f?as? ?a? t?? ?ata???s? ?e?µ????

7
?????G?S
  • Collocations
  • E??a? ? e??es? s??e?fe??µe??? ???e??
    (collocations), ???e?? d??ad? p?? eµfa?????ta?
    p??? s???? µa?? ?a? s??µat????? ??a ???
    s?µas???????? ??? µe s?µas?a d?af??et??? t??
    s?µas??? t?? s???st?s?? µe???.
  • ?a?ade??µat?? ????? ? ??f?as?
  • Ge?? ??t???

8
????t??
  • ? ?pe?e??as?a F?s???? G??ssa? e??a? a?aµf?sß?t?ta
    ep?st?µ?????? ???d?? µe p???pt???.
  • ??a ta p??????µe?a p??ß??µata e??a? e?a??et???
    d?s???a ?a? ? ep???s? t??? a?aµ??eta? ?a
    ep??e?se? ?ata??t??? t?? efa?µ???? ?p?????st????
    G??ss?????a? ?a? ?d?a?te?a t?? ???d? t?? ?e???t??
    ???µ?s????
  • ????? t??a p????? µ???d?? ?a? s?st?µata ?????
    p??ta?e? st?? d?e??? ß?ß?????af?a ??a t?? ep???s?
    t?t???? p??ß??µ?t?? a??? µe t??p? ap?spasµat???.
  • ??a???????ta? µeta?? t??? ta p??ß??µata,
    pa?at??e?ta? a??pt??? d?af??et???? µe??d?? ??a t?
    ???e p??ß??µa
  • ?p?t??esµa a??????µ?? ?a? te?????? p?? d???e????
    ??a µ?a pe????? t?? ?pe?e??as?a? F?s???? G??ssa?
    ?a µ?? µp????? ?a efa?µ?s???? se ????.

9
? ?d?a
  • ?a pe??ss?te?a p??ß??µata epe?e??as?a? f?s????
    ???ssa? eµfa?????? ??a ????? ?a?a?t???st???, a?t?
    t?? ep?????? µeta?? a?ta??????µe??? ??t?t?t?? ??a
    ??p??? s???e???µ??? st???.
  • ?a?ade??µata
  • ??ta??????µe?a ????afa st?? a???t?s? p????f???a?
    p?? a?ta???????ta? ?? p??? t?? st??? p?? e??a? ?
    s???fe?a µe t? e??t?µa (query) e??? ???st?,
    a?ta??????µe?e? ?????e? st?? ap?saf???s? t??
    ?????a? µ?a? ?????, ? a?ta??????µe?a ?e?????a
    ???e?? ??a t?? s??µat?sµ? collocations.
  • ? pa???sa d?at??ß? a?ade????e? a?t? t?
    ?a?a?t???st??? ?a? apa?t?e? µe µ?a e??a?a
    stat?st??? µe??d?????a ??a t?? ep???s? t??
    pa?ap??? p??ß??µ?t??, s?µß?????ta? st?? ???st???
    a???p???s? t?? ep?st?µ?????? ???s??.

10
? µe??d?????a
  • St? Stat?st??? e??a? p??? ?a?? ?eµe???µ???? ??
    ??e???? ?a??? ta????sµat?? (Goodness of
    statistical tests), ?? ?p???? e??????? ?at? p?s?
    ?a?? ta???????? ta ded?µ??a se µ?a ?p??e?µe??
    ?e???t??? ?p??es? p?? ?e????µe ?t? ta d??pe?.
  • St? d?at??ß? ???s?µ?p??e?ta? ? ?-tet???????
    stat?st???? ??e???? ?a??? ta????sµat?? ,
    (Chi-square Goodness of Fit Statistical Test) ??a
    t?? ap?t?µ?s? t?? s?et???t?ta? µe t? st??? t??
    ???e a?ta??????µe??? ??t?t?ta?.
  • ??? s???e???µ??a, d?at?p??eta? µ?a µ?de????
    ?p??es? (null hypothesis) ?t? ?? d??f??e?
    a?ta??????µe?e? ??t?t?te? de? ep?de??????? ?aµ?a
    ?d?a?te?? s?µpe??f??? ??a?t? t?? st???? p??a? t??
    t??a?a?. ??t? e??a? ? ?e???t??? ?p??es? p??
    ???eta? ??a ta ded?µ??a

11
? µe??d?????a
  • ?p? ta p?a?µat??? ded?µ??a ?ata???feta? ?
    p?a?µat??? s?µpe??f??? t?? ???e a?ta??????µe???
    ??t?t?ta? ?a? p?st?p??e?ta? ?ts? µ?a d?af???
    (discrepancy) µeta?? t?? p?a?µat???? s?µpe??f????
    ?a? a?t?? p?? ap????e? ap? t?? ?e???t??? ?p??es?.
  • ? d?af??? a?t? p?s?t???p??e?ta? µe t?? ß???e?a
    t?? X2 ?ata??µ?? ?a? a?t? ? p?s?t???p???s?
    e??a? ??a?? ?a ???s?µ?p????e? ?? µ?t?? t??
    ap?t?µ?s?? t?? s?et???t?ta? t?? a?ta??????µe???
    ??t?t?ta? µe t? st??? (ranking criterion).

12
?? a??????e?
  • ??????, pa???s?????µe µ?a e?sa???? t??
    stat?st???? µ??t???? p?? ???s?µ?p?????ta? st??
    epe?e??as?a f?s???? ???ssa? ?a??? ep?s?? ?a t??
    µ?t??? ap?t?µ?s?? t?? ap?d?t???t?ta? t??
    s?st?µ?t?? a?t??
  • ???????e? ? efa?µ??? t?? stat?st???? e??????
    st?? a???t?s? p????f???a? (Information
    Retrieval). ??sa st? ?d?? stat?st??? p?a?s??,
    pa???s?????µe ??a s?st?µa ??a a?a??t?s?
    ?e?µe????? p????f???a? ap? de?aµe??? e????f??
    (document repositories) µe ß?s? t? e??t?µa e???
    ???st?.
  • St?? s????e?a, pa???s?????µe stat?st???? µe??d???
    ??a t?? a?a?????? s??e?fe??µe??? ???e?? µ?sa se
    ???????? ?e?µe?a (Collocations) ?a? ?eµe??????µe
    ??a t??p? efa?µ???? t?? stat?st???? e?????? st??
    pe????? a?t?

13
?? a??????e?
  • ????? efa?µ????µe t??? stat?st????? e??????? st??
    pe????? t?? ap?saf???s?? t?? ?????a? µ?a? ?????
    (Word Sense Disambiguation). ??a stat?st???
    s?st?µa a?apt?sseta? ??a t?? ap?saf???s? t??
    ?????a? µ?a ????? ap? ta s?µf?a??µe?? t??
    ?????ta? ???s? t?? ??e?t??????? ?e????? WordNet
    sa? ?e?????????? p???.
  • ?a s?µpe??sµata p?? p????pt??? µet? ap? ap?t?µ?s?
    t?? µe??d?? p?? a?apt??aµe p??? se pe??aµat???
    ded?µ??a e??????, e??a? ?t? ta stat?st??? a?t?
    s?st?µata ap?de??????ta? e???sta ?a? ??a?? ?a
    d?s??? ap?te??sµata ?a??te?a ap? a?t? t??
    ??ass???? µe??d??

14
??S?GOG?
  • ? stat?st??? e??a? ? ???d?? t?? µa??µat????
    ep?st?µ?? p?? ??e? ???s?µ?p????e? e???tata st??
    ?pe?e??as?a F?s???? G??ssa? (?FG)
  • ? a?µat?d? e?????? t?? p????f?????? ta te?e?ta?a
    ?????a ?a? ? d?a?es?µ?t?ta µe????? ????? ?e?µ????
    se ??f?a?? µ??f?, d?µ??????sa? t?? s?????e? ??a
    t?? a?a?????s? t?? p?s?t???? µe??d?? st?? (?FG)
  • ?e t?? a?a?????s? t?? p?s?t???? µe??d??
    epe?e??as?a? f?s???? ???ssa?, ?? stat?st????
    µ???d?? ????a? ? ????a??? p??s????s? a??pt????
    s?st?µ?t?? ??a a???t?s? p????f???a?

15
??S?GOG?
  • ?? stat?st???? µ???d?? ?e?????ta? ?? t?
    ap???e?st??? e??a?e?? ??a t?? a??pt??? s?st?µ?t??
    ??a t?? ??a??t?s? ?????f???a? (Word Sense
    Disambiguation), ap?saf???s? ?e?t???? s?µas?a?
    (Word Sense Disambiguation), ?at??????p???s?
    ?e?µ????, e??es? Collocations ??p
  • ?a p??ß??µata a?t? a?a????????ta? sa?
    ?p?????st??? p???p???a p??ß??µata st??
    epe?e??as?a f?s???? ???ssa? ?a? ? ep???s? t???
    a?aµ??eta? ?a ep??e?se? ?ata??t??? t?? e??????
    t?? ???d?? t?? ?p?????st???? ???ss?????a?
    (Computational Linguistics)

16
  • Stat?st??? ???t??a st?? epe?e??as?a f?s????
    ???ssa?
  • ? ??e??a sta stat?st??? s?st?µata epe?e??as?a?
    f?s???? ???ssa? as???e?ta? µe t?? a??pt???
    a??????µ?? ?a? s?st?µ?t?? ??a t?? a?apa??stas?,
    ap????e?s?, ??????s?, epe?e??as?a ?a? p??sp??as?
    t?? st???e??? t?? p????f???a?.
  • ?? p??te? p??sp??e?e? ??a a?apa??stas? ?a?
    a???t?s? p????f???a? ?e????sa? µe ta s?st?µata
    a?a??t?s?? p????f???a?. ?? ?a? pa?ad?s?a?? ?
    ???d?? as?????ta? µ??? µe t?? a?a??t?s? ?e?µ????
    ?a? t?? e??es? e????f??, s?µe?a, ?p???e? ??t???
    e?d?af???? ?a? ??a ???e? µ??f?? p????f???a?.
  • ? a?apa??stas? t?? p????f???a? se ?p?????s?µ?
    µ??f? pa??e? ?a????st??? ???? st?? a??pt???
    s?st?µ?t?? epe?e??as?a? f?s???? ???ssa?.

17
???t??a ??apa??stas???????f???a?
  • ??????a µe t?? f?s? t?? d?ad??as?a? a?apa??stas??
    e??? ?e?µ???? sa? s????? ap? ???e?? ??e?d??,
    µp????µe ?a ?atat????µe ta p?? s?µa?t??? µ??t??a
    a?apa??stas?? p????f???a? st?? e??? ????e?
    ?at?????e?
  • ??ad??? µ??t??a (Boolean models)
  • ??a??sµat??? µ??t??a (Vector models)
  • ???a??t??? µ??t??a (probabilistic models)

18
???t??a ??apa??stas???????f???a?
  • ??ad??? µ??t??a
  • ?? d?ad??? µ??t??? e??a? t? p?? ap?? µ??t??? t?
    ?p??? ßas??eta? st?? ?e???a s?????? ?a? t??
    Boolean ???eß?a
  • ? p????f???a a?apa??stata? ?p? µ??f? se????
    ??f??? 0 ?a? 1. ?? 1 d????e? t?? pa???s?a e???
    ???? ?a? t? 0 t?? ap??s?a
  • ?p?f??e? ap? a??et? µe???e?t?µata. ??, d?s????a
    p?? ?p???e? st? Information Retrieval ?a
    e?f?as?e? ??a e??t?µa se Boolean ??f?as? ap? t??
    ???st?

19
???t??a ??apa??stas???????f???a?
  • ?? d?a??sµat??? µ??t???
  • ?? d?a??sµat??? µ??t??? 1, 2, e??a? t? p??t?
    µ??t??? p?? efa?µ?st??e p??ta st?? a?a??t?s?
    p????f???a?.
  • S?µf??a µe t? d?a??sµat??? µ??t???, ???e ???? kj
    se µ?a ?e?µe???? p????f???a, ?a?a?t????eta? µe
    ??a ?et??? µ? µ?de???? p?a?µat??? a???µ? p??
    ?a?e?ta? ß???? (weight) ?a? e?f???e? t??
    s?µa?t???t?ta t?? ???? st?? p??sd????sµ? t??
    s?µas??????a? t?? ?e?µ????

20
?? d?a??sµat??? µ??t??? st?? ??a??t?s? ?????f???a?
  • St?? ??a??t?s? ?????f???a?
  • ?p????µe ?a a?apa?ast?s??µe ??a ????af? dj sa?
    ??a d????sµa (w1j, w2j, , wt,j),
    ?p?? t t? p????? ????
  • ??a e??t?µa q sa? (w1q, w2q, , wtq),

21
?? d?a??sµat??? µ??t??? st?? ??a??t?s? ?????f???a?
  • ?p????µe ?pe?ta ?a ???s?µ?p???s??µe t? s???µ?t???
    t?? ????a? (cosine) µeta?? t?? d?? d?a??sµ?t??
    ??a ?a ß???µe t?? ?µ???t?ta µeta?? t?? d??
    p????f?????

22
?a ß??? st?? s?µas??????a t?? ?e?µ????
  • G?a t?? ?a????sµ? t?? ß????? e??? ????
    ?a????st??? ???? pa?????
  • ? s????t?ta t?? ???? st? ?e?µe?? t?? e????f??
  • ? a???µ?? t?? e????f?? sta ?p??a s?µµet??e? ?
    ????
  • ??t? ?a µp????saµe ?a ta s??d??s??µe se ??a
    µ??ad??? ß????

Tf-idf s??µata
23
???a??t??? ???t??a
  • Sta p??a??t??? µ??t??a ? eµf???s? e??? ????
    µ??te??p??e?ta? sa? ??a s?µß?? ?a? t??
    ap?d?deta? µ?a p??a??t?ta.
  • ?s? µe?a??te?? e??a? ? p??a??t?ta eµf???s?? e???
    ????, t?s? p?? s?µa?t???? e??a? ? ????? t?? st??
    ?a????sµ? t?? s?µas??????a? t?? p????f???a?.

24
???a??t??? ???t??a
  • ???sfata µ?a ??a p??s????s?, ? µ??te??p???s?
    ???ssa? (language Modeling) ??e? p??ta?e? sta
    pa?ad?s?a?? d?a??sµat??? ?a? ta ???a p??a??t???
    µ??t??a.
  • ??e? efa?µ?s?e? µe ep?t???a sta s?st?µata
    ??a??t?s?? ?????f???a? 8, 9, 10, 11.
  • ??a stat?st??? µ??t??? ???ssa? e??a? ??a?
    p??a??t???? µ??a??sµ?? pa?a????? ?e?µ????.

25
???a??t??? ???t??a
  • ? ?ata???? t?? µ??t???? ???ssa? a???eta? st??
    ep??? t?? Shannon 12, ? ?p???? d?at?p?se t??
    p??? ???st? ?e???a t?? st?? t?µ?a t??
    ep?????????? (source channel perspective)
  • O Shannon µe??t?se ?at? p?s? ta ap?? (?-???µµata)
    µ??t??a (n-gram models) µp????? ?a p??ß??????
    f?s??? ?e?µe??
  • ??e? efa?µ?s?e? µe ep?t???a st?? ??a?????s? ?????
    (Speech Recognition)

26
???a??t??? ???t??a
  • ?? µ??t??? ???ssa? efa?µ?st??e ??a p??t? f??? se
    efa?µ???? epe?e??as?a? p????f???a? ?e?µ???? ap?
    t??? Ponte ?a? Croft t? 1998 st?? ????t?s?
    ?????f???a? 8.
  • Sta ??as??? p??a??t??? µ??t??a ??a??t?s??
    ?????f???a? 3, 5, 13, 14, ?p???e? ?
    a????? ?a ?ata?e?µ??µe µ?a µ??a p??a??t?ta?
    (Probability mass) p??? se ??a te??st?? ????
    p??a??? t?µ?? (e?ß?se??) ??a t?? ???e ???
    (unigram language model)
  • ??a??et??? ??s????. ? µ??? ??de??? t??
    pe??ss?te?e? f???? e??a? ?? ???? t?? e??t?µat??

27
???a??t??? ???t??a
  • ?? Ponte ?a? Croft 8, a?t?µet?p?sa? t? ??t?µa
    µe µ?a a?t?st??f? p??s????s?. ???s?µ?p????ta?
    µ?a smoothed e?d??? t?? unigram language model,
    p??te??a? µ?a µ???d? ?a ap?d?s??? µ?a t?µ?
    p??a??f??e?a? (likelihood score), ap? t? ????af?
    st? e??t?µa.
  • ??t? ? p??s????s? e??a? ???st? sa? language
    modeling Approach
  • ??a µ??t??? ???ssa? ?e??e?ta? sa? ??a ????ß?de?
    ?a???? ? noisy channel ? translation channel,
    t? ?p??? ape??????e? ta ????afa sta e??t?µata

28
  • Evaluation Measures
  • ??t?a ?p?t?µ?s?? t?? s?st?µ?t?? ?pe?e??as?a?
    F?s???? G??ssa?

29
??t?a ?p?t?µ?s??
  • ?e?????f??µe ta µ?t?a ?p?t?µ?s?? p?? ?a
    ???s?µ?p???s??µe st?? ????t?s? ?????f???a? ?a?
    sta s?st?µata ?p?saf???s?? ???????.
  • ?a µ?t?a a?t? efa?µ????ta? ?a? ?e????te?a sta
    s?st?µata ?pe?e??as?a? F?s???? G??ssa?

30
??t?a ?p?t?µ?s?? S?st?µ?t?? ?FG
  • Precision ?a? Recall
  • ?? e????s??µe t?? ?????e? µe ????? ap? t?? s??p??
    t?? Information Retrieval ?a? ?a ?e???e?s??µe.
  • ?st? ?t? st? s?st?µa ??a??t?s?? ?????f???a?
    ?p?ß???eta? ??a e??t?µa q.
  • ??? R t? s????? t?? s?et???? e????f?? µe a?t? t?
    e??t?µa ?a? A t? s????? t?? e????f?? p??
    ep?st?e?e t? s?st?µa

31
??t?a ?p?t?µ?s?? S?st?µ?t?? ?FG
  • ?p? p???? ?st? Ra ? a???µ?? t?? e????f?? st??
    t?µ? (Intersection) t?? R ?a? A
  • Recall
  • Precision

32
??t?a ?p?t?µ?s?? S?st?µ?t?? ?FG
  • ???ad? ??a ??a s?st?µa ?pe?e??as?a?
  • Precision e??a? t? p?s?st? t?? ?p?t????? st?
    s????? t?? ?pa?t?se?? t?? s?st?µat??
  • Recall e??a? t? p?s?st? t?? ep?t????? st? s?????
    t?? s?st?? ?pa?t?se?? p?? ?p???e?.
  • S????????µe ?a a?apa??st??µe t?? ?aµp???
    Precision versus Recall
  • ????sta se s???e???µ??a p?s?st? t?? Recall
  • 0, 10, 20, ,100
  • ??te µ???µe ??a Precision Versus Recall at 11
    Recall Points

33
?fa?µ??? t?? Stat?st???? ??????? st?? ????t?s?
?????f???a?
34
?fa?µ??? t?? Stat?st???? ??????? st?? ????t?s?
?????f???a?
? ?as??? ?d?a.
  • Sta pe??ss?te?a µ??t??a p?? ???s?µ?p????µe ??a
    t?? ??a??t?s? ?????f???a? e?d?afe??µaste ?a
    e?t?µ?s??µe p?s? ?a?? t? µ??t??? t?? e????f??
    (document model) ta?????e? st?? p????f???a??
    a????? t?? ???st? (query model).
  • ?p? t?? ???? p?e??? st?? stat?st???, ?p??????
    ?a?? ?eµe???µ??e? te?????? ??a t?? e?t?µ?s? t??
    ?at? p?s? ??a µ??t??? ta?????e? µe ??p??? ????
    µ??t???

F?????? ???/??? Stat?st???? ??e???? st??
?pe?e??as?a F?s???? G??ssa?
35
? ?as??? ?d?a.
  • ?? stat?st???? ??e???? ?a??? ta????sµat??
    (Goodness of fit statistical tests) e??a? p???
    ???st?? µ???d?? ??a t?? e?t?µ?s? t?? ?p??es?? t??
    ?at? p?s? ??a ?e???t??? µ??t??? pe?????fe?
    ?a?? ??a s????? ded?µ????.
  • St? ßas??? ??s? t?? d?at??ß?? a?apt?ss??µe µ?a
    te????? ??a ??a??t?s? ?????f???a? ? ?p??a
    st????eta? st?? ?-tet?????? ??e??? ?a???
    ta????sµat?? ??a ?a e?t?µ?s??µe p?s? ?a?? t?
    µ??t??? t?? e????f?? ta????e? st?? p????f???a??
    a????? t?? ???st?

36
?fa?µ??? t?? Stat?st???? ??????? st?? ????t?s?
?????f???a?
  • ? te????? a?t? e?t?? t?? ?t? ap?de????eta?
    ?d?a?te?a ap?d?t???, e??a? ?a? e?????t?.
  • ?p??e? ?a p??sa?µ?s?e? ?a? se d?af??et???
    p??ß??µata, e?e? ?p?? ?pe?s???eta? ? ?????a t??
    e?t?µ?s?? t?? ta????sµat??, ?p?? p? st??
    ap?saf???s? t?? ?????a? µ?a? ?????.

37
?fa?µ??? t?? Stat?st???? ??????? st?? ????t?s?
?????f???a?
???p???s?
? ?????? e??a? ap??. ??at?p????µe µ?a ßas???
?p??es? ??a ta ded?µ??a ???st? ?a? ?? µ?de????
?p??es?
S?µf??a µe a?t? Te????µe ?t? de? ?p???e? ?aµ?a
?d?a?te?? s??s? ? desµ?? µeta?? t?? e??t?µat??
(query) ?a? e??? s???e???µ???? e????f??, e?t??
ap? t? ?t? ?? ???? t?? e??t?µat?? µp??e? ?a
eµfa??s???? se a?t? t? ????af? ap? t??? ?a?
µ???
G?a ?a e?t?µ?s??µe t?? ?p??es? a?t? e?te???µe
??a ?-tet?????? stat?st??? ??e??? (Goodness of
Fit Statistical Test) ?a? µe t?? ß???e?a t??
e?????? a?t?? e?t?µ??µe t?? s?et???t?ta t??
e????f?? µe t? e??t?µa t?? ???st?.
F?????? ???/??? Stat?st???? ??e???? st??
?pe?e??as?a F?s???? G??ssa?
38
? µ???d?? a?t? e?t?µ????e p??? sta ep?s?µa TREC
ded?µ??a ??a ??e??? t?? ap?d?t???t?ta? t??
Information Retrieval s?st?µ?t??
? ap?d?t???t?t? t?? sta?e?? p?? p??? ap? ta
??ass??? tf-idf s??µata ?a? t?? OKAPI µ???d?
??e??e?t?µata
  • ?? pa?aµet???? µ???d?? ??a Information Retrieval
  • ?????pt??? ap??? t?p?? ??a??t?s?? ?????f???a?
  • ??a??a?t???? t??p?? µ??te??p???s? ?????f?? ?a?
    ???t?µ?t??

39
??sa???? sta Stat?st??? µ??t??a G??ssa?
  • ??a??sµat??? µ??t??a (vector Space models)
  • ???a??t??? µ??t??a (Probabilistic models)
  • Language Modeling Approach

40
  • ??a??sµat??? µ??t???. ???t????e ap? t?? Salton
    2 t? 1972. ???te??p??e? ta ????afa ?a? ta
    e??t?µata ?? d?a??sµata ?a? ???s?µ?p??e?
    d?a??sµat???? µet????? ??a ?a e?t?µ?se? t??
    s?et???t?ta. ???µa ?a? s?µe?a ß??s?eta? se ???s?.
  • ???a??t??? µ??t???. ???t????e ap? t??? Robertson
    ?a? Sparck-Jones 3 t? 1975. ???s?µ?p??e? t??
    p??a??t?ta eµf???s?? e??? ???? a?t? t??
    s????t?ta? p?? ???s?µ?p??e?ta? st? ??a??sµat???
    µ??t???, ?a? e?t?µ? t?? s?et???t?ta t??
    e??t?µat?? µe t? ????af? ???s?µ?p????ta?
    ?ata??µ??
  • ?a?a??a???
  • Naïve Bayesian Networks 13
  • Inquery Retrieval System 14
  • OKAPI system

41
Language Modeling Approach
  • ???t????e to 1998 ap? t??? Ponte ?a? Croft 8
  • ???s?µ?p??e? ta stat?st??? µ??t??a ???ssa? µe
    ?µ??? t??p? ?p?? a?t? ???s?µ?p?????ta? st? Speech
    Recognition ?a? ????? t?? ?ata???? t??? ap? t??
    ep??? t?? Shannon µe t? µ??t??? t?? ????ß?de?
    ?a?a???? (noisy channel) 12.
  • ?a s?st?µata a?t? ap?d?d??? ?a?? a??? ????? t?
    µe?????t?µa ?t? e??a? pa?aµet???? ?a? ??e?????ta?
    e?t?µ?s? pa?aµ?t??? p??? se training data
  • ?a?a??a???
  • Hidden Markov Models 48,11
  • Translation Models 10

42
? d???? µa? ???s????s?Goodness of Fit (GOF)
??a??t?s?
  • G?a ?a ßa?µ?????s??µe ta d??f??a ????afa
    ßas???µaste st?? ?-tet?????? stat?st??? ??e???
  • ? ?-tet?????? ??e???? pe?????fe? t? p?s? ?a??
    µ?a ?p??es? (µ?de???? ?p??es?), st?? ?p??a
    ?e????µe ?t? ?p??e??ta? ta ded?µ??a ta?????e? µe
    ta ded?µ??a
  • ??? s???e???µ??a d?at?p????µe t?? µ?de????
    ?p??es? ?t? ???? ?? ???? t?? e??t?µat??
    ?ata??µ??ta? t??a?a sta d??f??a ????afa
  • ?et??µe t?? s????t?ta ???e ???? st? ????af?
    (observed) ?a? t?? s????????µe µe t?? µ?de????
    ?p??es? (expected).
  • ??? ? d?af??? e??a? µe???? a?t? e??a? ??de???
    s?s??t?s?? t?? e??t?µat?? µe t? ????af?.

43
Stat?st???? ??e???? ?a??? ?a????sµat??
  • ?a stat?st??? p??ß??µata a?????ta? s?????? st??
    ??e??? ??a t?? ep????? µ?a? ap? d?? e?a??a?t????
    ?p???se?? ??? µ?de???? (null Hypothesis) H0, ?
    ?p??a ?e??e? ?t? t? de??µa a??????e? t??
    ?p??e?µe?? ?e????µe?? ?ata??µ?, ?a? t??
    e?a??a?t??? H1, ? ?p??a ?e??e? ?t? a?t? de?
    s?µßa??e?.
  • ??a? stat?st???? ??e???? ?e??e?ta? ?s????? e?? ?
    p??a??t?ta ap?d???? t?? H0 e??a? µ???? ?ta? ? H0
    e??a? ?????.

44
?-tet?????? ??e????
  • ? p?? s?µa?t???? ?a? ? p?? ???st?? stat?st????
    ??e???? e??a? ? ?2 ?a? p??t????e ap? t?? Pearson
    33, (Pearsons chi-squared test).
  • G?a t?? ?p?????sµ? t?? ? stat?st??? p??
    ???s?µ?p??e?ta? e??a? ? e??s

?p?? Oi ? pa?at????e?sa s????t?ta ?a? Ei ?
a?aµe??µe?? s????t?ta ap? t?? µ?de???? ?p??es?.
? stat?st??? ??????? t?? e??s?s?? 2.1 a??????e?
t?? ?2 ?ata??µ? µe k-c ßa?µ??? e?e??e??a?, ?p?? k
? a???µ?? t?? ???se?? ?at??????p???s?? t??
ded?µ???? ?a? c o a???µ?? t?? e?t?µ?µe???
pa?aµ?t??? ??a t?? ?ata??µ? p?? ?e????µe ?t?
d??pe? ta ded?µ??a.
45
?-tet?????? ??e???? (s????e?a)
  • ???s?µ?p????ta? ??p??? stat?st??? pa??t? ?
    p??a?e? t?? ?2 ?ata??µ?? ?p????????µe t?? p t?µ?
    (p-value) ??a t?? ?p???????µe?? ?2 t?µ? ap? t??
    p??????µe?? e??s?s?.
  • ??? ? t?µ? p e??a? p??? µ???? (t?p??? ??t? ap?
    ??a ep?ped? s?µa?t???t?ta?) ap????pt??µe t??
    µ?de???? ?p??es?, d?af??et??? t?? ap?de??µaste.

46
????d?? ??a??t?s?? ?????f???a? µe t?? ???s? t??
?2 stat?st???? ???????
  • ? ??s?a t?? p??te???µe??? µe??d?? e??a? ?a
    s??????e? t?? pa?at????e?se? s????t?te? t?? ????
    t?? e??t?µat?? st? ????af? µe t?? a?aµe??µe?e?
    ap? t?? ?e????µe?? ?p??es? t?? t??a?a?
    ?ata??µ??.
  • ? s?????s? a?t? µe t?? ß???e?a t?? ?2 stat?st????
    ??????? µp??e? ?a p?s?t???p???se? µ?a d?af???
    (discrepancy), ? ?p??a te???? ?a ???s?µ?p????e?
    sa? ???t???? ßa?µ?????s?? t?? s???fe?a? t??
    e??t?µat?? µe t? ????af?.

47
????d?? ??a??t?s?? ?????f???a? µe t?? ???s? t??
?2 stat?st???? ??????? (S????e?a)
  • ? µ?de???? ?p??es? ap????pteta? ?ta? ?
    ?p???????µe?? ?2 t?µ? ap? t?? e??s?s? 2.1 t??
    Pearson e??a? µe?a??te?? ap? t?? t?µ? p??
    ?aµß????µe ap? t??? p??a?e? t?? ?2 ?ata??µ?? ??a
    ??a ep?ped? s?µa?t???t?ta? a (s?????? a0.05, ??a
    ßeßa??t?ta 95)
  • ???ad?, ?s? µe?a??te?? e??a? ? ?p???????µe?? ?2
    t?µ? t?s? ?s????te?? e??a? ? ??de??? ?a
    ap???????µe t?? µ?de???? ?p??es? ?a? ep?µ???? ?a
    ????µe µ?a s?s??t?s? (relatedness) µeta??
    e??t?µat?? ?a? e????f??

48
????d?? ??a??t?s?? ?????f???a? µe t?? ???s? t??
?2 stat?st???? ??????? (S????e?a)
  • ?p?µ???? ?s?? af??? t?? te????? µa? ??a t??
    µ?t??s? t?? s???fe?a? µeta?? e??t?µat?? ?a?
    e????f?? ?a µp????saµe ?a ???s?µ?p???s??µe a?t?
    ?a? ea?t? t?? ?p???????µe?? ?2 t?µ? ????? ?a
    e?d?afe??µaste p?a?µat??? ?a ap???????µe t??
    µ?de???? ?p??es?
  • ?a ????afa µe t?? µe?a??te?? a?t?st???? ?2 t?µ?
    ?a t?p??et????? st?? ????f? t?? ep?st?ef?µe???
    ßa?µ?????µ???? ??sta? µe ta s?et??? ????afa

49
????d?? ??a??t?s?? ?????f???a? µe t?? ???s? t??
?2 stat?st???? ??????? (S????e?a)
50
????d?? ??a??t?s?? ?????f???a? µe t?? ???s? t??
?2 stat?st???? ??????? (S????e?a)
51
????d?? ??a??t?s?? ?????f???a? µe t?? ???s? t??
?2 stat?st???? ??????? (S????e?a)
52
????d?? ??a??t?s?? ?????f???a? µe t?? ???s? t??
?2 stat?st???? ??????? (S????e?a)
  • ??e??e?t?µata
  • ?? ????? p?e????t?µa e??a? ?t? ? p??te???µe??
    µ???d?? de? e??a? pa?aµet????. Se ???e? µe??d???
    ?p?? ? KL-Divergence t? pa?a??µe?? µ??t???
    ??e???eta? e?t?µ?s? t?? pa?aµ?t??? t?? ?ata??µ??
    p??? se ded?µ??a e?pa?de?s?? (training data)
  • ?????pte? ap??? t?p?? ??a??t?s?? (Retrieval
    formula)
  • ?p????µe ?a d???µ?s??µe p?????? e?a??a?t?????
    t?p??? ??a??t?s?? ap?? a??????ta? t?? ßas??? µa?
    ?p??es? ??a ta ded?µ??a (d??ad? t? µ??t??? t??
    t??a??t?ta?)

53
?a µ??t??a S?????s??
  • Ta pe????????µe d?? d?µ?f??? µ??t??a ??a??t?s??
    ?????f???a? µe ta ?p??a ?a s????????µe t??
    p??te???µe?? ?2 GOF µ???d?, t??
  • OKAPI µ???d?, ap? ta ???st? tf-idf s??µata
  • KL-Divergence ap? t?? Language Modeling
    ???s????s? ??a Information Retrieval

54
Tf-idf s??µata, OKAPI t?p?? ??a??t?s??
  • ?a tf-idf s??µata e??a? ???st? ?a? ?? µ??t??a
    d?a??sµat???? ????? ?a? p??t????a? ??a p??t? ap?
    t?? Salton t? 1971, 2.
  • S?µf??a µe a?t? t? µ??t???, ???e ???? kj se ??a
    ????af? dj s??d?eta? µe ??a ?et??? ß???? wij t?
    ?p??? e?f???e? t? p?s? s?µa?t???? e??a? ? ????
    ??a t?? ?a????sµ? t?? s?µas??????a? t?? e????f??
    ?a? ep?µ???? t?? sp??da??t?t?? t?? st? s?st?µa
    ??a??t?s??
  • ?p?s?? ?a? ???e ???? t?? e??t?µat?? s??d?eta? µe
    ??a a?t?st???? ß????

55
Tf-idf, OKAPI t?p?? ??a??t?s?? (S????e?a)
56
Tf-idf, OKAPI t?p?? ??a??t?s?? (S????e?a)
57
Tf-idf, OKAPI t?p?? ??a??t?s?? (S????e?a)
  • G?a ?a ???e? p?? a?ta????st??? t? s??µa
    ???s?µ?p????µe µ?a pa?a??a?? t?? ß????? s?et???
    µe a?t? p?? d??eta? st?? t?p? (2.8), t?? OKAPI-TF
    t?p? ???st? ?a? ?? BM25 t?p? ??a t? ß??t?st?
    ta???asµa (Best matching OKAPI retrieval formula
    49).
  • ??? ? OKAPI TF t?p?? s?ed??st??e ??a ?a
    ???s?µ?p????e? µe t? ??API p??a??t??? µ??t???,
    ??e? ap?de???e? ?t? ?ta? ???s?µ?p??e?ta? µe t?
    d?a??sµat??? µ??t??? d??e? ?a??te?a ap?te??sµata
    ??a??t?s?? 66

58
Tf-idf, OKAPI t?p?? ??a??t?s?? (S????e?a)
59
KL-Divergence
  • H KL-Divergence 40, e??a? µ?a ?d?a?te?a
    ap?d?t??? µ???d? ? ?p??a epe?te??e? t??
    p??s????s? t?? µ??t???? ???ssa? (language
    modeling approach) st?? pe????? t?? Information
    Retrieval
  • ???a? µ?a pa?aµet???? µ???d?. ? ßas??? ?d?a
    ???e?ta? st?? e?t?µ?s? e??? µ??t???? ???ssa? ??a
    t? ????af? ?a? e??? µ??t???? ???ssa? ??a t?
    e??t?µa ?a? ?a ta s??????e? µe t??
    Kullback-Leibler Divergence

60
KL-Divergence (S????e?a)
  • H KL-Divergence a? ?a? de? e??a? p?a?µat???
    ap?stas? (de? e??a? s?µµet???? ?a? de? ?s??e? ?
    t???????? a??s?t?ta) e??a? ??a p??? ?a?? µ?t??
    µ?t??s?? t?? ?µ???t?ta? µeta?? d?? ?ata??µ??.

61
KL-Divergence (S????e?a)
? de?te??? ap? ta de??? ???? e??a? µ?a sta?e??
e?a?t?µe?? ap? t? e??t?µa, ? ?a??te?a ap? t??
e?t??p?a t?? µ??t???? t?? e??t?µat?? ?a? de?
e?a?t?ta? ap? t? ????af?, ??a a?t? µp??e? ?a
pa?a??f?e?.
St?? ?d?? t?p?, ? s?et???t?ta t?? e????f?? d se
s??s? µe t? e??t?µa q e?a?t?ta? ap? t?? e?t?µ?s?
t?? µ??t???? t?? e??t?µat?? p(w?q) ?a? t??
µ??t???? ???ssa? t?? e????f?? p(w?d)
62
KL-Divergence (S????e?a)
63
KL-Divergence (S????e?a)
64
??t?µ?s? t?? ?2 S?st?µat?? ??a??t?s?? ?????f???a?
  • Sta pa?ad?s?a?? s?st?µata ??a??t?s?? ?????f???a?
    ta ????afa pa?aµ????? sta?e?? st?? s??????, e??
    ??a e??t?µata ?p?ß?????ta? st? s?st?µa ap? t?
    ?p??? ??te?ta? ?a ep?st???e? ta p?? s?et???
    ????afa.
  • ??t? e??a? ???st? ?? Ad-hoc Retrieval.
  • ???? se a?t? ?a e??????µe t?? ap?d?t???t?ta t??
    p??te???µe??? ?2-GOF µe??d?? ?a? ?a t??
    s????????µe µe t?? OKAPI ?a? KL-Divergence µ???d?
    ??a t? ?d?? p??ß??µa

65
?e????af? t?? TREC ?ed?µ???? ?p?t?µ?s??
  • ??a s?????? e????f?? p?? ???s?µ?p??e?ta? ep?
    ?????a ??a t?? ap?t?µ?s? t?? s?st?µ?t??
    ??a??t?s?? ?????f???a? e??a? ? TIPSTER/TREC
    collection 44
  • ???? t?? µe????? ????? t?? ?e??e?ta? s?µe?a sa?
    standard reference test collection ??a t??
    pe????? t?? information Retrieval
  • H d?µ??????a t?? s??????? ?e????se ap? t?? Domna
    Harman, µ?a d?e????t??a st? National Institute of
    Standards and technology (NIST), p?? e??e t??
    ?d?a t?? d???????s?? e??? d?a????sµ?? se et?s?a
    ß?se? ??a Information Retrieval s?st?µata, ?p? t?
    ???µa TREC (Text Retrieval Conference)

66
?e????af? t?? TREC ?ed?µ????
67
?e????af? t?? TREC ?ed?µ???? (S????e?a)
  • ?pe?d? ?? s??????? a?t?? d?µ?????????a? ?p? t?
    ???µat?d?t??µe?? ap? t? DARPA p????aµµa TIPSTER
    a?af????ta? ?a? sa? TIPSTER ? TIPSTER/TREC test
    Collection
  • H TREC Collection a????e? sta?e?? ????? µe t?
    ?????. S?µe?a d?at??eta? ep? a???? se 6 CD Rom
    Disks p?? t? ?a???a ???d???? pe????e? pe??p?? 1
    gigabyte s?µp?esµ??? ?e?µe??
  • ????? ?????e?s?? t?? ?e?µ????

68
?e????af? t?? TREC ?ed?µ???? (S????e?a)
?e??µa ?????f?? st?? S??????
  • ??a ta ????afa st?? s?????? ????? et??et?p????e?
    (tagged) µe SGML ??a e????? Parsing

69
?e????af? t?? TREC ?ed?µ???? (S????e?a)
?e??µa ?????f?? st?? S?????? (S????e?a)
70
?e????af? t?? TREC ?ed?µ???? (S????e?a)
  • ? TREC s?????? pe????e? ?a? ??a s????? ap?
    e??t?µata (queries) p?? e??a? a?t?µata p??
    e?f?????? ??p??a p????f???a?? a????? ?a? µe a?t?
    µp??e? ?a e?e???e? ??a? ???? a??????µ?? ?? p???
    t?? ap?d?t???t?t? t??.
  • St?? TREC ???????a ??a t?t??? e??t?µa ???µ??eta?
    topic
  • ?a??de??µa e??? topic e??a? t? ep?µe??

71
?e????af? t?? TREC ?ed?µ???? (S????e?a)
?e??µa topic
72
S?????s? ?p?d?t???t?ta? µe ta tf-idf s??µata
OKAPI µ???d??
  • G?a ?a ????µe µ?a ?a??te?? e????a t?? d??at?t?ta?
    ??a??t?s?? t?? p??te???µe??? X2-GOF µe??d??,
    ep????aµe ?a ???e? ? ??e???? t?? ap?d?t???t?ta?
    se 3 µe???e? ?p?s??????? t?? TREC s???????, ??
    ?p??e? e??a?
  • T?? p??te???µe?? µ???d? ?a t?? s????????µe ep?s??
    p??? st?? ?d?a s?????? µe t?? OKAPI µ???d? p??
    ?e??e?ta? ??ass??? ??a Information Retrieval

73
S?????s? ?p?d?t???t?ta? µe ta tf-idf s??µata
OKAPI µ???d?? (S????e?a)
  • St?? p??a?a 2.1 fa????ta? ta stat?st??? st???e?a
    t?? s??????? a?t??

74
S?????s? ?p?d?t???t?ta? µe ta tf-idf s??µata
OKAPI µ???d?? (S????e?a)
  • O? e??t?µata ???s?µ?p???saµe ta ??µata 351-400
    (topics 351-400), ta ?p??a ???s?µ?p???saµe st?
    s???d??? TREC-7
  • ??te??saµe d?? pe???µata µe a?t? ta ??µata. St?
    ??a ???s?µ?p???saµe µ??? t??? t?t???? ap? t?
    ?e?µe?? t?? e??t?µat?? ?a? st? de?te??
    ???s?µ?p????ta? µ?a µe?a??te?? ??d?s? t??
    e??t?µ?t??
  • G?a ta pe???µata a?t? de? ???s?µ?p???saµe ?aµ?a
    p??epe?e??as?a sta ?e?µe?a, ?p?? p?,
    tokenization, stemming ??te efa?µ?saµe ?aµ?a
    ??sta ap???e?sµ?? s????? ???e?? (stopword list),
    ?p?? ??????, s??d?sµ??, ep????µ?t??, ??p.
    ??t??eta ??ßaµe ?p ???? ??e? a?e?a???t?? t??
    ???e?? ???? t?? e????f?? st?? s??????

75
S?????s? ?p?d?t???t?ta? µe ta tf-idf s??µata
OKAPI µ???d?? (S????e?a)
76
S?????s? ?p?d?t???t?ta? µe ta tf-idf s??µata
OKAPI µ???d?? (S????e?a)
77
S?????s? ?p?d?t???t?ta? µe ta tf-idf s??µata
OKAPI µ???d?? (S????e?a)
78
S?????s? ?p?d?t???t?ta? µe ta tf-idf s??µata
OKAPI µ???d?? (S????e?a)
79
S?????s? ?p?d?t???t?ta? µe ta tf-idf s??µata
OKAPI µ???d?? (S????e?a)
80
S?????s? ?p?d?t???t?ta? µe ta tf-idf s??µata,
OKAPI µ???d?? (S????e?a)
81
S?????s? ?p?d?t???t?ta? µe ta tf-idf s??µata,
OKAPI µ???d?? (S????e?a)
82
S?????s? ?p?d?t???t?ta? µe ta tf-idf s??µata,
OKAPI µ???d?? (S????e?a)
83
S?????s? ?p?d?t???t?ta? µe ta tf-idf s??µata,
OKAPI µ???d?? (S????e?a)
84
S?????s? ?p?d?t???t?ta? µe ta tf-idf s??µata,
OKAPI µ???d?? (S????e?a)
85
S?????s? ?p?d?t???t?ta? µe ta tf-idf s??µata,
OKAPI µ???d?? (S????e?a)
86
S?????s? µe t?? KL-Divergence ?a? OKAPI st?? TREC
s??????
  • G?a t?? ?a??te?? ap?t?µ?s? t?? d??at?t?t?? t??
    p??te???µe??? X2-GOF µe??d?? e?te??saµe ??a
    µe?a??te?? pe??aµa p??? se ??? t?? TREC t?? CDs
    4,5 s????????ta? a?t? t?? f??? ?a? µe t??
    KL-Divergence µ???d?
  • ?a stat?st??? st???e?a t?? s??????? fa????ta?
    pa?a??t?

87
(No Transcript)
88
(No Transcript)
89
?a?a?t???st??? ?a? p?e??e?t?µata t??
???te???µe??? ?2-GOF µe??d??
  • ?? ?a? ? p??te???µe?? µ???d?? ???s?µ?p??e? ??a
    t?? ??a??t?s? µ??? ?a?a??? s????t?te?, ? µ???d??
    ?epe??? sta?e?? t?? OKAPI BM25 µ???d? ??a??t?s??
  • Ost?s? ?a? st?? d?? pe??pt?se?? TREC-7 ?a? TREC-8
    ? KL-Divergence ??e? t?? ?a??te?? ap?d?t???t?ta
  • ? µ???d?? ?µ?? a?t? ??e? t? µe?????t?µa ?t? e??a?
    pa?aµet???? ?a? ??e???eta? e?t?µ?s? t??
    pa?aµ?t??? p??? se ???????? t?? s??????.

90
?a?a?t???st??? ?a? p?e??e?t?µata t??
???te???µe??? ?2-GOF µe??d??
  • ? ap??t?ta e??a? ??a ap? ta ßas??? p?e??e?t?µata
    t?? p??te???µe??? µe??d??. ? ?p???????µe?? X2-GOF
    t?µ? ße?t???e? t?? ap?d?t???t?ta ?a? ep?t??pe?
    t?? a?e??es? e????f?? p?? p??se??????? t??
    s?????e? t?? e??t?µat??
  • ? µ???d?? µa? ep?t??pe? ?a ap?fas?s??µe e??
    ?p???e? µ?a stat?st??? s?µa?t??? s??s? µeta??
    e??t?µat?? ?a? e????f??
  • ?p? p????, µa? ep?t??pe? µ?sa st? p?a?s?? t??
    stat?st???? e?????? ?a d???µ?s??µe e?a??a?t?????
    t?p??? ??a??t?s??, ap?? a??????ta? t?? ßas???
    ?p??es? ??a ta de?dµ??a

91
Stat?st??? ??t?µ?s? t?? ?p?d?t???t?ta? t??
S???????µe??? ???????µ??
  • ?? ap?d?se?? t?? s???????µe??? a??????µ?? se a?t?
    ta pe???µata fa??eta? ?a e??a? d?af??et???
  • G?a ?a t? e?t?µ?s??µe ?a? p?? t?p??? a?t? ?a
    e?te??s??µe ??a ??e??? paired t-test
  • O ??e???? paired t-test ???s?µ?p??e?ta? ??a ?a
    e??????µe e?? ?? µ?se? t?µ?? t?? p????sµ?? d??
    de??µ?t?? e??a? ?s??
  • St?? pe??pt?s? µa? ???s?µ?p????µe sa? de??µata
    t?? ?aµßa??µe?e? µ?se? t?µ?? a???ße?a? sta
    11-s?µe?a ap? ta pe???µata p?? ???aµe

92
Stat?st??? ??t?µ?s? t?? ?p?d?t???t?ta? t??
S???????µe??? ???????µ??
  • ??te???ta? t?? ??e??? paired t-test ??a ta
    µ??t??a X2-GOF ?a? OKAPI, t?te ? ep?st?ef?µe??
    p??a??t?ta (p-value) ??a ta ??µata 351-400 e??a?
    0.0326 ?a? ??a ta ??µata 401-450 e??a? 0.00010608
  • ?p?µ???? s?µpe?a????µe ?t? ?? ?aµßa??µe?e? µ?se?
    a???ße?e? ??a ta µ??t??a X2-GOF ?a? OKAPI, e??a?
    d?af??et???? µe ßeßa??t?ta 96.74 ?a? 99.98 ??a
    ta ??a ta ??µata 351-400 ?a? 401-450 a?t?st???a
  • ?µ??a s????????ta? ta µ??t??a X2-GOF ?a?
    KL-Divergence ß??s???µe ep?st?ef?µe?e?
    p??a??t?te? 0.0004 ??a ta ??µata 351-400 ?a?
    0.0018 ??a ta ??µata 401-450

93
???????ta? t?? ?as??? ?p??es? ??a ta ?ed?µ??a
  • St?? e??as?a Divergence from Randomness, Amati
    67, p??te??eta? ??a ßas??? µ??t??? t??a??t?ta?
    t?? ?ata??µ?? t?? ???? sta d??f??a ????afa.
    S?µf??a µe a?t? ?? d?ad??as?e? ?ata??µ?? t?? ????
    µp????? ?a ???s???? sa? t??a?e? e?????? (Random
    Drawings) ap? ??a d??e?? (urn) p?? pe????e?
    t??? d?a??s?µ??? ?????.
  • ?????????ta? a?t? t?? p??tas? a????aµe t? µ??t???
    t??a??t?ta? ap? a?t? t?? ?µ???µ??f?? ?ata??µ??
    st? d????µ??? µ??t??? (Binomial model)
  • S?µf??a µe a?t? t? µ??t??? ? eµf???s? e???
    µ??ad???? ???? i se ??a ????af? d ?e??e?ta?
    Bernoulli d?ad??as?a µe p??a??t?ta p1/N, ?p?? ?
    ? a???µ?? t?? e????f??

94
???????ta? t?? ?as??? ?p??es? ??a ta ?ed?µ??a
95
???????ta? t?? ?as??? ?p??es? ??a ta ?ed?µ??a
  • G?a ?a s????????µe t?? ap?d?t???t?ta t?? d??
    d?af??et???? ?p???se?? e?te??saµe ??a pe??aµa
    ??a??t?s?? p??? st?? s?????? FBIS ap? t? CD 5 t??
    TREC s???????

96
(No Transcript)
97
S?µpe??sµata
  • ?a???s??saµe µ?a µ???d? ??a efa?µ??? t?? X2-GOF
    stat?st???? ??????? st?? ??a??t?s? ?????f???a?
  • H µ???d?? ap?de????eta? e???st? (robust) ?a?
    ap?d?t???, ap?d?d??ta? ?a?? t?sa ??a s??t?µa
    e??t?µata ?s? ?a? ??a pe??ss?te?a f??a?a
    (Verbose)
  • ??e? t? p?e????t?µa ?a µa? ep?t??pe? ?a
    µ??te??p???s??µe ta ????afa ?a? ta e??t?µata µe
    p????? d?af??et???? ?p???se??.
  • ??p??e? d?af??et???? ?p???se?? ??a ta ded?µ??a
    ?p?? ? ?a??????t?ta (Normality), Weibull, ??p,
    p??a??? ?a e??a? ?a??? e?a??a?t???? ?p???se??.
  • ?p?s?? ?a? ? d???µ? ????? stat?st???? ???????,
    ?p?? Kolmogorov-Smirnov ?a? Anderson-Darling.

98
  • Stat?st???? ????d?? st?? ???es? Collocations

99
Collocations
  • ???e?? p?? s??e?f????ta? p??? s???? µa??
  • ???a? ????? ?a?a?t???st??? t?? f?s???? ???ss??
    ?a? µp????? ?a eµfa??s???? t?s? se ap?? ?e?µe??
    f?s???? ???ssa? ?s? ?a? se te????? ?a?
    ep?st?µ????? ?e?µe??
  • ??a Collocation µp??e? ?a e??a? s??d?asµ?? ???e??
    ? (f??se??) p?? eµfa???eta? p??? s???? st??
    ???ssa µe ??a t??p? p?? ?a fa??eta? f?s????
    ???µat??? ap? ta s?µf?a??µe?a, pa??t? ?
    ap?µ???µ??? s???es? t?? ep? µ????? ???µ?t?? p??
    apa?t????? t? collocation, ?d??e? se ???µat???
    pe??e??µe?? ?s?et? µe ta s?µf?a??µe?a

100
Collocations
  • ?a collocations se ???sse? µe ??a p???s?? ???t???
    s?st?µa, ?p?? ? ????????, eµfa?????ta? µe 2
    t??p???
  • ??aµpt??
  • ?? ???e?? ???µat?st???? ?a? a??a sa?
    ???µat?st???? ?????
  • ?a?a???
  • ?? ???e?? St????µa? ?a? d???e?? sa?
  • St????µa? st?? d???e??
  • ? d???e?? µ?? st???e?

101
Collocations
  • G?a ta Collocation ?p?????? p????? ???sµ??, af??
    ?? d??f???? e?e???t?? ????? est??se? p??? se
    d?af??et??? ?a?a?t???st???
  • Firth 55
  • Collocations of a given word are statements of
    the habitual or customary places of the word
  • Benson ?a? Morton 50
  • An arbitrary and recurrent word combination
  • ?? recurrent s?µa??e? ?t? a?t?? ?? s??d?asµ??
    eµfa?????ta? s???? ??a ??a ded?µ??? Context
    (s?µf?a??µe?a)

102
Collocations
  • Smadja 64
  • ?a?????e? 4 ?a?a?t???st??? ??a ta Collocations
    ???s?µa ??a t?? ?p?????st???? efa?µ????
  • ?a Collocations e??a? a??a??eta, a?t? s?µa??e?
    ?t? de? a?t?st?????? se ??p??a s??ta?t??? ?
    s?µas???????? pa?a??a??
  • ?a Collocations e??a? domain-dependent, ep?µ????
    ? ?e???sµ?? ?e?µ???? se ??a ped?? apa?te? saf?
    ???s? t?? ???????a? ?a? t?? domain-dependent
    Collocations
  • ?a Collocations e??a? recurrent, ?p?? ???st??e
    pa?ap???
  • ?a Collocations e??a? Cohesive lexical clusters,
    p?? s?µa??e? ?t? ? eµf???s? µ?a? ? pe??ss?te???
    ???e?? s???? s??ep??eta? t?? eµf???s? ?a? t??
    ?p????p?? ???e??

103
Collocations
  • S?µf??a µe t??? Manning ?a? Schutze 60 ta
    Collocations ?a?a?t??????ta? ap? limited
    compositionality (pe?????sµ??? s???et???t?ta)
  • ??a ??f?as? f?s???? ???ssa? e??a? compositional,
    e?? ? ?????a t?? ??f?as?? µp??e? ?a p??ß?ef?e?
    ap? t?? s???es? t?? e?????? p?? s????t?? t?
    collocation
  • ?a??de??µa ? ??f?as?
  • Ge?? p?t???
  • ?????
  • ??a ???? ?a?a?t???st??? t?? collocations e??a? ?
    ap??s?a ??????? s?????µ?? 59, 60
  • ?a??de??µa S?????µa Baggage ?a? luggage
  • ???? emotional, historical ? psychological
    baggage

104
? ???s?µ?t?ta t?? Collocations
  • ???a? s?µa?t??? ??a ??a s?µa?t??? a???µ?
    efa?µ???? ?p??
  • Natural Language Generation ??e???eta? t?? s?st?
    s??d?asµ? ???e??
  • Machine Translation ???a? d?s???? ?a
    µetaf??s??µe ap? t?? µ?a ???ssa st?? ???? ta
    Collocations, p.?. Clear road -gt ??e??e??? d??µ??
  • Text Simplification ??t??at?stas? d?s?????
    ???e?? µe ap??? ??e???eta? ???s? Collocations
  • Computational Lexicography ?a Collocations e??a?
    apa?a?t?ta ??a ?a ?a?a?t???s??? p????? t??
    ?e????? ?ata????se??

105
H ?????? t?? ??a????? Collocations
  • ?a???s?????µe d?? µe??d??? pa?a?????
    Collocations.
  • St?? p??t? pe??pt?s? efa?µ????µe t?? d???µasµ???
    µ???d? t?? µ?s?? ?a? t?? d?asp????
  • St?? de?te?? µ???d? ?eµe??????µe t?? efa?µ??? t??
    X2 stat?st???? e?????? ??a t?? e?a????
    Collocations

106
H ?????? t?? ??a????? Collocations
  • ? pa?ad?s?a?? p??s????s? ??a t?? e?a????
    Collocations e??a? ? ?e??????af??? p??s????s?.
  • S?µf??a µe t??? Benson ?a? Morton 50 de?
    µp????µe ?a ?e???st??µe ?e????st? ta s?µµet????ta
    µ??? se ??a Collocation (Collocates). ?p?µ???? ?
    e?a???? Collocations de? e??a? p??ß????µ?, p??pe?
    ?a ???eta? p??ta ?e????a?t??? ?a? ?pe?ta ?a
    e?s????ta? sta ?e????

107
H ?????? t?? ??a????? Collocations
  • ???sfata ? stat?st??? ??e? efa?µ?ste? st??
    e?a???? Collocations
  • O Choueka 52, d???µase ?a e?a???e? Collocations
    ???s?µ?p????ta? N-???µµata (N-grams) s??d?asµ???
    ap? 2 ??? 4 ????? ???s?µ?p????ta? ??a p??? ap??
    ???t???? t?? s????t?ta eµf???s??
  • ?t???? ? ep????? a?t? de? ?d??e? p??t?te sta
    ?a??te?a ap?te??sµata, p.?. st?? ??????? ???ssa
    ta s????te?a bigrams Of the, in the, to the

108
H ?????? t?? ??a????? Collocations
  • G?a ?a ?epe??s??? t? p??????µe?? p??ß??µa ??
    Justenson ?a? Katz 58 p??te??a? ?a ep??????ta?
    µ??? e?e??a ta bigrams p?? ap?te???? f??se??.
  • ???s?µ?p???sa? part-of-speech f??t?a
  • AN, NN, AAN, ANN, ?p?? A s?µa??e? ep??et? ?a? ?
    ??s?ast???
  • ?? ?a? e???st??? ap?? µ???d?? ?? s????afe??
    a??fe?a? s?µa?t??? ße?t??s? sta ap?te??sµata

109
H ?????? t?? ??a????? Collocations
  • ? ßas???µe?? st?? s????t?ta eµf???s?? µ???d??
    d???e?e? p??? ?a?? µe f??se?? ??s?ast????. Ost?s?
    p???? Collocations pe??????? ???e?? µe p??? p??
    e?????te? s?s?et?se?? µeta?? t??
  • ? µ???d?? t?? µ?s?? ?a? t?? d?asp???? (mean and
    Variance method 64) ?epe???e? t? p??ß??µa
    ?p????????ta? t?? p??s?µasµ??e? ap?st?se?? µeta??
    t?? Collocates ?a? ß??s???ta? t?? d?asp???
    (spread) a?t?? t?? p??s?µasµ???? ap?st?se??

110
H ?????? t?? ??a????? Collocations
  • ? p??s????s? t?? µ?s?? ?a? t?? d?asp???? fa??eta?
    ?????? e??a? ap??. ??a??t??µe ?ata??µ?? µe µ????
    d?asp???
  • ??a e?a??a?t??? µ???d?? ßas???µe?? st?? s????t?ta
    eµf???s?? e??a? aµ??ßa?a p????f???a (mutual
    information 53).
  • O ???? ??e? mutual information t?? ?ata???? t??
    ap? t?? Te???a t?? ?????f???a? ?a? e??a? ???d????
    ??a µ?t?? t?? p?s? p??? µ?a ???? µa? p????f??e?
    ??a µ?a ????

111
? p??te???µe?? µ???d?? t?? X2 stat?st???? e??????
  • ? ßas???µe?? st?? s????t?ta eµf???s?? µ???d? ??e?
    µ?a ad??aµ?a. ?p?t?????e? st?? pe??pt?s? p??
    ????µe a??a?e? t?µ?? Outliers (Bigrams µe p???
    ????? s????t?ta)
  • ?µe?? ?a pa???s??s??µe µ?a e?a??a?t??? p??s????s?
    p?? ßas??eta? st?? X2 stat?st??? ??e???.
  • Ta d?s??µe ep?s?? ??a e?a??a?t??? t?p? ??a t??
    ?p?????sµ? t?? X2 stat?st???? ??a t?? pe??pt?s?
    t?? e?a????? bigrams ap? t? corpus

112
? p??te???µe?? µ???d?? t?? X2 stat?st???? e??????
  • To X2 e??a? µ?a p??? ?a?? ???sµ??? stat?st???
    p??s????s? p?? e?t?µ? ?at? p?s? ??a s?µß?? e??a?
    ap?t??esµa t?? t????
  • ??t? e??a? ??a ap? ta ?e????te?a p??ß??µata st??
    stat?st??? ?a? s?????? d?at?p??eta? ap? t?? ?p???
    t?? Hypothesis testing
  • St?? pe??pt?s? µa? ?????µe ?a ?????µe ?at? p?s?
    d?? ???e?? eµfa?????ta? pe??ss?te?? s???? µa??
    ap ?t? st?? t???

113
? p??te???µe?? µ???d?? t?? X2 stat?st???? e??????
  • ??at?p????µe t?? µ?de???? ?p??es? (null
    Hypothesis H0) ?t? de? ?p???e? d?as??des? µeta??
    t?? d?? ???e?? p??a? ap? a?t?? t?? eµf???s?? µa??
    ap? t???.
  • ?p????????µe t?? p??a??t?ta (p0) p?? ?a e??e t?
    s?µß?? e?? ? H0 ?ta? a??????.
  • ??? ? p0 e??a? µ????, t?p??? ??t? ap? ??a
    p???a????sµ??? ep?ped? s?µa?t???t?ta? p0 lt0.005 ?
    p0 lt0.001 ap????pt??µe t?? ?0 d?af??et??? t??
    de??µaste ?? a??????

114
? p??te???µe?? µ???d?? t?? X2 stat?st???? e??????
  • St?? stat?st??? ?e????te?a ??a t?? ?p?????sµ?
    t?t???? p??a??t?t?? ??a t?? ap?????? ? µ? t??
    µ?de????? ?p??es?? ???s?µ?p????µe t?? student
    stat?st??? ??e??? (t-statistic), p?? ?p???te?
    ?a?????? ?ata?eµ?µ??a stat?st??? de??µata
  • O ????? p?? ep????aµe t?? ?2 stat?st??? ??e???
    e??a? ?t? de? ?p???te? ?t? ta ded?µ??a a?????????
    t?? ?a?????? ?ata??µ? (free distribution), ??t?
    p?? e??a? p??? s?st? st?? pe??pt?s? ???e??
    ?e?µ????

115
? efa?µ??? t?? µe??d?? ?a? s?????s? µe t?? µ???d?
Mean and Variance
  • Se ?t? a??????e? se a?t? t?? e??t?ta
  • ?e?????f??µe p?? a?a??t??? t?? d?? µe??d???
  • ?????µe pe??aµat??? ap?te??sµata ap? t?? efa?µ???
    t??? p??? se ??a s?µa (corpus) ????????? ?e?µ????

116
? µ???d?? Mean and Variance
117
? efa?µ??? t?? µe??d?? ?a? s?????s? µe t?? µ???d?
Mean and Variance
  • ?? d??µe ??a pa??de??µa ?p?????sµ?? t?? µ?s?? ?a?
    t?? ?p????s??.
  • ?st? ?? p??t?se?? ap? t?? ???????? ???ssa ??a t??
    ???e?? ?t?p?se ?a? p??ta.

118
? µ???d?? Mean and Variance
  • ?p????µe ?a ?p?????s??µe t? µ?s? (mean) ?a? t??
    d?a??µa?s? t?? ap?st?se?? t?? ????? ?t?p?se se
    s??s? µe t?? ???? p??ta

119
? µ???d?? Mean and Variance
  • ? µ?s?? ?a? ? d?asp??? µa? ß???? ?a ß???µe
    Collocations ??????ta? ??a ?e?????a µe t?? p??
    ?aµ??? d?asp??? (spread)
  • ?s? p?? ?aµ??? e??a? ? d?a??µa?s? µeta?? t??
    ap?st?se?? se ??a ?e????? ???e?? t?s? p?? ?s????
    e??a? ? ??de??? ?t? a?t? t? ?e????? ap?te?e?
    Collocation
  • ??a ??e?a ????f??µe?? ?ata??µ? t?? ap?st?se??
    e??a? ?s???? ??de???. ?? t? e????s??µe a?t? µe
    d?? ?ata??µ?? µe p?a?µat??? ded?µ??a ap? t? s?µa
    ap?t?µ?s?? t?? µe??d?? (Evaluation Corpus)

120
? µ???d?? Mean and Variance
121
? µ???d?? Mean and Variance
122
? µ???d?? ?-tet??????
  • ?? 1900 ? Karl Pearson p??te??e µ?a stat?st???,
    t?? ?2 stat?st???, ? ?p??a s??????e? t???
    pa?at??????te? µe t??? a?aµe??µe???? a???µ???
    ?ta? ?? d??at?? e?ß?se?? e??? pe???µat??
    ?p?d?a?????ta? se aµ??ßa?a ap???e??µe?e?
    ?at?????e?

?? S pa??st??e? t? ?????sµa ?a? ?p??????eta? ??a
??e? t?? d??at?? e?ß?se?? t?? pe???µat??
123
? µ???d?? ?-tet??????
  • ?? a?aµe??µe?e? ?a? ?? pa?at????e?se? s????t?te?
    µp????? ?a e???????? st? p?a?s?? t?? Hypothesis
    testing
  • ??? ta ded?µ??a d?a?????ta? se aµ??ßa?a
    ap???e??µe?e? ?at?????e? ?a? d?at?p?s??µe µ?a
    µ?de???? ?p??es? ??a ta ded?µ??a
  • ??te
  • ? a?aµe??µe?? t?µ? e??a? ? t?µ? ??a t?? ???e
    ?at?????a e?? ? µ?de???? ?p??es? e??a? a??????
  • ? pa?at????e?sa t?µ? ??a ???e ?at?????a p????pte?
    ap? ta ded?µ??a t?? de??µat??

124
? µ???d?? ?-tet??????
  • G?a ?a ???e? p?? ?ata???t? ? efa?µ??? t??
    pa?ap??? µe??d?? d????µe ??a pa??de??µa
  • ?st? ?t? ????µe ??a ???ss??????? corpus ?a?
    e?d?afe??µaste ?a e?a?????µe Collocations
  • ??????µe ??a collocational window 10 ???e?? ?a?
    µet??µe t?? s????t?ta eµf???s?? t?? ?e??a???? t??
    ???e?? ?s????? ?a? ??d?a?

125
? µ???d?? ?-tet??????
  • ?????pt??? ta a??????a.
  • 10 eµfa??se?? t?? ?e??a???? (?s?????, ??d?a?)
    µ?sa st? corpus
  • 1000 bigrams ?p?? ? de?te?? ???? e??a? ??d?a?
    ?a? ? p??t? ??? ?s?????
  • 500 bigrams ?p?? ? p??t? ???? e??a? ?s????? ?a?
    ? de?te?? ??? ??d?a?
  • 1,500,000 bigrams p?? de? pe??????? ?aµµ?a ap?
    t?? d?? ???e?? ded?µ???? t?? Collocational window

126
? µ???d?? ?-tet??????
St?? pe??pt?s? a?t? ?a ?ta? ???s?µ? ?a
???s?µ?p???s??µe t?? p??a?a s???fe?a?
(Contingency table)
127
? µ???d?? ?-tet??????
  • ???s?µ?p????ta? maximum likelihood estimates
    µp????µe ?a ?p?????s??µe t?? p??a??t?ta
    eµf???s?? t?? ?e??a???? p?? ap????e? ap? t??
    µ?de???? ?p??es?

? µ?de???? ?p??es? e??a? ?t? ?? eµfa??se?? t??
?s????? ?a? ??d?a? e??a? a?e???t?te?
128
? µ???d?? ?-tet??????
  • ?pe?ta ?p????????µe t?? ?2 t?µ? ap? t?? e??s?s?
    3.7
  • ?p? t??? p??a?e? t?? ?2 ?ata??µ?? ß??s???µe t??
    ???s?µ? t?µ? ??a ??a ep?ped? s?µa?t???t?ta?
    (s?????? a0.05)
  • ??? ? ?p???????µe?? ?2 t?µ? e??a? µe?a??te?? ap?
    t?? ???s?µ? t?µ? µp????µe ?a ap???????µe t??
    µ?de???? ?p??es? ?t? ?? ???e?? ?s????? ?a?
    ??d?a? eµfa?????ta? a?e???t?ta
  • ?p?µ???? ??a µe???e? t?µ?? t?? ?2 stat?st????
    e?????? ????µe ?s???? ??de??? ??a t?? s??µat?sµ?
    Collocation

129
? µ???d?? ?-tet??????
  • G?a ??a 2x2 p??a?a s???fe?a? ??a t?? ?p?????sµ?
    t?? ?2 stat?st???? µp????µe ?a ???s?µ?p???s??µe
    t?? pa?a??t? t?p?

?p?? aij ?? ?ata????se?? t?? 2x2 p??a?a s???fe?a?
?a? ? t? ?????sµa a?t?? t?? ?ata????se??
130
?e??aµat??? ap?te??sµata
  • ???et? a??e?a ?e?µ???? t?? ?e?e???????? ???ssa?
    ?ta? d?a??s?µa se eµ?? se ??e?t?????? µ??f? ap?
    d??f??e? p????
  • ??a p??ta????? µ??f??????? d?ad??as?a
    part-of-speech tagging s?µe??se t? µ???? t??
    ????? ?a? t? ??µµa ??a ???e ???? t?? s?µat??
    (corpus)
  • ?t???? ? p??epe?e??as?a µa? de? ?ta? ??a?? ?a µa?
    pa??s?e? ta ??µµata ??a ??µata ?a? ep????µata.

131
?e??aµat??? ap?te??sµata
? ?ata??µ? t?? ??µµ?t?? st? corpus fa??eta? st??
pa?a??t? p??a?a
132
?e??aµat??? ap?te??sµata
  • O µ???? s??d?asµ?? d???aµµ?t?? (bigrams) p??
    µp????µe ?a d???µ?s??µe e??a? (?p??et?,
    ??s?ast???), ?a??? de? pe??????ta? ta ???a µ???
    t?? ?????
  • ??????µe ??a collocational window µ????? 10
    ???e?? s?µpe???aµßa??µ???? ?a? t?? s?µe??? st????

?????s? d?asp???? ?p????????µe ap? t? Corpus t??
ap?st?se?? ?a? t?? t?p??? ap????s? ??a ????? t???
s??d?asµ??? t?? d???aµµ?t?? (?p??et?, ??s?ast???)
133
?????s? d?asp????
134
?????s? d?asp????
135
?????s? d?asp????
136
?????s? d?asp????
137
?????s? d?asp????
138
?????s? t?? ??????? ?2
  • ??at?p????µe t?? µ?de???? ?p??es? t?? stat?st????
    a?e?a?t?s?a? µeta?? t?? d?? ???e?? p?? apa?t?????
    t? de??µa
  • ??t? s?µa??e? ?t? ?? d?? ???e?? eµfa?????ta?
    a?e???t?te? ? µ?a ap? t?? ???? µ?sa st? de??µa
    st? ?p??? ?a? ?ata??µ??ta? t??a?a
  • ?p????????µe t?? X2 stat?st??? µe t?? t??p? p??
    pe??????aµe pa?ap???. ?s? µe?a??te?? e??a? ? t?µ?
    t?s? p?? ?s???? e??a? ? ??de??? ??a ?a
    ap???????µe t?? µ?de???? ?p??es?

139
?????s? t?? ??????? ?2
140
?????s? t?? ??????? ?2
141
S?µpe??sµata
  • St? ?ef??a?? a?t? efa?µ?saµe t?? ??e??? ?2 ??a
    t?? a??de??? ?e??a???? ???e?? p?? e?de??µe?a ?a
    s??µat????? Collocations.
  • H µ???d?? a?t? ?pe?te?e? t?? ??ass???? a????s??
    t?? d?asp???? ? ?p??a ap?t?????e? st?? pe??pt?s?
    a??a??? t?µ?? Outliers.
  • ?p?s?? ?pe?te?e? ?a? ????? µe??d?? p?? ?????
    efa?µ?s?e? ?at? ?a?????, ?p?? t?? t-test,
    likelihood (LL) ratio test, mutual Information
    ??at? a?t?? ?? µ???d?? ????? t? µe?????t?µa ?t?
    ?p???t??? pa?aµet???? ?ata??µ? ded?µ????.

142
S?µpe??sµata
  • ?p?p???? ? µ???d?? mutual information (MI),
    s??????e? t?? s??dedeµ??? p??a??t?ta p(w1,w2) ?a?
    apa?te? ?? a?e???t?te? p??a??t?te? p(w1) ?a?
    p(w2) ?a s?µßa????? µe ?p????d?p?te t??p? st?
    de??µa, t? ?p??? de? d??e? µ?a ?ea??st??? e????a
    st?? pe??pt?s? ?aµ???? s????t?t??
  • ????? ????? bigrams ß?????a? st?? p??te? ??s??
    ßa?µ?????a? t?? µe??d?? t?? d?asp???? ?a? t?? X2
    e??????, se ???e pe??pt?s? ?µ?? e?ap??e?ta? st???
    e?d????? ???ss??????? ?a a???????s??? a?t? ta
    e???µata.

143
O X2 Stat?st???? ??e???? st?? ?p?saf???s? ???????
???e??Word Sense Disambiguation
144
?p?saf???s? ???????
  • ? s??t??pt??? p?e????t?ta t?? ???e?? p??
    eµfa?????ta? se ?e?µe?a f?s???? ???ssa? e??a?
    p???s?µe?, d??ad? eµfa?????ta? µe d?af??et????
    s?µas?e? se d?af??et??? linguistic contexts
    (p?a?s?a ?e?µ????).
  • ??, ? ??????? ???? bank, µp??e? ?a ??e? se
    ??p??? context t?? ?????a t?? t??pe?a? ?a? se
    ???? t?? ?????a t?? ????? p?taµ??.
  • ??sa st? ?d?? p?a?s?? t?? X2 stat?st???? e??????,
    µe t?? ß???e?a t?? ??e?t??????? ?e????? WordNet,
    ?a a?apt????µe µ?a µ???d? ??a t?? ap?saf???s?
    t?? ?????a? µ?a? ????? p?? eµfa???eta? se ??a
    context.

145
?p?saf???s? ???????
  • S?µf??a µe t?? µ???d? a?t? e??a??µaste ?? e???.
  • ?pa??????µe t? p?a?s?? (context) st? ?p???
    eµfa???eta? ? p??? ap?saf???s? ???? µe
    s?s?et???µe?e? ?????e? (Related Synsets) ap? t?
    ??e?t?????? ?e???? WordNet. ?? epa???µ??? p?a?s??
    t? ?e????µe sa? ??a stat?st??? de??µa
  • ?e?et?µe t?? ?ata??µ? t?? s?s?et???µe??? e??????
    t?? ???e µ?a ?????a? t?? p??? ap?saf???s? ?????
    st? stat?st??? a?t? de??µa

146
?p?saf???s? ???????
  • ??at?p????µe t?? µ?de???? ?p??es? ?t? ??e? ??
    s?s?et???µe?e? ?????e?, d??ad? ta Related Synsets
    ap? t? WordNet ?ata??µ??ta? ?a?????? (Normally)
    st? de??µa.
  • ?e t?? ß???e?a t?? ?2 stat?st???? e?????? ?a???
    ta????sµat?? (X2 Goodness of fit statistical
    test), p??spa???µe ?a e?t?p?s??µe t?? ?????a t??
    ?p??a? ta related Synsets ap???????? ap? a?t? t??
    ?p??es?.
  • ??? ?????a a?t? t?? ep??????µe sa? t?? s?st?
    ?????a t?? p??? ap?saf???s? ?????

147
?p?saf???s? ????? ?a? WordNet
  • To p??ß??µa t?? ap?d?s?? t?? s?st?? ?????a? µ?a
    ????? (target word) µ?sa st? p?a?s?? (context)
    p?? ap?te?e?ta? ap? t?? pe??ß?????se? ???e??
    e??a? ? ap?st??? t?? s?st?µ?t?? ap?saf???s??
    ?????
  • ??a??????eta? sa? ??a ap? ta p?? d?s???a
    p??ß??µata st?? epe?e??as?a f?s???? ???ssa?.
  • ????? s?st?µata ????? p??ta?e? ?at? ?a????? ?
    p?e????t?ta t?? ?p???? st????eta? se stat?st????
    µe??d??? epe?e??as?a?

148
?p?saf???s? ????? ?a? WordNet
  • ?a p??ta s?st?µata eßas????t? se eµpe???????
    ?a???e? ?a? ???s?µ?p????ta? µ???? ?e????
    (?ata?????? e??????) ap?saf????a? µ???? a???µ?
    pe??pt?se?? 16,17,18
  • S?µe?a µe t?? d?a?es?µ?t?ta µe????? ??e?t???????
    ?e????? ?p?? t? WordNet, d??e? µe???? ???s? ??a
    t?? a??pt??? apa?t?t???? efa?µ???? st??
    ap?saf???s? ????? 20,21,22.
  • ?p? p???? t? ?e????? ?t? ?? d??f??e? ?????e?
    s??d???ta? µeta?? t??? µe ??a µe???? a???µ? ap?
    s?µas????????? (semantic) ?a? ?e???????????
    (lexical) s??se?? ???e? t? WordNet p???t?µ? p???
    ??a t?? a?apa??stas? t?? d??t??? ???s??

149
?p?saf???s? ????? ?a? WordNet
  • ???s?µ?p????ta? ???sµ??? ap? t? WordNet ?a?
    ?e?µe?a ap? t? Internet ?? Mihalcea ?a? Moldovan
    23, s?????t??sa? ?e?????????? p????f???a ??a
    t?? ap?saf???s? p???s?µ?? ???e??
  • ?? Montoyo ?a? Palomar 24 pa???s?asa? µ???d?
    ??a t?? ap?saf???s? ???e?? st?????µe??? st??
    s?µas????????? t??e?? (semantic classes) t??
    Wordnet ?a??? ?a? st??? ???sµ??? t?? ?e?????.
  • ? e??as?a t?? Banerjee ?a? Pederson 25
    pa???s???e? µ?a p??sa?µ??? t?? a??????µ?? Lesk
    16 p?? ßas??eta? st??? ???sµ??? t?? Wordnet.

150
?p?saf???s? ????? ?a? WordNet
  • ??t?? ap? t??? ???sµ??? p??? d???e?? ??e?
    p?a?µat?p????e? ???s?µ?p????ta? ?a? t?? s??s?
    ?e?a???a? t?? Wordnet (Hypernymy/Hyponymy
    relation)
  • O Resnic 21 ap?saf???se ?e???? ???s?µ?p????ta?
    t?? s?µas???????? ?µ???t?ta (semantic similarity)
    µeta?? d?? ???e?? d?a?????ta? t?? ????? p??????
    µe t? µe?a??te?? p????f???a?? pe??e??µe?? (the
    most informative subsumer), ?p?? t?
    p????f???a?? pe??e??µe?? t? ???se sa? s????t?s?
    t?? p?????? t?? ?pa??µe??? ????
  • Leacock ?a? Chodorow 26, p??te??a? ??a µ?t??
    ??a t?? ?p?????sµ? t?? s?µas????????? ?µ???t?ta?
    µet???ta? t? µ???? t?? d?ad??µ??µeta?? t?? d??
    ??µß?? t?? ?e?a???a?

151
???s????s? t?? stat?st???? e??????
  • St? p??????µe?? p?e?µa t?? efa?µ???? stat?st????
    e?????? pa???s?????µe µ?a p??s????s? ??a
    s?st?µata ap?saf???s?? ????? ßas???µe??? se µ?a
    d?af??et??? a?t????? ??a t?? e?t?µ?s? t?? µ?t???
    t?? s?et???t?ta? (relatedness) µeta?? t?? context
    µ?a? ????? ?a? t?? ???e µ?a? ?????a? ?e????st?
  • St? WordNet ???e ?e???? ?ata????s? a?apa??st? µ?a
    ?????a ?a? ap?d?deta? µe ??a s????? ap? s?????µe?
    ???e?? p?? ???eta? Synset.

152
???s????s? t?? stat?st???? e??????
  • ?? s??µa de???e? p?? e??a? ?? ?ata????se?? st?
    Wordnet (t??a?a se???, ? a???µ?? st?? a??? st??
    pa????es? e??a? ? a???? a???µ?? ?ata????s?? st?
    WordNet ).

153
???s????s? t?? stat?st???? e??????
  • ?ts? ???p?? ?? ?ata????se?? st? WordNet
    ap??a????ta? synsets. ???e synset e??a? µ??ad???
    ?a? s????????µe ?a t? s?µß??????µe µe ????st?a.
  • ?.?. city, metropolis, urban center ,
    man, adult male ??p.
  • ? µe???? d?af??? t?? WordNet µe ta s?µaßat???
    ?e???? e??a? ?? s?s?et?se?? µeta?? t?? Synsets.
  • ???e Synset s?s?et??eta? µe ???a Synsets µe
    d??f??e? s??se??. T? pa?a??t? s??µa d??e? µ?a
    e????a t?? ?at?stas?? st? WordNet .

154
???s????s? t?? stat?st???? e??????
155
???s????s? t?? stat?st???? e??????
  • ??t? ta s?s?et???µe?a Synsets ta ??µe Related
    Synsets.
  • ??t??e? s?s?et?se?? s??a?t?µe a??et?? st?
    WordNet.
  • ?p?????? 32 t?t??e? s?s?et?se?? ?a? ?ata??µ??ta?
    sta ????a µ??? t?? ????? (??a µe???? p?s?st?
    e??a? ??a ??s?ast??? ?a? ??µata a??? ?p??????
    ß?ßa?a ?a? ??a ep??eta ?a? ep????µata
  • (ß??pe d?at??ß? ??a a?a??t??? pa???s?as?)

156
???s? t?? Related Synsets ??a Word Sense
Disambiguation
  • ?? ?????e? (Senses)
  • ???e ?e???? µ??f? se µ?a ???ssa eµfa???eta? µe
    p????? ?????e? se d??f??e? p??t?se?? (context).
    ?? ? ???? bank µp??e? ?a ??e? µeta?? t?? ????? se
    ??a context t?? ?????a t?? Financial Institute,
    e?? se ??a ???? t?? ?????a t?? ????? t?? p?taµ??
    (bank river).
  • ?f?? ?? ?????e? st? WordNet ap?d?d??ta? µe
    Synsets, ap?? ? ???? bank eµfa???eta? se
    pe??ss?te?a t?? e??? Synsets.
  • ?a pa?a??t? ta 10 Sysnsets sta ?p??a eµfa???eta?
    ? ???? bank. ??a ??a ???e µ?a ap? t?? 10 ?????e?
    t?? p?? ??e? st? Wordnet.

157
1. (883) depository financial institution, bank,
banking concern, banking company -- (a financial
institution that accepts deposits and channels
the money into lending activities "he cashed a
check at the bank" "that bank holds the mortgage
on my home") 2. (99) bank -- (sloping land
(especially the slope beside a body of water)
"they pulled the canoe up on the bank" "he sat
on the bank of the river and watched the
currents") 3. (76) bank -- (a supply or stock
held in reserve for future use (especially in
emergencies)) 4. (54) bank, bank building -- (a
building in which commercial banking is
transacted "the bank is on the corner of Nassau
and Witherspoon") 5. (7) bank -- (an arrangement
of similar objects in a row or in tiers "he
operated a bank of switches") 6. (6) savings
bank, coin bank, money box, bank -- (a container
(usually with a slot in the top) for keeping
money at home "the coin bank was empty") 7. (3)
bank -- (a long ridge or pile "a huge bank of
earth") 8. (1) bank -- (the funds held by a
gambling house or the dealer in some gambling
games "he tried to break the bank at Monte
Carlo") 9. (1) bank, cant, camber -- (a slope in
the turn of a road or track the outside is
higher than the inside in order to reduce the
effects of centrifugal force) 10. bank -- (a
flight maneuver aircraft tips laterally about
its longitudinal axis (especially in turning)
"the plane w
Write a Comment
User Comments (0)
About PowerShow.com