Statistische Methoden in der Computerlinguistik Statistical Methods in Computational Linguistics - PowerPoint PPT Presentation

1 / 8
About This Presentation
Title:

Statistische Methoden in der Computerlinguistik Statistical Methods in Computational Linguistics

Description:

Statistische Methoden in der Computerlinguistik. Statistical Methods in ... Recherchen zu einem Teil-Thema (zu Literatur und/ oder verf gbaren Werkzeugen/Ressourcen) ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 9
Provided by: jonas5
Category:

less

Transcript and Presenter's Notes

Title: Statistische Methoden in der Computerlinguistik Statistical Methods in Computational Linguistics


1
Statistische Methoden in der ComputerlinguistikSt
atistical Methods in Computational Linguistics
  • 2a. Course projects
  • Jonas Kuhn
  • Universität Potsdam, 2007

2
Leistungen im Kurs
  • Übungsaufgaben (werden nicht benotet)
  • 2-3 größere Programmieraufgaben (Abgabe werden
    bewertet)
  • Teilnahme in einem Projekt-Team (à 2-5
    Mitglieder)
  • Bezug zu einem Gesamt-Kursprojekt (s.u.)
  • Recherchen zu einem Teil-Thema (zu Literatur und/
    oder verfügbaren Werkzeugen/Ressourcen)
  • (Kurz-)Referat zu Ergebnissen / evtl. kleines
    Tutorium zu Techniken von allgemeinem Interesse
  • Experimente mit Werkzeugen bzw. Programmierung
  • Dokumentation der Projektarbeit (nach Teilnehmern
    aufgeschlüsselt

3
The Spock Challenge
  • The Entity Resolution Problem
  • A common problem that we face is that there are
    many people with the same name. Given that, how
    do we distinguish a document about Michael
    Jackson the singer from Michael Jackson the
    football player?
  • World-wide contest for a software solution
  • http//challenge.spock.com/
  • Winning team receives 50,000 prize
  • (NOTE RULES! Upon acceptance of the prize, the
    winning Software Submissions and all source code
    and algorithms related thereto becomes the sole
    and exclusive property of Spock.)

4
The Spock Challenge
  • With billions of documents and people on the web,
    we need to identify and cluster web documents
    accurately to the people they are related to.
  • Mapping these named entities from documents to
    the correct person is the essence of the Spock
    Challenge.

5
The Spock Challenge
  • Data set
  • The complete data-set is divided into training
    and test sets containing roughly 25,000 and
    75,000 documents, respectively.
  • Along with a set of documents we've included a
    set of target names. You can assume that each
    document contains only one of the target names
    (even though most documents contain many names).
  • The challenge is to partition all the documents
    relevant to a target name by their referent.
    Consider the following two documents with the
    target name "Michael Jackson"   Michael Jackson
    - The King of Pop or Wacko Jacko?   Michael
    Jackson statistics - pro-football-reference.com
    The referents of these articles are the pop star
    and football player, respectively. We've included
    the ground truth for the training set so you have
    something to compare against.

6
The Spock Challenge
  • Test/Application
  • Once you're done training, you can run your
    algorithm on the test set and submit your results
    on this site. (http//challenge.spock.com/)
  • We will provide instant feedback in the form of a
    percentage rank score (using the F-measure). This
    way you can see how you stack up against the
    other teams. What good is a problem without a
    little competition?

7
Course projects inspired by Spock challenge
  • Experiment with various (mostly statistical) NLP
    techniques on the data set
  • Any Ideas?

8
Sub-tasks (we need a team for each)
  • State of the Art in Entity Resolution (a.k.a.
    deduplication, or merge-purge)
  • Clustering
  • Starting point Manning/Schütze 1999, ch. 14
  • Information/Document Retrieval (?)
  • Starting point Manning/Schütze 1999, ch. 15
  • Term weighting techniques
  • Possibly build additional data sets
  • Named Entity Detection
  • Coreference Resolution
  • Parsing, Semantic Role Labelling
  • Using Word-Net (and other ontological resources)
  • Using Wikipedia (and other encyclopaedic
    resources)
  • Word Sense Disambiguation (possibly similar
    techniques)
  • Software Integration, Testing
Write a Comment
User Comments (0)
About PowerShow.com