Statistische Methoden in der Computerlinguistik Statistical Methods in Computational Linguistics

About This Presentation

Title:

Statistische Methoden in der Computerlinguistik Statistical Methods in Computational Linguistics

Description:

Statistische Methoden in der Computerlinguistik. Statistical Methods in ... Recherchen zu einem Teil-Thema (zu Literatur und/ oder verf gbaren Werkzeugen/Ressourcen) ... – PowerPoint PPT presentation

Number of Views:27

Avg rating:3.0/5.0

Slides: 9

Provided by: jonas5

Category:

more less

Transcript and Presenter's Notes

Title: Statistische Methoden in der Computerlinguistik Statistical Methods in Computational Linguistics

1
Statistische Methoden in der ComputerlinguistikSt
atistical Methods in Computational Linguistics

2a. Course projects
Jonas Kuhn

Universität Potsdam, 2007

2
Leistungen im Kurs

Übungsaufgaben (werden nicht benotet)
2-3 größere Programmieraufgaben (Abgabe werden
bewertet)
Teilnahme in einem Projekt-Team (à 2-5
Mitglieder)
Bezug zu einem Gesamt-Kursprojekt (s.u.)
Recherchen zu einem Teil-Thema (zu Literatur und/
oder verfügbaren Werkzeugen/Ressourcen)
(Kurz-)Referat zu Ergebnissen / evtl. kleines
Tutorium zu Techniken von allgemeinem Interesse
Experimente mit Werkzeugen bzw. Programmierung
Dokumentation der Projektarbeit (nach Teilnehmern
aufgeschlüsselt

3
The Spock Challenge

The Entity Resolution Problem
A common problem that we face is that there are
many people with the same name. Given that, how
do we distinguish a document about Michael
Jackson the singer from Michael Jackson the
football player?
World-wide contest for a software solution
http//challenge.spock.com/
Winning team receives 50,000 prize
(NOTE RULES! Upon acceptance of the prize, the
winning Software Submissions and all source code
and algorithms related thereto becomes the sole
and exclusive property of Spock.)

4
The Spock Challenge

With billions of documents and people on the web,
we need to identify and cluster web documents
accurately to the people they are related to.
Mapping these named entities from documents to
the correct person is the essence of the Spock
Challenge.

5
The Spock Challenge

Data set
The complete data-set is divided into training
and test sets containing roughly 25,000 and
75,000 documents, respectively.
Along with a set of documents we've included a
set of target names. You can assume that each
document contains only one of the target names
(even though most documents contain many names).
The challenge is to partition all the documents
relevant to a target name by their referent.
Consider the following two documents with the
target name "Michael Jackson" Michael Jackson
- The King of Pop or Wacko Jacko? Michael
Jackson statistics - pro-football-reference.com
The referents of these articles are the pop star
and football player, respectively. We've included
the ground truth for the training set so you have
something to compare against.

6
The Spock Challenge

Test/Application
Once you're done training, you can run your
algorithm on the test set and submit your results
on this site. (http//challenge.spock.com/)
We will provide instant feedback in the form of a
percentage rank score (using the F-measure). This
way you can see how you stack up against the
other teams. What good is a problem without a
little competition?

7
Course projects inspired by Spock challenge