The Web as a Parallel Corpus - PowerPoint PPT Presentation


PPT – The Web as a Parallel Corpus PowerPoint presentation | free to download - id: 765092-YzYzZ


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation

The Web as a Parallel Corpus


The Web as a Parallel Corpus Philip Resnik, Noah A. Smith, Computational Linguistics, 29, 3, pp. 349 380, MIT Press,2004. University of Maryland, Johns Hopkins ... – PowerPoint PPT presentation

Number of Views:26
Avg rating:3.0/5.0
Slides: 34
Provided by: Oni74


Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: The Web as a Parallel Corpus

The Web as a Parallel Corpus
  • Philip Resnik, Noah A. Smith, Computational
    Linguistics, 29, 3, pp. 349 380, MIT
  • University of Maryland, Johns Hopkins University

  • STRAND system for mining parallel text on the
    World Wide Web
  • reviewing the original algorithm and results
  • presenting a set of significant enhancements
  • use of supervised learning based on structural
    features of documents to improve classification
  • new content-based measure of translational
  • adaptation of the system to take advantage of the
    Internet Archive for mining parallel text from
    the Web on a large scale
  • construction of a significant parallel corpus for
    a low-density language pair

  • Parallel corpora, bitexts
  • for automatic lexical acquisition (Gale and
    Church 1991 Melamed 1997)
  • provide indispensable training data for
    statistical translation models (Brown et al.
    1990 Melamed 2000 Och and Ney 2002)
  • provide the connection between vocabularies in
    cross-language information retrieval (Davis and
    Dunning 1995 Landauer and Littman 1990 see also
    Oard 1997)

Recent works at UMJHU
  • exploit parallel corpora in order to develop
    monolingual resources and tools, using a process
    of annotation, projection, and training
  • given a parallel corpus in English and a less
    resource-rich language
  • project English annotations across the parallel
    corpus to the second language
  • using word-level alignments as the bridge, and
    then use robust statistical techniques in
    learning from the resulting noisy annotations
  • (Cabezas, Dorr, and Resnik 2001 Diab and Resnik
    2002Hwa et al. 2002 Lopez et al. 2002
    Yarowsky, Ngai, and Wicentowski 2001 Yarowsky
    and Ngai 2001 Riloff, Schafer, and Yarowsky

parallel corpora as a critical resource
  • not readily available in the necessary quantities
  • heavily on French-English
  • because the Canadian parliamentary proceedings
    (Hansards) in English and French were the only
    large bitext available
  • United Nations proceedings (LDC)
  • religious texts (Resnik, Olsen, and Diab 1999)
  • software manuals (Resnik and Melamed 1997
    Menezes and Richardson 2001)
  • tend to be unbalanced, representing primarily
    governmental or newswire-style texts

World Wide Web
  • People tend to see the Web as a reflection of
    their own way of viewing the world
  • a huge semantic network
  • an enormous historical archive
  • a grand social experiment
  • a great big body of text waiting to be mined
  • a huge fabric of linguistic data often interwoven
    with parallel threads

  • (Resnik 1998, 1999)
  • structural translation recognition acquiring
    natural data
  • identify pairs of Web pages that are mutual
  • Incorporating new work on content-based detection
    of translations (Smith 2001, 2002)
  • efficient exploitation of the Internet Archive

(No Transcript)
Finding parallel text on the Web
  • Location of pages that might have parallel
  • Generation of candidate pairs that might be
  • Structural filtering out of nontranslation
    candidate pairs

Locating Pages
  • AltaVista search engines advanced search to
    search for two types of Web pages parents and
  • Ask AV with regular expressions
  • (anchor"english" OR anchor"anglais")
  • (anchor"french" OR anchor"francais").
  • spider component
  • The results reported here do not make use of the

(No Transcript)
Generating Candidate Pairs
  • with URL
  • http// en.html, on which
    one combination of substitutions might produce
    the URLhttp// ch.html.
  • Another possible criterion for matching is the
    use of document lengths.
  • text E in language 1 and text F in language 2,
    length(E) C length(F), where C is a constant
    tuned for the language pair

Structural Filtering
  • linearize the HTML structure and ignore the
    actual linguistic content of the documents
  • align the linearized sequences using a standard
    dynamic programming technique (Hunt and McIlroy

(No Transcript)
(No Transcript)
STRAND Results (1)
  • Recall in this setting is measured relative to
    the set of candidate pairs that was generated
  • Precision
  • Was this pair of pages intended to provide the
    same content in the two different languages?
  • Asking the question in this way leads to high
    rates of interjudge agreement, as measured using
    Cohens µ measure

STRAND Results (2)
  • Using the manually set thresholds for dp and n,we
    have obtained 100 precision and 68.6 recall in
    an experiment using STRAND to find English-French
    Web pages (Resnik 1999)
  • to obtain English-Chinese pairs and in a similar
    formal evaluation, we found that the resulting
    set had 98 precision and 61 recall

Assessing the STRAND Data
  • a translation lexicon automatically extracted
    from the French-English STRAND data could be
    combined productively with a bilingual
    French-English dictionary in order to improve
    retrieval results using a standard cross-language
    IR test collection (English queries against the
    CLEF-2000 French collection)
  • backing off from the dictionary to the STRAND
    translation lexicon accounted for over 8 of the
    lexicon matches (by token)
  • reducing the number of untranslatable terms by a
    third and producing a statistically significant
    12 relative improvement in mean average
    precision as compared to using the dictionary

  • 30 human-translated sentence pairs from the FBIS
    (Release 1) English-Chinese parallel corpus,
    sampled at random.
  • 30 Chinese sentences from the FBIS corpus,
    sampled at random, paired with their English
    machine translation output from AltaVistas
  • 30 paired items from Chinese-English Web data,
    sampled at random from sentence-like aligned
    chunks as identified using the HTML-based chunk
    alignment process of STRAND

(No Transcript)
Optimizing Parameters Using Machine Learning
  • Using the English-French data, we constructed a
    ninefold cross-validation experiment using
    decision tree induction to predict the class
    assigned by the human judges
  • The decision tree software was the widely used

Content-Based Matching
  • Ma and Liberman (1999) point out, not all
    translators create translated pages that look
    like the original page
  • structure-based matching is applicable only in
    corpora that include markup
  • other applications for translation detection
  • a generic score of translational similarity that
    is based upon any word-to-word translation
  • Define a CL sim score, tsim

Quantifying Translational Similarity
  • ???????Doc?structure
  • ????????sim
  • ????lexicon
  • (Melameds 2000 Method A)
  • link be a pair (x, y)
  • maximum-weighted bipartitie matching problem

(No Transcript)
Exploiting the Internet Archive
  • The Internet Archive 120TB (8TB per month)
    http// is a
    nonprofit organization attempting to archive the
    entire publicly available Web, preserving the
    content and providing free access to researchers,
    historians, scholars, and the general public.
  • Wayback Machine
  • Temporal database, but not stored in temporal
  • Need decompression
  • Almost data are stored in compressed plain-text

STRAND on the Archive
  • Extracting URLs from index files using simple
    pattern matching
  • Combining the results from step 1 into a single
    huge list
  • Grouping URLs into buckets by handles
  • Generating candidate pairs from buckets

URLs similarity
  • We arrived at an algorithmically simple solution
    that avoids this problem but is still based on
    the idea of language-specific substrings (LSSs).
  • The idea is to identify a set of
    language-specific URL substrings that pertain to
    the two languages of interest (e.g., based on
    language names, countries, character codeset
    labels, abbreviations, etc.)

(No Transcript)
Building an English-Arabic Corpus
  • Finding English-Arabic Candidate Pairs on the
    Internet Archive
  • 24 top-level national domains
  • Egypt (.eg), Saudi Arabia (.sa), Kuwait (.kw)
  • 21 .com

(No Transcript)
(No Transcript)
Future Work
  • Weight in the dictionary can be exploited
  • IR IDF scores for discerning
  • Bootstrapping
  • Seed to form high-precision initial classifiers

Search Engine Estimate
  • According to (Dec. 31, 2002) www.searchengineshowd
  • Google 3,033 millions
  • Altavista 1,689 millions 11.795estimates are
    based on exact counts obtained from AlltheWeb
    multiplied by the percentage of Relative Size
    Showdown as compared to the number found by
  • Our estimate for Chinese page count
  • ASBC?? ???? ABC ??? ??? ??????5915?term???
    ta?relative count
    5021? 2181?google?altavista?2.3?