A DOM Tree Alignment Model for Mining Parallel Data from the Web - PowerPoint PPT Presentation

1 / 61
About This Presentation
Title:

A DOM Tree Alignment Model for Mining Parallel Data from the Web

Description:

Aeolian. NDk.TCj. Definition html head body title #People table tbody tr ... Aeolian. NDk.TC[m,n] Definition. Pr(TFm|TEi): Prob of translating ... – PowerPoint PPT presentation

Number of Views:453
Avg rating:3.0/5.0
Slides: 62
Provided by: che134
Category:

less

Transcript and Presenter's Notes

Title: A DOM Tree Alignment Model for Mining Parallel Data from the Web


1
A DOM Tree Alignment Model for Mining Parallel
Data from the Web
Lei Shi, Cheng Niu, Ming Zhou, Jianfeng
Gao (COLACL06)
  • Presented by Luo Weihua
  • 2006-11-20

2
Outline
  • Introduction
  • Related Work
  • A New Parallel Data Mining Scheme
  • Experimental Results
  • Conclusion

3
Outline
  • Introduction
  • Related Work
  • A New Parallel Data Mining Scheme
  • Experimental Results
  • Conclusion

4
Introduction
  • Parallel bilingual corpora
  • Critical resources for SMT, CLIR, etc
  • Not readily available for most language pairs
  • Limited size
  • Limited domain
  • Government documents
  • Newswire texts
  • Web mining a promising solutions
  • 15,000 German-English bilingual web sites in the
    .de domain (Ma and Liberman, 1999)

5
Introduction
  • Existing parallel web documents mining systems
  • Strategies
  • Use pre-defined URL patterns to discover
    candidate parallel documents within a site
  • Use content-based features to verify the
    translational equivalence of the candidate pairs

6
Introduction
  • Existing parallel web documents mining systems
  • Drawbacks
  • Noneffective for bilingual websites with varied
    naming schemes
  • High bandwidth cost and slow download speed
  • Low sentence alignment accuracy

7
Outline
  • Introduction
  • Related Work
  • A New Parallel Data Mining Scheme
  • Experimental Results
  • Conclusion

8
Related Work
  • Cross-Language Information Retrieval based on
    Parallel Texts and Automatic Mining of Parallel
    Texts from the Web (Jian-Yun Nie, Michel
    Simard, etc. 1999)
  • Selection of candidate sites
  • Links anchor text as indicator, e.g. French
    Vesion
  • Selection of candidate documents from candidate
    sites
  • Parallel texts usually have similar names

9
Related Work
  • BITS A Method for Bilingual Text Search over the
    Web (Xiaoyi Ma, Mark Y. Liberman, 1999)
  • Candidate Websites Generation
  • Focus on some specific domains for specific
    language pairs
  • Content-based Translation Pairs Finder

10
Related Work
  • The Web as a Parallel Corpus (Philip Resnik and
    Noah A. Smith, 2003)
  • STRAND system (structural translation
    recognition, acquiring natural data)
  • Locating Pages key words submitted to search
    engines
  • Generating Candidate Pairs
  • URL matching algorithm
  • Document length len(E) ? f(len(F))
  • Structural Filtering
  • Linearize the HTML structure and ignore the
    actual linguistic content of the document

11
Related Work
  • Discovering Parallel Text from the World Wide Web
    (Jisong Chen,Rowena CHAU,etc. 2004)

12
Related Work
  • Automatic Acquisition of Chinese-English Parallel
    Corpus from the Web (Ying Zhang, Ke Wu, etc.
    2006)
  • Web Parallel Data Extraction(WPDE)
  • Candidate sites selection and crawling
  • Anchor text, image ALT text
  • Extraction of candidate pairs of parallel texts
  • Pattern matching, edit-distance similarity
    measure
  • Parallel text pairs validation
  • KNN classifier
  • File length
  • File structure
  • Content translation

13
Outline
  • Introduction
  • Related Work
  • A New Parallel Data Mining Scheme
  • Experimental Results
  • Conclusion

14
Sub-outline
  • A New Parallel Data Mining Scheme
  • Profile
  • Candidate site determination
  • DOM Tree Alignment Model
  • Sentence Alignment Model
  • Web Document Pair Verification Model

15
Profile
16
Sub-outline
  • A New Parallel Data Mining Scheme
  • Profile
  • Candidate site determination
  • DOM Tree Alignment Model
  • Sentence Alignment Model
  • Web Document Pair Verification Model

17
Candidate site determination
  • Steps
  • Given a web site, the root page and web pages
    directly linked from it are downloaded
  • For each page, use predefined trigger strings to
    compare all its anchor texts
  • Trigger for English translation English, English
    Version,??,???
  • Trigger for Chinese translationChinese, Chinese
    Version,??,???
  • If both categories of trigger words are found,
    the web site is considered bilingual

18
Sub-outline
  • A New Parallel Data Mining Scheme
  • Profile
  • Candidate site determination
  • DOM Tree Alignment Model
  • Sentence Alignment Model
  • Web Document Pair Verification Model

19
(No Transcript)
20
DOM Tree Alignment Model
  • Document Object Model (DOM)

The Document Object Model is a platform- and
language-neutral interface that will allow
programs and scripts to dynamically access and
update the content, structure and style of
documents. It defines the logical structure of
documents and the way a document is accessed and
manipulated. --W3C
21
DOM Tree Alignment Model
22
DOM Tree Alignment Model
23
DOM Tree Alignment Model
  • Minor modification
  • Only Element nodes and Text nodes are kept
  • ALT attribute is represented as Text node
  • The Text node and its parent Element node are
    combined into one node

24
(No Transcript)
25
DOM Tree Alignment Model
  • Comparison of STSG and DOM tree alignment
    model(DTAM)
  • Similarity
  • Support node deletion, insertion and substitution
  • Define the alignment as a tree hierarchical
    invariance process
  • Difference
  • DTAM requires the alignment a sequential order
    invariant process
  • DTAM search for the best alignment on the
    condition that both trees are given, while STST
    in the context of language generation

26
Definition
TD
lthtmlgt
ltheadgt
ltbodygt
lttablegt
lttitlegt People
lttbodygt
lttrgt
lttrgt
lttdgt Shady Grove
lttdgt Aeolian
27
Definition
lthtmlgt
ltheadgt
ltbodygt
lttablegt
NDi
lttitlegt People
TDi
lttbodygt
lttrgt
lttrgt
lttdgt Shady Grove
lttdgt Aeolian
28
Definition
lthtmlgt
ltheadgt
ltbodygt
lttablegt
lttitlegt People
lttbodygt
TDi,j
lttrgt
lttrgt
lttdgt Shady Grove
lttdgt Aeolian
29
Definition
lthtmlgt
ltheadgt
ltbodygt
lttablegt
lttitlegt People
NDk.Cj
lttbodygt
lttrgt
lttrgt
NDi.l
lttdgt Shady Grove
lttdgt Aeolian
NDi.t
30
Definition
lthtmlgt
ltheadgt
ltbodygt
lttablegt
lttitlegt People
NDk.Cm,n
lttbodygt
lttrgt
lttrgt
lttdgt Shady Grove
lttdgt Aeolian
31
Definition
lthtmlgt
NDk.TCj
ltheadgt
ltbodygt
lttablegt
lttitlegt People
lttbodygt
lttrgt
lttrgt
lttdgt Shady Grove
lttdgt Aeolian
32
Definition
lthtmlgt
NDk.TCm,n
ltheadgt
ltbodygt
lttablegt
lttitlegt People
lttbodygt
lttrgt
lttrgt
lttdgt Shady Grove
lttdgt Aeolian
33
Definition
  • Pr(TFmTEi) Prob of translating subtree TEi into
    subtree TFm
  • Pr(NFmNEi) Prob of translating node NEi into
    NFm
  • Pr(TFm,nTEi,j,A) Prob of translating forest
    TEi,j into TFm,n based on the alignment A

34
DOM Tree Alignment Model
  • Problem
  • Search
  • A argmax APr(ATF,TE) ? Pr(TFTE,A)Pr(ATE)
  • pd is the prob of a node deletion occuring in A

35
DOM Tree Alignment Model
  • Alignment configuration of A (1)
  • NFl is aligned with NEi, and the children of NFl
    are aligned with the children of NEi

36
DOM Tree Alignment Model
  • Alignment configuration of A (2)
  • NFl is deleted, and the children of NFl is
    aligned with
  • NEi

37
DOM Tree Alignment Model
  • Alignment configuration of A (3)
  • NEi is deleted, and NFl is aligned with the
    children of NEi

38
DOM Tree Alignment Model
  • Alignment configuration of A (4)
  • TFl is aligned with TEi, and TFm1,n is aligned
    with TEi1,j

39
DOM Tree Alignment Model
  • Alignment configuration of A (5)
  • TFm is deleted, and TFm1,n is aligned with
    TEi,j

40
DOM Tree Alignment Model
  • Alignment configuration of A (6)
  • TEi is deleted , and TFm,n is aligned with
    TEi1,j

41
DOM Tree Alignment Model
  • Alignment configuration of A (7)
  • NFm is deleted, and NFms children NFm.C1,K are
    combined with TFm1,n to align with TEi,j

42
DOM Tree Alignment Model
  • Alignment configuration of A (8)
  • NEi is deleted, and NEis children NEi.C1,K are
    combined with TEi1,j to align with TFm,n

43
DOM Tree Alignment Model
  • Node translation probability
  • Text translation probability
  • IBM model 1

44
DOM Tree Alignment Model
  • Parameter estimation
  • EM based
  • 3 categories of parameters
  • Text translation prob Pr(tFtE)
  • Tag mapping prob Pr(ll)
  • Node deletion prob pd
  • Inside outside algorithm

45
DOM Tree Alignment Model
  • Decoding

46
Sub-outline
  • A New Parallel Data Mining Scheme
  • Profile
  • Candidate site determination
  • DOM Tree Alignment Model
  • Sentence Alignment Model
  • Web Document Pair Verification Model

47
Sentence Alignment Model
  • Sentence Aligner with Tree Alignment Model
  • Text chunks associated with DOM tree nodes are
    aligned
  • For each pair of parallel text chunks
  • A model is used to align parallel sentences
    (Adaptive Parallel Sentences Mining from Web
    Bilingual News Collection. Bing Zhao, et al)
  • IBM Model 1 based
  • Sentence length based
  • Maximum likelihood criterion

48
Sub-outline
  • A New Parallel Data Mining Scheme
  • Profile
  • Candidate site determination
  • DOM Tree Alignment Model
  • Sentence Alignment Model
  • Web Document Pair Verification Model

49
Web Document Pair Verification Model
  • Binary maximum entropy based classifier
  • File length ratio (FLR)
  • HTML tag similarity(HTS)
  • Extract html tags of a given web page,
    concatenate as a string
  • Compute a minimum edit distance
  • HTS match operation number/total operation
    number
  • Sentence alignment score(SAS)
  • SAS number of aligned sentences/total number of
    sentences in both files

50
Outline
  • Introduction
  • Related Work
  • A New Parallel Data Mining Scheme
  • Experimental Results
  • Conclusion

51
Experimental Results
  • Contrast
  • URL pattern based mining system (following Nie
    1999, Ma 1999, Chen 2004)
  • Host crawling for URL collection
  • Candidate pair identification by pre-defined URL
    pattern matching
  • Candidate pair verification

52
Experimental Results
  • Measure
  • Quality of the mined data
  • Mining coverage
  • Mining efficiency

53
Experimental Results
  • Results
  • Original 300,000 URLs of Chinese websites from
    cn.yahoo.com,hk.yahoo.com,tw.yahoo.com
  • 11,000 E-C websites
  • 1,069,423 pairs of E-C parallel sentences

54
Experimental Results
  • Precision of Mined Parallel Documents
  • 3,000 pairs of E-C candidate documents randomly
    selected from the output of each system
  • Reviewed by manually

55
Experimental Results
  • Accuracy of sentence alignment model
  • 150 E-C parallel documents pairs are randomly
    taken from results
  • Cross-validation by 2 annotators

56
Experimental Results
  • quality of the mined parallel sentences
  • 2000 sentence pairs randomly taken from results
  • 3 categories of quality (1)exact parallel
    (2)roughly parallel (3)not parallel
  • Cross-validation by 2 annotators

57
Experimental Results
  • Mining efficiency
  • Comparison of mining coverage and efficiency
  • 100 E-C bilingual websites

58
Experimental Results
  • Mining results complementarity

59
Outline
  • Introduction
  • Related Work
  • A New Parallel Data Mining Scheme
  • Experimental Results
  • Conclusion

60
Conclusion
  • Mining parallel data from web is a promising
    method to overcome the knowledge bottleneck of MT
  • The DOM tree alignment model achieves better
    results than url pattern based systems
  • The new mining scheme reduces the bandwidth cost
    and increase mining throughput
  • The new mining scheme is more general and reliable

61
Thank you!
Write a Comment
User Comments (0)
About PowerShow.com