ATLAS Web Crawling for Data - PowerPoint PPT Presentation

1 / 14
About This Presentation
Title:

ATLAS Web Crawling for Data

Description:

Event Detection and Tracking: finding news on an interesting topic. How can Atlas help? ... Finding non-English documents. Non-English web pages carry relevant news ... – PowerPoint PPT presentation

Number of Views:14
Avg rating:3.0/5.0
Slides: 15
Provided by: ashwin4
Category:
Tags: atlas | crawling | data | finding | web

less

Transcript and Presenter's Notes

Title: ATLAS Web Crawling for Data


1
ATLAS Web Crawling for Data
  • Ashwin Tengli
  • Language Technologies Institute
  • Carnegie Mellon University

2
Outline
  • Event Detection and Tracking finding news on an
    interesting topic
  • How can Atlas help?
  • Collecting data and using it
  • Future Work

3
Finding News on a Topic
  • Manually monitor news sites
  • Use search engines to find news

Drawbacks
  • Time consuming limited number of sites covered
  • Need to know many languages
  • .
  • .
  • .
  • .
  • .
  • .

4
Using ATLAS to solve the problem
  • ATLAS will monitor the web for documents on the
    topic of interest
  • New events are detected
  • Look for events in non-English documents too
  • Provide event tracking support

Select Event To Track
Event in English Found
Event Detection
Arabic and French Event Found
5
Training Data for ATLAS
  • Needs topic-specific training data, in addition
    to general data
  • Needs data in multiple languages parallel text

6
Crawling for Data Topic focused data collection
and filtering
  • Finding on-topic web pages
  • Filtering content
  • Using Named Entities to find event descriptions
  • Finding non-English documents

7
Topic focused data collection and
filtering Finding web pages
  • Using search engines
  • Traversing the web as a graph (crawling)
  • Follow links
  • Filtering out non-relevant pages using
  • Text Classification
  • Named Entities

8
Topic focused data collection and
filtering Filtering Content
  • Extracting relevant content from html

Relevant Content
  • Use maximum entropy models/Hidden Markov Models
    to pinpoint relevant content

9
Topic focused data collection and filtering Using
Named Entities
  • Named entities signify relevance
  • Help identify events

Iraq Biological Warfare Agents Hussein
Kamal 1988 Smallpox Saddam Hussein 8,500
liters Anthrax 15,000 24,000 liters botulinum
toxin
10
Topic focused data collection and
filtering Finding non-English documents
  • Non-English web pages carry relevant news
  • Many times they are the news-breakers
  • Need to identify language
  • Automatically
  • Using HTML meta-data

11
Crawling for Data Parallel Text
  • Parallel Text Web pages in different languages
    but content is translation of each other
  • Comparable Text Web pages in different languages
    and content is on same topic
  • This data is required for training ATLAS for
    event-detection in non-English webpages

12
Parallel Text Comparable Text
  • Use heuristics to detect parallel and comparable
    text
  • URL format
  • HTML structure similarity
  • Link structure of the website

13
Future Work
  • Learn relations among topics on the web
  • Use this to improve topic focused data collection

Topic Q
Topic S
Topic D
Topic X
Topic J
Topic N
Topic A
Topic M
Topic G
Topic B
14
Conclusions
  • Training data is crucial
  • We need to supplement general training data with
    topic-specific data
  • The Web
  • Good source of multilingual data
  • More realistic data
Write a Comment
User Comments (0)
About PowerShow.com