Information Extraction in Real Estate - PowerPoint PPT Presentation

About This Presentation
Title:

Information Extraction in Real Estate

Description:

Information Extraction in Real Estate. Lawrence Wisne - CS224n. The Problem. We are looking to index real estate listings and extract structured data ... – PowerPoint PPT presentation

Number of Views:52
Avg rating:3.0/5.0
Slides: 7
Provided by: Office20041534
Learn more at: https://nlp.stanford.edu
Category:

less

Transcript and Presenter's Notes

Title: Information Extraction in Real Estate


1
Information Extraction in Real Estate
  • Lawrence Wisne - CS224n

2
The Problem
  • We are looking to index real estate listings and
    extract structured data
  • bedrooms, bathrooms, agent name, etc
  • The numbers are too big for manual extraction
  • Hundreds of thousands of listings
  • Tens of thousands of sites
  • Current process involves writing regular
    expressions to handle each unique batch of sites
  • Results in an error-prone process which is not
    robust

3
Extraction Process
  • Token extraction using page differentiation
  • Reduces the number of false positives in
    classification of all tokens
  • Token transformation
  • Isolates the right answer for classification
  • Apply hand-coded regular expressions to raw HTML
  • Extracts a correct result for each label

4
Extraction Process (cont.)
  • Classify tokens using MaxEnt classifier
  • Feature extraction on a token finds relevant
    characteristics
  • Previous and Following words in HTML, capital
    letters, number of words, number values, etc
  • Test on held-out data

5
Example
  • A real excerpt of HTML in the data set
  • lttr id"_ctl1__ctl4_trLivingArea"gt
  • lttdgtltspan class"field-title"gtLiving Area square
    feetlt/spangtnbsp
  • ltspan id"_ctl1__ctl4_lblHouseSqft"
    class"normal"gt2012lt/spangtlt/tdgt
  • This will differ from other pages in the set
  • Create token for classification with
  • Data2012, Previous Text1nbsp Previous
    Text2Living Area square feet,
    firstLetterCapsfalse, numberTypeinteger,
    numberValuegt1000, etc
  • Find correct result via regular expression
  • For example,
  • square feetLiving Area square feet.?(\d)

6
Results
  • Two ways of finding results
  • Choose a token which correlates to a correct
    answer, then test classification (like assignment
    2)
  • Choose any token, then test classification
  • Introduces extraneous noise that should be
    thrown out
  • The results focused on the first way
  • 88 accuracy in this case
  • Actual values even higher!
  • Further work to be done on second way, with
    preliminary results promising!
Write a Comment
User Comments (0)
About PowerShow.com