Web Scraping Using Nutch and Solr 1/3

About This Presentation

Title:

Web Scraping Using Nutch and Solr 1/3

Description:

A short presentation ( part 1 of 3 ) describing the use of open source code nutch and solr to web crawl the internet and process the data. – PowerPoint PPT presentation

Number of Views:439

Slides: 17

Provided by: semtechs

Category: Medicine, Science & Technology

more less

Transcript and Presenter's Notes

Title: Web Scraping Using Nutch and Solr 1/3

1
Web Scraping Using Nutch and Solr

A simple example of using open source code
Web Scrape a single web site - ours
Environment and code
Using Centos V6.2 ( Linux )?
Apache Nutch 1.6
Solr 4.2.1
Java 1.6

2
Nutch and Solr Architecture

Nutch processes urls and feeds content to Solr
Solr indexes content

3
Where to get source code

Nutch
http//nutch.apache.org
Solr
http//lucene.apache.org/solr
Java
http//java.com

4
Installing Source - Nutch

Nutch is delivered as
apache-nutch-1.6-bin.tar ( 64M )?
apache-nutch-1.6-src.tar ( 20M )?
Copy each tar file to your desired location
Install each tar file as
tar xvf lttar filegt
Second tar file optional

5
Installing Source - Solr

Solr is delivered as
solr-4.2.1.zip ( 116M )?
Copy file to your desired location
Install each tar file as
unzip ltzip filegt

6
Configuring Nutch Part 1

Assuming we will crawl a single web site
Ensure that JAVA_HOME is set
cd apache-nutch-1.6
Edit agent name in conf/nutch-site.xml
ltpropertygt
ltnamegthttp.agent.namelt/namegt
ltvaluegtNutch Spiderlt/valuegt
lt/propertygt
mkdir -p urls cd urls touch seed.txt

7
Configuring Nutch Part 2

Add following url ( ours ) to seed.txt
http//www.semtech-solutions.co.nz
Change url filtering in conf/regex-urlfilter.txt,
change the line
accept anything else
.
To be
http//(a-z0-9\.)semtech-solutions.co.nz/
This means that we will filter the urls found to
only be from the local site

8
Configuring Solr Part 1

cd solr-4.2.1/example/solr/collection1/conf
Add some extra fields to schema.xml after
_version_ field i.e.

9
Start Solr Server Part 1

Within solr-4.2.1/example
Run the following command
java -jar start.jar
Now try to access admin web page for solr
http//localhost8983/solr/admin
You should now see the admin web site
( see next page )?

10
Start Solr Server Part 2

Solr Admin web page

11
Run Nutch / Solr

We are ready to crawl our first web site
Go to apache-nutch-1.6 directory
Run the following commands
touch nutch_start.bash
chmod 755 nutch_start.bash
vi nutch_start.bash
Add the text to the file
!/bin/bash
bin/nutch crawl urls -solr http//localhost8983/s
olr/ \
-dir crawl -depth 3 -topN 3

12
Run Nutch / Solr

Now run the nutch bash file
./nutch_start.bash
Select the Logging option on the admin console
Monitor for errors in Logging console
The crawl should finish with no errors and the
line
Crawl finished crawl
In the crawl window

13
Check Crawled Data

Now we check the data that we have crawled
In Admin Console window
Set Core Selector to collection1
Select the Query option
Click execute query button
You should now see some of the data that you
have crawled

14
Crawled Data

Crawled data in solr query

15
Crawled Data

Thats your first simple crawl completed
Further reading at
http//nutch.apache.org
http//lucene.apache.org/solr
Now you can
Add more urls to your seed.txt
Increase the depth of your link search via
options
-depth
-topN
Modify your url filtering

16
Contact Us

Feel free to contact us at
www.semtech-solutions.co.nz
info_at_semtech-solutions.co.nz
We offer IT project consultancy
We are happy to hear about your problems
You can just pay for those hours that you need
To solve your problems

Write a Comment

User Comments (0)

About PowerShow.com

Web Scraping Using Nutch and Solr 1/3 - PowerPoint PPT Presentation

Web Scraping Using Nutch and Solr 1/3

A short presentation ( part 1 of 3 ) describing the use of open source code nutch and solr to web crawl the internet and process the data. – PowerPoint PPT presentation