Python: Web crawling with Scrapy tutorial


 Scrappy is an open  source web scraping framework for python.

In this tutorial the target website we are  going to scrape is:
http://hudoc.echr.coe.int/sites/eng/Pages/search.aspx#{%22documentcollectionid2%22:[%22GRANDCHAMBER%22,%22CHAMBER%22]}

  1. Creating a new Scrapy project
  2. Defining the Items you will extract
  3. Writing a spider to crawl a site and extract Items
  4. Writing an Item Pipeline to store the extracted Items

To start a new scrapy project cd into a new directory and type:
 scrapy startproject tutorial #tutorial is the projects name

Now the following files will be created. These are basically:
  • scrapy.cfg: the project configuration file
  • tutorial/: the project’s python module, you’ll later import your code from here.
  • tutorial/items.py: the project’s items file.
  • tutorial/pipelines.py: the project’s pipelines file.
  • tutorial/settings.py: the project’s settings file.
  • tutorial/spiders/: a directory where you’ll later put your spiders.

1. To begin with, first task is to define our items. Inside the items.py file:

# Define here the models for your scraped items

from scrapy.item import Item, Field

class TutorialItem(Item):
    # define the fields for your item here like:
    Title = Field()


2. Then inside the spiders file create a test1.py file

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

class MySpider(BaseSpider):
    name = "HUDOC" #unique identifier
    allowed_domains  = ["http://hudoc.echr.coe.int"]
    start_urls = ["http://hudoc.echr.coe.int/sites/eng/Pages/search.aspx#{%22documentcollectionid2%22:[%22CASELAW%22]}"]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        titles = hxs.select("//p")
       
    for titles in titles:
            Title = titles.select("a/text()").extract()
            print Title



3. To launch the spider cd to the projects root directory which is tutorial and type:
scrapy crawl HUDOC

where HUDOC is the name  variable and unique identifier that we used in the test1.py file.



[cont.]


2 comments:

  1. Hello. Thank you for the wonderful tutorial. I added it to my website's list of great Python- and Scrapy-based website crawler tutorials.

    ReplyDelete
  2. Scrapy works well both for beginners as well as for experts.I am doing web scraping since last 6 years with scrapy.Here is my website to look at : http://prowebscraping.com/web-scraping-services/

    ReplyDelete


Free online chess

View Kapellas Nick's profile on LinkedIn
Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License