Python: Web crawling with Scrapy tutorial

 Scrappy is an open  source web scraping framework for python.

In this tutorial the target website we are  going to scrape is:{%22documentcollectionid2%22:[%22GRANDCHAMBER%22,%22CHAMBER%22]}

  1. Creating a new Scrapy project
  2. Defining the Items you will extract
  3. Writing a spider to crawl a site and extract Items
  4. Writing an Item Pipeline to store the extracted Items

To start a new scrapy project cd into a new directory and type:
 scrapy startproject tutorial #tutorial is the projects name

Now the following files will be created. These are basically:
  • scrapy.cfg: the project configuration file
  • tutorial/: the project’s python module, you’ll later import your code from here.
  • tutorial/ the project’s items file.
  • tutorial/ the project’s pipelines file.
  • tutorial/ the project’s settings file.
  • tutorial/spiders/: a directory where you’ll later put your spiders.

1. To begin with, first task is to define our items. Inside the file:

# Define here the models for your scraped items

from scrapy.item import Item, Field

class TutorialItem(Item):
    # define the fields for your item here like:
    Title = Field()

2. Then inside the spiders file create a file

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

class MySpider(BaseSpider):
    name = "HUDOC" #unique identifier
    allowed_domains  = [""]
    start_urls = ["{%22documentcollectionid2%22:[%22CASELAW%22]}"]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        titles ="//p")
    for titles in titles:
            Title ="a/text()").extract()
            print Title

3. To launch the spider cd to the projects root directory which is tutorial and type:
scrapy crawl HUDOC

where HUDOC is the name  variable and unique identifier that we used in the file.



