Python: Web crawling with Scrapy tutorial


 Scrappy is an open  source web scraping framework for python.

In this tutorial the target website we are  going to scrape is:
http://hudoc.echr.coe.int/sites/eng/Pages/search.aspx#{%22documentcollectionid2%22:[%22GRANDCHAMBER%22,%22CHAMBER%22]}

  1. Creating a new Scrapy project
  2. Defining the Items you will extract
  3. Writing a spider to crawl a site and extract Items
  4. Writing an Item Pipeline to store the extracted Items

To start a new scrapy project cd into a new directory and type:
 scrapy startproject tutorial #tutorial is the projects name

Now the following files will be created. These are basically:
  • scrapy.cfg: the project configuration file
  • tutorial/: the project’s python module, you’ll later import your code from here.
  • tutorial/items.py: the project’s items file.
  • tutorial/pipelines.py: the project’s pipelines file.
  • tutorial/settings.py: the project’s settings file.
  • tutorial/spiders/: a directory where you’ll later put your spiders.

1. To begin with, first task is to define our items. Inside the items.py file:

# Define here the models for your scraped items

from scrapy.item import Item, Field

class TutorialItem(Item):
    # define the fields for your item here like:
    Title = Field()


2. Then inside the spiders file create a test1.py file

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

class MySpider(BaseSpider):
    name = "HUDOC" #unique identifier
    allowed_domains  = ["http://hudoc.echr.coe.int"]
    start_urls = ["http://hudoc.echr.coe.int/sites/eng/Pages/search.aspx#{%22documentcollectionid2%22:[%22CASELAW%22]}"]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        titles = hxs.select("//p")
       
    for titles in titles:
            Title = titles.select("a/text()").extract()
            print Title



3. To launch the spider cd to the projects root directory which is tutorial and type:
scrapy crawl HUDOC

where HUDOC is the name  variable and unique identifier that we used in the test1.py file.



[cont.]


3 comments:

  1. Hello. Thank you for the wonderful tutorial. I added it to my website's list of great Python- and Scrapy-based website crawler tutorials.

    ReplyDelete
  2. Scrapy works well both for beginners as well as for experts.I am doing web scraping since last 6 years with scrapy.Here is my website to look at : http://prowebscraping.com/web-scraping-services/

    ReplyDelete
  3. I like your blog, I read this blog please update more content on python, further check it once at python online training

    ReplyDelete


Free online chess

View Kapellas Nick's profile on LinkedIn
Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License