Force my scrapy spider to stop crawling

In the latest version of Scrapy, available on GitHub, you can raise a CloseSpider exception to manually close a spider. In the 0.14 release note doc is mentioned: “Added CloseSpider exception to manually close spiders (r2691)” Example as per the docs: def parse_page(self, response): if ‘Bandwidth exceeded’ in response.body: raise CloseSpider(‘bandwidth_exceeded’) See also: http://readthedocs.org/docs/scrapy/en/latest/topics/exceptions.html?highlight=closeSpider

crawl site that has infinite scrolling using python

You can use selenium to scrap the infinite scrolling website like twitter or facebook. Step 1 : Install Selenium using pip pip install selenium Step 2 : use the code below to automate infinite scroll and extract the source code from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.common.keys import Keys from selenium.webdriver.support.ui import … Read more

how to filter duplicate requests based on url in scrapy

You can write custom middleware for duplicate removal and add it in settings import os from scrapy.dupefilter import RFPDupeFilter class CustomFilter(RFPDupeFilter): “””A dupe filter that considers specific ids in the url””” def __getid(self, url): mm = url.split(“&refer”)[0] #or something like that return mm def request_seen(self, request): fp = self.__getid(request.url) if fp in self.fingerprints: return True … Read more

Access django models inside of Scrapy

If anyone else is having the same problem, this is how I solved it. I added this to my scrapy settings.py file: def setup_django_env(path): import imp, os from django.core.management import setup_environ f, filename, desc = imp.find_module(‘settings’, [path]) project = imp.load_module(‘settings’, f, filename, desc) setup_environ(project) setup_django_env(‘/path/to/django/project/’) Note: the path above is to your django project folder, … Read more

scrapy: Call a function when a spider quits

It looks like you can register a signal listener through dispatcher. I would try something like: from scrapy import signals from scrapy.xlib.pydispatch import dispatcher class MySpider(CrawlSpider): def __init__(self): dispatcher.connect(self.spider_closed, signals.spider_closed) def spider_closed(self, spider): # second param is instance of spder about to be closed. In the newer version of scrapy scrapy.xlib.pydispatch is deprecated. instead you … Read more

Scrapy image download how to use custom filename

This is just actualization of the answer for scrapy 0.24 (EDITED), where the image_key() is deprecated class MyImagesPipeline(ImagesPipeline): #Name download version def file_path(self, request, response=None, info=None): #item=request.meta[‘item’] # Like this you can use all from item, not just url. image_guid = request.url.split(“https://stackoverflow.com/”)[-1] return ‘full/%s’ % (image_guid) #Name thumbnail version def thumb_path(self, request, thumb_id, response=None, info=None): … Read more

How to give URL to scrapy for crawling?

I’m not really sure about the commandline option. However, you could write your spider like this. class MySpider(BaseSpider): name=”my_spider” def __init__(self, *args, **kwargs): super(MySpider, self).__init__(*args, **kwargs) self.start_urls = [kwargs.get(‘start_url’)] And start it like: scrapy crawl my_spider -a start_url=”http://some_url”