Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redis Sitemap Spider #102

Open
mirceachira opened this issue Jun 8, 2017 · 0 comments
Open

Redis Sitemap Spider #102

mirceachira opened this issue Jun 8, 2017 · 0 comments
Assignees

Comments

@mirceachira
Copy link

mirceachira commented Jun 8, 2017

Hello,

I'm working on a project that requires the memory efficiency of scrapy redis and the features of scrapy's sitemap spider and I think there should be a special implementation for this in scrapy redis because sitemap spiders often keep lots of data in memory.

Specificaly (source):

    def _parse_sitemap(self, response):
        if response.url.endswith('/robots.txt'):
            for url in sitemap_urls_from_robots(response.text, base_url=response.url):
                yield Request(url, callback=self._parse_sitemap)
        else:
            body = self._get_sitemap_body(response)
            if body is None:
                logger.warning("Ignoring invalid sitemap: %(response)s",
                               {'response': response}, extra={'spider': self})
                return

            s = Sitemap(body)
            if s.type == 'sitemapindex':
                for loc in iterloc(s, self.sitemap_alternate_links):
                    if any(x.search(loc) for x in self._follow):
                        yield Request(loc, callback=self._parse_sitemap)
            elif s.type == 'urlset':
                for loc in iterloc(s):
                    for r, c in self._cbs:
                        if r.search(loc):
                            yield Request(loc, callback=c)
                            break

In this method a sitemap object is created for every sitemap body and as you can see in the next example a lxml element tree object is stored in self._root.

    def __init__(self, xmltext):
        xmlp = lxml.etree.XMLParser(recover=True, remove_comments=True, resolve_entities=False)
        self._root = lxml.etree.fromstring(xmltext, parser=xmlp)
        rt = self._root.tag
        self.type = self._root.tag.split('}', 1)[1] if '}' in rt else rt

Now, this is okay for most sitemaps that have a dozen to a few hundred urls BUT if you have to deal with huge sitemaps (example: autobidmaster.com/sitemap_index.xml.gz). You'll soon find that even with scrapy redis your heap memory will still complain about space since the entire sitemap object is not garbage collected until all requests have been yielded and handled.

To be more clear, I'll show you the heap memory, body size, sitemap size, s._root size and response url while crawling autobidmasters.com/robots.txt. I made a simple scrapy project that uses scrapy redis, here's the spider:

import sys
import logging

from scrapy.spiders import SitemapSpider
from scrapy.spiders.sitemap import regex, iterloc
from scrapy.utils.sitemap import Sitemap, sitemap_urls_from_robots
from scrapy.http import Request, XmlResponse


logger = logging.getLogger(__name__)

class TestSitemapSpider(SitemapSpider):

    name = 'test_sitemap_spider'

    sitemap_urls = [
        'http://www.autobidmaster.com/robots.txt?',
    ]

    def parse(self, response):
        import os;
        _proc_status = '/proc/%d/status' % os.getpid();
        _scale = {'kB': 1024.0, 'mB': 1024.0 * 1024.0, 'KB': 1024.0, 'MB': 1024.0 * 1024.0};
        t = open(_proc_status);
        v = t.read();
        t.close();
        i = v.index('VmSize:');
        v = v[i:].split(None, 3);
        heap_size = float(v[1]) * _scale[v[2]] / 1000 / 1000

        print(heap_size, response.url)

    def _parse_sitemap(self, response):
        if response.url.endswith('/robots.txt'):
            for url in sitemap_urls_from_robots(response.text, base_url=response.url):
                yield Request(url, callback=self._parse_sitemap)
        else:
            body = self._get_sitemap_body(response)
            if body is None:
                logger.warning("Ignoring invalid sitemap: %(response)s",
                               {'response': response}, extra={'spider': self})
                return

            s = Sitemap(body)

            # Get heap memory
            import os;
            _proc_status = '/proc/%d/status' % os.getpid();
            _scale = {'kB': 1024.0, 'mB': 1024.0 * 1024.0, 'KB': 1024.0, 'MB': 1024.0 * 1024.0};
            t = open(_proc_status);
            v = t.read();
            t.close();
            i = v.index('VmSize:');
            v = v[i:].split(None, 3);
            heap_size = float(v[1]) * _scale[v[2]] / 1000 / 1000

            # Get _root memory
            root_memory = 0
            for child in s._root.getchildren():
                for childs_child in child.getchildren():
                    root_memory += sys.getsizeof(childs_child)

            print(heap_size, sys.getsizeof(body), sys.getsizeof(s), root_memory, response.url)

            if s.type == 'sitemapindex':
                for loc in iterloc(s, self.sitemap_alternate_links):
                    if any(x.search(loc) for x in self._follow):
                        yield Request(loc, callback=self._parse_sitemap)
            elif s.type == 'urlset':
                for loc in iterloc(s):
                    for r, c in self._cbs:
                        if r.search(loc):
                            yield Request(loc, callback=c)
                            break

And here's the ouput (the relevant parts with heap size in MB and the rest in bytes):

(heap, body, s, s._root, response.url)
(208.36352, 1875, 64, 1440, 'http://www.autobidmaster.com/blog/sitemap-pt-post-2016-12.xml')
(208.36352, 2323, 64, 2016, 'http://www.autobidmaster.com/blog/sitemap-pt-post-2017-01.xml')
(208.36352, 2079, 64, 1728, 'http://www.autobidmaster.com/blog/sitemap-pt-post-2017-02.xml')
Here it reaches the huge sitemaps
(258.093056, 5008929, 64, 5808096, 'http://www.autobidmaster.com/sitemap_7.xml.gz')
(341.95046399999995, 8693305, 64, 10080000, 'http://www.autobidmaster.com/sitemap_2.xml.gz')
(444.493824, 8693814, 64, 10080000, 'http://www.autobidmaster.com/sitemap_3.xml.gz')
(529.514496, 8622267, 64, 10080000, 'http://www.autobidmaster.com/sitemap_1.xml.gz')
(615.108608, 8694015, 64, 10080000, 'http://www.autobidmaster.com/sitemap_4.xml.gz')
(701.52192, 8679427, 64, 10080000, 'http://www.autobidmaster.com/sitemap_5.xml.gz')
(786.2722560000001, 8688847, 64, 10080000, 'http://www.autobidmaster.com/sitemap_6.xml.gz')
Here it starts calling parse
(788.893696, 'http://www.autobidmaster.com/blog/2011/10/used-boat-parts/')
(788.893696, 'http://www.autobidmaster.com/blog/2011/10/used-truck-parts/')
(788.893696, 'http://www.autobidmaster.com/blog/2011/10/used-honda-parts/')

As you can see the memory is not freed while scraping and I think this can elegantly be done by storing urls in redis until all sitemaps have been scraped and only then start with these urls therefore releasing 's' from memory and freeing quite alot of heap.

This worked on my project and I think there should be an implementation for it in scrapy redis, now I'll make a branch asap but wanted to show you where the issue is and get some feedback as I have some questions:

If I implement this as a RedisSitemapSpider (similar to existing RedisSpider and RedisCrawlSpider) what would be the best way to store requests in redis? I think that a new key should be added "%spidername":sitemap_urls" and urls extracted from the sitemaps should be stored either in "%spidername%: start_urls" or directly into "%spidername%:requests".

In other words, the RedisSitemapSpider would take a sitemap url, extract all site links and put them into redis in some key then go to the next sitemap and repeat the process while not holding the previous sitemap into memory.

@rmax rmax self-assigned this Jun 8, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants