Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in get_filename loses images #126

Open
ed2050 opened this issue May 13, 2024 · 1 comment
Open

Bug in get_filename loses images #126

ed2050 opened this issue May 13, 2024 · 1 comment

Comments

@ed2050
Copy link

ed2050 commented May 13, 2024

I ran into a bug in get_filename that loses some images. The core problem seems to be that the dl counter is not threadsafe. Should be a simple fix, adding a threadlock around counter.

Issue

Here's the observed behavior. I ran the crawler asking for 50 google images:

    crawler = icrawler.builtin.GoogleImageCrawler (downloader_threads = 4 , storage = storage)
    crawler.crawl (keyword = f'happy face', max_num = 50)

Storage dir ends up with 48 images. Some numbers are missing: there's no 000003.jpg and no 000007.jpg.

Diagnosing

So I modify ImageDownloader.get_filename to save all url-filename pairs in a threadsafe dict . That gives the dict shown below under Output. There are several name collisions: four 1's, two 4's, and no 3's or 7's. Other numbers seem to be present.

ImageDownloader.get_filename uses an index number to assign a filename to each url. Index is generated with this line:

file_idx = self.fetched_num + self.file_idx_offset

file_idx_offset should be 0 as I didn't pass any value and storage dir is empty.

I didn't trace where self.fetched_num is generated but it appears to not be threadsafe. Looks like a thread race condition on the counter that generates fetched_num, with perhaps an initialization error (four threads, four 1's - seems like all threads start at 1).

Output

Here's the dict logged from modified get_filename.

{
    "https://wikimedia.com/contents/1909694/image.jpg": "000001.jpg",
    "https://wikimedia.com/contents/1774931/image.jpg": "000001.jpg",
    "https://wikimedia.com/1/1/image.jpg": "000001.jpg",
    "https://wikimedia.com/contents/1734139/image.jpg": "000001.jpg",
    "https://wikimedia.com/media/6765494/8.jpg": "000002.jpg",
    "https://wikimedia.com/10649496/1/0/image.jpg": "000004.jpg",
    "https://wikimedia.tv/contents/10738072/image.jpg": "000004.jpg",
    "https://wikimedia.com/260/8.jpg": "000005.jpg",
    "https://wikimedia.com/contents/9317509/image.jpg": "000006.jpg",
    "https://wikimedia.com/media/image.jpg": "000008.jpg",
    ...
    "https://www.wikimedia.com/contents/684404/image.jpg": "000049.jpg",
    "https://v.wikimedia.com/contents/99637/image.jpg": "000050.jpg",
}

Hope this helps.

@ed2050
Copy link
Author

ed2050 commented May 13, 2024

I found where fetched_num is set, it does use a lock. I think the problem may be that a new lock instance is created for each threadpool object. See thread_pool.py :

    def __init__(self, thread_num, in_queue=None, out_queue=None, name=None):
        self.thread_num = thread_num
        self.in_queue = in_queue if in_queue else CachedQueue(5 * self.thread_num)
        self.out_queue = out_queue if out_queue else CachedQueue(5 * self.thread_num)
        self.name = name if name else __name__
        self.workers = []
        self.lock = Lock()  # <--- HERE

So each threadpool instance has its own separate lock object. Downloader subclasses ThreadPool and inherits this lock. This could be the problem, if each thread is a separate ThreadPool object. Pool usually implies a manager that handles all the threads, but maybe that's not how the Pool works. Each downloader thread needs to access the same lock object for proper locking.

I haven't traced through how Downloader objects are made. Is there only one Downloader object per crawl () with multiple threads accessing that one Downloader? It doesn't seem so, because that would cause problems with other instance vars. But perhaps.

If one call to crawl () does spawn multiple Downloader objects (one for each thread) then self.lock needs to be a shared object. For instance (simple dumb way) - make a global var dl_lock = Lock () and have Downloader objects use that. If that's not fine grained enough, you can create a new lock () object in each crawl () method and pass it to Downloader objects in init. Or whatever.

I haven't traced through how thread spawning works in icrawler so maybe I'm wrong. But that looks like it could be the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants