Bug in get_filename loses images #126

ed2050 · 2024-05-13T19:06:06Z

I ran into a bug in get_filename that loses some images. The core problem seems to be that the dl counter is not threadsafe. Should be a simple fix, adding a threadlock around counter.

Issue

Here's the observed behavior. I ran the crawler asking for 50 google images:

    crawler = icrawler.builtin.GoogleImageCrawler (downloader_threads = 4 , storage = storage)
    crawler.crawl (keyword = f'happy face', max_num = 50)

Storage dir ends up with 48 images. Some numbers are missing: there's no 000003.jpg and no 000007.jpg.

Diagnosing

So I modify ImageDownloader.get_filename to save all url-filename pairs in a threadsafe dict . That gives the dict shown below under Output. There are several name collisions: four 1's, two 4's, and no 3's or 7's. Other numbers seem to be present.

ImageDownloader.get_filename uses an index number to assign a filename to each url. Index is generated with this line:

file_idx = self.fetched_num + self.file_idx_offset

file_idx_offset should be 0 as I didn't pass any value and storage dir is empty.

I didn't trace where self.fetched_num is generated but it appears to not be threadsafe. Looks like a thread race condition on the counter that generates fetched_num, with perhaps an initialization error (four threads, four 1's - seems like all threads start at 1).

Output

Here's the dict logged from modified get_filename.

{
    "https://wikimedia.com/contents/1909694/image.jpg": "000001.jpg",
    "https://wikimedia.com/contents/1774931/image.jpg": "000001.jpg",
    "https://wikimedia.com/1/1/image.jpg": "000001.jpg",
    "https://wikimedia.com/contents/1734139/image.jpg": "000001.jpg",
    "https://wikimedia.com/media/6765494/8.jpg": "000002.jpg",
    "https://wikimedia.com/10649496/1/0/image.jpg": "000004.jpg",
    "https://wikimedia.tv/contents/10738072/image.jpg": "000004.jpg",
    "https://wikimedia.com/260/8.jpg": "000005.jpg",
    "https://wikimedia.com/contents/9317509/image.jpg": "000006.jpg",
    "https://wikimedia.com/media/image.jpg": "000008.jpg",
    ...
    "https://www.wikimedia.com/contents/684404/image.jpg": "000049.jpg",
    "https://v.wikimedia.com/contents/99637/image.jpg": "000050.jpg",
}

Hope this helps.

The text was updated successfully, but these errors were encountered:

ed2050 · 2024-05-13T21:06:24Z

I found where fetched_num is set, it does use a lock. I think the problem may be that a new lock instance is created for each threadpool object. See thread_pool.py :

    def __init__(self, thread_num, in_queue=None, out_queue=None, name=None):
        self.thread_num = thread_num
        self.in_queue = in_queue if in_queue else CachedQueue(5 * self.thread_num)
        self.out_queue = out_queue if out_queue else CachedQueue(5 * self.thread_num)
        self.name = name if name else __name__
        self.workers = []
        self.lock = Lock()  # <--- HERE

So each threadpool instance has its own separate lock object. Downloader subclasses ThreadPool and inherits this lock. This could be the problem, if each thread is a separate ThreadPool object. Pool usually implies a manager that handles all the threads, but maybe that's not how the Pool works. Each downloader thread needs to access the same lock object for proper locking.

I haven't traced through how Downloader objects are made. Is there only one Downloader object per crawl () with multiple threads accessing that one Downloader? It doesn't seem so, because that would cause problems with other instance vars. But perhaps.

If one call to crawl () does spawn multiple Downloader objects (one for each thread) then self.lock needs to be a shared object. For instance (simple dumb way) - make a global var dl_lock = Lock () and have Downloader objects use that. If that's not fine grained enough, you can create a new lock () object in each crawl () method and pass it to Downloader objects in init. Or whatever.

I haven't traced through how thread spawning works in icrawler so maybe I'm wrong. But that looks like it could be the problem.

ZhiyuanChen added bug help wanted labels May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug in get_filename loses images #126

Bug in get_filename loses images #126

ed2050 commented May 13, 2024 •

edited

Loading

ed2050 commented May 13, 2024 •

edited

Loading

Bug in get_filename loses images #126

Bug in get_filename loses images #126

Comments

ed2050 commented May 13, 2024 • edited Loading

Issue

Diagnosing

Output

ed2050 commented May 13, 2024 • edited Loading

ed2050 commented May 13, 2024 •

edited

Loading

ed2050 commented May 13, 2024 •

edited

Loading