Replace pySmartDL and ThreadPoolExecutor with aiohttp #279

spwoodcock · 2024-07-30T16:09:39Z

Is your feature request related to a problem? Please describe.

pySmartDL is very verbose.
Tiles are kb scale, so we don't really care about download progress.
Instead we just need to know how many tile downloads failed at the end.
Using aiohttp would be cleaner than ThreadPoolExecutor.

Describe the solution you'd like

Replace pySmartDL and ThreadPoolExecutor with an aiohttp implementation for tile downloads.

The text was updated successfully, but these errors were encountered:

joemaren · 2024-08-09T08:26:39Z

Will you require aiohttp to run multiple threads like pySmartDL and ThreadPoolExecutor or just asynchronously?

spwoodcock · 2024-08-09T10:09:44Z

Just asynchronously using the thread pool I think 👍

aiohttp has a limit for the connection pool size for downloads, with the default being 100

RonaldRonnie · 2024-08-21T20:43:32Z

Does it mean that the focus is not to relying on multi-threading??? also, I think the slicing logic that splits xforms into chunks based on the number of CPU cores can remain , having looked at the code. Right? Also i would like to work on this given the procedures to test results after. I request to be assigned to this @spwoodcock .

rsavoye · 2024-08-21T21:00:49Z

Yes, the number of threads is based on the number of cores. I think you'd want to keep the thread pool even if you use aiohttp. For FMTM you're downloading small amounts of tiles so don't see the performance problems. I'm commonly downloading imagery for a huge area. And often the server with the imagery is bandwidth limited, so the more concurrent threads the better. I used SmartDL cause it was easy, but I don't really see a problem with it. When downloading many, many tiles, but running from the command line, I do like to monitor the progress, which may run for hours... sometimes a few days, and it's the only way I can make sure it's still working. I don't have any issues if you want to replace it with aiohttp, most of what it is doing is network access. But please keep the thread pool unless aiohttp has something similar. A quick check, it appears to be single threaded, so I think more discussion is needed before starting on anything.

spwoodcock · 2024-08-22T07:25:22Z

When it comes to IO-bound (network) tasks like downloading files, asyncio is much more efficient than threading.

We can't download all the files at once, as this chokes the network, so the OS manages connections for us.

The download process is the same between the two but threading creates a blocking socket (that will sit idle until the OS returns a response), while asyncio creates a non-blocking socket.

For asyncio, when a coroutine initiates some IO, the given socket is added to the selector. Then when that selector notices the socket has some actionable change, the coroutine waiting for it is continued.

Asyncio is there to basically minimise this unnecessary idle time and only action tasks when the OS sees they are ready to be completed. It does all of this on a single thread and is more resource efficient.

It's a minor benefit when running standalone, but when running on an ASGI webserver like FastAPI, as we can pass this to a single background worker thread to run until complete (instead of using up threads that could be used by other async tasks).

spwoodcock · 2024-08-22T07:28:38Z

The logic to slice up the tasks and get the CPU cores can be removed entirely. This can be handled by the aoihttp connection pool automatically

joemaren · 2024-08-22T08:30:33Z

I feel the same way. Keeping thread pool defeats the purpose of using aiohttp. It creates more complexity. aiohttp also manages multiple connections with a pool size which defaults to 100.

RonaldRonnie · 2024-08-22T08:54:46Z

I believe transitioning to aiohttp without the thread pool is the better approach due to its simplicity and efficiency. It aligns well with modern async practices, especially in environments where async tasks need to be managed effectively. However, I recognize that long-term monitoring and handling bandwidth limitations might require additional considerations. These could include more extensive testing and fine-tuning aiohttp’s connection pool settings to meet those specific needs.

That said, I also acknowledge @rsavoye’s valid points regarding large-scale downloads. In cases where monitoring progress over extended periods is essential, or where maximizing bandwidth on limited servers is a concern, using multiple threads may still offer advantages. However, these edge cases can often be addressed by tuning aiohttp settings (e.g., adjusting connection pool size) and implementing custom progress logging.

In summary, with proper configuration and some minor custom additions, aiohttp can be optimized to handle even more demanding scenarios without needing to revert to threading.

joemaren · 2024-08-22T09:34:37Z

I think @spwoodcock mentioned that the tiles to be downloaded are in kilobytes, so I don't see the concern for large scale downloads. Failed downloads can be retried. Yes, custom progress logging can still be implemented

rsavoye · 2024-08-22T12:36:29Z

If aiohttp has support for it's own pool, I don't have a problem with the change. I just want to have multiple network connections to the remote server so I can download multiple tiles in parallel since I'm often downloading many, many thousands of tiles and don't want it to take forever... I agree that for python, threading is more for computational tasks (like conflation), and asyncio is more for I/O bound tasks. Also I still like the verbosity when running in a terminal, so that should be optional.

rsavoye · 2024-08-22T13:03:56Z

Just as a note, I wrote basemapperr originally to make large basemaps of imagery and topo maps for OsmAnd that cover an entire state, like all of Colorado. I use this for navigating in rural areas where often the OSM data isn't very good.

spwoodcock · 2024-08-22T14:22:34Z

There is no limit to the size of the area we attempt to download with either approach.

Updating to aoihttp simply makes the downloads more resource efficient.

It may even be marginally faster.

rsavoye · 2024-08-22T15:56:11Z

I can test it with a large AOI when there is a PR to review.

RonaldRonnie · 2024-08-25T13:58:47Z

Can i go ahead and work on this then we do testing after the PR

spwoodcock · 2024-08-25T18:06:53Z

For sure! Go ahead 😁🙏

RonaldRonnie · 2024-08-27T05:10:19Z

Am I allowed to use the Caching and Resume Support.

spwoodcock · 2024-08-27T07:32:50Z

I'm not sure Resume functionality will provide much benefit as the files are so small.

Caching isn't actually necessary either, as the files on disk are essentially our cache.

The flow should be like this:

Check check for existing directory.
If present, then check the subdirs for required files recursively.
Create a list of missing files / files to download.
Use the list for the aoihttp pooled downloading.

RonaldRonnie · 2024-08-28T06:09:33Z

Hi @spwoodcock , is it okay to implement asyncio.gather to run multiple asycnronous download tasks concurrrent??

joemaren · 2024-08-28T07:34:13Z

You don't need to use asyncio.gather()
Use asyncio.create_task() to create a task object and append to it.

spwoodcock · 2024-08-28T08:41:06Z

async def aiohttp_get(session: aiohttp.ClientSession, url: str):
    async with session.request("GET", url) as res:
        data = await res.read()
        return data

async with aiohttp.ClientSession() as session:
    await asyncio.gather(*(aiohttp_get(session, url) for url in urls))

I haven't tested the code above, but hopefully it gives an idea how to solve.

Using asyncio.gather is good 👍

Be sure to create the ClientSession once, outside of the async function that is pooled using gather (so the same connection pool is reused and not recreated each download)

spwoodcock added the enhancement New feature or request label Jul 30, 2024

spwoodcock added this to Field Mapping Tasking Manager (FMTM) Jul 30, 2024

github-project-automation bot moved this to Tasks in Field Mapping Tasking Manager (FMTM) Jul 30, 2024

spwoodcock assigned RonaldRonnie Aug 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace pySmartDL and ThreadPoolExecutor with aiohttp #279

Replace pySmartDL and ThreadPoolExecutor with aiohttp #279

spwoodcock commented Jul 30, 2024

joemaren commented Aug 9, 2024 •

edited

Loading

spwoodcock commented Aug 9, 2024

RonaldRonnie commented Aug 21, 2024 •

edited

Loading

rsavoye commented Aug 21, 2024

spwoodcock commented Aug 22, 2024 •

edited

Loading

spwoodcock commented Aug 22, 2024

joemaren commented Aug 22, 2024

RonaldRonnie commented Aug 22, 2024 •

edited

Loading

joemaren commented Aug 22, 2024

rsavoye commented Aug 22, 2024

rsavoye commented Aug 22, 2024

spwoodcock commented Aug 22, 2024

rsavoye commented Aug 22, 2024

RonaldRonnie commented Aug 25, 2024

spwoodcock commented Aug 25, 2024

RonaldRonnie commented Aug 27, 2024

spwoodcock commented Aug 27, 2024

RonaldRonnie commented Aug 28, 2024

joemaren commented Aug 28, 2024

spwoodcock commented Aug 28, 2024 •

edited

Loading

Replace pySmartDL and ThreadPoolExecutor with aiohttp #279

Replace pySmartDL and ThreadPoolExecutor with aiohttp #279

Comments

spwoodcock commented Jul 30, 2024

joemaren commented Aug 9, 2024 • edited Loading

spwoodcock commented Aug 9, 2024

RonaldRonnie commented Aug 21, 2024 • edited Loading

rsavoye commented Aug 21, 2024

spwoodcock commented Aug 22, 2024 • edited Loading

spwoodcock commented Aug 22, 2024

joemaren commented Aug 22, 2024

RonaldRonnie commented Aug 22, 2024 • edited Loading

joemaren commented Aug 22, 2024

rsavoye commented Aug 22, 2024

rsavoye commented Aug 22, 2024

spwoodcock commented Aug 22, 2024

rsavoye commented Aug 22, 2024

RonaldRonnie commented Aug 25, 2024

spwoodcock commented Aug 25, 2024

RonaldRonnie commented Aug 27, 2024

spwoodcock commented Aug 27, 2024

RonaldRonnie commented Aug 28, 2024

joemaren commented Aug 28, 2024

spwoodcock commented Aug 28, 2024 • edited Loading

joemaren commented Aug 9, 2024 •

edited

Loading

RonaldRonnie commented Aug 21, 2024 •

edited

Loading

spwoodcock commented Aug 22, 2024 •

edited

Loading

RonaldRonnie commented Aug 22, 2024 •

edited

Loading

spwoodcock commented Aug 28, 2024 •

edited

Loading