-
Notifications
You must be signed in to change notification settings - Fork 184
Closed
Labels
type: featureA new featureA new feature
Description
Hi there,
I've been working with this library for a while, and I'm using it to query an endpoint that is capable of receive multiple queries at once. The API returns responses in the same order as requested.
I've found this feature is implemented by other libraries under the name of Batching. And I've successfully implemented this in Python too. My question is, have you considered this already, are you accepting PRs for this?
I'd be glad to contribute
Metadata
Metadata
Assignees
Labels
type: featureA new featureA new feature
Type
Projects
Milestone
Relationships
Development
Select code repository
Activity
leszekhanusz commentedon Sep 1, 2023
Something like transport-level batching? Yes, that could be useful.
You would probably need to create
execute_batch
,execute_batch_sync
,execute_batch_async
methods in client.py which have the same arguments as theexecute
methods but taking in argument a list ofdocuments
,variable_values
andoperation_names
instead of a single element and returning a list of answers.Any help is greatly appreciated. Please check the CONTRIBUTING.md file. The PR needs to have 100% test code coverage to be accepted.
itolosa commentedon Sep 1, 2023
I have one from a crypto exchange. We just need to generate a token but's pretty straightforward, I've managed to obtain one using a temporary email.
I'd create another transport that derives from the http / requests. In my case BatchingTransport
Actually, I have a working prototype of this on another public repo with 100% coverage, using the github ci tool. I've used github secrets to store the api token.
I've used threads and futures to implement this, but we could do better than that. As a prototype though it's ok.
You have to use the execution result instead of dict to keep an open communication within the transport processor.
Check out this:
https://github.com/itolosa/orionx-api-client/blob/master/orionx_api_client/transports/batch.py
The client should care to not wait for the data from the execution result, So part of the validation should be split to outside of it, as a helper function. In my case called
as_completed
itolosa commentedon Sep 1, 2023
I'm sorry, I've realized I haven't explained myself enough
My implementation uses the same client you have, it receives a single document each time, but the clever thing is that if you execute it many times in a row on a short period of time, then a little delay in the transport code allows to pack the documents together before sending the entire batch to the server. So every time you receive an execution, the transport is waiting in another thread for extra documents in a queue. After this short waiting, the batcher start processing, and it repeats the cycle. Every result is sent to the client using an special ExecutionResult that has a future injected though the constructor in order to achieve lazy loading, so it fetches and wait for the result of the future if and only if you get a value from it. The transport returns each one of the responses in the list received from the server using the future: .set_result or .set_exception
I hope I've explained myself enough, if not let me know, I'm glad to contribute
leszekhanusz commentedon Sep 2, 2023
I see.
So you want to mix two different things:
It's too bad you want to do that with a sync transport as that would have been much more straightforward with async...
Now, if I understand correctly, in your batch example, it does not actually help to wait using a timer as you have to first provide all the queries, and then you run
as_completed
to wait for the answers after the timer expired. But in that case, the user knows all the queries he want to do, as the code is sync, so instead if he could simply provide all the queries and run anexecute_batch
method manually, then in that case that would be better for him (the two queries would be sent in a single batch query, but without needing to wait).So that code with the timer only starts to be useful if the queries are provided from different threads, in that case you don't know when multiple different queries from different threads might be needed in the same time frame, and waiting might in some cases be useful. For that use case, we could call
future.result()
inside the execute method in order not to return a Future, the downside being that you need to use different threads to make different queries.The reason I not really sure about sending a Future from the transport is that if you check the client.py file of gql, you'll see we can already return a Dict or an ExecutionResult from the execute methods and for that we have ugly
@overload
all over the place, it's getting quite complicated.In any case we need to separate the concerns, adding only an
execute_batch
method in the current transports to send a batch request directly, and managing the batching should be done in theclient.py
, to easily be able to add batching functionalities to any transport.We could split this in multiple PRs:
RequestsHTTPTransport
andexecute_batch
methods inclient.py
client.py
, allowing to execute queries in batch by specifyingbatch=True
in an execute methodNote: I don't understand how your lazy loading could work. In your code the query will be sent after the timer except if the user canceled the future before. But if the user is simply receiving the future and not bothering to get the data, then the future is not canceled and the query will be sent to the server. Even if
future.result()
is not called the query should have been sent.itolosa commentedon Sep 2, 2023
Right! I agree with your concerns. Better if we split all the different variations of transports. In this case what I call "lazy-load" is actually not the official meaning per se. What it actually means is that the result is not yet fetched at the exact moment you asked it, but it will be in the future, so it's more like a "promise". In my case the requests are never getting cancelled; what the delay does, is to wait for couple of milliseconds so a bunch of documents are enqueued by the client and gathered by the transport so they could be sent at once to the server, it's a useful trick that I believe I've seen it on other implementations of graphql clients. But I agree with you 100% on that this kind of tricks are not optimal. My decision to do this is because it's more transparent to the user to simply use the client as usual and allow them to simply set
batching=True
parameter in the client. In that case the usage will be the same from the outside and the result of every execute will be an ExecutionResult (or actually a delayed ExecutionResult) that fills its values from the internal future stored in itself when a property is accessed or when its asked to do so explicitly by using theas_completed
utility. In this last case when the future is asked for the data it'll block the flow of the program until the transport has filled the data.I think it's a good idea to start with the sync batch first that receives a collection of documents and returns another collection with the results. I'll open a PR for it.
Thanks!
leszekhanusz commentedon Sep 2, 2023
Alright. To be clear this should update the existing RequestsHTTPTransport and not create another transport.
itolosa commentedon Sep 2, 2023
Yes, the design would be to add a
Transport#execute_batch
and aClient#execute_batch
. In the beginning just to the sync transport, raising a non-implemented error for the rest.A thing that I didn't understand was what did you mean with?:
Is it applicable in this case? I think you mean that instead of using threads and futures, a simple async/promise would have been more straightforward, right?
leszekhanusz commentedon Sep 2, 2023
👍
Right
itolosa commentedon Sep 2, 2023
Agree, actually I wrote this code about six years ago and I didn't know about async 😄
I'm struggling a little bit about the signature of the
.execute_batch
method since the.execute
receives a singledocument
, this new method would receive an iterable of them, but the same happens forvariable_values
, andoperation_name
, and I see a couple of solutions here:documents=[<document>,...], variable_values=[<variable_name>,...], operation_names=[<operation_name>,...]
queries=[<document | variable_values | operation_name>, ...]
Biggest problem with solution 1, is that it seems very dangerous to me, since the user may end up mixing the order of the params and send the a query in the wrong corresponding order between its elements
<document,variables,operation>
Solution 2 seems better but it implies the user has to build a
Query
data-structure or an ugly list that contains all the sub-queries, and it seems very inconsistent with theexecute
method.If I have to choose, I'd go with the
Query
option. Andexecute_batch
would receive aqueries=Iterable[Query]
Do you have any idea about it? Thanks
itolosa commentedon Sep 2, 2023
Also, I believe some queries don't need to receive a variable nor operation name. So if we decide to receive an iterable for every parameter, we'll end up with some
None
in between.With a datastructure instead:
leszekhanusz commentedon Sep 2, 2023
Yes, using a dataclass seems to be a much better option indeed.
leszekhanusz commentedon Sep 2, 2023
but I would use something like
Operation
instead ofGraphQLQuery
to better correspond to GraphQL terminology.itolosa commentedon Sep 3, 2023
The spec says:
Request
Therefore:
GQLRequest
instead ofRequest
to avoid conflicts with requests library11 remaining items