-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster lookups for reuse leveraging IR #287
Comments
(other optimization would be to only look for (id, name, property) when looking in current folder, tho that may already be happening - |
For context, I think our standard compiled workflow now has 56 applets and 7 workflows per build. We are creating builds more and more frequently (from past 6 months we have 2.6K workflows and 8K applets, and that's ramping up) and we hope to (in general) keep using the same project indefinitely for builds to leverage reuse. |
I dug into this yesterday, with the help of the platform team. There are two separate problems, both on the platform side, not in dxWDL.
The platform team has filed bugs for these, and will work on them. |
The dxWDL part of this issue has been fixed; optimizations have been implemented for the find-data-objects queries. Therefore, I am closing it here. This remains an issue for the rest of our team. |
I actually meant a slightly different point here. Currently, my understanding is that the reuse component of workflow compilation functions as follows:
Now that we have more than 2K applets in our project, that (1) step is going to start causing some issues. I was thinking of a different strategy for projectWideReuse that would potentially be more performant (or at least require less data across the wire) and would work, without pagination, regardless of the size of the project.
The good part is that this makes the lookup time be relative to the size of the workflow, rather than the size of the project and (I believe) this should cause the lookups to be relatively quick because they'll do a query for the property first, and then do the describe on the found IDs later. (my understanding is there's an index like (project, property, dxID)). Which means each request should be (relatively!) small. |
I asked the back-end team, and this is worth a try. |
It turns out that this isn't so simple to do, because the checksum cannot be computed just from the IR. It also covers referenced data-objects. It can only be calculated incrementally, while building a complex workflow (from the bottom up). There are two ways I could think of to doing this:
Both approachs are non optimal, I am not sure which is better. In the meantime, I limited the number of results returned by adding a constraint on the data-object name. It has to be one of those we are generating, which is known at after the IR phase. Let's see if 1.18 is sufficient. |
@jtratner, is this version better? |
What specifically can't be computed just from the IR? Any data objects should be resolvable prior to compilation, right? |
Right. But if you create a new applet, workflow, or data object, it has an unpredictable ID. Let's say that you have a workflow that compiles into, applet |
I’m suggesting that checksums should not include the dependencies compiled
from the workflow. You could walk the entire IR, looking at dependents
first, and generate checksums based upon checksums, etc. Then you could use
those to define the dxWDLChecksum property.
At each step, you’d check for the value of that property and create only if
it does not exist.
No reason to hash in the opaque identifier from DNAnexus
…On Thu, Sep 26, 2019 at 9:07 AM Ohad Rodeh ***@***.***> wrote:
Right. But if you create a new applet, workflow, or data object, it has an
unpredictable ID. Let's say that you have a workflow that compiles into,
applet B, that depends on applet A. B's checksum requires the ID for A.
You have to, first, create A, and then, create B.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#287?email_source=notifications&email_token=AAMGHK3TK273EJAI6XLZ3T3QLTM4ZA5CNFSM4IKXFT2KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7WDTMQ#issuecomment-535574962>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAMGHK26PCESXQZ7WQD3UY3QLTM4ZANCNFSM4IKXFT2A>
.
|
More recent versions of dxWDL, as well as dxCompiler, constrain the search by the applet names (which are deterministic). Hopefully this has sped up the query. |
Right now I still notice that queries for existing applets and files can take a varying amount of time - sometimes quite long. My guess is that some of this has to do with how system performance on rendering workflows or applets with long specifications.
Would it make sense to use an in query to limit the response to a smaller set of files, e.g., pseudocode-wise, rather than:
instead do:
I think the complex property will still cause backend to search through all objects, but I think it'll limit how many of the found objects get described, thus speeding response time.
(this is based on my guesses of overall implementation)
The text was updated successfully, but these errors were encountered: