-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test with 100 activities with 10 results in each activity, 10 indicators in each result and 20 periods in each indicator #1557
Comments
@emmajclegg please have a look at this. |
Thanks @PG-Momik - some initial responses to this:
My questions:
|
Avg time taken for each major steps in bulk publish
Most of the time consumed in bulkpublish time is on Validating against Validator. As per previous reply, changing the implementation of how we validate against IATI validator will reduce this time to less than half (I'll get back to you on the actual time improvements).
I'll look into both setting a max file size limit and chunking dataset to 60MBs when validating in validator. |
@PG-Momik (cc' @praweshsth) - to summarise from our discussion this morning,
|
This is really useful and interesting - thank you @PG-Momik ! 60MB is a very large amount of IATI data, so it's not surprising that it is taking a long time to go through the pipeline. We can see from the Dashboard that very few files are larger than 20MB, and those that are approaching the 60MB limit are from some very large organisations - such as UNICEF's files for their HQ budgets and activities in India. The 60MB limit will become more important over time as we make all of IATI's tools behave in a more similar way. For example, files over the limit already don't appear in the Datastore, and may soon be excluded from d-portal as well. As I shared on the call, I would like to see if we can reduce the limit: it's easier for systems to process 3 x 20MB files than one 60MB file. 50 minutes seems like a very long time for a single file to take to validate on the IATI Validator. I've just tested a 50MB file that I downloaded (one of the UNICEF files) and it validated in ~2 minutes. I see above that you're validating activities one-by-one; I suspect that is the cause of the slow validation as that's not how the Validator is intended to be used. It also misses out on some of the more advanced validation, such as activity ID lookups for related activities. Whenever we talk about IATI Publisher's role in the IATI ecosystem we talk about it being "for organisations that have a small amount of IATI data to publish" - if someone has enough data to hit the IATI pipeline limit, then IATI Publisher probably isn't suitable for their use case. If you're autogenerating the test file then maybe a more appropriate test file might be to randomise the number of results, indicators and periods within each activity with the maximums set at the current levels? I defer to @emmajclegg as to whether that's more realistic, however. |
Hi @robredpath Regarding 60MB limit
We're on the same page. With this in mind we've decided to keep track of the filesize an organization has published till date. When performing publish, we'll check if the cumulative filesize is > 60MB or not and prevent publish accordingly. Regarding use of validatorWe're currently implementing validating all activities in 1 go. One challenge we've come to face is the errors and line numbers each error is on. The IATI validator validates the entire merged xml in one go and is returning the error line numbers for the merged xml. IATI publisher shows the error messages received from IATI validator in 2 places for each activity:
Since the line numbers of error messages for the merged XML do not directly correspond to the line numbers for individual activities within their original XML files, it becomes challenging to accurately display the error messages in the IATI Publisher. Would it be possible to update the IATI validator to provide a response that includes error information with line numbers for each individual activity within the merged file, rather than just the error line numbers for the entire merged file? I'm currently working on a mapper to map error lines, but it's taking significantly longer than I initially anticipated. If changes to the IATI validator are not feasible or would require a substantial amount of time to implement, I'll continue working on this approach. However, this is delaying the completion of this task more than what I had estimated. |
Hi @PG-Momik
This makes sense. As we discussed last week, it will likely be beneficial to set an even lower warning threshold for file size (20 or 30MB?). This doesn't need to prevent publishing, but should warn in cases where publishing is going to take a long time.
Case 1 (if you're referring to the error message below) has actually been on our snagging list a while, and is something I was going to raise an issue about to remove: We don't think many users use the XML download functionality, and the feedback above is not particularly readable or actionable. IATI Publisher's validator feedback within the activity detail page is much more useful, and users can run their XML files through the IATI Validator itself if they really need to. Therefore - if removing the XML download feedback pictured above makes this bulk publishing work simpler, by all means remove it.
I've asked within the team about this request for activity-level error lines - I don't know if it's easy to implement or something we see as valuable elsewhere. In the meantime, can you let me know roughly how much time you think will be required for the error mapping? (and how much will this impact on the time bulk publishing takes?)
Lastly, to respond to @robredpath's suggestion above - yes, this sounds a good way to get a better sense of average publishing times. It's still useful to know what IATI Publisher's limits are in terms of maximums but most users will have significantly less data. Possible maximums to use for random experimentation are 50 transactions, 10 results with 5 indicators each and 5 periods. Again, looking at 1-2 AidStream users with a lot of data as examples could be used as another source of maximums. |
I'm hoping that we'll be able to carry out a substantial overhaul of Validator messaging in 2025 - the line/column references are a common cause of complaints about the validator as most of the tools that people use to create their XML don't work that way. But, I wouldn't expect any changes around that before mid-2025, realistically. I've asked a colleague to see if we can provide per-activity line numbers (or, presumably, an offset from the start of the file would be ok?) but I think that's quite a substantial change so I wouldn't expect it to be straightforward. I'll let you know when I hear back. Do you send multiple activities to be validated in parallel? I'm not sure quite how many parallel validation processes we can run but I think that 3-5 would be fine. We can always look at additional capacity if necessary. Alternatively, would it make sense for you to run your own instance of the Validator? All of the code is on GitHub (although it's all written for Azure). We could arrange a call with our developers if you wanted to discuss what that might involve. |
Thank you @robredpath. I'm a bit late on following up on this topic but I've completed writing the logic to map error line numbers on activity level based on the offset. It seems to be working well. You can disregard my comment. cc: @emmajclegg |
@emmajclegg I feel like we've diverted a bit from our primary objective: Increase bulk publish to 100. 1. Regarding file size limit.As I stated on discussion, we should address the import and publish size limits in a separate issue. We can think of whether to show warning or error on certain threshold. 2. Regarding error message on download
If we're looking to decrease bulk publish time, removing this module on download will not result in any performance gains in bulk publish nor download.
My initial concern with error mapping was the man hrs it'd take for me. I was hoping there'd be a quicker solution on the validator side. But I've managed to write a function to map errors. In terms of its impact on bulk publish time, it takes sub 2 mins to map errors (depending on the payload, even less). From our initial stress test, we know that it is possible to publish 100 files. I think we ought to focusing on approaches to reduce bulk publish time.
As @robredpath mentioned, it should take ~2 mins to validate this payload and YES IT DOES JUST TAKE ~2 mins to actually validate. Most of the time is taken up by xml generation and upload to s3. (I think the s3 upload is related to other features, I'll leave a follow up on this). I'll look further into ways to reduce bulk publish time.
I'll comment the bp time taken for 100 publish with said payloads. cc: @BibhaT |
Note
According to previous discussion, for testing 100 activities bulk publish, we used 50 transactions, 50 results, 10 indicator (for each result) and 10 period (for each indicator)
Context
This time, we used 1 to 3 transactions, with 10 results , 10 indicators for each result and 20 periods for each indicator for each activity. By significantly reducing number of results and its children, we were able to reduce the size of each activity xml to less than 2 MB. We created 100 similar activities for this test.
Findings
Questions
Possible changes
Validating activities (x/N)
in our previous design. Since our current design does not have any progress bar to display texts likeValidating activities (x/N)
. We could opt to validate all 100 activities in 1 go if they are under 60MB (IATI validator doesnt accept xml file larger than 60MB). This will decrease our bulk publish time.The text was updated successfully, but these errors were encountered: