Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issue #2063

Open
liquid36 opened this issue Jan 30, 2025 · 2 comments
Open

Performance issue #2063

liquid36 opened this issue Jan 30, 2025 · 2 comments

Comments

@liquid36
Copy link

Hi!
I'm working with a synthetic dataset in order to test the tool. I have 1000 patients and 50.000 conditions.

I made this basic request that never ends.

POST http://localhost:9090/fhir/Patient/$aggregate

{
    "resourceType": "Parameters",
    "parameter": [
        {
            "name": "aggregation",
            "valueString": "count()"
        },
        {
            "name": "grouping",
            "valueString": "reverseResolve(Condition.subject).code.coding.where(subsumedBy(http://snomed.info/sct|73211009))"
        }
    ]
}

The problem is Pathling made one request per Condition to check if it belong to <<73211009 adn thats is unviable. How do you deal with this?

@johngrimes
Copy link
Member

Hi @liquid36,

Thanks for trying it out!

Which terminology server are you using? It uses https://tx.ontoserver.csiro.au/fhir by default, are you using something different?

There is a configuration option that might help diagnose the problem: pathling.terminology.verboseLogging (https://pathling.csiro.au/docs/server/configuration#terminology-service). Some logging with this option turned on might be helpful.

We have tried many different strategies for making terminology requests, and we found the individual request model to actually work fastest. This is because we can effectively parallelize the requests, cache the results and only make unique requests that we have not made before. Pathling has a client-side cache to facilitate this, and most terminology servers will also have a server-side cache in addition to this.

We have demonstrated that this works effectively on large datasets with tens of thousands of unique SNOMED CT codings. Ontoserver is the terminology server that we prefer to use, and it can service a subsumes request in less than 5 ms.

@liquid36
Copy link
Author

I'm using Snowstorm. But i just tried with the default ontoserver and it worked better.

How could i parallelize the requests? Deploying an Spark Cluster?

For the amount of data i mention before, i'm getting a response of 3/4 seconds with cache enabled, is it okey?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants