Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

run_sbc (and run_tarp) run time #1329

Closed
humnaawan opened this issue Dec 11, 2024 · 5 comments
Closed

run_sbc (and run_tarp) run time #1329

humnaawan opened this issue Dec 11, 2024 · 5 comments
Labels
question Further information is requested

Comments

@humnaawan
Copy link

Hello, is there any documentation on how to effectively use num_workers and use_batched_sampling? I am running into very long run times with run_sbc and I am not sure whats going wrong. Here's how I'm calling the function:

    ranks, dap_samples = run_sbc(thetas=thetas, xs=xs,
                                            posterior=posterior,
                                            num_posterior_samples=nsamples,
                                            show_progress_bar=True,
                                            num_workers=ncpus
                                            )

I have 1000 simulations and I set nsamples to be 1000. When I toggle the above between use_batched_sampling=False and use_batched_sampling=True (default) in the function call, the former at least gives me a progress update although it still doesn't finish.

Looking through the code, I think the bottleneck might be max_sampling_batch_size which is set to 10,000? The parameter is not exposed though (at least when you build a posterior via inference.build_posterior). I did set simulation_batch_size in simulate_for_sbi (to be int(nsims/ncpus)) but I dont think that gets communicated to the DirectPosterior object.

I run into the same issue with run_tarp which doesnt have the use_batched_sampling exposed (although #1321 should enable that once its merged).

I use cpus-per-task=35 in my sbatch script and confirm that 35 cpus are indeed available. The run_sbc call seems to be stuck at 1/1000 even after 5hours when using the default option for use_batched_sampling, and barely passes 100/1000 after 12hrs (even though the time estimates on the progress bar estimate otherwise) when I set use_batched_sampling=False.

I'd really appreciate some help. I am starting to unpack run_sbc since I can't think of anything else but thought I'd inquire here in case I'm missing something. My understanding is that my call never makes it past get_posterior_samples_on_batch (which calls posterior.sample_batched).

Thank you!

@humnaawan humnaawan added the question Further information is requested label Dec 11, 2024
@janfb
Copy link
Contributor

janfb commented Dec 11, 2024

Hi @humnaawan

thanks for reporting this! Some context that might help already:

  • batched sampling is quite fast for "direct" posteriors like NPE because it's just a forward pass in the flow and it can just pass the entire batch of xs.
  • for MCMCPosteriors is in principle much slower because we have to run MCMC for each x in xs separately. For the slice_np_vectorize MCMC method we implemented batched sampling, but it's still slower because it has to run MCMC and evaluate the flow for each element in the chain.
  • when you use batched sampling, it's one single call to be sample method with a big batch of xs, so we cannot parallelize it. Thus, num_workers only has an effect if batch_sampling=False.

To summarize:

  • with a NPE-based posterior it should be quite fast and if not, something if off
  • with a MCMC-based posterior, it will be slow anyway, but using slice_np_vectorized and batched_sampling(default) is probably the fastest way. Alternatively, you could try using batched_sampling=False and use many workers.

Note that num_posterior_samples is just the number of posterior samples used during sbc / tarp and it's different from the num_sbc_samples, which is the number of xs and thetas. num_sbc_samples should be on the order of hundreds to give reasonable results, e.g., 100-500 and it's the main bottleneck.

Does this help?

@humnaawan
Copy link
Author

hi @janfb, thanks so much! I am indeed working with a direct posterior (sorry for not including the detail in my first post) via:

        # create inference object
        inference = NPE(prior=prior)
        # generate simulations
        theta, x = simulate_for_sbi(simulator=simulator,
                                    proposal=prior, num_simulations=nsims,
                                    seed=seed,
                                    show_progress_bar=True,
                                    num_workers=ncpus
                                    )
        # pass sims to inference object
        inference = inference.append_simulations(theta=theta, x=x)
        # now train
        density_estimator = inference.train()
        # build posterior
        posterior = inference.build_posterior(density_estimator=density_estimator)

thank you for reminding me about num_sbc_samples vs num_posterior_samples. im currently setting both to 1000, so my xs, thetas are shaped as (torch.Size([1000, 500]), torch.Size([1000, 2])) (since my simulation is for a 500-point spectrum and I am attempting to constrain two parameters).

It is helpful to know that num_workers only plays a role without batch sampling. I'm not sure why either option is not working with my call to run_sbc though.

@janfb
Copy link
Contributor

janfb commented Dec 11, 2024

I see, thanks for the details. So you effectively evaluation the underlying density estimator with a batch-size of 1000 thetas (posterior samples) and 1000 xs (sbc samples), which could be the bottleneck here.
Are you using some kind of embedding network to process the 500-D input dimension?

What I would try:

  • use fewer num_sbc_samples
  • set batched_sampling=False and pass num_workers=30 or so.

@humnaawan
Copy link
Author

Thank you! I am currently not using an embedding network as I wanted to see how things work out of the box; it's certainly on my todo list. I will try your two suggestions and hopefully the function call finishes.

I appreciate your quick feedback!

@janfb
Copy link
Contributor

janfb commented Dec 13, 2024

I would recommend using at least a small embedding net, e.g., the standard MLP we have implemented here:
https://github.com/sbi-dev/sbi/blob/main/sbi/neural_nets/embedding_nets/fully_connected.py

and explained here:
https://github.com/sbi-dev/sbi/blob/main/tutorials/04_embedding_networks.ipynb

Otherwise it could be challenging for the flow-based density estimator to cope with the 500-D conditioning dimension.

I am moving this issue to discussions and close it here. Feel free to give updates of your case there.

@sbi-dev sbi-dev locked and limited conversation to collaborators Dec 13, 2024
@janfb janfb converted this issue into discussion #1332 Dec 13, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants