Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import a large glb file (778MB) which contains 800 models will crash the editor. #93587

Open
AllenDang opened this issue Jun 25, 2024 · 21 comments

Comments

@AllenDang
Copy link

Tested versions

4.2 stable

System information

macOS 14.5 - forward+ - godot 4.2 stable

Issue description

Import a large glb file (778MB) which contains 800 models will crash the editor.

Steps to reproduce

  1. Create a new project.
  2. Drag and drop the large glb file into editor.

Minimal reproduction project (MRP)

Here is the glb file
https://drive.google.com/file/d/1f74-29422AmZQJohng74ySdELGJptgSA/view?usp=sharing

@fire
Copy link
Member

fire commented Jun 25, 2024

Can you check 4.3? The cow data size was increased to a larger number

@Sluggernot
Copy link

Tried on latest from github (4 or 5 days ago). I hang on import. Restarting the editor restarts and re-hangs the import, automatically.
For some reason my Attach to Process is being disconnected and reattaching it doesnt show me the Call Stack. (Mind currently blown.)
Just pulled latest and recompiling.

@lvcivs
Copy link

lvcivs commented Jun 25, 2024

I tried this on 4.3.beta2.official and although it was very slow, it did eventually load after about 6 minutes (during the whole time it appeared stuck at 0%).
grafik

Opening the scene took a couple more minutes:
grafik
This was on Ubuntu 24.04. Edit: Godot uses about 9 GB of RAM with this scene open.

@AllenDang
Copy link
Author

AllenDang commented Jun 25, 2024

@lvcivs I created this file just for testing purpose, want to see how godot will handle it :P

@JekSun97
Copy link

After transferring the model to Godot 4.3 beta2, it still didn’t load for me, I waited 28 minutes, then closed it.
I also tested this on Blender 3.6.2, waited 3 minutes and Blender closed itself, which didn't happen with Godot.

Godot v4.3.beta2 - Windows 10.0.19045 - Vulkan (Mobile) - dedicated Radeon RX 560 Series (Advanced Micro Devices, Inc.; 31.0.14001.45012) - Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz (4 Threads)

@fire
Copy link
Member

fire commented Jun 26, 2024

Next steps is to get profiles for the load.

My recommendations is use either https://github.com/mstange/samply or https://superluminal.eu/

@Sluggernot
Copy link

Yes. I have been able to load the file. I did some quick benchmarking with Visual Studio and have a couple of very small efficiencies made locally. I need to benchmark the before and after when I get some really good changes made to this.
Main finding is that _parse_meshes is the main function loading this file. My changes are to GenerateSharedVerticesIndexList and one small one to static SVec3 GetPosition().

@fire
Copy link
Member

fire commented Jun 26, 2024

I will try to review any pull requests that can improve load times on the 777mb glb with nothing broken.

@Sluggernot
Copy link

Oh... Nothing broken? Ah, nevermind then.
Really, yes my first challenge is proving that it is faster.
Thanks!

Sluggernot added a commit to Sluggernot/godot that referenced this issue Jun 26, 2024
Created in response to: godotengine#93587
Large glb file (778 MB) would hang, crash or load after extremely long time. This first set of optimizations focuses on sending SVec3 objects by reference instead of by value. Local tests in debug mode caused one iteration of GenerateSharedVerticesIndexList to go from 552ms to 470ms, on average. Unsure of the performance gains on release mode.
Sluggernot added a commit to Sluggernot/godot that referenced this issue Jun 26, 2024
Created in response to: godotengine#93587
Large glb file (778 MB) would hang, crash or load after extremely long time. This first set of optimizations focuses on sending SVec3 objects by reference instead of by value. Local tests in debug mode caused one iteration of GenerateSharedVerticesIndexList to go from 552ms to 470ms, on average. Unsure of the performance gains on release mode.
@Sluggernot
Copy link

Ok, I didnt know github would add these comments from my own fork because I referenced the Issue in the description. I will be avoiding that in the future.

@zeux
Copy link
Contributor

zeux commented Jun 28, 2024

Since I ended up looking into this a little bit, I'll share my findings in hopes that it will help.

Measured by clicking "Reimport" on the scene in an otherwise empty project, --verbose says import took 276 seconds (that's a little under 5 minutes).
Note that the scene has ~800 meshes that add up to ~39.3M triangles (~50k each, looks reasonably uniformly distributed). Overall I would have expected one mesh per scene here, but I'm not familiar with how Godot workflows work, and it's a good stress test regardless.

perf profile on Linux / editor build with default settings with fno-omit-frame-pointers -- please note that timings add up to 45% (perf doesn't normalize them):

image

Renormalizing the percentages by dividing by 0.45, and focusing on significant underlying components, we get:

  • 5% scene save
  • 14% tangent space generation
  • 25% normal reprojection after LOD generation (raycasts)
  • 29% simplification (meshopt_simplify)
  • 24% the rest of generate_lods (it's inlined here so hard to see from the profile exactly)

In aggregate, LOD generation takes ~78% here, so definitely good to focus on that. When looking at something like a 5-minute import though, my expectations are usually that small gains are not terribly exciting, so something more significant needs to happen.

A note on the scale here: each mesh gets approximately 6 LOD levels generated. The work for meshopt_simplify scales with that; the work for normal reprojection scales with the total number of rays, which scales with the total number of triangles in all LODs, times the area factor - looks like we cast 16..64 rays which is a lot of rays :)

If I were tackling this problem, I would entertain the following projects:

  1. For scenes with many large meshes like this, my first goal would be to process meshes in parallel. I'm not familiar with the details of ImporterMesh code but superficially nothing should prevent fully generating each mesh in parallel. Maybe that requires refactoring some of this code to actually be thread-safe. It would also require making sure that the dependent code is thread-safe internally - meshopt definitely is, I assume so is Embree, but some care would be required. That alone would probably get this to be under a minute on an 8-core system if we discount tangent space generation.

  2. I'm skeptical that tangent space generation is efficient here. For a sense of scale, meshopt_simplify does a fair bit more work per call, and it's called ~6 times per mesh here and still only takes twice as much time. I would assume tangent space generation has internal algorithmic inefficiencies and could be improved, but I haven't looked at that code myself.

I would not advise trying to optimize the internals of meshopt_simplify (trust me...). Some small future performance improvements are planned here in meshoptimizer but largely speaking unless this runs into some edge case, which it doesn't look like it does to me, it should be very well tuned already. Same for Embree - I would assume it's impractical to optimize that to the degree that is relevant here. However:

  1. I would certainly think of, at the minimum, reducing the amount of requested work from both meshopt_simplify and Embree here. Notably, meshopt_simplify is called approximately 6 times per mesh here and is asked to generate larger and larger meshes. Because of this, it does more or less the same amount of work: simplifying the mesh 2x is almost the same effort as simplifying the mesh 10x (... well, not quite, but it gets there quickly). However, in LOD chain generations you can usually generate the LODs in the opposite direction: start by requesting a ~1.5x smaller mesh, if that target is reached, ask for ~1.5x smaller mesh again, etc. I don't recall why the order here is reversed but I would consider flipping it and simplifying from the last LOD. I don't think that's going to reduce the work here 6x, but I would expect something like 3-4x improvement in cost to call simplify.

  2. In a similar vein, casting 16-64 rays per triangle is a lot, especially for higher levels of detail. I would probably reduce this in general or at least scale this as the LOD levels get closer to original mesh: in the limit, we're casting at least 16 rays per triangle here for something that only has 1.5x fewer triangles than original mesh, and that just feels wasteful. This has a risk of reducing the quality of the resulting normals because there's a higher chance of missing the mesh or hitting a wrong triangle. Maybe ray casts here aren't the right fit and averaging triangle normals from triangles that are in a bounding sphere of the generated triangle is better, but this brings me to my final point:

  3. We've already discussed this at some point in another issue, but overall I'm not 100% sure the current normal processing in the importer for LODs is generally beneficial. With the normal aware simplifier with the recent fixes, generally speaking I'd expect decent normals to come out of the simplifier itself. Sometimes that's not the case, but I'm not sure the ray cast logic is perfect either, and it's just a lot of complexity to always keep in mind. I do think the reindexing that happens in this code is beneficial for some faceted meshes though. So a good use of time would be to perhaps introduce an option for normal reprojection that would disable the ray cast based normal recreation (I'd expect that alone cuts half of the overhead of LOD generation here), test the option in a release, then maybe default it to skip the normal recreation and see if this comes up.

Hopefully this is helpful :) I would be happy to discuss (3)/(5) further and/or maybe contribute a patch or two as I'm generally interested in making sure simplification integration is working well for Godot; I'll leave 1/2/4 to others if they are motivated to work on this.

@zeux
Copy link
Contributor

zeux commented Jun 28, 2024

On "I'm not 100% sure the current normal processing in the importer for LODs is generally beneficial", I decided to do a quick comparison on the scene from this file. It looks like it's easy to disable normal override, basically just need to disable the ray caster creation (as mentioned earlier, I believe current splitting logic to be generally beneficial for faceted meshes). I then look at a few low LODs (where the risk of picking a bad normal due to ray casts is maximized), by tuning the LOD bias to be a very small value.

On the left (yes, left, I double checked!) is the import without using the raycaster. On the right is current master (raycaster enabled). Both levels are at ~2200 triangles. I see somewhat similar issues on a few other models - this is not universal, this happened to be the first model I checked, and some models from this scene look about the same with or without the raycaster enabled. But this to me is strong evidence that raycaster should be optional, and probably opt-in.

image
image

I've switched to using a smaller version of the scene from the original post (that one has 800 meshes but each mesh is duplicated 8 times, I've switched to a deduplicated version where there's only 100 meshes, easier to work with and faster to reimport). Reimport takes 37 seconds on master and 22 seconds without raycaster enabled.

@Sluggernot
Copy link

Wow, well that is surprising. Are there any examples where the raycaster was better in visual fidelity. (I understand that's somewhat subjective but your above screenshot feels fairly objective as to which is "better.")
I've been diving further into this section of code throughout the day today, attempting to rally myself before trying multithreading. I really appreciate your write-up. This is absolutely great to see!

@fire
Copy link
Member

fire commented Jun 28, 2024

As someone who works on this, I am supporting changes that improves quality and performance. Can review and help test.

@Saul2022
Copy link

After trying to import this glb file on a s23 + mobile it ends up crashing after some time, so this does not look to be the cow fault , using #93064
as it the fastest when loading big project along with the other pr which still causes it to crash on reimport.

@fire
Copy link
Member

fire commented Aug 20, 2024

@Saul2022 does it also crash on your pc?

Edited:

I would expect like 10-20 gigabytes of cpu ram to be used too.

@Saul2022
Copy link

Saul2022 commented Aug 20, 2024

@Saul2022 does it also crash on your pc?

Can't test on pc, sorry , it dead, only black screen despite the light thing is working, so prob screen issue.

Edit: Also tried without lods or shadow mesh, lightbake enabled , by adjudting it on import defaults, and still crashes , so it not lods..

@anderlli0053
Copy link

anderlli0053 commented Aug 26, 2024

I've tried this with v4.3.stable.official [77dcf97] and this is the resulted Godot's memory crash dump:

godot.exe.14296.zip

My specs:

specs

@fire
Copy link
Member

fire commented Aug 26, 2024

I suspect that developers loading that 3d asset require more than 16GB of ram.

We can check how big the difference is. If the requirements is closer to 32 gb then it's a lot harder than like 18gb.

Godot Engine 4.3-stable

Edited:

I'll try to get a cpu usage chart via samply or https://superluminal.eu/ using a custom build of 4.3-stable

Edited:

  1. Download https://drive.google.com/file/d/1f74-29422AmZQJohng74ySdELGJptgSA/view?usp=sharing
  2. Apple M2 Pro with 32GB of ram.
  3. curl --proto '=https' --tlsv1.2 -LsSf https://github.com/mstange/samply/releases/download/samply-v0.12.0/samply-installer.sh | sh
  4. scons production=yes debug_symbols=yes @ https://github.com/godotengine/godot/releases/tag/4.3-stable
  5. ./bin/godot.macos.editor.arm64 #create a new-game-project
  6. rm -rf ~/Documents/new-game-project/.godot
  7. samply record ./bin/godot.macos.editor.arm64 -e --path ~/Documents/new-game-project/
  8. Drag asset gltf file.
  9. Open asset gltf file as a scene.
  10. Firefox Profiler with stack traces! https://share.firefox.dev/3AH8zLh
  11. I saw around 19 GB of max usage, but I don't have logging.

Godot Engine master

Edited:

  1. Download https://drive.google.com/file/d/1f74-29422AmZQJohng74ySdELGJptgSA/view?usp=sharing
  2. Apple M2 Pro with 32GB of ram.
  3. curl --proto '=https' --tlsv1.2 -LsSf https://github.com/mstange/samply/releases/download/samply-v0.12.0/samply-installer.sh | sh
  4. scons production=yes debug_symbols=yes @ db76de5
  5. ./bin/godot.macos.editor.arm64 #create a new-game-project
  6. rm -rf ~/Documents/new-game-project/.godot
  7. samply record ./bin/godot.macos.editor.arm64 -e --path ~/Documents/new-game-project/
  8. Drag asset gltf file.
  9. Open asset gltf file as a scene.
  10. Around 18 GB of max ram usage during import
  11. Around 9GB when using the internal godot engine formats and loaded the 3d asset in the editor.
    image
  12. Firefox Profiler with stack traces! https://share.firefox.dev/3X4zE2k

Notes

  1. Disable normal raycaster for LOD generation by default #93727 is expected to reduce memory usage.
  2. May be able to optimize import runtime by making GLTFDocument::_parse_image_save_image parallel @ db76de5
    image

@fire
Copy link
Member

fire commented Aug 26, 2024

What is the expected behaviour if we exceed the ram (like 19gb usage on a 16 gb - 14gb free) on the system?

Edited:

Personally I think requiring more ram and crashing is expected on large datasets.

We can attempt to use less memory, but there will always be a dataset that exceeds a limit.

@Saul2022
Copy link

We can attempt to use less memory, but there will always be a dataset that exceeds a limit.

Ye i guess, i tried with multithread import off, vsync dissable and continous update ,but still crash. Though the image files did import though, except the glb scene. Maybe to avoid crash instead of crashing the engine, make it so before a crash happens, quit the import process and print an error message about not enough ram to import the scene.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants