Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TRT no results or totally wrong #3

Open
2 tasks
malfonsoNeoris opened this issue Aug 25, 2021 · 7 comments
Open
2 tasks

TRT no results or totally wrong #3

malfonsoNeoris opened this issue Aug 25, 2021 · 7 comments
Labels
bug Something isn't working question Further information is requested

Comments

@malfonsoNeoris
Copy link

Hi again.
after succesfully trained two models, mobilenet_256 and resnet18_256 where 256 is the image size.
now im starting the process of validating and converting to onnx and trt.
Now i have two problems

  • For resnet.. i got the same mask ( or almos same) for all images. i will try to verify this is not a training problem.. This doesnt happen in the mobilenet version. i will retrain both with exact same dataset to verify.

if i continue the process

  • Conversion to onnx goes well. Same problem
  • Modifying onnx for trt goes well. i cant infer with this mod model. i dunno if this is expected or not.
  • Conversion to TRT goes well.. i have a .engine file
    now second problem
  • Infrerence with .engine goes totally wrong. with the mobilenete i got totally wrong result, like a all image bbox. the resnet model, always return empty result.

to clarrify all 3 test ( tensorflow, onnx, and trt models) were done with the exact same images. tf2 and onnx models results are the same.

Attached.. is a small script i created to test and convert (just copy pasted from the ipynb, with some minnor mods)
inference.zip

can you giveme some direction for where to look for this errors ?
thanks again!

@alexander-pv alexander-pv added question Further information is requested bug Something isn't working labels Aug 25, 2021
@alexander-pv
Copy link
Owner

Hi, @malfonsoNeoris,

Thank you for the code in the attachment. I'll learn it a bit later and help you to figure the problem out.
Modified .onnx graph is not valid for onnxruntime because its nodes are specially prepared for TensorRT.

@malfonsoNeoris
Copy link
Author

hi alexander. just an update.
For the first issue, i have re trained. same dataset mobilenet and resnet18/50, imgsize 256.
mobilenete work as a charm. both resnet have same problem.. always almost same result for different images.

would copy some image result help to undertand the problem ?

@xuatpham
Copy link

Hi @alexander-pv , thanks for your effort.

I've successfully converted a trained tensorflow-model to ONNX and from ONNX to the modified_ONNX.

After that from modified_ONNX to TRT was successfully as well.

But the result of TRT seems too much different from the original tensorflow_model.

Is that normal when you converted from to TRT ?

Help to advice or suggest me how can I improve the TRT result or somewhere I can touch into and modify the modified_ONNX.

Hello @malfonsoNeoris , How're you doing? are you able to get the good result from TRT?

Once again, thanks all.

@alexander-pv
Copy link
Owner

alexander-pv commented Sep 25, 2021

Hi, @malfonsoNeoris , @xuatpham

Sorry for the rather long answer.

I have trained several models with the balloon dataset and I can say that there is an error somewhere in the construction of the ONNX graph for TRT. Sometimes NaNs happens in the TensorRT model output.
At the moment, I have found and fixed an error in the data normalization and Zero Padding configuration in the ONNX graph. The mAP increased a bit, but I continue to see periodic NaNs in the output of TRT models. I started to note repository changes here.

I plan to compare the subgraphs outputs of the tensorflow/onnx with the tensorrt-optimized version. It is highly likely that this way it will be possible to find the location of the problem in the modified graph.

@xuatpham, you can open ./src/common/inference_optimize.py. Here I add up all the functions for working with the ONNX graph. modify_onnx_model function prepares ONNX model for TensorRT. You can experiment with the graph modification function or also generate subgraphs, optimize them with TensorRT and check the differences in the outputs with the original model.

Also, please do not forget to update nvinfer_plugin, since the default mrcnn_config.h header of proposalLayerPlugin may be different from the python model config.

An interesting fact is that for efficientnet and mobilenet backbones mAP drop is quite small.

@xuatpham
Copy link

Hi, @malfonsoNeoris , @xuatpham

Sorry for the rather long answer.

I have trained several models with the balloon dataset and I can say that there is an error somewhere in the construction of the ONNX graph for TRT. Sometimes NaNs happens in the TensorRT model output.
At the moment, I have found and fixed an error in the data normalization and Zero Padding configuration in the ONNX graph. The mAP increased a bit, but I continue to see periodic NaNs in the output of TRT models. I started to note repository changes here.

I plan to compare the subgraphs outputs of the tensorflow/onnx with the tensorrt-optimized version. It is highly likely that this way it will be possible to find the location of the problem in the modified graph.

@xuatpham, you can open ./src/common/inference_optimize.py. Here I add up all the functions for working with the ONNX graph. modify_onnx_model function prepares ONNX model for TensorRT. You can experiment with the graph modification function or also generate subgraphs, optimize them with TensorRT and check the differences in the outputs with the original model.

Also, please do not forget to update nvinfer_plugin, since the default mrcnn_config.h header of proposalLayerPlugin may be different from the python model config.

An interesting fact is that for efficientnet and mobilenet backbones mAP drop is quite small.

Thank you Alex, will have a look over that.
Yes, I saw many NaN values when converting to TRT.

But as my experiment, beside the results are quite different from the original, it seems like all the masks had been moved in the same direction so probably there is a problem with resize function, I guess.

Anyway, don't forget to let us know if you can fix the NaNs values when converting to TRT. Thanks a lot.

@dk-chun
Copy link

dk-chun commented Jan 6, 2022

Hi @alexander-pv. First of all, many THANKS to your hard work.

I have a question about TRT results which looks different from TF, ONNX runtime.

  1. Detection scores are different
  2. mask has incomplete shape comparatively (looks a bit fuzzy)
  3. some of detections are missing

I roughly guess this is from different implementation between TF codes and TRT plugins (ProposalLayer_TRT, PyramidROIAlign_TRT, DetectionLayer_TRT).

Do you have a way to get same results without loss ? Please give a comment. Thank you.

@alexander-pv
Copy link
Owner

Hi, @dk-chun,

I am glad that you find the repo useful.
AFAIK, TRT plugins were written based on the original matterport model implementation.
I believe that there are 2 points that lead to the distorted result in TRT.

First, ONNX graph modification for TRT porting that happens in modify_onnx_model function may contain mistakes. I have found recently wrong zero padding nodes modifications and will push changes to maskrcnn_tf2.5 develop branch after some tests ASAP. The first experiments show a closer result to TF&ONNX models.

Second, nvinfer_plugin should be recompiled according to the customized model config. Otherwise, TRT plugins may really work wrong or even segmentation fault errors can occur.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants