Baseline Low Performance #2

yyuncong · 2024-05-28T17:42:14Z

Hi,

Thank you for the great work and the tremendous efforts of open-sourcing the baselines!

I tried the VLM baseline and noticed that the model performance is way lower than the reported results (more like a random selection). This is quite confusing because I did not modify the codebase (other than the data path).

Could you help me double check if the baselines are functioning properly? I am also actively investigating if this is caused by my local environment. Thank you for your help!

allenzren · 2024-05-30T22:19:28Z

Hi Yuncong, thanks for checking out our work!

This is my bad --- when we ran experiments, we were using a different version of the Prismatic VLM --- the official repo was not released then. It was a 13B one and I am not sure if the specific one was released at the end. If you would like to improve the performance, I would suggest using a different checkpoint from their repo and also seeing if the weighting parameters for the semantic values make sense (i.e., the semantic values look reasonable). You can also look into other newer VLMs.

allenzren · 2024-05-30T22:23:42Z

Do you notice if the question answering is particularly bad? or the semantic exploration does not work?

yusirhhh · 2024-06-13T06:29:09Z

The outcomes of the evaluation are summarized below. Notably, without any adjustments to the codebase, the achieved metrics were marginally lower than those reported in the paper:

Total cases: 500
Successful cases (weighted): 118
Successful cases (max): 120
Success rate (weighted): 23.60%
Success rate (max): 24.00%

@allenzren @yyuncong

yyuncong · 2024-06-13T07:11:17Z

The outcomes of the evaluation are summarized below. Notably, without any adjustments to the codebase, the achieved metrics were marginally lower than those reported in the paper:
Total cases: 500
Successful cases (weighted): 118
Successful cases (max): 120
Success rate (weighted): 23.60%
Success rate (max): 24.00%
@allenzren @yyuncong

Thank you for summarizing the evaluation results! Given the fact that the questions are all multiple choices with at most 4 choices, the evaluation results suggest that the current pipeline barely helps question answering?

allenzren · 2024-06-13T21:57:47Z

Hi @yyuncong @yusirhhh, thanks for looking into this! I think there is something off right now that the success rate is not even above 25% --- even if the exploration is not working as well as the original experiments, question answering should not be that bad if the VLM functions. I can look into this this weekend if that helps.

yusirhhh · 2024-06-19T12:04:16Z

@allenzren @yyuncong I am troubleshooting this issue and would appreciate it if you could provide the images corresponding to the questions from your experiments. This would help me investigate the VQA performance and determine if the low accuracy is due to the VLM model's VQA capabilities

yusirhhh · 2024-06-19T12:07:16Z

@allenzren When I sample views from the scene, I find that the "../hm3dsem/topdown" folder is missing. Could you please tell me how to generate the top-down files?

allenzren · 2024-06-23T22:32:15Z

@yusirhhh I added the script that I used for getting the topdown views. I literally went to the hm3d website and downloaded the topdown views there (example) --- it is not that high resolution and I didn't use it to generate questions.

yyuncong · 2024-06-26T10:58:40Z

Hi @yyuncong @yusirhhh, thanks for looking into this! I think there is something off right now that the success rate is not even above 25% --- even if the exploration is not working as well as the original experiments, question answering should not be that bad if the VLM functions. I can look into this this weekend if that helps.

Hi! I would like to follow up on the performance issue. I was wondering if there have been any updates or progress on this matter? Thank you for your help!

dunkegg · 2024-07-01T13:19:50Z

@yusirhhh Hi. Did you use the stopping criterion in the paper for the success rate?
I also found the stopping criterion value resulted in a low success rate.

allenzren · 2024-07-05T05:50:08Z

Hi everyone, there was a bug in loading the questions from the csv, and it is fixed in 18381da. Initially the choices of the question were loaded as a string, so it was not parsed into the four choices correctly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Baseline Low Performance #2

Baseline Low Performance #2

yyuncong commented May 28, 2024 •

edited

Loading

allenzren commented May 30, 2024 •

edited

Loading

allenzren commented May 30, 2024

yusirhhh commented Jun 13, 2024 •

edited

Loading

yyuncong commented Jun 13, 2024

allenzren commented Jun 13, 2024

yusirhhh commented Jun 19, 2024

yusirhhh commented Jun 19, 2024

allenzren commented Jun 23, 2024

yyuncong commented Jun 26, 2024

dunkegg commented Jul 1, 2024

allenzren commented Jul 5, 2024

Baseline Low Performance #2

Baseline Low Performance #2

Comments

yyuncong commented May 28, 2024 • edited Loading

allenzren commented May 30, 2024 • edited Loading

allenzren commented May 30, 2024

yusirhhh commented Jun 13, 2024 • edited Loading

yyuncong commented Jun 13, 2024

allenzren commented Jun 13, 2024

yusirhhh commented Jun 19, 2024

yusirhhh commented Jun 19, 2024

allenzren commented Jun 23, 2024

yyuncong commented Jun 26, 2024

dunkegg commented Jul 1, 2024

allenzren commented Jul 5, 2024

yyuncong commented May 28, 2024 •

edited

Loading

allenzren commented May 30, 2024 •

edited

Loading

yusirhhh commented Jun 13, 2024 •

edited

Loading