You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi! I was trying to replicate these CLIP Text-Image Direction Similarity results from the paper:
Here is how tried to do it:
I trained a NeRF on the bear example (resolution: 497*369):
ns-train nerfacto --data data/bear_resized/
Edited the NeRF using in2n:
ns-train in2n --data data/bear_resized/ --load-dir outputs/bear_resized/nerfacto/2024-03-07_111958/nerfstudio_models/ --pipeline.prompt "Turn the bear into a grizzly bear" --pipeline.guidance-scale 6.5 --pipeline.image-guidance-scale 1.5 --max-num-iterations 4000
Exported 172 views from each, 86 from training view angles and 86 novel views. I did this by manually setting each training view as a keyframe and setting the FPS to 2 and the transition length to 1 s in the ns-viewer to generate camera paths. Then I ran these commands to render 172 view images from both unedited and edited NeRFs:
These exports are 1920*1080.
4. Using the ClipSimilarity module you provided, I compared each of the unedited views to their corresponding edited views. I used these captions: "a statue of a bear", "a grizzly bear". Then I calculated the mean sim_direction over 172 views. My code looks like this:
## clip_metrics.py code abovedefget_file_names(directory, extension):
"""Fetch all file names with a specific extension from the given directory."""return [fileforfileinos.listdir(directory) iffile.endswith(extension) andos.path.isfile(os.path.join(directory, file))]
defread_images(images_dir, extension="png"):
"""Reads image files from given directory and converts them into tensors."""file_names=get_file_names(images_dir, extension)
image_paths= [os.path.join(images_dir, file_name) forfile_nameinfile_names]
images= [Image.open(image_path).convert("RGB") forimage_pathinimage_paths]
# Changing the array shape from [h,w,c] to [1,c,w,h]images= [torch.Tensor(np.array(image).T[None,:,:,:]) forimageinimages]
returnimagesdefmain():
# Read and parse argumentsparser=ArgumentParser()
parser.add_argument("--original-dir", required=True, type=str)
parser.add_argument("--edited-dir", required=True, type=str)
parser.add_argument("--original-caption", required=True, type=str)
parser.add_argument("--edited-caption", required=True, type=str)
parser.add_argument("--seed", default=42, type=int)
args=parser.parse_args()
torch.manual_seed(args.seed)
original_dir=Path(args.original_dir)
edited_dir=Path(args.edited_dir)
original_caption=args.original_captionedited_caption=args.edited_caption# Load original and edited views as tensorsoriginal_views=read_images(original_dir, "jpg")
edited_views=read_images(edited_dir, "jpg")
clip_similarity=ClipSimilarity()
sim_dirs= []
# calculate CLIP Direction Similarity for each original/edited image pairforiinrange(len(original_views)):
sim_0, sim_1, sim_direction, sim_image=clip_similarity(
original_views[i], edited_views[i], original_caption, edited_caption
)
print(float(sim_direction))
sim_dirs.append(float(sim_direction))
# Print mean directional similarityprint(np.mean(sim_dirs))
if__name__=="__main__":
main()
I ran the script like this:
python metrics/clip_metrics.py --original-dir renders/bear_resized/images/2024-03-08-15-56-50-extra/ --edited-dir renders/grizzly_bear/images/2024-03-08-15-56-50/ --original-caption "a statue of a bear" --edited-caption "a grizzly bear"
Result: 0.04 which is significantly lower than 0.16 reported in paper.
I trained more, until 30k steps. Result: 0.0095. Even lower. The edited NeRF also looks worse for some reason so it makes sense that the mean CLIP Direction Similarity is lower.
I read in the paper that you made 10 edits across 2 scenes for the quantitative evaluation. So, maybe the mean score for other examples were better. But before trying this for another scene and different edits, I wanted to ask if I'm on the right track, is this how you calculated these scores, or am I doing something wrong?
The text was updated successfully, but these errors were encountered:
Hi! I was trying to replicate these CLIP Text-Image Direction Similarity results from the paper:
Here is how tried to do it:
These exports are 1920*1080.
4. Using the ClipSimilarity module you provided, I compared each of the unedited views to their corresponding edited views. I used these captions: "a statue of a bear", "a grizzly bear". Then I calculated the mean
sim_direction
over 172 views. My code looks like this:I ran the script like this:
I read in the paper that you made 10 edits across 2 scenes for the quantitative evaluation. So, maybe the mean score for other examples were better. But before trying this for another scene and different edits, I wanted to ask if I'm on the right track, is this how you calculated these scores, or am I doing something wrong?
The text was updated successfully, but these errors were encountered: