Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to evaluate CLIP Text-Image Direction Similarity for edit results? #95

Open
umutyazgan opened this issue Mar 9, 2024 · 0 comments

Comments

@umutyazgan
Copy link

Hi! I was trying to replicate these CLIP Text-Image Direction Similarity results from the paper:
image
Here is how tried to do it:

  1. I trained a NeRF on the bear example (resolution: 497*369):
ns-train nerfacto --data data/bear_resized/
  1. Edited the NeRF using in2n:
ns-train in2n --data data/bear_resized/ --load-dir outputs/bear_resized/nerfacto/2024-03-07_111958/nerfstudio_models/ --pipeline.prompt "Turn the bear into a grizzly bear" --pipeline.guidance-scale 6.5 --pipeline.image-guidance-scale 1.5 --max-num-iterations 4000
  1. Exported 172 views from each, 86 from training view angles and 86 novel views. I did this by manually setting each training view as a keyframe and setting the FPS to 2 and the transition length to 1 s in the ns-viewer to generate camera paths. Then I ran these commands to render 172 view images from both unedited and edited NeRFs:
ns-render camera-path --load-config outputs/bear_resized/nerfacto/2024-03-07_111958/config.yml --camera-path-filename data/bear_resized/camera_paths/2024-03-08-15-56-50.json --output-format images --output-path renders/bear_resized/images/2024-03-08-15-56-50-extra/
ns-render camera-path --load-config outputs/bear_resized/in2n/2024-03-07_114733/config.yml --camera-path-filename data/bear_resized/camera_paths/2024-03-08-15-56-50.json --output-format images --output-path renders/grizzly_bear/images/2024-03-08-15-56-50/

These exports are 1920*1080.
4. Using the ClipSimilarity module you provided, I compared each of the unedited views to their corresponding edited views. I used these captions: "a statue of a bear", "a grizzly bear". Then I calculated the mean sim_direction over 172 views. My code looks like this:

## clip_metrics.py code above

def get_file_names(directory, extension):
    """Fetch all file names with a specific extension from the given directory."""
    return [file for file in os.listdir(directory) if file.endswith(extension) and os.path.isfile(os.path.join(directory, file))]

def read_images(images_dir, extension="png"):
   """Reads image files from given directory and converts them into tensors."""
    file_names = get_file_names(images_dir, extension)
    image_paths = [os.path.join(images_dir, file_name) for file_name in file_names]
    images = [Image.open(image_path).convert("RGB") for image_path in image_paths]
    # Changing the array shape from [h,w,c] to [1,c,w,h]
    images = [torch.Tensor(np.array(image).T[None,:,:,:]) for image in images]
    return images

def main():
    # Read and parse arguments
    parser = ArgumentParser()
    parser.add_argument("--original-dir", required=True, type=str)
    parser.add_argument("--edited-dir", required=True, type=str)
    parser.add_argument("--original-caption", required=True, type=str)
    parser.add_argument("--edited-caption", required=True, type=str)
    parser.add_argument("--seed", default=42, type=int)
    args = parser.parse_args()
    torch.manual_seed(args.seed)
    original_dir = Path(args.original_dir)
    edited_dir = Path(args.edited_dir)
    original_caption = args.original_caption
    edited_caption = args.edited_caption
    # Load original and edited views as tensors
    original_views = read_images(original_dir, "jpg")
    edited_views = read_images(edited_dir, "jpg")
    clip_similarity = ClipSimilarity()
    sim_dirs = []
    # calculate CLIP Direction Similarity for each original/edited image pair
    for i in range(len(original_views)):
        sim_0, sim_1, sim_direction, sim_image = clip_similarity(
            original_views[i], edited_views[i], original_caption, edited_caption
        )
        print(float(sim_direction))
        sim_dirs.append(float(sim_direction))
    # Print mean directional similarity
    print(np.mean(sim_dirs))

if __name__=="__main__":
    main()

I ran the script like this:

python metrics/clip_metrics.py --original-dir renders/bear_resized/images/2024-03-08-15-56-50-extra/ --edited-dir renders/grizzly_bear/images/2024-03-08-15-56-50/ --original-caption "a statue of a bear" --edited-caption "a grizzly bear"
  1. Result: 0.04 which is significantly lower than 0.16 reported in paper.
  2. I trained more, until 30k steps. Result: 0.0095. Even lower. The edited NeRF also looks worse for some reason so it makes sense that the mean CLIP Direction Similarity is lower.

I read in the paper that you made 10 edits across 2 scenes for the quantitative evaluation. So, maybe the mean score for other examples were better. But before trying this for another scene and different edits, I wanted to ask if I'm on the right track, is this how you calculated these scores, or am I doing something wrong?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant