How to evaluate CLIP Text-Image Direction Similarity for edit results? #95

umutyazgan · 2024-03-09T17:07:56Z

Hi! I was trying to replicate these CLIP Text-Image Direction Similarity results from the paper:

Here is how tried to do it:

I trained a NeRF on the bear example (resolution: 497*369):

ns-train nerfacto --data data/bear_resized/

Edited the NeRF using in2n:

ns-train in2n --data data/bear_resized/ --load-dir outputs/bear_resized/nerfacto/2024-03-07_111958/nerfstudio_models/ --pipeline.prompt "Turn the bear into a grizzly bear" --pipeline.guidance-scale 6.5 --pipeline.image-guidance-scale 1.5 --max-num-iterations 4000

Exported 172 views from each, 86 from training view angles and 86 novel views. I did this by manually setting each training view as a keyframe and setting the FPS to 2 and the transition length to 1 s in the ns-viewer to generate camera paths. Then I ran these commands to render 172 view images from both unedited and edited NeRFs:

ns-render camera-path --load-config outputs/bear_resized/nerfacto/2024-03-07_111958/config.yml --camera-path-filename data/bear_resized/camera_paths/2024-03-08-15-56-50.json --output-format images --output-path renders/bear_resized/images/2024-03-08-15-56-50-extra/
ns-render camera-path --load-config outputs/bear_resized/in2n/2024-03-07_114733/config.yml --camera-path-filename data/bear_resized/camera_paths/2024-03-08-15-56-50.json --output-format images --output-path renders/grizzly_bear/images/2024-03-08-15-56-50/

These exports are 1920*1080.
4. Using the ClipSimilarity module you provided, I compared each of the unedited views to their corresponding edited views. I used these captions: "a statue of a bear", "a grizzly bear". Then I calculated the mean sim_direction over 172 views. My code looks like this:

## clip_metrics.py code above

def get_file_names(directory, extension):
    """Fetch all file names with a specific extension from the given directory."""
    return [file for file in os.listdir(directory) if file.endswith(extension) and os.path.isfile(os.path.join(directory, file))]

def read_images(images_dir, extension="png"):
   """Reads image files from given directory and converts them into tensors."""
    file_names = get_file_names(images_dir, extension)
    image_paths = [os.path.join(images_dir, file_name) for file_name in file_names]
    images = [Image.open(image_path).convert("RGB") for image_path in image_paths]
    # Changing the array shape from [h,w,c] to [1,c,w,h]
    images = [torch.Tensor(np.array(image).T[None,:,:,:]) for image in images]
    return images

def main():
    # Read and parse arguments
    parser = ArgumentParser()
    parser.add_argument("--original-dir", required=True, type=str)
    parser.add_argument("--edited-dir", required=True, type=str)
    parser.add_argument("--original-caption", required=True, type=str)
    parser.add_argument("--edited-caption", required=True, type=str)
    parser.add_argument("--seed", default=42, type=int)
    args = parser.parse_args()
    torch.manual_seed(args.seed)
    original_dir = Path(args.original_dir)
    edited_dir = Path(args.edited_dir)
    original_caption = args.original_caption
    edited_caption = args.edited_caption
    # Load original and edited views as tensors
    original_views = read_images(original_dir, "jpg")
    edited_views = read_images(edited_dir, "jpg")
    clip_similarity = ClipSimilarity()
    sim_dirs = []
    # calculate CLIP Direction Similarity for each original/edited image pair
    for i in range(len(original_views)):
        sim_0, sim_1, sim_direction, sim_image = clip_similarity(
            original_views[i], edited_views[i], original_caption, edited_caption
        )
        print(float(sim_direction))
        sim_dirs.append(float(sim_direction))
    # Print mean directional similarity
    print(np.mean(sim_dirs))

if __name__=="__main__":
    main()

I ran the script like this:

python metrics/clip_metrics.py --original-dir renders/bear_resized/images/2024-03-08-15-56-50-extra/ --edited-dir renders/grizzly_bear/images/2024-03-08-15-56-50/ --original-caption "a statue of a bear" --edited-caption "a grizzly bear"

Result: 0.04 which is significantly lower than 0.16 reported in paper.
I trained more, until 30k steps. Result: 0.0095. Even lower. The edited NeRF also looks worse for some reason so it makes sense that the mean CLIP Direction Similarity is lower.

I read in the paper that you made 10 edits across 2 scenes for the quantitative evaluation. So, maybe the mean score for other examples were better. But before trying this for another scene and different edits, I wanted to ask if I'm on the right track, is this how you calculated these scores, or am I doing something wrong?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to evaluate CLIP Text-Image Direction Similarity for edit results? #95

How to evaluate CLIP Text-Image Direction Similarity for edit results? #95

umutyazgan commented Mar 9, 2024

How to evaluate CLIP Text-Image Direction Similarity for edit results? #95

How to evaluate CLIP Text-Image Direction Similarity for edit results? #95

Comments

umutyazgan commented Mar 9, 2024