Skip to content

Improve textual inversion compatibility #10949

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

suzukimain
Copy link
Contributor

What does this PR do?

Fixes #10373

This PR fixes the issue of incompatibility in textual inversion between different SD versions such as SD 1.5 and SD 2.1

Example:

!pip install git+https://github.com/suzukimain/diffusers.git@textual_inversion
import torch
from diffusers import StableDiffusionPipeline
from huggingface_hub import hf_hub_download

pipe = StableDiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-2-1", torch_dtype=torch.float16).to("cuda")

path = hf_hub_download(repo_id="gsdf/EasyNegative", filename="EasyNegative.safetensors", repo_type="dataset")

pipe.load_textual_inversion(path, token="EasyNegative")

Additionally, if you find any mistakes, please feel free to let me know.

Before submitting

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@DN6 DN6 requested a review from sayakpaul March 11, 2025 10:56
Comment on lines +405 to +409
for i, embedding in enumerate(embeddings):
if embedding.shape[-1] != expected_emb_dim:
linear = nn.Linear(embedding.shape[-1], expected_emb_dim)
embeddings[i] = linear(embedding)
logger.info(f"Changed embedding dimension from {embedding.shape[-1]} to {expected_emb_dim}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to add a test case to cover this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to add a test case to cover this?

What should we do with this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @suzukimain. The test would load an embedding to an incompatible model and check for the log "Changed embedding dimension...".
Also, do you have any example outputs to share?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi, @hlky
The following log is what I was able to get.

The loaded token: emb_params is overwritten by the passed token EasyNegative.
Changed embedding dimension from 768 to 1024
Changed embedding dimension from 768 to 1024
Changed embedding dimension from 768 to 1024
Changed embedding dimension from 768 to 1024
Changed embedding dimension from 768 to 1024
Changed embedding dimension from 768 to 1024
Changed embedding dimension from 768 to 1024
Changed embedding dimension from 768 to 1024
Loaded textual inversion embedding for EasyNegative.
Loaded textual inversion embedding for EasyNegative_1.
Loaded textual inversion embedding for EasyNegative_2.
Loaded textual inversion embedding for EasyNegative_3.
Loaded textual inversion embedding for EasyNegative_4.
Loaded textual inversion embedding for EasyNegative_5.
Loaded textual inversion embedding for EasyNegative_6.
Loaded textual inversion embedding for EasyNegative_7.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello. Do you need any other information?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @suzukimain, apologies for the delay, last week was the Diffusers team offsite.

Changed embedding dimension from 768 to 1024

This text is what we would check for in the test, either just Changed embedding dimension from or including the original + new dimensions depending on how existing TI tests are set up. Would you like assistance adding the test? happy to take over if needed.

Do you need any other information?

Example outputs from a model using an incompatible TI would be useful. cc @asomoza Is this something you've tested before?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @suzukimain, apologies for the delay, last week was the Diffusers team offsite.

Changed embedding dimension from 768 to 1024

This text is what we would check for in the test, either just or including the original + new dimensions depending on how existing TI tests are set up. Would you like assistance adding the test? happy to take over if needed.Changed embedding dimension from

Do you need any other information?

Example outputs from a model using an incompatible TI would be useful. cc @asomoza Is this something you've tested before?

Hello @hlky, if possible, could you please add a test?

@hlky
Copy link
Contributor

hlky commented Apr 4, 2025

Hi @suzukimain, I've ran some examples using 2 different v1 TI on v2, IMO this isn't working as expected, can you confirm whether you have seen good results with this method?

gsdf/EasyNegative

import torch
from diffusers import StableDiffusionPipeline
from huggingface_hub import hf_hub_download


pipe = StableDiffusionPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    torch_dtype=torch.float16,
    variant="fp16",
).to("cuda")


image = pipe(
    prompt="Astronaut in a jungle, cold color palette, muted colors, detailed, 8k",
    negative_prompt="EasyNegative",
    generator=torch.Generator("cuda").manual_seed(0),
).images[0]

image.save("v1.png")

path = hf_hub_download(
    repo_id="gsdf/EasyNegative",
    filename="EasyNegative.safetensors",
    repo_type="dataset",
)

pipe.load_textual_inversion(path, token="EasyNegative")

image = pipe(
    prompt="Astronaut in a jungle, cold color palette, muted colors, detailed, 8k",
    negative_prompt="EasyNegative",
    generator=torch.Generator("cuda").manual_seed(0),
).images[0]

image.save("v1_easy_negative.png")


pipe = StableDiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-2-1",
    torch_dtype=torch.float16,
    variant="fp16",
).to("cuda")


image = pipe(
    prompt="Astronaut in a jungle, cold color palette, muted colors, detailed, 8k",
    negative_prompt="EasyNegative",
    generator=torch.Generator("cuda").manual_seed(0),
).images[0]

image.save("v2.png")

path = hf_hub_download(
    repo_id="gsdf/EasyNegative",
    filename="EasyNegative.safetensors",
    repo_type="dataset",
)

pipe.load_textual_inversion(path, token="EasyNegative")

image = pipe(
    prompt="Astronaut in a jungle, cold color palette, muted colors, detailed, 8k",
    negative_prompt="EasyNegative",
    generator=torch.Generator("cuda").manual_seed(0),
).images[0]

image.save("v2_easy_negative.png")

v1

Base TI
v1 v1_easy_negative

v2

Base TI
v2 v2_easy_negative
sd-concepts-library/gta5-artwork

import torch
from diffusers import StableDiffusionPipeline
from huggingface_hub import hf_hub_download


pipe = StableDiffusionPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    torch_dtype=torch.float16,
    variant="fp16",
).to("cuda")


image = pipe(
    prompt="A cute brown bear eating a slice of pizza, stunning color scheme, masterpiece, illustration, <gta5-artwork> style",
    generator=torch.Generator("cuda").manual_seed(0),
).images[0]

image.save("v1.png")

pipe.load_textual_inversion("sd-concepts-library/gta5-artwork")

image = pipe(
    prompt="A cute brown bear eating a slice of pizza, stunning color scheme, masterpiece, illustration, <gta5-artwork> style",
    generator=torch.Generator("cuda").manual_seed(0),
).images[0]

image.save("v1_gta5.png")


pipe = StableDiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-2-1",
    torch_dtype=torch.float16,
    variant="fp16",
).to("cuda")


image = pipe(
    prompt="A cute brown bear eating a slice of pizza, stunning color scheme, masterpiece, illustration, <gta5-artwork> style",
    generator=torch.Generator("cuda").manual_seed(0),
).images[0]

image.save("v2.png")

pipe.load_textual_inversion("sd-concepts-library/gta5-artwork")

image = pipe(
    prompt="A cute brown bear eating a slice of pizza, stunning color scheme, masterpiece, illustration, <gta5-artwork> style",
    generator=torch.Generator("cuda").manual_seed(0),
).images[0]

image.save("v2_gta5.png")

v1

Base TI
v1 v1_gta5

v2

Base TI
v2 v2_gta5

@suzukimain
Copy link
Contributor Author

It certainly seems that the expected results have not been achieved.

gsdf/EasyNegative (test)
import os
import torch
from diffusers import StableDiffusionPipeline
from transformers import CLIPProcessor, CLIPModel
from huggingface_hub import hf_hub_download
from PIL import Image


class Image_score:
    def __init__(self):
        self.model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
        self.processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

    def open_image(self, image_path):
        if isinstance(image_path, str):
            return Image.open(image_path)
        elif isinstance(image_path, Image.Image):
            return image_path
        else:
            raise ValueError("Invalid image path or type")

    def get_score(self, prompt, image):
        input_image = self.open_image(image)
        inputs = self.processor(text=[prompt], images=input_image, return_tensors="pt", padding=True)
        outputs = self.model(**inputs)
        return round(outputs.logits_per_image.item(), 3)


class Generate(Image_score):
    
    def __init__(self, prompt, textual_inversion_path, save_dir):
        super().__init__()
        self.num_images = 10
        self.textual_inversion_path = textual_inversion_path
        self.prompt = prompt
        self.save_dir = save_dir
        os.makedirs(save_dir, exist_ok=True)
        

    def test(self):
        score_list = []
        v1_score = self.v1_test()
        v2_score = self.v2_test()
        
        return {
            "v1":v1_score,
            "v2":v2_score
        }


    def v1_test(self):
        score_list = []

        pipe = StableDiffusionPipeline.from_pretrained(
            "stable-diffusion-v1-5/stable-diffusion-v1-5",
            torch_dtype=torch.float16,
            variant="fp16",
        ).to("cuda")

        image = pipe(
            prompt=self.prompt,
            negative_prompt="EasyNegative",
            generator=torch.Generator("cuda").manual_seed(0),
        ).images[0]

        base_score = self.get_score(self.prompt, image)
        image.save(f"{self.save_dir}/v1_base.png")

        pipe.load_textual_inversion(self.textual_inversion_path, token="EasyNegative")

        image = pipe(
            prompt=self.prompt,
            negative_prompt="EasyNegative",
            generator=torch.Generator("cuda").manual_seed(0),
        ).images[0]

        textual_score = self.get_score(self.prompt, image)
        image.save(f"{self.save_dir}/v1_textual.png")

        return {
            "base":base_score,
            "textual":textual_score,
        }
        

    def v2_test(self):
        score_list = []

        pipe = StableDiffusionPipeline.from_pretrained(
            "stabilityai/stable-diffusion-2-1",
            torch_dtype=torch.float16,
            variant="fp16",
        ).to("cuda")

        image = pipe(
            prompt=self.prompt,
            negative_prompt="EasyNegative",
            generator=torch.Generator("cuda").manual_seed(0),
        ).images[0]
        base_score = self.get_score(self.prompt, image)
        image.save(f"{self.save_dir}/v2_base.png")

        pipe.load_textual_inversion(self.textual_inversion_path, token="EasyNegative")

        image = pipe(
            prompt=self.prompt,
            negative_prompt="EasyNegative",
            generator=torch.Generator("cuda").manual_seed(0),
        ).images[0]
        textual_score = self.get_score(self.prompt, image)
        image.save(f"{self.save_dir}/v2_textual.png")

        return {
            "base":base_score,
            "textual":textual_score,
        }


EasyNegative_path = hf_hub_download(repo_id="gsdf/EasyNegative", filename="EasyNegative.safetensors", repo_type="dataset")


test_prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"

EasyNegative_test = Generate(
    prompt=test_prompt,
    textual_inversion_path=EasyNegative_path,
    save_dir="EasyNegative_test",
).test()

print(EasyNegative_test)

result

{'v1': {'base': 31.409, 'textual': 32.48}, 'v2': {'base': 39.412, 'textual': 36.865}}

SD 1.5

Base textual inversion
31.409 32.48
v1_base v1_textual

SD 2.1

Base textual inversion
39.412 36.865
v2_base v2_textual

sd-concepts-library/gta5-artwork (test)
import os
import torch
from diffusers import StableDiffusionPipeline
from transformers import CLIPProcessor, CLIPModel
from huggingface_hub import hf_hub_download
from PIL import Image


class Image_score:
    def __init__(self):
        self.model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
        self.processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

    def open_image(self, image_path):
        if isinstance(image_path, str):
            return Image.open(image_path)
        elif isinstance(image_path, Image.Image):
            return image_path
        else:
            raise ValueError("Invalid image path or type")

    def get_score(self, prompt, image):
        input_image = self.open_image(image)
        inputs = self.processor(text=[prompt], images=input_image, return_tensors="pt", padding=True)
        outputs = self.model(**inputs)
        return round(outputs.logits_per_image.item(), 3)


class Generate(Image_score):
    
    def __init__(self, prompt, textual_inversion_path, save_dir):
        super().__init__()
        self.num_images = 10
        self.textual_inversion_path = textual_inversion_path
        self.prompt = prompt
        self.save_dir = save_dir
        os.makedirs(save_dir, exist_ok=True)
        

    def test(self):
        score_list = []
        v1_score = self.v1_test()
        v2_score = self.v2_test()
        
        return {
            "v1":v1_score,
            "v2":v2_score
        }


    def v1_test(self):
        score_list = []

        pipe = StableDiffusionPipeline.from_pretrained(
            "stable-diffusion-v1-5/stable-diffusion-v1-5",
            torch_dtype=torch.float16,
            variant="fp16",
        ).to("cuda")

        image = pipe(
            prompt=self.prompt,
            generator=torch.Generator("cuda").manual_seed(0),
        ).images[0]
        base_score = self.get_score(self.prompt, image)
        image.save(f"{self.save_dir}/v1_base.png")

        pipe.load_textual_inversion(self.textual_inversion_path)

        image = pipe(
            prompt=self.prompt,
            generator=torch.Generator("cuda").manual_seed(0),
        ).images[0]
        textual_score = self.get_score(self.prompt, image)
        image.save(f"{self.save_dir}/v1_textual.png")

        return {
            "base":base_score,
            "textual":textual_score,
        }
        

    def v2_test(self):
        score_list = []

        pipe = StableDiffusionPipeline.from_pretrained(
            "stabilityai/stable-diffusion-2-1",
            torch_dtype=torch.float16,
            variant="fp16",
        ).to("cuda")

        image = pipe(
            prompt=self.prompt,
            generator=torch.Generator("cuda").manual_seed(0),
        ).images[0]
        base_score = self.get_score(self.prompt, image)
        image.save(f"{self.save_dir}/v2_base.png")

        pipe.load_textual_inversion(self.textual_inversion_path)

        image = pipe(
            prompt=self.prompt,
            generator=torch.Generator("cuda").manual_seed(0),
        ).images[0]
        textual_score = self.get_score(self.prompt, image)
        image.save(f"{self.save_dir}/v2_textual.png")

        return {
            "base":base_score,
            "textual":textual_score,
        }

gta5_artwork_path = hf_hub_download(repo_id="sd-concepts-library/gta5-artwork", filename="learned_embeds.bin")

test_prompt = "A cute brown bear eating a slice of pizza, stunning color scheme, masterpiece, illustration, <gta5-artwork> style"

gta5_artwork_test = Generate(
    prompt=test_prompt,
    textual_inversion_path="sd-concepts-library/gta5-artwork",
    save_dir="gta5_artwork",
).test()

print(gta5_artwork_test)

result

{'v1': {'base': 37.794, 'textual': 40.9}, 'v2': {'base': 39.672, 'textual': 36.811}}

SD 1.5

Base textual inversion
37.794 40.9
v1_base v1_textual

SD 2.1

Base textual inversion
39.672 36.811
v2_base v2_textual

@suzukimain
Copy link
Contributor Author

Can anyone give me some advice?

@DN6
Copy link
Collaborator

DN6 commented Apr 15, 2025

@suzukimain Is there an example of this approach working elsewhere? Looking at the code, it seems like it is a random projection through a linear layer of an embedding for SD1.5 CLIP to the dimension of SD2.1 CLIP?

I don't think this will work well since you're essentially just multiplying the SD1.5 embedding by a random matrix that isn't aligned with the SD 2.1 CLIP embedding space?

@suzukimain
Copy link
Contributor Author

@suzukimain Is there an example of this approach working elsewhere? Looking at the code, it seems like it is a random projection through a linear layer of an embedding for SD1.5 CLIP to the dimension of SD2.1 CLIP?

I don't think this will work well since you're essentially just multiplying the SD1.5 embedding by a random matrix that isn't aligned with the SD 2.1 CLIP embedding space?

Hello @DN6, thank you for your response.
I am not very familiar with this field, so I might have taken an incorrect approach.

Copy link
Contributor

github-actions bot commented May 9, 2025

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot added the stale Issues that haven't received updates label May 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Issues that haven't received updates
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Request] Compatibility of textual inversion between SD 1.5 and SD 2.1
4 participants