Improve `textual inversion` compatibility #10949

suzukimain · 2025-03-03T16:44:56Z

What does this PR do?

This PR fixes the issue of incompatibility in textual inversion between different SD versions such as SD 1.5 and SD 2.1

Example:

!pip install git+https://github.com/suzukimain/diffusers.git@textual_inversion

import torch
from diffusers import StableDiffusionPipeline
from huggingface_hub import hf_hub_download

pipe = StableDiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-2-1", torch_dtype=torch.float16).to("cuda")

path = hf_hub_download(repo_id="gsdf/EasyNegative", filename="EasyNegative.safetensors", repo_type="dataset")

pipe.load_textual_inversion(path, token="EasyNegative")

Additionally, if you find any mistakes, please feel free to let me know.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

sayakpaul · 2025-03-11T14:40:43Z

src/diffusers/loaders/textual_inversion.py

+        for i, embedding in enumerate(embeddings):
+            if embedding.shape[-1] != expected_emb_dim:
+                linear = nn.Linear(embedding.shape[-1], expected_emb_dim)
+                embeddings[i] = linear(embedding)
+                logger.info(f"Changed embedding dimension from {embedding.shape[-1]} to {expected_emb_dim}")


Do we want to add a test case to cover this?

Do we want to add a test case to cover this?

What should we do with this?

Hi @suzukimain. The test would load an embedding to an incompatible model and check for the log "Changed embedding dimension...".
Also, do you have any example outputs to share?

hi, @hlky
The following log is what I was able to get.

The loaded token: emb_params is overwritten by the passed token EasyNegative. Changed embedding dimension from 768 to 1024 Changed embedding dimension from 768 to 1024 Changed embedding dimension from 768 to 1024 Changed embedding dimension from 768 to 1024 Changed embedding dimension from 768 to 1024 Changed embedding dimension from 768 to 1024 Changed embedding dimension from 768 to 1024 Changed embedding dimension from 768 to 1024 Loaded textual inversion embedding for EasyNegative. Loaded textual inversion embedding for EasyNegative_1. Loaded textual inversion embedding for EasyNegative_2. Loaded textual inversion embedding for EasyNegative_3. Loaded textual inversion embedding for EasyNegative_4. Loaded textual inversion embedding for EasyNegative_5. Loaded textual inversion embedding for EasyNegative_6. Loaded textual inversion embedding for EasyNegative_7.

Hello. Do you need any other information?

Hi @suzukimain, apologies for the delay, last week was the Diffusers team offsite.

Changed embedding dimension from 768 to 1024

This text is what we would check for in the test, either just Changed embedding dimension from or including the original + new dimensions depending on how existing TI tests are set up. Would you like assistance adding the test? happy to take over if needed.

Do you need any other information?

Example outputs from a model using an incompatible TI would be useful. cc @asomoza Is this something you've tested before?

Hi @suzukimain, apologies for the delay, last week was the Diffusers team offsite.

Changed embedding dimension from 768 to 1024

This text is what we would check for in the test, either just or including the original + new dimensions depending on how existing TI tests are set up. Would you like assistance adding the test? happy to take over if needed.Changed embedding dimension from

Do you need any other information?

Example outputs from a model using an incompatible TI would be useful. cc @asomoza Is this something you've tested before?

Hello @hlky, if possible, could you please add a test?

hlky · 2025-04-04T08:20:36Z

Hi @suzukimain, I've ran some examples using 2 different v1 TI on v2, IMO this isn't working as expected, can you confirm whether you have seen good results with this method?

gsdf/EasyNegative

import torch
from diffusers import StableDiffusionPipeline
from huggingface_hub import hf_hub_download


pipe = StableDiffusionPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    torch_dtype=torch.float16,
    variant="fp16",
).to("cuda")


image = pipe(
    prompt="Astronaut in a jungle, cold color palette, muted colors, detailed, 8k",
    negative_prompt="EasyNegative",
    generator=torch.Generator("cuda").manual_seed(0),
).images[0]

image.save("v1.png")

path = hf_hub_download(
    repo_id="gsdf/EasyNegative",
    filename="EasyNegative.safetensors",
    repo_type="dataset",
)

pipe.load_textual_inversion(path, token="EasyNegative")

image = pipe(
    prompt="Astronaut in a jungle, cold color palette, muted colors, detailed, 8k",
    negative_prompt="EasyNegative",
    generator=torch.Generator("cuda").manual_seed(0),
).images[0]

image.save("v1_easy_negative.png")


pipe = StableDiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-2-1",
    torch_dtype=torch.float16,
    variant="fp16",
).to("cuda")


image = pipe(
    prompt="Astronaut in a jungle, cold color palette, muted colors, detailed, 8k",
    negative_prompt="EasyNegative",
    generator=torch.Generator("cuda").manual_seed(0),
).images[0]

image.save("v2.png")

path = hf_hub_download(
    repo_id="gsdf/EasyNegative",
    filename="EasyNegative.safetensors",
    repo_type="dataset",
)

pipe.load_textual_inversion(path, token="EasyNegative")

image = pipe(
    prompt="Astronaut in a jungle, cold color palette, muted colors, detailed, 8k",
    negative_prompt="EasyNegative",
    generator=torch.Generator("cuda").manual_seed(0),
).images[0]

image.save("v2_easy_negative.png")

v1

Base	TI

v2

Base	TI

sd-concepts-library/gta5-artwork

import torch
from diffusers import StableDiffusionPipeline
from huggingface_hub import hf_hub_download


pipe = StableDiffusionPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    torch_dtype=torch.float16,
    variant="fp16",
).to("cuda")


image = pipe(
    prompt="A cute brown bear eating a slice of pizza, stunning color scheme, masterpiece, illustration, <gta5-artwork> style",
    generator=torch.Generator("cuda").manual_seed(0),
).images[0]

image.save("v1.png")

pipe.load_textual_inversion("sd-concepts-library/gta5-artwork")

image = pipe(
    prompt="A cute brown bear eating a slice of pizza, stunning color scheme, masterpiece, illustration, <gta5-artwork> style",
    generator=torch.Generator("cuda").manual_seed(0),
).images[0]

image.save("v1_gta5.png")


pipe = StableDiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-2-1",
    torch_dtype=torch.float16,
    variant="fp16",
).to("cuda")


image = pipe(
    prompt="A cute brown bear eating a slice of pizza, stunning color scheme, masterpiece, illustration, <gta5-artwork> style",
    generator=torch.Generator("cuda").manual_seed(0),
).images[0]

image.save("v2.png")

pipe.load_textual_inversion("sd-concepts-library/gta5-artwork")

image = pipe(
    prompt="A cute brown bear eating a slice of pizza, stunning color scheme, masterpiece, illustration, <gta5-artwork> style",
    generator=torch.Generator("cuda").manual_seed(0),
).images[0]

image.save("v2_gta5.png")

v1

Base	TI

v2

Base	TI

suzukimain · 2025-04-04T16:19:11Z

It certainly seems that the expected results have not been achieved.

gsdf/EasyNegative (test)

import os
import torch
from diffusers import StableDiffusionPipeline
from transformers import CLIPProcessor, CLIPModel
from huggingface_hub import hf_hub_download
from PIL import Image


class Image_score:
    def __init__(self):
        self.model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
        self.processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

    def open_image(self, image_path):
        if isinstance(image_path, str):
            return Image.open(image_path)
        elif isinstance(image_path, Image.Image):
            return image_path
        else:
            raise ValueError("Invalid image path or type")

    def get_score(self, prompt, image):
        input_image = self.open_image(image)
        inputs = self.processor(text=[prompt], images=input_image, return_tensors="pt", padding=True)
        outputs = self.model(**inputs)
        return round(outputs.logits_per_image.item(), 3)


class Generate(Image_score):
    
    def __init__(self, prompt, textual_inversion_path, save_dir):
        super().__init__()
        self.num_images = 10
        self.textual_inversion_path = textual_inversion_path
        self.prompt = prompt
        self.save_dir = save_dir
        os.makedirs(save_dir, exist_ok=True)
        

    def test(self):
        score_list = []
        v1_score = self.v1_test()
        v2_score = self.v2_test()
        
        return {
            "v1":v1_score,
            "v2":v2_score
        }


    def v1_test(self):
        score_list = []

        pipe = StableDiffusionPipeline.from_pretrained(
            "stable-diffusion-v1-5/stable-diffusion-v1-5",
            torch_dtype=torch.float16,
            variant="fp16",
        ).to("cuda")

        image = pipe(
            prompt=self.prompt,
            negative_prompt="EasyNegative",
            generator=torch.Generator("cuda").manual_seed(0),
        ).images[0]

        base_score = self.get_score(self.prompt, image)
        image.save(f"{self.save_dir}/v1_base.png")

        pipe.load_textual_inversion(self.textual_inversion_path, token="EasyNegative")

        image = pipe(
            prompt=self.prompt,
            negative_prompt="EasyNegative",
            generator=torch.Generator("cuda").manual_seed(0),
        ).images[0]

        textual_score = self.get_score(self.prompt, image)
        image.save(f"{self.save_dir}/v1_textual.png")

        return {
            "base":base_score,
            "textual":textual_score,
        }
        

    def v2_test(self):
        score_list = []

        pipe = StableDiffusionPipeline.from_pretrained(
            "stabilityai/stable-diffusion-2-1",
            torch_dtype=torch.float16,
            variant="fp16",
        ).to("cuda")

        image = pipe(
            prompt=self.prompt,
            negative_prompt="EasyNegative",
            generator=torch.Generator("cuda").manual_seed(0),
        ).images[0]
        base_score = self.get_score(self.prompt, image)
        image.save(f"{self.save_dir}/v2_base.png")

        pipe.load_textual_inversion(self.textual_inversion_path, token="EasyNegative")

        image = pipe(
            prompt=self.prompt,
            negative_prompt="EasyNegative",
            generator=torch.Generator("cuda").manual_seed(0),
        ).images[0]
        textual_score = self.get_score(self.prompt, image)
        image.save(f"{self.save_dir}/v2_textual.png")

        return {
            "base":base_score,
            "textual":textual_score,
        }


EasyNegative_path = hf_hub_download(repo_id="gsdf/EasyNegative", filename="EasyNegative.safetensors", repo_type="dataset")


test_prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"

EasyNegative_test = Generate(
    prompt=test_prompt,
    textual_inversion_path=EasyNegative_path,
    save_dir="EasyNegative_test",
).test()

print(EasyNegative_test)

result

{'v1': {'base': 31.409, 'textual': 32.48}, 'v2': {'base': 39.412, 'textual': 36.865}}

SD 1.5

Base	textual inversion
31.409	32.48

SD 2.1

Base	textual inversion
39.412	36.865

sd-concepts-library/gta5-artwork (test)

import os
import torch
from diffusers import StableDiffusionPipeline
from transformers import CLIPProcessor, CLIPModel
from huggingface_hub import hf_hub_download
from PIL import Image


class Image_score:
    def __init__(self):
        self.model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
        self.processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

    def open_image(self, image_path):
        if isinstance(image_path, str):
            return Image.open(image_path)
        elif isinstance(image_path, Image.Image):
            return image_path
        else:
            raise ValueError("Invalid image path or type")

    def get_score(self, prompt, image):
        input_image = self.open_image(image)
        inputs = self.processor(text=[prompt], images=input_image, return_tensors="pt", padding=True)
        outputs = self.model(**inputs)
        return round(outputs.logits_per_image.item(), 3)


class Generate(Image_score):
    
    def __init__(self, prompt, textual_inversion_path, save_dir):
        super().__init__()
        self.num_images = 10
        self.textual_inversion_path = textual_inversion_path
        self.prompt = prompt
        self.save_dir = save_dir
        os.makedirs(save_dir, exist_ok=True)
        

    def test(self):
        score_list = []
        v1_score = self.v1_test()
        v2_score = self.v2_test()
        
        return {
            "v1":v1_score,
            "v2":v2_score
        }


    def v1_test(self):
        score_list = []

        pipe = StableDiffusionPipeline.from_pretrained(
            "stable-diffusion-v1-5/stable-diffusion-v1-5",
            torch_dtype=torch.float16,
            variant="fp16",
        ).to("cuda")

        image = pipe(
            prompt=self.prompt,
            generator=torch.Generator("cuda").manual_seed(0),
        ).images[0]
        base_score = self.get_score(self.prompt, image)
        image.save(f"{self.save_dir}/v1_base.png")

        pipe.load_textual_inversion(self.textual_inversion_path)

        image = pipe(
            prompt=self.prompt,
            generator=torch.Generator("cuda").manual_seed(0),
        ).images[0]
        textual_score = self.get_score(self.prompt, image)
        image.save(f"{self.save_dir}/v1_textual.png")

        return {
            "base":base_score,
            "textual":textual_score,
        }
        

    def v2_test(self):
        score_list = []

        pipe = StableDiffusionPipeline.from_pretrained(
            "stabilityai/stable-diffusion-2-1",
            torch_dtype=torch.float16,
            variant="fp16",
        ).to("cuda")

        image = pipe(
            prompt=self.prompt,
            generator=torch.Generator("cuda").manual_seed(0),
        ).images[0]
        base_score = self.get_score(self.prompt, image)
        image.save(f"{self.save_dir}/v2_base.png")

        pipe.load_textual_inversion(self.textual_inversion_path)

        image = pipe(
            prompt=self.prompt,
            generator=torch.Generator("cuda").manual_seed(0),
        ).images[0]
        textual_score = self.get_score(self.prompt, image)
        image.save(f"{self.save_dir}/v2_textual.png")

        return {
            "base":base_score,
            "textual":textual_score,
        }

gta5_artwork_path = hf_hub_download(repo_id="sd-concepts-library/gta5-artwork", filename="learned_embeds.bin")

test_prompt = "A cute brown bear eating a slice of pizza, stunning color scheme, masterpiece, illustration, <gta5-artwork> style"

gta5_artwork_test = Generate(
    prompt=test_prompt,
    textual_inversion_path="sd-concepts-library/gta5-artwork",
    save_dir="gta5_artwork",
).test()

print(gta5_artwork_test)

result

{'v1': {'base': 37.794, 'textual': 40.9}, 'v2': {'base': 39.672, 'textual': 36.811}}

SD 1.5

Base	textual inversion
37.794	40.9

SD 2.1

Base	textual inversion
39.672	36.811

suzukimain · 2025-04-07T08:52:48Z

Can anyone give me some advice?

DN6 · 2025-04-15T09:17:19Z

@suzukimain Is there an example of this approach working elsewhere? Looking at the code, it seems like it is a random projection through a linear layer of an embedding for SD1.5 CLIP to the dimension of SD2.1 CLIP?

I don't think this will work well since you're essentially just multiplying the SD1.5 embedding by a random matrix that isn't aligned with the SD 2.1 CLIP embedding space?

suzukimain · 2025-04-15T11:39:48Z

@suzukimain Is there an example of this approach working elsewhere? Looking at the code, it seems like it is a random projection through a linear layer of an embedding for SD1.5 CLIP to the dimension of SD2.1 CLIP?

I don't think this will work well since you're essentially just multiplying the SD1.5 embedding by a random matrix that isn't aligned with the SD 2.1 CLIP embedding space?

Hello @DN6, thank you for your response.
I am not very familiar with this field, so I might have taken an incorrect approach.

github-actions · 2025-05-09T15:03:56Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

suzukimain and others added 3 commits March 4, 2025 00:30

update

2eef706

update

1388b3d

Merge branch 'main' into textual_inversion

1e1a4f2

DN6 requested a review from sayakpaul March 11, 2025 10:56

sayakpaul reviewed Mar 11, 2025

View reviewed changes

suzukimain added 4 commits March 26, 2025 00:47

Merge branch 'main' into textual_inversion

4c5158f

Merge branch 'main' into textual_inversion

115fce7

Merge branch 'main' into textual_inversion

537c836

Merge branch 'main' into textual_inversion

8e80d6d

suzukimain mentioned this pull request Apr 14, 2025

Issue regarding compatibility improvements for textual inversion. #11312

Closed

github-actions bot added the stale Issues that haven't received updates label May 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve `textual inversion` compatibility #10949

Improve `textual inversion` compatibility #10949

Uh oh!

suzukimain commented Mar 3, 2025

Uh oh!

sayakpaul Mar 11, 2025

Uh oh!

suzukimain Mar 22, 2025

Uh oh!

hlky Mar 22, 2025

Uh oh!

suzukimain Mar 23, 2025

Uh oh!

suzukimain Apr 3, 2025

Uh oh!

hlky Apr 3, 2025

Uh oh!

suzukimain Apr 4, 2025

Uh oh!

hlky commented Apr 4, 2025

Uh oh!

suzukimain commented Apr 4, 2025

Uh oh!

suzukimain commented Apr 7, 2025

Uh oh!

DN6 commented Apr 15, 2025

Uh oh!

suzukimain commented Apr 15, 2025

Uh oh!

github-actions bot commented May 9, 2025

Uh oh!

Uh oh!

Improve textual inversion compatibility #10949

Are you sure you want to change the base?

Improve textual inversion compatibility #10949

Uh oh!

Conversation

suzukimain commented Mar 3, 2025

What does this PR do?

Before submitting

Who can review?

Uh oh!

sayakpaul Mar 11, 2025

Choose a reason for hiding this comment

Uh oh!

suzukimain Mar 22, 2025

Choose a reason for hiding this comment

Uh oh!

hlky Mar 22, 2025

Choose a reason for hiding this comment

Uh oh!

suzukimain Mar 23, 2025

Choose a reason for hiding this comment

Uh oh!

suzukimain Apr 3, 2025

Choose a reason for hiding this comment

Uh oh!

hlky Apr 3, 2025

Choose a reason for hiding this comment

Uh oh!

suzukimain Apr 4, 2025

Choose a reason for hiding this comment

Uh oh!

hlky commented Apr 4, 2025

v1

v2

v1

v2

Uh oh!

suzukimain commented Apr 4, 2025

SD 1.5

SD 2.1

SD 1.5

SD 2.1

Uh oh!

suzukimain commented Apr 7, 2025

Uh oh!

DN6 commented Apr 15, 2025

Uh oh!

suzukimain commented Apr 15, 2025

Uh oh!

github-actions bot commented May 9, 2025

Uh oh!

Uh oh!

Improve `textual inversion` compatibility #10949

Improve `textual inversion` compatibility #10949