Guided generation to classify Python projects #379

lmmx · 2023-11-19T18:23:16Z

lmmx
Nov 19, 2023

Wanted to share some examples of how I tried to use Outlines and what did/didn't work so well, and what seemed not possible.

My idea (use case) was based around PyPI trove classifiers, which are the tags you put on a project, and I wanted to experiment with whether I could automatically select some of them, building towards automated tag suggestion based on project description/README etc.

So first off I wrote this example of a program that picks an "Intended Audience" (program 0) and when that failed I tried another to pick Development Status, thinking that'd be an easier task (program 1). I concluded that TinyLlama was too small a model, because I didn't manage to get good results.

I attach the source for these below followed by their output. (I only noticed that I wasn't using the GPU in the 4th script 😅 I had to look up how the device gets passed through to the transformers code but this was ultimately fine, and bugs were not a fault in this library).

Environment setup note

Note that for installation I added outlines as a dependency of the project this demo relies on, classipypi - specifically here. By using the newly added PDM per-dependency URL I was able to install PyTorch entirely via pip (I usually use conda). This means that to reproduce the following results you can choose to pdm install and the lockfile will give the same environment (by default it'll make a .venv but I use a conda env which I activate then run which python > .pdm-python which then lets PDM pick up the env without me even needing to activate it).

Code

Click to show 0_pick_audience_demo.py

from pprint import pformat, pprint
from textwrap import dedent

import outlines.models as models
import outlines.text.generate as generate

from classipypi import list_tags
from classipypi.interfaces import ListingConfig


def get_candidate_tags():
    """
    Retrieves and prints candidate tags for classifying Python projects.
    """
    candidate_tags = list_tags(ListingConfig(include=["Audience"]))
    simpler_tags = [t.split(" :: ")[1] for t in candidate_tags]
    pprint(simpler_tags)
    return simpler_tags


def initialise_model(model_name):
    """
    Initialises and returns the AI model based on the provided model name.
    """
    model = models.transformers(model_name)
    return model


def generate_tag_guided_model(model, candidate_tags):
    """
    Creates and returns a tag guided model generator.
    """
    index = list(candidate_tags)
    return generate.choice(model, index)


def run_tests(tag_guided_model_generator, descriptions, candidate_tags):
    """
    Runs a series of tests using the provided generator and descriptions.
    """
    prompt = dedent(
        f"""
    You are a Python software project labelling assistant.

    You have the following labels to describe the intended audience of the package:

    {pformat(list(candidate_tags))}

    For example, if a project is described as "a program to train supermarket staff in
    how to interact with customers" then its intended audience is "Customer Service".

    If a project is described as "a floor plan generator for factory designers" then its
    intended audience is "Manufacturing".

    What label should be given to a project with the following description?

    Description: """
    ).lstrip()
    print(f"{prompt}")
    for description in descriptions:
        print("\n", description)
        filled_prompt = f"{prompt}{description}"
        formatted_prompt = (
            f"<|im_start|>user\n{filled_prompt}<|im_end|>\n<|im_start|>assistant\n"
        )
        answer = tag_guided_model_generator(formatted_prompt)
        print("->", answer)


def main():
    model_name = "PY007/TinyLlama-1.1B-Chat-v0.3"
    candidate_tags = get_candidate_tags()
    model = initialise_model(model_name)
    tag_guided_model_generator = generate_tag_guided_model(model, candidate_tags)
    descriptions = [
        "A Python package for refactoring code libraries into smaller parts",
        "A Python package for selecting appropriate Bible verses for your sermon",
        "A tool to select appropriate readings for your Bible study group",
        "A tool to help students learn about the periodic table",
        "A program for biologists to analyse microscopy images of living cells",
        "A tool to help lawyers look up relevant case law",
        "A tool to help hospital doctors take accurate measurements of patients",
    ]
    run_tests(tag_guided_model_generator, descriptions, candidate_tags)


if __name__ == "__main__":
    main()

Result: it always chooses "Information Technology" (or otherwise fails).

 A Python package for refactoring code libraries into smaller parts
-> Information Technology

 A Python package for selecting appropriate Bible verses for your sermon
-> Information Technology

 A tool to select appropriate readings for your Bible study group
-> Information Technology

 A tool to help students learn about the periodic table
-> Information Technology

(...etc)

Next I tried...

Click to show 1b_pick_devel_status_demo_tinyllama_1b_fixed_prompt_formatting.py

from pprint import pformat, pprint
from textwrap import dedent

import outlines.models as models
import outlines.text.generate as generate

from classipypi import list_tags
from classipypi.interfaces import ListingConfig


def get_candidate_tags(show: bool = False):
    """
    Retrieves and prints candidate tags for classifying Python projects.
    """
    candidate_tags = list_tags(ListingConfig(include=["Development Status"]))
    simpler_tags = [t.split(" :: ")[1] for t in candidate_tags]
    if show:
        pprint(simpler_tags)
    return simpler_tags


def initialise_model(model_name):
    """
    Initialises and returns the AI model based on the provided model name.
    """
    model = models.transformers(model_name)
    return model


def generate_tag_guided_model(model, candidate_tags):
    """
    Creates and returns a tag guided model generator.
    """
    return generate.choice(model, candidate_tags)


def run_tests(tag_guided_model_generator, descriptions, candidate_tags):
    """
    Runs a series of tests using the provided generator and descriptions.
    """
    feature = "project development status"
    tag_list = "\n    ".join(candidate_tags)
    prompt = dedent(
        f"""
    You are a Python software project labelling assistant.

    You have the following labels to describe the {feature} of the package:

    {tag_list}

    For example, if a project is described as "a brand new experimental library" then its
    development status is intended audience is "3 - Alpha".

    If a project is described as "an advanced toolbox with a wide range of features" then its
    development status is "6 - Mature".

    What label should be given to a project with the following description?

    Description: """
    ).lstrip()
    print(f"{prompt}")
    for description in descriptions:
        print()
        print("Q:", description)
        filled_prompt = f"{prompt}{description}\n\n"
        formatted_prompt = (
            f"<|im_start|>user\n{filled_prompt}<|im_end|>\n<|im_start|>assistant\n"
        )
        answer = tag_guided_model_generator(formatted_prompt)
        print("A:", answer)


def main():
    model_name = "PY007/TinyLlama-1.1B-Chat-v0.3"
    candidate_tags = get_candidate_tags()
    model = initialise_model(model_name)
    tag_guided_model_generator = generate_tag_guided_model(model, candidate_tags)
    descriptions = [
        "An industry-renowned tool",
        "A proof of concept for an idea",
        "A work in progress, may contain bugs",
        "A new library for data processing, feedback welcome",
        "A battle-tested parser for web scraping",
        "A new and improved version of the library with lots of new features",
        "This library is no longer maintained, use at your own risk",
    ]
    run_tests(tag_guided_model_generator, descriptions, candidate_tags)


if __name__ == "__main__":
    main()

The results here were essentially random, as if the model was just guessing not following the prompt.

Q: An industry-renowned tool
A: 1 - Planning

Q: A proof of concept for an idea
A: 4 - Beta

Q: A work in progress, may contain bugs
A: 6 - Mature

Q: A new library for data processing, feedback welcome
A: 1 - Planning

Q: A battle-tested parser for web scraping
A: 1 - Planning

Q: A new and improved version of the library with lots of new features
A: 1 - Planning

Q: This library is no longer maintained, use at your own risk
A: 6 - Mature

Next I did some more reading around, and recalled the Zephyr model (Mistral finetuned by HuggingFace) and got that to run, switching to the beta version.

Click to show 3_pick_devel_status_demo_zephyr_beta.py

from pprint import pformat, pprint
from textwrap import dedent

import outlines.models as models
import outlines.text.generate as generate

from classipypi import list_tags
from classipypi.interfaces import ListingConfig


def get_candidate_tags(show: bool = False):
    """
    Retrieves and prints candidate tags for classifying Python projects.
    """
    candidate_tags = list_tags(ListingConfig(include=["Development Status"]))
    simpler_tags = [t.split(" :: ")[1] for t in candidate_tags]
    if show:
        pprint(simpler_tags)
    return simpler_tags


def initialise_model(model_name):
    """
    Initialises and returns the AI model based on the provided model name.
    """
    model = models.transformers(model_name)
    return model


def generate_tag_guided_model(model, candidate_tags):
    """
    Creates and returns a tag guided model generator.
    """
    return generate.choice(model, candidate_tags)


def run_tests(tag_guided_model_generator, descriptions, candidate_tags):
    """
    Runs a series of tests using the provided generator and descriptions.
    """
    feature = "project development status"
    tag_list = "\n    ".join(candidate_tags)
    system, user, assistant = "<|system|>", "<|user|>", "<|assistant|>"
    prompt = dedent(
        f"""
    {system}
    You are a Python software project labelling assistant.

    You have the following labels to describe the {feature} of the package:

    {tag_list}

    For example, if a project is described as "a brand new experimental library" then its
    development status is intended audience is "3 - Alpha".

    If a project is described as "an advanced toolbox with a wide range of features" then its
    development status is "6 - Mature".

    {user}
    What label should be given to a project with the following description?

    """
    ).lstrip()
    print(f"{prompt}")
    for desc in descriptions:
        print()
        print("Q:", desc)
        for idx in range(1, 4):
            answer = tag_guided_model_generator(f"{prompt}{desc}\n\n{assistant}\n")
            print(f"A{idx}:", answer)


def main():
    model_name = "HuggingFaceH4/zephyr-7b-beta"
    candidate_tags = get_candidate_tags()
    model = initialise_model(model_name)
    tag_guided_model_generator = generate_tag_guided_model(model, candidate_tags)
    descriptions = [
        "An industry-renowned tool",
        "A proof of concept for an idea",
        "A work in progress, may contain bugs",
        "A new library for data processing, feedback welcome",
        "A battle-tested parser for web scraping",
        "A new and improved version of the library with lots of new features",
        "This library is no longer maintained, use at your own risk",
    ]
    run_tests(tag_guided_model_generator, descriptions, candidate_tags)


if __name__ == "__main__":
    main()

Result: it suddenly began to work really well, and to emphasise where it was/wasn't stable I added 3 attempts per prompt (so repeated answers indicate confidence). This suggested to me that the Zephyr model was the way to go and that I could do something with this library!

Q: An industry-renowned tool
A1: 5 - Production/Stable
A2: 5 - Production/Stable
A3: 5 - Production/Stable

Q: A proof of concept for an idea
A1: 1 - Planning
A2: 1 - Planning
A3: 1 - Planning

Q: A work in progress, may contain bugs
A1: 2 - Pre-Alpha
A2: 2 - Pre-Alpha
A3: 2 - Pre-Alpha

Q: A new library for data processing, feedback welcome
A1: 1 - Planning
A2: 1 - Planning
A3: 1 - Planning

Q: A battle-tested parser for web scraping
A1: 5 - Production/Stable
A2: 5 - Production/Stable
A3: 5 - Production/Stable

Q: A new and improved version of the library with lots of new features
A1: 4 - Beta
A2: 5 - Production/Stable
A3: 4 - Beta

Q: This library is no longer maintained, use at your own risk
A1: 5 - Production/Stable
A2: 7 - Inactive
A3: 5 - Production/Stable

Next I tried the audience one again, since now I had confidence it might be able to do the harder task... I also stopped chopping the full multi-part trove classifier tag apart as I realised it didn't change the effectiveness.

Click to show 4_pick_audience_demo_zephyr_beta.py

from pprint import pformat, pprint
from textwrap import dedent

import outlines.models as models
import outlines.text.generate as generate
import torch

from classipypi import list_tags
from classipypi.interfaces import ListingConfig


def get_candidate_tags(show: bool = False, simplify: bool = False):
    """
    Retrieves and prints candidate tags for classifying Python projects.
    """
    tags = list_tags(ListingConfig(include=["Audience"]))
    if simplify:
        tags = [t.split(" :: ")[1] for t in tags]
    if show:
        pprint(tags)
    return tags


def initialise_model(model_name, use_cpu: bool, use_4bit: bool):
    """
    Initialises and returns the AI model based on the provided model name.
    """
    if use_cpu:
        model = models.transformers(model_name, device=None)
    else:
        # device = "cuda" if torch.cuda.is_available() else "cpu"
        # https://github.com/tloen/alpaca-lora/issues/21#issuecomment-1473318920
        model_kw = dict(load_in_4bit=use_4bit)
        if use_4bit:
            model_kw["torch_dtype"] = torch.float32
        model = models.transformers(model_name, device={"": 0}, model_kwargs=model_kw)
    return model


def generate_tag_guided_model(model, candidate_tags):
    """
    Creates and returns a tag guided model generator.
    """
    return generate.choice(model, candidate_tags)


def run_tests(tag_guided_model_generator, descriptions, candidate_tags):
    """
    Runs a series of tests using the provided generator and descriptions.
    """
    feature = "intended audience"
    tag_list = "\n    ".join(candidate_tags)
    system, user, assistant = "<|system|>", "<|user|>", "<|assistant|>"
    demo1, expect1 = (
        "a program to train supermarket staff in how to interact with customers",
        "Customer Service",
    )
    demo2, expect2 = ("a floor plan generator for factory designers", "Manufacturing")
    prompt = dedent(
        f"""
    {system}
    You are a Python software project labelling assistant.

    You have the following labels to describe the {feature} of the package:

    {tag_list}

    For example, if a project is described as {demo1} then its {feature} is {expect1}.

    If a project is described as {demo2} then its {feature} is {expect2}.

    {user}
    What label should be given to a project with the following description?

    """
    ).lstrip()
    print(f"{prompt}")
    for desc in descriptions:
        print()
        print("Q:", desc)
        for idx in range(1, 4):
            answer = tag_guided_model_generator(f"{prompt}{desc}\n\n{assistant}\n")
            print(f"A{idx}:", answer)


def main(
    model_name: str = "HuggingFaceH4/zephyr-7b-beta",
    use_cpu: bool = False,
    use_4bit: bool = True,
):
    model = initialise_model(model_name, use_cpu=use_cpu, use_4bit=use_4bit)
    candidate_tags = get_candidate_tags()
    tag_guided_model_generator = generate_tag_guided_model(model, candidate_tags)
    descriptions = [
        "A Python package for refactoring code libraries into smaller parts",
        "A Python package for selecting appropriate Bible verses for your sermon",
        "A tool to select appropriate readings for your Bible study group",
        "A tool to help students learn about the periodic table",
        "A program for biologists to analyse microscopy images of living cells",
        "A tool to help lawyers look up relevant case law",
        "A tool to help hospital doctors take accurate measurements of patients",
    ]
    run_tests(tag_guided_model_generator, descriptions, candidate_tags)


if __name__ == "__main__":
    main()

The results here were even stronger: essentially getting 100%

Q: A Python package for refactoring code libraries into smaller parts
A1: Intended Audience :: Information Technology
A2: Intended Audience :: Information Technology
A3: Intended Audience :: Information Technology

Q: A Python package for selecting appropriate Bible verses for your sermon
A1: Intended Audience :: Religion
A2: Intended Audience :: Religion
A3: Intended Audience :: Religion

Q: A tool to select appropriate readings for your Bible study group
A1: Intended Audience :: Religion
A2: Intended Audience :: Religion
A3: Intended Audience :: Religion

Q: A tool to help students learn about the periodic table
A1: Intended Audience :: Education
A2: Intended Audience :: Education
A3: Intended Audience :: Education

Q: A program for biologists to analyse microscopy images of living cells
A1: Intended Audience :: Science/Research
A2: Intended Audience :: Science/Research
A3: Intended Audience :: Science/Research

Q: A tool to help lawyers look up relevant case law
A1: Intended Audience :: Legal Industry
A2: Intended Audience :: Legal Industry
A3: Intended Audience :: Legal Industry

Q: A tool to help hospital doctors take accurate measurements of patients
A1: Intended Audience :: Healthcare Industry
A2: Intended Audience :: Healthcare Industry
A3: Intended Audience :: Healthcare Industry

Lastly I realised I didn't need to include the list of generation choices in the prompt (but I did need to include ICL demos in it). When I removed the examples from the prompt the performance dropped, but when I took the generation choices out there was no change. This is hardly new, I've read about these techniques before, and it's intuitive that the guided generation wouldn't need to be made aware of the options in advance (in the middle of the prompt).

Click to show 5_pick_audience_demo_zephyr_beta_no_tag_list_in_prompt.py

from pprint import pformat, pprint
from textwrap import dedent

import outlines.models as models
import outlines.text.generate as generate
import torch

from classipypi import list_tags
from classipypi.interfaces import ListingConfig


def get_candidate_tags(show: bool = False, simplify: bool = False):
    """
    Retrieves and prints candidate tags for classifying Python projects.
    """
    tags = list_tags(ListingConfig(include=["Audience"]))
    if simplify:
        tags = [t.split(" :: ")[1] for t in tags]
    if show:
        pprint(tags)
    return tags


def initialise_model(model_name, use_cpu: bool, use_4bit: bool):
    """
    Initialises and returns the AI model based on the provided model name.
    """
    if use_cpu:
        model = models.transformers(model_name, device=None)
    else:
        # device = "cuda" if torch.cuda.is_available() else "cpu"
        # https://github.com/tloen/alpaca-lora/issues/21#issuecomment-1473318920
        model_kw = dict(load_in_4bit=use_4bit)
        if use_4bit:
            model_kw["torch_dtype"] = torch.float32
        model = models.transformers(model_name, device={"": 0}, model_kwargs=model_kw)
    return model


def generate_tag_guided_model(model, candidate_tags):
    """
    Creates and returns a tag guided model generator.
    """
    return generate.choice(model, candidate_tags)


def run_tests(tag_guided_model_generator, descriptions):
    """
    Runs a series of tests using the provided generator and descriptions.
    """
    feature = "intended audience"
    system, user, assistant = "<|system|>", "<|user|>", "<|assistant|>"
    demos = {
        "Customer Service": "a program to train supermarket staff in how to interact with customers",
        "Manufacturing": "a floor plan generator for factory designers",
    }
    demo_template = (
        "If a project is described as {description} then its {feature} is {expected}"
    )
    demo_text = "\n\n".join(
        demo_template.format(description=desc, feature=feature, expected=exp)
        for exp, desc in demos.items()
    )
    demo_text = demo_text[0].lower() + demo_text[1:]
    prompt = dedent(
        f"""
    {system}
    You are a Python software project labelling assistant.

    You must use PyPI trove classifier labels to describe the {feature} of the package.

    For example: {demo_text}

    {user}
    What label should be given to a project with the following description?

    """
    ).lstrip()
    print(f"{prompt}")
    for desc in descriptions:
        print()
        print("Q:", desc)
        for idx in range(1, 4):
            answer = tag_guided_model_generator(f"{prompt}{desc}\n\n{assistant}\n")
            print(f"A{idx}:", answer)


def main(
    model_name: str = "HuggingFaceH4/zephyr-7b-beta",
    use_cpu: bool = False,
    use_4bit: bool = True,
):
    model = initialise_model(model_name, use_cpu=use_cpu, use_4bit=use_4bit)
    candidate_tags = get_candidate_tags()
    tag_guided_model_generator = generate_tag_guided_model(model, candidate_tags)
    descriptions = [
        "A Python package for refactoring code libraries into smaller parts",
        "A Python package for selecting appropriate Bible verses for your sermon",
        "A tool to select appropriate readings for your Bible study group",
        "A tool to help students learn about the periodic table",
        "A program for biologists to analyse microscopy images of living cells",
        "A tool to help lawyers look up relevant case law",
        "A tool to help hospital doctors take accurate measurements of patients",
    ]
    run_tests(tag_guided_model_generator, descriptions)


if __name__ == "__main__":
    main()

I should also mention that I used GPT-2 and got poor results from that (I think I started with that before moving up to TinyLlama)
I'm looking forward to quantised models as there's a little lag even on GPU, and 10 second latency to load the checkpoint shards, given that it seems Mistral is the minimal size model (unless I did something wrong!) any further speedup would come from using the quantised versions (GGUF, etc).

Generation failure/slowdown when given many choices

I also ran an attempt to pass in all of the trove classifier tags and when I did this it just 'gummed up' (froze, and maybe it was going to complete but I cancelled out). I then began to think about how to break the problem down into nested Pydantic models, or to do each category of trove classifier separately.

(I don't have the code to hand for this but it was essentially as above, but with all of the trove classifiers i.e. running list_tags without include filters in the ListingConfig: the 839 trove classifiers you get on the command line from running classipypi ls).

This was something of a weekend hack and I didn't get back to look at it again until @rlouf nudged me to send this user report, I hope it's helpful and needless to say bravo! 🙂

rlouf · 2023-11-19T21:18:07Z

rlouf
Nov 19, 2023
Maintainer

Thanks, that's really helpful. If only every user would send us detailed reports like this 🙏

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Guided generation to classify Python projects #379

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Guided generation to classify Python projects #379

lmmx Nov 19, 2023

Environment setup note

Code

Generation failure/slowdown when given many choices

Replies: 1 comment

rlouf Nov 19, 2023 Maintainer

lmmx
Nov 19, 2023

rlouf
Nov 19, 2023
Maintainer