⚡️ Speed up function `convert_to_coco` by 26% by codeflash-ai[bot] · Pull Request #267 · codeflash-ai/unstructured

codeflash-ai · 2026-01-24T08:20:49Z

📄 26% (0.26x) speedup for `convert_to_coco` in `unstructured/staging/base.py`

⏱️ Runtime : 29.6 milliseconds → 23.4 milliseconds (best of 7 runs)

📝 Explanation and details

The optimized code achieves a 26% speedup by eliminating redundant computations and replacing inefficient list scans with O(1) dictionary lookups.

Key optimizations:

Single datetime.now() call: The original code called datetime.now() three times to populate the "info" section (for the description string formatting, year extraction, and date_created). The optimized version caches the result in a now variable and reuses it, avoiding two redundant system calls.
Avoided expensive per-item dictionary sorting for image deduplication: The original code deduplicated images using {tuple(sorted(d.items())): d for d in images}, which sorts every dictionary's items—an O(k log k) operation per dictionary where k is the number of keys. The optimized code builds a tuple key directly from the relevant fields (width, height, file_directory, file_name, page_number) without sorting, reducing overhead from O(n·k log k) to O(n).
Replaced O(n·m) category ID lookups with O(1) dictionary mapping: The original code used a list comprehension [x["id"] for x in categories if x["name"] == el["type"]][0] for every annotation, scanning all categories (O(m)) for each of n elements. The optimized version builds a name_to_id dictionary once and performs O(1) lookups, reducing this from O(n·m) to O(n).
Hoisted repeated metadata lookups: The original code repeatedly called el["metadata"].get("coordinates") up to 12 times per element when building annotations. The optimized version caches this in a coords variable and reuses it, eliminating redundant dictionary accesses.
Explicit loops for readability and micro-optimizations: Replaced list comprehensions with explicit loops where intermediate values (like coords, bbox components) are reused multiple times, reducing dictionary indexing overhead.

Impact on workloads:

The test results show the optimizations excel with larger datasets (e.g., 500+ elements show 22-28% speedup), where the O(1) category lookup and reduced metadata access compound.
For small datasets (empty or 1-3 elements), the overhead of creating the name_to_id dictionary can cause a slight slowdown (5-10%), but this is negligible in absolute terms (microseconds).
The deduplication improvement benefits cases with many duplicate images, as seen in test_large_scale_many_annotations_with_mixed_metadata (173% faster) where tuple-based deduplication significantly outperforms sorting-based deduplication.

Behavioral preservation:

The optimized code raises IndexError (via except KeyError: raise IndexError) when a category is not found, matching the original's exception behavior when indexing an empty list.
Image deduplication logic preserves insertion order and updates with the last-seen value for each key, maintaining the original's {...}.values() semantics.

✅ Correctness verification report:

Test	Status
⚙️ Existing Unit Tests	✅ 38 Passed
🌀 Generated Regression Tests	✅ 29 Passed
⏪ Replay Tests	🔘 None Found
🔎 Concolic Coverage Tests	✅ 1 Passed
📊 Tests Coverage	100.0%

⚙️ Click to see Existing Unit Tests

Test File::Test Function	Original ⏱️	Optimized ⏱️	Speedup
`staging/test_base.py::test_convert_to_coco`	342μs	337μs	1.55%✅

🌀 Click to see Generated Regression Tests

import types
from datetime import datetime  # avoid shadowing in tests

# function to test (original implementation preserved exactly)
from typing import Any

from unstructured.staging.base import convert_to_coco

# Set up a fake module `unstructured.documents.elements` with TYPE_TO_TEXT_ELEMENT_MAP
# The convert_to_coco function imports TYPE_TO_TEXT_ELEMENT_MAP dynamically from this module.
_elements_mod = types.ModuleType("unstructured.documents.elements")


# Helper: a lightweight element-like object with a to_dict() method matching expected shape.
class SimpleElement:
    """
    Minimal element-like object for testing:
    - must implement to_dict() returning a dict with keys:
      'type', 'element_id', 'text', 'metadata'
    - metadata may include 'coordinates' with 'points' (sequence of (x,y)), and layout_{width,height}
      as well as optional filename, file_directory, page_number
    """

    def __init__(
        self, element_id: str, type_name: str, text: str = "", metadata: dict | None = None
    ):
        self._element_id = element_id
        self._type = type_name
        self._text = text
        # metadata should be a dict; ensure keys exist; convert None -> {}
        self._metadata = metadata if metadata is not None else {}

    def to_dict(self) -> dict[str, Any]:
        # Return structure matching convert_to_coco expectations
        return {
            "type": self._type,
            "element_id": self._element_id,
            "text": self._text,
            "metadata": self._metadata,
        }


def test_basic_single_element_with_coordinates_and_metadata():
    # A basic test: one element with coordinates and file metadata.
    # Setup coordinates with points arranged so bbox and area are predictable.
    coords = {
        "points": [(1.0, 1.0), (1.0, 3.0), (4.0, 1.0)],  # x0=1,y0=1 ; x2=4 ; y1=3
        "layout_width": 800,
        "layout_height": 600,
    }
    metadata = {
        "coordinates": coords,
        "filename": "page1.png",
        "file_directory": "/tmp",
        "page_number": 1,
    }
    el = SimpleElement(element_id="el_1", type_name="Text", text="hello", metadata=metadata)

    codeflash_output = convert_to_coco(
        [el], dataset_description="My Dataset", dataset_version="2.5", contributors=("A", "B")
    )
    coco = codeflash_output  # 47.6μs -> 45.5μs (4.60% faster)
    # date_created is ISO date
    datetime.strptime(coco["info"]["date_created"], "%Y-%m-%d")
    img = coco["images"][0]

    # Categories: should reflect keys from TYPE_TO_TEXT_ELEMENT_MAP defined above
    cat_names = sorted([c["name"] for c in coco["categories"]])
    ann = coco["annotations"][0]
    # Find the expected category id for "Text"
    expected_cat_id = [c["id"] for c in coco["categories"] if c["name"] == "Text"][0]


def test_element_without_coordinates_yields_empty_bbox_and_none_area_and_defaults_for_missing_metadata():
    # Edge case: element without coordinates should produce empty bbox and None area
    metadata = {"file_directory": "/no_coords", "filename": "nocoord.png"}  # no 'coordinates' key
    el = SimpleElement(element_id="no_coords_1", type_name="Caption", text="", metadata=metadata)

    codeflash_output = convert_to_coco([el], dataset_description=None)
    coco = codeflash_output  # 53.4μs -> 51.5μs (3.57% faster)
    img = coco["images"][0]

    # Annotation bbox should be empty list and area None
    ann = coco["annotations"][0]

    # Info.description must be auto-generated (today's date present)
    today = datetime.now().strftime("%Y-%m-%d")


def test_deduplication_of_images_based_on_image_fields_and_consistent_image_ids():
    # Create multiple elements that share the same image metadata to test deduplication logic.
    shared_meta = {
        "coordinates": {
            "points": [(0, 0), (0, 1), (1, 0)],
            "layout_width": 100,
            "layout_height": 200,
        },
        "filename": "shared.png",
        "file_directory": "/shared",
        "page_number": 2,
    }
    # Two elements sharing image fields should result in a single image entry in coco["images"]
    el1 = SimpleElement(element_id="a1", type_name="Text", metadata=shared_meta)
    el2 = SimpleElement(element_id="a2", type_name="Title", metadata=shared_meta)

    codeflash_output = convert_to_coco([el1, el2])
    coco = codeflash_output  # 62.4μs -> 55.9μs (11.6% faster)
    # Annotation ids correspond to element ids (strings preserved)
    ids = {ann["id"] for ann in coco["annotations"]}


def test_large_scale_many_annotations_with_mixed_metadata():
    # Large-scale test within the allowed complexity limits:
    # Create 200 elements with alternating coordinates/no-coordinates and two distinct images to force deduplication.
    elements = []
    # Two unique file names to ensure at most two images in result.
    for i in range(200):
        if i % 2 == 0:
            # elements with coordinates and image A
            coords = {
                "points": [(i, i), (i, i + 2), (i + 3, i)],
                "layout_width": 10 + i,
                "layout_height": 20 + i,
            }
            metadata = {
                "coordinates": coords,
                "filename": "imageA.png",
                "file_directory": "/bulk",
                "page_number": 1,
            }
            el = SimpleElement(element_id=f"id_{i}", type_name="Text", metadata=metadata)
        else:
            # elements without coordinates and image B
            metadata = {"filename": "imageB.png", "file_directory": "/bulk", "page_number": 2}
            el = SimpleElement(element_id=f"id_{i}", type_name="Caption", metadata=metadata)
        elements.append(el)

    codeflash_output = convert_to_coco(elements, dataset_version="9.9")
    coco = codeflash_output  # 1.20ms -> 441μs (173% faster)
    # Ensure annotations for elements without coordinates have bbox=[] and area is None
    no_coord_anns = [
        ann for ann in coco["annotations"] if ann["id"].startswith("id_") and ann["bbox"] == []
    ]

    # Ensure that for those with coordinates, bbox values are floats and area equals width*height for each
    coord_anns = [ann for ann in coco["annotations"] if ann["bbox"] != []]
    for ann in coord_anns:
        bbox = ann["bbox"]
        for v in bbox:
            pass
        # calculate expected area from bbox width and height
        expected_area = bbox[2] * bbox[3]


def test_category_ids_are_consistent_and_sorted_by_name():
    # Validate that categories are sorted by name and ids assigned in that order
    # TYPE_TO_TEXT_ELEMENT_MAP keys in our fake module: {"Caption", "Text", "Title"}
    # The sorted order should be ['Caption', 'Text', 'Title']
    codeflash_output = convert_to_coco([])
    coco = codeflash_output  # 42.5μs -> 44.9μs (5.30% slower)
    names_in_order = [c["name"] for c in coco["categories"]]
    # Verify ids start at 1 and increase sequentially
    ids = [c["id"] for c in coco["categories"]]


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

from datetime import datetime
from typing import Any

from unstructured.documents.elements import Element, ElementMetadata
from unstructured.staging.base import convert_to_coco

# ============================================================================
# Test Scenarios and Examples:
# ============================================================================
# 1. BASIC TEST CASES
#    - Empty elements list
#    - Single element with minimal metadata
#    - Single element with complete metadata
#    - Multiple elements with same type
#    - Multiple elements with different types
#
# 2. EDGE TEST CASES
#    - Elements with None coordinates
#    - Elements with missing metadata fields
#    - Elements with zero dimensions
#    - Elements with negative coordinates
#    - Elements with duplicate image metadata
#    - Elements with special characters in filenames
#    - Elements with very long file paths
#    - Empty contributors tuple
#    - None dataset_description
#    - Elements with overlapping bounding boxes
#
# 3. LARGE SCALE TEST CASES
#    - 500+ elements with full metadata
#    - 500+ unique images
#    - Elements with large coordinate values
#    - Multiple elements sharing same image metadata
# ============================================================================


# Helper class for testing - a concrete Element subclass
class TextElement(Element):
    """Concrete implementation of Element for testing."""

    category = "Text"

    def __init__(self, text: str = "", **kwargs):
        self.text = text
        super().__init__(**kwargs)

    def to_dict(self) -> dict[str, Any]:
        result = super().to_dict()
        result["type"] = "Text"
        return result


class Title(Element):
    """Concrete implementation of Element for testing."""

    category = "Title"

    def __init__(self, text: str = "", **kwargs):
        self.text = text
        super().__init__(**kwargs)

    def to_dict(self) -> dict[str, Any]:
        result = super().to_dict()
        result["type"] = "Title"
        return result


class NarrativeText(Element):
    """Concrete implementation of Element for testing."""

    category = "NarrativeText"

    def __init__(self, text: str = "", **kwargs):
        self.text = text
        super().__init__(**kwargs)

    def to_dict(self) -> dict[str, Any]:
        result = super().to_dict()
        result["type"] = "NarrativeText"
        return result


class ListItem(Element):
    """Concrete implementation of Element for testing."""

    category = "ListItem"

    def __init__(self, text: str = "", **kwargs):
        self.text = text
        super().__init__(**kwargs)

    def to_dict(self) -> dict[str, Any]:
        result = super().to_dict()
        result["type"] = "ListItem"
        return result


def test_empty_elements_list():
    """Test convert_to_coco with empty elements list."""
    codeflash_output = convert_to_coco([])
    result = codeflash_output  # 42.2μs -> 46.1μs (8.44% slower)


def test_basic_info_generation():
    """Test that info section is correctly generated with default parameters."""
    codeflash_output = convert_to_coco([])
    result = codeflash_output  # 42.1μs -> 45.4μs (7.24% slower)

    info = result["info"]


def test_custom_dataset_description():
    """Test that custom dataset description is used when provided."""
    custom_desc = "My Custom Dataset"
    codeflash_output = convert_to_coco([], dataset_description=custom_desc)
    result = codeflash_output  # 32.3μs -> 35.9μs (10.1% slower)


def test_custom_dataset_version():
    """Test that custom dataset version is used when provided."""
    codeflash_output = convert_to_coco([], dataset_version="2.5")
    result = codeflash_output  # 42.4μs -> 45.2μs (6.29% slower)


def test_custom_contributors():
    """Test that custom contributors are used when provided."""
    contributors = ("Alice", "Bob", "Charlie")
    codeflash_output = convert_to_coco([], contributors=contributors)
    result = codeflash_output  # 42.5μs -> 44.8μs (5.20% slower)


def test_single_element_minimal_metadata():
    """Test conversion of single element with minimal metadata."""
    element = TextElement(text="Hello World", element_id="elem_1")
    metadata = ElementMetadata()
    element.metadata = metadata

    codeflash_output = convert_to_coco([element])
    result = codeflash_output  # 74.5μs -> 73.7μs (0.978% faster)

    # Check annotation
    annotation = result["annotations"][0]


def test_multiple_elements_same_type():
    """Test conversion of multiple elements with the same type."""
    elements = [
        TextElement(text="Text 1", element_id="elem_1"),
        TextElement(text="Text 2", element_id="elem_2"),
        TextElement(text="Text 3", element_id="elem_3"),
    ]

    codeflash_output = convert_to_coco(elements)
    result = codeflash_output  # 108μs -> 100μs (7.69% faster)
    for i, annotation in enumerate(result["annotations"]):
        pass


def test_multiple_elements_different_types():
    """Test conversion of multiple elements with different types."""
    elements = [
        Title(text="Document Title", element_id="title_1"),
        NarrativeText(text="Some narrative", element_id="narr_1"),
        ListItem(text="Item 1", element_id="list_1"),
    ]

    codeflash_output = convert_to_coco(elements)
    result = codeflash_output  # 106μs -> 100μs (6.12% faster)

    # Verify category IDs are assigned
    for annotation in result["annotations"]:
        pass


def test_categories_are_unique_and_sorted():
    """Test that categories are unique and sorted."""
    elements = [
        TextElement(text="Text", element_id="e1"),
        Title(text="Title", element_id="e2"),
        TextElement(text="More Text", element_id="e3"),
        NarrativeText(text="Narrative", element_id="e4"),
    ]

    codeflash_output = convert_to_coco(elements)
    result = codeflash_output  # 117μs -> 108μs (8.57% faster)

    categories = result["categories"]
    category_names = [cat["name"] for cat in categories]


def test_image_deduplication():
    """Test that duplicate image metadata is deduplicated."""
    metadata = ElementMetadata()
    metadata.file_directory = "/path/to/docs"
    metadata.filename = "document.pdf"
    metadata.page_number = 1

    elements = [
        TextElement(text="Text 1", element_id="e1", metadata=metadata),
        TextElement(text="Text 2", element_id="e2", metadata=metadata),
        TextElement(text="Text 3", element_id="e3", metadata=metadata),
    ]

    codeflash_output = convert_to_coco(elements)
    result = codeflash_output  # 119μs -> 112μs (6.21% faster)


def test_elements_with_none_coordinates():
    """Test handling of elements with None coordinates."""
    element = TextElement(text="No Coordinates", element_id="e1")

    codeflash_output = convert_to_coco([element])
    result = codeflash_output  # 72.9μs -> 71.1μs (2.46% faster)

    annotation = result["annotations"][0]

    image = result["images"][0]


def test_elements_with_missing_metadata_fields():
    """Test handling of elements with missing metadata fields."""
    element = TextElement(text="Minimal Meta", element_id="e1")
    # ElementMetadata has defaults, so check they're applied

    codeflash_output = convert_to_coco([element])
    result = codeflash_output  # 72.6μs -> 70.9μs (2.45% faster)

    image = result["images"][0]


def test_special_characters_in_filenames():
    """Test handling of special characters in filenames."""
    metadata = ElementMetadata()
    metadata.filename = "document_with_special_chars_@#$%.pdf"
    metadata.file_directory = "/path/with spaces/and-dashes"

    element = TextElement(text="Special Chars", element_id="e1", metadata=metadata)

    codeflash_output = convert_to_coco([element])
    result = codeflash_output  # 80.9μs -> 81.1μs (0.309% slower)

    image = result["images"][0]


def test_very_long_file_paths():
    """Test handling of very long file paths."""
    long_path = "/very/long/file/path/" + "subfolder/" * 50 + "file.pdf"
    metadata = ElementMetadata()
    metadata.file_directory = long_path

    element = TextElement(text="Long Path", element_id="e1", metadata=metadata)

    codeflash_output = convert_to_coco([element])
    result = codeflash_output  # 78.2μs -> 75.8μs (3.16% faster)

    image = result["images"][0]


def test_empty_contributors_tuple():
    """Test with empty contributors tuple."""
    codeflash_output = convert_to_coco([], contributors=())
    result = codeflash_output  # 42.8μs -> 45.3μs (5.69% slower)


def test_none_dataset_description():
    """Test that None dataset_description uses default."""
    codeflash_output = convert_to_coco([], dataset_description=None)
    result = codeflash_output  # 42.5μs -> 46.1μs (7.94% slower)

    # Should use default description with today's date
    today = datetime.now().strftime("%Y-%m-%d")


def test_element_id_consistency():
    """Test that element IDs are preserved in annotations."""
    element_ids = ["custom_id_1", "custom_id_2", "custom_id_3"]
    elements = [TextElement(text=f"Text {i}", element_id=eid) for i, eid in enumerate(element_ids)]

    codeflash_output = convert_to_coco(elements)
    result = codeflash_output  # 112μs -> 105μs (5.86% faster)

    for i, annotation in enumerate(result["annotations"]):
        pass


def test_image_id_assignment():
    """Test that image IDs are correctly assigned starting from 1."""
    metadata1 = ElementMetadata()
    metadata1.filename = "file1.pdf"
    metadata2 = ElementMetadata()
    metadata2.filename = "file2.pdf"

    elements = [
        TextElement(text="Text 1", element_id="e1", metadata=metadata1),
        TextElement(text="Text 2", element_id="e2", metadata=metadata2),
    ]

    codeflash_output = convert_to_coco(elements)
    result = codeflash_output  # 100μs -> 94.6μs (6.32% faster)

    images = result["images"]
    for i, image in enumerate(images):
        pass


def test_category_id_assignment():
    """Test that category IDs are correctly assigned starting from 1."""
    codeflash_output = convert_to_coco([])
    result = codeflash_output  # 42.7μs -> 45.4μs (6.03% slower)

    categories = result["categories"]
    for i, category in enumerate(categories):
        pass


def test_large_number_of_elements():
    """Test conversion of 500+ elements."""
    elements = [TextElement(text=f"Text content {i}", element_id=f"elem_{i}") for i in range(500)]

    codeflash_output = convert_to_coco(elements)
    result = codeflash_output  # 6.81ms -> 5.31ms (28.2% faster)


def test_large_number_of_unique_images():
    """Test conversion with 500+ unique images."""
    elements = []
    for i in range(500):
        metadata = ElementMetadata()
        metadata.filename = f"document_{i}.pdf"
        metadata.page_number = i
        element = TextElement(text=f"Text {i}", element_id=f"elem_{i}", metadata=metadata)
        elements.append(element)

    codeflash_output = convert_to_coco(elements)
    result = codeflash_output  # 8.44ms -> 6.91ms (22.1% faster)
    for i, image in enumerate(result["images"]):
        pass


def test_large_scale_many_elements_same_image():
    """Test many elements (500+) pointing to same image."""
    shared_metadata = ElementMetadata()
    shared_metadata.filename = "shared_document.pdf"

    elements = [
        TextElement(text=f"Text {i}", element_id=f"elem_{i}", metadata=shared_metadata)
        for i in range(500)
    ]

    codeflash_output = convert_to_coco(elements)
    result = codeflash_output  # 7.05ms -> 5.54ms (27.2% faster)


def test_performance_info_section_large_contributors():
    """Test info section generation with many contributors."""
    contributors = tuple(f"Contributor_{i}" for i in range(100))

    codeflash_output = convert_to_coco([], contributors=contributors)
    result = codeflash_output  # 45.3μs -> 48.6μs (6.84% slower)

    expected_contributors = ",".join(contributors)


def test_large_scale_category_consistency():
    """Test that categories remain consistent across large dataset."""
    elements = [
        (
            Title(text=f"Title {i}", element_id=f"title_{i}")
            if i % 3 == 0
            else (
                NarrativeText(text=f"Narrative {i}", element_id=f"narr_{i}")
                if i % 3 == 1
                else ListItem(text=f"Item {i}", element_id=f"item_{i}")
            )
        )
        for i in range(300)
    ]

    codeflash_output = convert_to_coco(elements)
    result = codeflash_output  # 4.12ms -> 3.25ms (26.9% faster)

    # All annotations should have valid category IDs
    category_ids = set(cat["id"] for cat in result["categories"])
    for ann in result["annotations"]:
        pass

from unstructured.documents.elements import DataSourceMetadata, ElementMetadata, ListItem
from unstructured.staging.base import convert_to_coco


def test_convert_to_coco():
    convert_to_coco(
        ListItem(
            "",
            element_id=None,
            coordinates=None,
            coordinate_system=None,
            metadata=ElementMetadata(
                attached_to_filename="",
                bcc_recipient=None,
                category_depth=None,
                cc_recipient=None,
                coordinates=None,
                data_source=DataSourceMetadata(
                    url=None,
                    version="",
                    record_locator={"": 0},
                    date_created="",
                    date_modified=None,
                    date_processed="",
                    permissions_data=[{}, {}],
                ),
                detection_class_prob=float("inf"),
                emphasized_text_contents=[""],
                emphasized_text_tags=[],
                file_directory=None,
                filename="/",
                filetype="",
                header_footer_type=None,
                image_base64="",
                image_mime_type="",
                image_url="",
                image_path=None,
                is_continuation=None,
                languages=["\x00"],
                last_modified=None,
                link_start_indexes=[0],
                link_texts=None,
                link_urls=[],
                links=None,
                email_message_id="",
                orig_elements=None,
                page_name="",
                page_number=None,
                parent_id=None,
                sent_from=None,
                sent_to=[],
                signature="",
                subject="",
                table_as_cells=None,
                text_as_html=None,
                url=None,
            ),
            detection_origin="",
            embeddings=[0.0],
        ),
        dataset_description=None,
        dataset_version="",
        contributors="",
    )


def test_convert_to_coco_2():
    convert_to_coco((), dataset_description="\x00", dataset_version="", contributors="")

🔎 Click to see Concolic Coverage Tests

Test File::Test Function	Original ⏱️	Optimized ⏱️	Speedup
`codeflash_concolic_xdo_puqm/tmpzl0yp0n_/test_concolic_coverage.py::test_convert_to_coco_2`	35.4μs	38.5μs	-8.12%⚠️

To edit these changes git checkout codeflash/optimize-convert_to_coco-mks1ib2s and push.

The optimized code achieves a **26% speedup** by eliminating redundant computations and replacing inefficient list scans with O(1) dictionary lookups. **Key optimizations:** 1. **Single `datetime.now()` call**: The original code called `datetime.now()` three times to populate the "info" section (for the description string formatting, year extraction, and date_created). The optimized version caches the result in a `now` variable and reuses it, avoiding two redundant system calls. 2. **Avoided expensive per-item dictionary sorting for image deduplication**: The original code deduplicated images using `{tuple(sorted(d.items())): d for d in images}`, which sorts every dictionary's items—an O(k log k) operation per dictionary where k is the number of keys. The optimized code builds a tuple key directly from the relevant fields (`width`, `height`, `file_directory`, `file_name`, `page_number`) without sorting, reducing overhead from O(n·k log k) to O(n). 3. **Replaced O(n·m) category ID lookups with O(1) dictionary mapping**: The original code used a list comprehension `[x["id"] for x in categories if x["name"] == el["type"]][0]` for every annotation, scanning all categories (O(m)) for each of n elements. The optimized version builds a `name_to_id` dictionary once and performs O(1) lookups, reducing this from O(n·m) to O(n). 4. **Hoisted repeated metadata lookups**: The original code repeatedly called `el["metadata"].get("coordinates")` up to 12 times per element when building annotations. The optimized version caches this in a `coords` variable and reuses it, eliminating redundant dictionary accesses. 5. **Explicit loops for readability and micro-optimizations**: Replaced list comprehensions with explicit loops where intermediate values (like `coords`, bbox components) are reused multiple times, reducing dictionary indexing overhead. **Impact on workloads:** - The test results show the optimizations excel with **larger datasets** (e.g., 500+ elements show 22-28% speedup), where the O(1) category lookup and reduced metadata access compound. - For small datasets (empty or 1-3 elements), the overhead of creating the `name_to_id` dictionary can cause a slight slowdown (5-10%), but this is negligible in absolute terms (microseconds). - The deduplication improvement benefits cases with many duplicate images, as seen in `test_large_scale_many_annotations_with_mixed_metadata` (173% faster) where tuple-based deduplication significantly outperforms sorting-based deduplication. **Behavioral preservation:** - The optimized code raises `IndexError` (via `except KeyError: raise IndexError`) when a category is not found, matching the original's exception behavior when indexing an empty list. - Image deduplication logic preserves insertion order and updates with the last-seen value for each key, maintaining the original's `{...}.values()` semantics.

codeflash-ai bot requested a review from aseembits93 January 24, 2026 08:20

codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Jan 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

⚡️ Speed up function `convert_to_coco` by 26%#267

⚡️ Speed up function `convert_to_coco` by 26%#267
codeflash-ai[bot] wants to merge 1 commit intomainfrom
codeflash/optimize-convert_to_coco-mks1ib2s

codeflash-ai bot commented Jan 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants

Conversation

codeflash-ai bot commented Jan 24, 2026

📄 26% (0.26x) speedup for convert_to_coco in unstructured/staging/base.py

📝 Explanation and details

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants

📄 26% (0.26x) speedup for `convert_to_coco` in `unstructured/staging/base.py`