Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Oct 14, 2025

📄 24% (0.24x) speedup for get_unicode_from_response in src/requests/utils.py

⏱️ Runtime : 117 microseconds 94.6 microseconds (best of 273 runs)

📝 Explanation and details

The optimization significantly improves performance by avoiding expensive header parsing for the majority of common cases. The key change is implementing a fast-path parsing approach in get_encoding_from_headers.

What changed:

  • Added upfront case-insensitive string operations (content_type.lower()) to check for charset= directly in the header string
  • For simple headers with charset, extracts encoding using basic string splitting (split("charset=", 1)) instead of full header parsing
  • Only falls back to the expensive _parse_content_type_header() function for complex headers that the fast-path cannot handle

Why this is faster:
The original code always called _parse_content_type_header() which does comprehensive parameter parsing with multiple string operations, dictionary building, and edge case handling. The line profiler shows this function consumed 73% of the total execution time in the original version.

The optimization reduces this to just 9.7% for the 3 cases that still need full parsing, while handling 20 out of 23 test cases with simple string operations that are much faster than full parsing.

Performance gains by test type:

  • Best gains (35-72% faster): Headers with charset parameters, especially those with quotes, extra parameters, or case variations
  • Good gains (20-35% faster): Standard text/JSON content with explicit or default charsets
  • Minimal impact: Complex binary content or cases requiring fallback parsing (0-8% slower in rare edge cases)

The optimization is particularly effective for typical HTTP responses where charset is explicitly specified or content-type follows standard patterns.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 22 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
import warnings

# imports
import pytest  # used for our unit tests
from requests.compat import str
from requests.utils import get_unicode_from_response

# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

#------------------------------------------------
import warnings

# imports
import pytest
from requests.compat import str
from requests.utils import get_unicode_from_response


# Helper: Minimal Response object for testing
class DummyResponse:
    def __init__(self, content, headers):
        self.content = content
        self.headers = headers

# ---- UNIT TESTS ----

# --- BASIC TEST CASES ---

def test_ascii_content_with_explicit_utf8_header():
    # ASCII bytes, UTF-8 header
    r = DummyResponse(b'hello world', {'content-type': 'text/plain; charset=utf-8'})
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        codeflash_output = get_unicode_from_response(r); result = codeflash_output # 5.47μs -> 3.96μs (38.4% faster)

def test_utf8_content_with_utf8_header():
    # UTF-8 bytes, UTF-8 header
    text = 'café'
    r = DummyResponse(text.encode('utf-8'), {'content-type': 'text/plain; charset=utf-8'})
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        codeflash_output = get_unicode_from_response(r); result = codeflash_output # 5.43μs -> 4.05μs (34.1% faster)

def test_latin1_content_with_latin1_header():
    # Latin-1 bytes, Latin-1 header
    text = 'café'
    r = DummyResponse(text.encode('latin1'), {'content-type': 'text/plain; charset=latin1'})
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        codeflash_output = get_unicode_from_response(r); result = codeflash_output # 4.78μs -> 3.54μs (35.3% faster)

def test_json_content_without_charset():
    # JSON content, no charset -> should default to utf-8
    text = '{"message": "привет"}'
    r = DummyResponse(text.encode('utf-8'), {'content-type': 'application/json'})
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        codeflash_output = get_unicode_from_response(r); result = codeflash_output # 4.15μs -> 3.32μs (25.2% faster)

def test_text_content_without_charset():
    # Text content, no charset -> should default to ISO-8859-1
    text = 'café'
    r = DummyResponse(text.encode('latin1'), {'content-type': 'text/plain'})
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        codeflash_output = get_unicode_from_response(r); result = codeflash_output # 3.43μs -> 2.61μs (31.4% faster)

# --- EDGE TEST CASES ---

def test_no_content_type_header():
    # No content-type header, fallback should decode as str(r.content, None, errors='replace')
    text = 'naïve'
    r = DummyResponse(text.encode('utf-8'), {})
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        codeflash_output = get_unicode_from_response(r); result = codeflash_output # 3.31μs -> 3.34μs (0.808% slower)


def test_bytes_content_with_non_text_header():
    # Binary content, non-text header, no charset
    binary = b'\x89PNG\r\n\x1a\n'
    r = DummyResponse(binary, {'content-type': 'application/octet-stream'})
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        codeflash_output = get_unicode_from_response(r); result = codeflash_output # 6.62μs -> 7.24μs (8.60% slower)

def test_content_type_with_quotes_in_charset():
    # Charset parameter has quotes
    text = 'café'
    r = DummyResponse(text.encode('utf-8'), {'content-type': 'text/plain; charset="utf-8"'})
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        codeflash_output = get_unicode_from_response(r); result = codeflash_output # 6.40μs -> 4.79μs (33.4% faster)

def test_content_type_with_single_quotes_in_charset():
    # Charset parameter has single quotes
    text = 'café'
    r = DummyResponse(text.encode('utf-8'), {'content-type': "text/plain; charset='utf-8'"})
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        codeflash_output = get_unicode_from_response(r); result = codeflash_output # 5.18μs -> 3.95μs (31.2% faster)

def test_content_type_with_extra_parameters():
    # Extra parameters in content-type
    text = 'hello'
    r = DummyResponse(text.encode('utf-8'), {'content-type': 'text/plain; foo=bar; charset=utf-8; baz=qux'})
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        codeflash_output = get_unicode_from_response(r); result = codeflash_output # 6.09μs -> 3.54μs (72.0% faster)

def test_content_type_with_uppercase_charset():
    # Charset parameter is uppercase
    text = 'café'
    r = DummyResponse(text.encode('utf-8'), {'content-type': 'text/plain; CHARSET=UTF-8'})
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        codeflash_output = get_unicode_from_response(r); result = codeflash_output # 4.88μs -> 3.59μs (35.9% faster)

def test_content_type_with_spaces_around_equals():
    # Spaces around '=' in charset param
    text = 'café'
    r = DummyResponse(text.encode('utf-8'), {'content-type': 'text/plain; charset = utf-8'})
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        codeflash_output = get_unicode_from_response(r); result = codeflash_output # 4.89μs -> 2.87μs (70.5% faster)

def test_content_type_with_no_charset_and_non_text_type():
    # No charset, non-text type, should not default to ISO-8859-1 or utf-8
    binary = b'\x00\x01\x02\x03'
    r = DummyResponse(binary, {'content-type': 'application/octet-stream'})
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        codeflash_output = get_unicode_from_response(r); result = codeflash_output # 4.59μs -> 5.01μs (8.32% slower)

def test_content_type_with_semicolon_but_no_params():
    # Content-type with semicolon but no params
    text = 'hello'
    r = DummyResponse(text.encode('utf-8'), {'content-type': 'text/plain;'})
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        codeflash_output = get_unicode_from_response(r); result = codeflash_output # 3.70μs -> 2.72μs (36.1% faster)

def test_content_type_with_multiple_semicolons():
    # Content-type with multiple semicolons and empty params
    text = 'hello'
    r = DummyResponse(text.encode('utf-8'), {'content-type': 'text/plain;;; ; ;'})
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        codeflash_output = get_unicode_from_response(r); result = codeflash_output # 3.97μs -> 2.56μs (55.3% faster)

def test_content_type_with_charset_and_invalid_content():
    # Charset is utf-8, but content is not valid utf-8
    r = DummyResponse(b'\xff\xff\xff', {'content-type': 'text/plain; charset=utf-8'})
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        codeflash_output = get_unicode_from_response(r); result = codeflash_output # 8.32μs -> 6.88μs (20.9% faster)

def test_content_type_with_charset_and_partial_valid_content():
    # Charset is utf-8, but content is partially valid
    valid = 'abc'
    invalid = b'\xff'
    r = DummyResponse(valid.encode('utf-8') + invalid, {'content-type': 'text/plain; charset=utf-8'})
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        codeflash_output = get_unicode_from_response(r); result = codeflash_output # 6.77μs -> 5.59μs (21.0% faster)


def test_large_utf8_content():
    # Large UTF-8 content, should decode correctly
    text = '😀' * 1000  # 1000 emoji
    r = DummyResponse(text.encode('utf-8'), {'content-type': 'text/plain; charset=utf-8'})
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        codeflash_output = get_unicode_from_response(r); result = codeflash_output # 9.74μs -> 8.62μs (13.0% faster)

def test_large_latin1_content():
    # Large Latin-1 content, should decode correctly
    text = 'é' * 1000
    r = DummyResponse(text.encode('latin1'), {'content-type': 'text/plain; charset=latin1'})
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        codeflash_output = get_unicode_from_response(r); result = codeflash_output # 5.54μs -> 4.22μs (31.3% faster)

def test_large_binary_content_with_no_charset():
    # Large binary content, no charset, non-text type
    binary = bytes(range(256)) * 4  # 1024 bytes
    r = DummyResponse(binary, {'content-type': 'application/octet-stream'})
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        codeflash_output = get_unicode_from_response(r); result = codeflash_output # 5.13μs -> 5.46μs (6.01% slower)

def test_large_json_content_without_charset():
    # Large JSON content, no charset, should default to utf-8
    text = '{"data": "' + 'x' * 900 + '"}'
    r = DummyResponse(text.encode('utf-8'), {'content-type': 'application/json'})
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        codeflash_output = get_unicode_from_response(r); result = codeflash_output # 3.69μs -> 3.04μs (21.7% faster)

def test_many_headers_and_large_content():
    # Many headers, large content, ensure only content-type matters
    text = 'hello world' * 80
    headers = {f'X-Header-{i}': f'value{i}' for i in range(50)}
    headers['content-type'] = 'text/plain; charset=utf-8'
    r = DummyResponse(text.encode('utf-8'), headers)
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        codeflash_output = get_unicode_from_response(r); result = codeflash_output # 5.01μs -> 3.70μs (35.4% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-get_unicode_from_response-mgpvpwol and push.

Codeflash

The optimization significantly improves performance by **avoiding expensive header parsing** for the majority of common cases. The key change is implementing a **fast-path parsing approach** in `get_encoding_from_headers`.

**What changed:**
- Added upfront case-insensitive string operations (`content_type.lower()`) to check for `charset=` directly in the header string
- For simple headers with charset, extracts encoding using basic string splitting (`split("charset=", 1)`) instead of full header parsing
- Only falls back to the expensive `_parse_content_type_header()` function for complex headers that the fast-path cannot handle

**Why this is faster:**
The original code always called `_parse_content_type_header()` which does comprehensive parameter parsing with multiple string operations, dictionary building, and edge case handling. The line profiler shows this function consumed **73% of the total execution time** in the original version. 

The optimization reduces this to just **9.7%** for the 3 cases that still need full parsing, while handling 20 out of 23 test cases with simple string operations that are much faster than full parsing.

**Performance gains by test type:**
- **Best gains (35-72% faster)**: Headers with charset parameters, especially those with quotes, extra parameters, or case variations
- **Good gains (20-35% faster)**: Standard text/JSON content with explicit or default charsets  
- **Minimal impact**: Complex binary content or cases requiring fallback parsing (0-8% slower in rare edge cases)

The optimization is particularly effective for typical HTTP responses where charset is explicitly specified or content-type follows standard patterns.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 October 14, 2025 01:24
@codeflash-ai codeflash-ai bot added the ⚡️ codeflash Optimization PR opened by Codeflash AI label Oct 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant