Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Frame-level view aggregations exceed MongoDB pipeline size limit, causing OperationFailure #5453

Open
Emelian opened this issue Jan 31, 2025 · 2 comments
Labels
bug Bug fixes

Comments

@Emelian
Copy link

Emelian commented Jan 31, 2025

Describe the problem

The error consistently appears with datasets that have extensive frame data.

When using filters that require frame data (exists, match_frames and more other), FiftyOne attempts to aggregate all frame documents from the dataset. If there is a large amount of frame data, MongoDB throws an error about exceeding the 104857600 bytes limit. This makes any operation on such a view (e.g., len(view)) fail. The issue cannot be resolved at the user-code level.

Traceback (most recent call last):
  File "/builds/analytics/networks/datasets/src/annotator.py", line 90, in main
    logger.info(f"Annotating {len(samples)} videos without {label_field} field")
                              ^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/fiftyone/core/view.py", line 102, in __len__
    return self.count()
           ^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/fiftyone/core/collections.py", line 7748, in count
    return self._make_and_aggregate(make, field_or_expr)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/fiftyone/core/collections.py", line 10419, in _make_and_aggregate
    return self.aggregate(make(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/fiftyone/core/collections.py", line 10112, in aggregate
    _results = foo.aggregate(self._dataset._sample_collection, pipelines)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/fiftyone/core/odm/database.py", line 346, in aggregate
    result = collection.aggregate(pipelines[0], allowDiskUse=True)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/pymongo/synchronous/collection.py", line 2937, in aggregate
    return self._aggregate(
           ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/pymongo/_csot.py", line 120, in csot_wrapper
    return func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/pymongo/synchronous/collection.py", line 2845, in _aggregate
    return self._database.client._retryable_read(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/pymongo/synchronous/mongo_client.py", line 1863, in _retryable_read
    return self._retry_internal(
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/pymongo/_csot.py", line 120, in csot_wrapper
    return func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/pymongo/synchronous/mongo_client.py", line 1830, in _retry_internal
    ).run()
      ^^^^^
  File "/usr/local/lib/python3.12/dist-packages/pymongo/synchronous/mongo_client.py", line 2554, in run
    return self._read() if self._is_read else self._write()
           ^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/pymongo/synchronous/mongo_client.py", line 2697, in _read
    return self._func(self._session, self._server, conn, read_pref)  # type: ignore
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/pymongo/synchronous/aggregation.py", line 164, in get_cursor
    result = conn.command(
             ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/pymongo/synchronous/helpers.py", line 45, in inner
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/pymongo/synchronous/pool.py", line 538, in command
    return command(
           ^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/pymongo/synchronous/network.py", line 218, in command
    helpers_shared._check_command_response(
  File "/usr/local/lib/python3.12/dist-packages/pymongo/helpers_shared.py", line 247, in _check_command_response
    raise OperationFailure(errmsg, code, response, max_wire_version)
pymongo.errors.OperationFailure: PlanExecutor error during aggregation :: caused by :: Total size of documents in frames.samples.678b5d532a5565037d9de50e matching pipeline's $lookup stage exceeds 104[857](****)600 bytes, full error: {'ok': 0.0, 'errmsg': "PlanExecutor error during aggregation :: caused by :: Total size of documents in frames.samples.678b5d532a5565037d9de50e matching pipeline's $lookup stage exceeds 104857600 bytes", 'code': 4568, 'codeName': 'Location4568', '$clusterTime': {'clusterTime': Timestamp(1738175668, 25), 'signature': {'hash': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', 'keyId': 0}}, 'operationTime': Timestamp(1738175668, 25)}

This is a simplified example. In practice, the filters might vary. The important factor - dataset is having large frames metadata.

System Information

  • OS Platform and Distribution: Linux
  • Python version: 3.12
  • FiftyOne version: 1.3.0
  • FiftyOne installed from: pip

Extra info

Total count of videos in dataset: 108
Total count of frames in dataset: 928747
Total count of labels (in 2 label fields) in dataset: 950372

@Emelian Emelian added the bug Bug fixes label Jan 31, 2025
@Emelian Emelian changed the title [BUG] [BUG] Frame-level view aggregations exceed MongoDB pipeline size limit, causing OperationFailure Feb 1, 2025
@smartnet-club
Copy link

smartnet-club commented Feb 1, 2025

I confirm that this error is reproduced on video datasets with a sufficient number of detections

The code:

samples = dataset.match_frames(F("yolov8x-world-test").is_null())
len(samples)

produces mongo request:

      "pipeline": [
        {
          "$lookup": {
            "from": "frames.samples.672b50348af9bde1a02499b0",
            "let": {"sample_id": "$_id"},
            "pipeline": [
              {
                "$match": {"$expr": {"$eq": ["$$sample_id", "$_sample_id"]}}
              },
              {"$sort": {"frame_number": 1}}
            ],
            "as": "frames"}
        },
        {
          "$addFields": {"frames": {"$filter": {"input": "$frames", "as": "this", "cond": {"$not": {"$gt": ["$$this.yolov8x-world-test"]}}}}}
        },
        {
          "$match": {"$expr": {"$gt": [{"$size": {"$ifNull": ["$frames"]}}]}}
        },
        {"$count": "count"}
      ]

lookup part of pipeline aggregates all frames documents from frames.samples.672b50348af9bde1a02499b0 into samples.672b50348af9bde1a02499b0

@benjaminpkane
Copy link
Contributor

HI @smartnet-club. I am working on a proposal that aims to fix this. This week, hopefully

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bug fixes
Projects
None yet
Development

No branches or pull requests

3 participants