Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job Manager crash on ApiResponseError #632

Open
JorisCod opened this issue Sep 25, 2024 · 3 comments
Open

Job Manager crash on ApiResponseError #632

JorisCod opened this issue Sep 25, 2024 · 3 comments

Comments

@JorisCod
Copy link

JorisCod commented Sep 25, 2024

The job manager crashed on some ApiResponseError:

The error:

  File "/data/users/Private/joris.c/lcfm-production/notebooks/sentinel1-jm.py", line 92, in <module>
    job_manager.run_jobs(
  File "/home/joris.c/mambaforge/envs/lcfm-production/lib/python3.11/site-packages/openeo/extra/job_management.py", line 365, in run_jobs
    self._launch_job(start_job, df, i, backend_name)
  File "/home/joris.c/mambaforge/envs/lcfm-production/lib/python3.11/site-packages/openeo/extra/job_management.py", line 400, in _launch_job
    job = start_job(
          ^^^^^^^^^^
  File "/data/users/Private/joris.c/lcfm-production/src/sentinel1/pipeline.py", line 87, in start_job
    secondary_result = result_datacube.result_node()
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/joris.c/mambaforge/envs/lcfm-production/lib/python3.11/site-packages/openeo/rest/connection.py", line 1764, in create_job
    response = self.post("/jobs", json=pg_with_metadata, expected_status=201)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/joris.c/mambaforge/envs/lcfm-production/lib/python3.11/site-packages/openeo/rest/connection.py", line 249, in post
    return self.request("post", path=path, json=json, allow_redirects=False, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/joris.c/mambaforge/envs/lcfm-production/lib/python3.11/site-packages/openeo/rest/connection.py", line 816, in request
    return _request()
           ^^^^^^^^^^
  File "/home/joris.c/mambaforge/envs/lcfm-production/lib/python3.11/site-packages/openeo/rest/connection.py", line 809, in _request
    return super(Connection, self).request(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/joris.c/mambaforge/envs/lcfm-production/lib/python3.11/site-packages/openeo/rest/connection.py", line 187, in request
    self._raise_api_error(resp)
  File "/home/joris.c/mambaforge/envs/lcfm-production/lib/python3.11/site-packages/openeo/rest/connection.py", line 207, in _raise_api_error
    raise OpenEoApiError(
openeo.rest.OpenEoApiError: [500] Internal: Server error: EjrApiResponseError('EJR API error: 500 \'Internal Server Error\' on `POST \'https://jobregistry.vgt.vito.be/jobs\'`: {"statusCode":500,"message":"Could not store jobs in database: illegal_argument_exception - no write index is defined for alias [openeo-jobs-prod]. The write index may be explicitly disabled using is_write_index=false or the alias points to multiple indices without one being designated as a write index"}') (ref: r-24092514fc8a4f3c96a98bd6ec4230c2)

openeo.rest.OpenEoApiError: [500] Internal: Server error: EjrApiResponseError('EJR API error: 500 'Internal Server Error' on POST \'https://jobregistry.vgt.vito.be/jobs\': {"statusCode":500,"message":"Could not store jobs in database: illegal_argument_exception - no write index is defined for alias [openeo-jobs-prod]. The write index may be explicitly disabled using is_write_index=false or the alias points to multiple indices without one being designated as a write index"}') (ref: r-24092514fc8a4f3c96a98bd6ec4230c2)

@soxofaan
Copy link
Member

This is a back-end issue: creation of the job failed there.

What we can do client side in job manager:

  • given that this a server side "500" error, we could retry a couple of times (with some wait time in between)
  • but in the end there is no guarantee that it will work eventually, so I guess we ultimately should mark the job as failed and continue with the other jobs (but these might all fail as well)

@JorisCod
Copy link
Author

JorisCod commented Sep 26, 2024

I had a another similar 500 error, OidcException:

Traceback (most recent call last):
  File "/data/users/Private/joris.c/lcfm-production/notebooks/sentinel1-jm.py", line 88, in <module>
    job_manager.run_jobs(
  File "/home/joris.c/mambaforge/envs/lcfm-production/lib/python3.11/site-packages/openeo/extra/job_management.py", line 365, in run_jobs
    self._launch_job(start_job, df, i, backend_name)
  File "/home/joris.c/mambaforge/envs/lcfm-production/lib/python3.11/site-packages/openeo/extra/job_management.py", line 414, in _launch_job
    status = job.status()
             ^^^^^^^^^^^^
  File "/home/joris.c/mambaforge/envs/lcfm-production/lib/python3.11/site-packages/openeo/rest/job.py", line 87, in status
    return self.describe().get("status", "N/A")
           ^^^^^^^^^^^^^^^
  File "/home/joris.c/mambaforge/envs/lcfm-production/lib/python3.11/site-packages/openeo/rest/job.py", line 77, in describe
    return self.connection.get(f"/jobs/{self.job_id}", expected_status=200).json()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/joris.c/mambaforge/envs/lcfm-production/lib/python3.11/site-packages/openeo/rest/connection.py", line 239, in get
    return self.request("get", path=path, stream=stream, auth=auth, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/joris.c/mambaforge/envs/lcfm-production/lib/python3.11/site-packages/openeo/rest/connection.py", line 816, in request
    return _request()
           ^^^^^^^^^^
  File "/home/joris.c/mambaforge/envs/lcfm-production/lib/python3.11/site-packages/openeo/rest/connection.py", line 809, in _request
    return super(Connection, self).request(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/joris.c/mambaforge/envs/lcfm-production/lib/python3.11/site-packages/openeo/rest/connection.py", line 187, in request
    self._raise_api_error(resp)
  File "/home/joris.c/mambaforge/envs/lcfm-production/lib/python3.11/site-packages/openeo/rest/connection.py", line 207, in _raise_api_error
    raise OpenEoApiError(
openeo.rest.OpenEoApiError: [500] Internal: Server error: OidcException('Failed to retrieve access token at \'https://sso.terrascope.be/auth/realms/terrascope/protocol/openid-connect/token\': 500 \'Internal Server Error\' \'{"error":"unknown_error"}\'') (ref: r-240925ae6e434800b096226d2de87769)

openeo.rest.OpenEoApiError: [500] Internal: Server error: OidcException('Failed to retrieve access token at 'https://sso.terrascope.be/auth/realms/terrascope/protocol/openid-connect/token\': 500 'Internal Server Error' '{"error":"unknown_error"}'') (ref: r-240925ae6e434800b096226d2de87769)

@JorisCod
Copy link
Author

JorisCod commented Oct 24, 2024

With v 0.32.0, I just got this error:

2024-10-24 07:19:34.070 | INFO     | sentinel1.pipeline:start_job:120 - Starting Job: j-2410247e2d7f46239f21d65252bebc31 
{'executor_memory': '2G', 'executor_memoryOverhead': '1G', 'driver-memory': '3G', 'driver-memoryOverhead': '1G', 'python-memory': '16m', 'max-executors': 10, 'executor-memory': '1G', 'executor-memoryOverhead': '1G'}
urllib3.exceptions.ResponseError: too many 503 error responses

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/joris.c/mambaforge/envs/lcfm-production/lib/python3.11/site-packages/requests/adapters.py", line 667, in send
    resp = conn.urlopen(
           ^^^^^^^^^^^^^
  File "/home/joris.c/mambaforge/envs/lcfm-production/lib/python3.11/site-packages/urllib3/connectionpool.py", line 944, in urlopen
    return self.urlopen(
           ^^^^^^^^^^^^^
  File "/home/joris.c/mambaforge/envs/lcfm-production/lib/python3.11/site-packages/urllib3/connectionpool.py", line 944, in urlopen
    return self.urlopen(
           ^^^^^^^^^^^^^
  File "/home/joris.c/mambaforge/envs/lcfm-production/lib/python3.11/site-packages/urllib3/connectionpool.py", line 944, in urlopen
    return self.urlopen(
           ^^^^^^^^^^^^^
  [Previous line repeated 2 more times]
  File "/home/joris.c/mambaforge/envs/lcfm-production/lib/python3.11/site-packages/urllib3/connectionpool.py", line 934, in urlopen
    retries = retries.increment(method, url, response=response, _pool=self)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/joris.c/mambaforge/envs/lcfm-production/lib/python3.11/site-packages/urllib3/util/retry.py", line 519, in increment
    raise MaxRetryError(_pool, url, reason) from reason  # type: ignore[arg-type]
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='openeo.dataspace.copernicus.eu', port=443): Max retries exceeded with url: /openeo/1.2/jobs/j-241024b74812415ca0e89a771dc4ca22 (Caused by ResponseError('too many 503 error responses'))```

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants