Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Diracx demo crashes #274

Open
martynia opened this issue Jul 9, 2024 · 27 comments
Open

Diracx demo crashes #274

martynia opened this issue Jul 9, 2024 · 27 comments

Comments

@martynia
Copy link

martynia commented Jul 9, 2024

Hello,
The demo crashes w/o any interactions from my client. A freshly cloned demo:

2024-07-09 08:09:37,364 - ERROR: Traceback (most recent call last):
File "/opt/conda/lib/python3.11/site-packages/starlette/routing.py", line 732, in lifespan
async with self.lifespan_context(app) as maybe_state:
File "/opt/conda/lib/python3.11/contextlib.py", line 210, in aenter
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/diracx_source/diracx/diracx-routers/src/diracx/routers/fastapi_classes.py", line 35, in lifespan
await asyncio.gather(
File "/opt/conda/lib/python3.11/contextlib.py", line 650, in enter_async_context
result = await _enter(cm)
^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/contextlib.py", line 210, in aenter
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/diracx_source/diracx/diracx-routers/src/diracx/routers/job_manager/sandboxes.py", line 62, in lifetime_function
if not await s3_bucket_exists(self._client, self.bucket_name):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/diracx_source/diracx/diracx-core/src/diracx/core/s3.py", line 29, in s3_bucket_exists
return await _s3_exists(s3_client.head_bucket, Bucket=bucket_name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/diracx_source/diracx/diracx-core/src/diracx/core/s3.py", line 39, in _s3_exists
await method(**kwargs)
File "/opt/conda/lib/python3.11/site-packages/aiobotocore/client.py", line 411, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (403) when calling the HeadBucket operation: Forbidden

@martynia
Copy link
Author

martynia commented Jul 9, 2024

@chrisburr @chaen.

@chaen
Copy link
Contributor

chaen commented Jul 9, 2024

It's expected upon creation, the pod should redeploy itself and everything should be fine then

@martynia
Copy link
Author

martynia commented Jul 9, 2024

OK, but shouldn't the success be visible in the logs ? kubectl get pods shows the demo pod ready 1/1
But the logs:
INFO: Stopping reloader process [1]

How long do I have to wait for redeployments ? 10 minutes have passed now, and the logs don't show any sign of life.
I'm pretty sure I waited longer that 10 minutes with no success.

@martynia
Copy link
Author

martynia commented Jul 9, 2024

I think it is properly dead. I has never restarted after 15:07 GMT

@chaen
Copy link
Contributor

chaen commented Jul 10, 2024

what does kubectl get pods show ?

@martynia
Copy link
Author

martynia commented Jul 10, 2024

martynia@Januszs-Laptop Dirac % kubectl get pods                               
NAME                                                   READY   STATUS    RESTARTS      AGE
diracx-demo-76dbcdd78f-hc8fj                           1/1     Running   1 (20h ago)   20h
diracx-demo-cert-manager-7469966f7c-2qrd5              1/1     Running   0             20h
diracx-demo-cert-manager-cainjector-6db6d7d979-87hw6   1/1     Running   0             20h
diracx-demo-cert-manager-webhook-c57d6d6b7-x6spc       1/1     Running   0             20h
diracx-demo-cli-5ff5cfd9bd-dwdmq                       1/1     Running   0             20h
diracx-demo-dex-8ccfd8564-xcqm7                        1/1     Running   0             20h
diracx-demo-minio-5674bcbd44-z4s6g                     1/1     Running   0             20h
diracx-demo-mysql-0                                    1/1     Running   0             20h
diracx-demo-rabbitmq-0                                 1/1     Running   0             20h
diracx-demo-web-6bfdcb8578-fq74k                       1/1     Running   0             20h
elasticsearch-master-0                                 1/1     Running   1 (18h ago)   20h
opensearch-cluster-master-0                            1/1     Running   0             20h

@chaen
Copy link
Contributor

chaen commented Jul 10, 2024

(don't forget triple backtick :-)

To me it looks fine. What is the problem ? Everything is running

@martynia
Copy link
Author

OK, I added an access policy to my router. As I understand from the docs the policy is activated by adding a line:
PilotLogsAccessPolicy = "diracx.routers.pilot_logging.access_policies:PilotLogsAccessPolicy"
into pyproject.toml, this is where I spotted a sandbox policy, so I followed. I', assuming that if I comment that line in the toml file above I deactivate the policy.

Now I'm getting 500 on the client side, w/o any json doc I could decode. The server does not print anything. So it is a blind walk. I managed to have 1 successfull response in the past, followed by 500 at the second attempt. That's why I was concerned about the error message from the server, which looked like dead. Any hints how I could trace that 500 on the server side ?

@chaen
Copy link
Contributor

chaen commented Jul 10, 2024

any change in pyproject.toml requires a reinstall of the package, so it is not really suited for run_demo.sh (you could rerun the deployment, but if you are unsure about what you are doing with kubernetes, better just restart the demo, as changing pyproject.toml while coding is relatively rare.

That being said, if you see error 500, there should be something in the logs.

kubectl logs diracx-demo-76dbcdd78f-hc8fj does not give anything ?

@martynia
Copy link
Author

It's a different pod now. It just gives the last message when it died (this is a tail of it, mind the time):

2024-07-10 12:08:27,113 - ERROR:    Application startup failed. Exiting.
INFO:     Stopping reloader process [1]

@chaen
Copy link
Contributor

chaen commented Jul 10, 2024

the --previous should give you logs of the previous pod.

@martynia
Copy link
Author

Confusions here. The current pod is with my pilot logger service, which used to work in Lyon. Now I'm not sure if my policy is active or not, is the change in the tool file taken into account ?

My client is not suitable for a fresh demo.

@martynia
Copy link
Author

Could we have a chat on zoom, it will be quicker.

@chaen
Copy link
Contributor

chaen commented Jul 10, 2024

Confusions here. The current pod is with my pilot logger service, which used to work in Lyon. Now I'm not sure if my policy is active or not, is the change in the tool file taken into account ?
My client is not suitable for a fresh demo.

I am awfully confused by this statement

Could we have a chat on zoom, it will be quicker.

Connect on mattermost

@martynia
Copy link
Author

Can't do lattermost, my CERN account has been revoked.

@martynia
Copy link
Author

you say:
but if you are unsure about what you are doing with kubernetes, better just restart the demo, as changing pyproject.toml while coding is relatively rare.

I suspect that adding a policy could be a problem. If I restart the demo with a different .toml, will it be taken into account ?

@chaen
Copy link
Contributor

chaen commented Jul 10, 2024

Can't do lattermost, my CERN account has been revoked.
You'll need that eventually ..

If I restart the demo with a different .toml, will it be taken into account ?

Restarting the demo will indeed reinstall diracx, so take the new pyproject.toml into account.

Alternatively, I think that kubectl rollout restart deployment diracx should also do and will be faster

@martynia
Copy link
Author

Success!, but only once. The second run returns 500 and no logs at the server side.

/Users/martynia/mambaforge/envs/diracx-dev/bin/python /Users/martynia/Dirac/FastApiTutorial/logger_client.py

Returned message: b'{"pilot_stamp":"78f39a90-2073-11ec-98d7-b496913c0cf4","lines":[{"line_no":0,"line":"2022-02-23 13:48:35.123456 UTC DEBUG [PilotParams] JSON file loaded: pilot.json"},{"line_no":1,"line":"2022-02-23 13:48:36.123456 UTC DEBUG [PilotParams] JSON file analysed: pilot.json"}],"vo":"diracPilot"}'
{'pilot_stamp': '78f39a90-2073-11ec-98d7-b496913c0cf4', 'lines': [{'line_no': 0, 'line': '2022-02-23 13:48:35.123456 UTC DEBUG [PilotParams] JSON file loaded: pilot.json'}, {'line_no': 1, 'line': '2022-02-23 13:48:36.123456 UTC DEBUG [PilotParams] JSON file analysed: pilot.json'}], 'vo': 'diracPilot'}

Process finished with exit code 0

@martynia
Copy link
Author

But as I said before, sometimes 500 comes at the first attempt.

@martynia
Copy link
Author

The router has some logs at the warning level, but nothing came out even when there was a successful connection .

@chaen
Copy link
Contributor

chaen commented Jul 10, 2024

I am afraid it is going to be very difficult to help you debug like that. You'll need to add a lot of printout to see where things crash.

@martynia
Copy link
Author

The problem is that there is no logs at the server side. My token expired, I have 401 now (expected), rather than 500, but the server logs show:
2024-07-10 14:50:01,147 - ERROR: Application startup failed. Exiting. INFO: Stopping reloader process [1]

And this is a message from yesterday. Are you not surprised that there are no newer messages at all ?

@chaen
Copy link
Contributor

chaen commented Jul 11, 2024

very much yes. Please get back your CERN account so we can do a bit of interactive debugging on mattermost and/or zoom

@martynia
Copy link
Author

I sent you an email, hopefully to the correct account.

@martynia
Copy link
Author

Getting the account back could take a week. Details in the email.

@chaen
Copy link
Contributor

chaen commented Jul 12, 2024

So, my findings:

With that, you should be good to go

@martynia
Copy link
Author

martynia commented Jul 12, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants