Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

log settings in logging_on() not inherited by dask distributed tasks #209

Open
gerritholl opened this issue Aug 1, 2024 · 1 comment
Open
Labels
bug Something isn't working

Comments

@gerritholl
Copy link
Member

gerritholl commented Aug 1, 2024

Describe the bug

The log settings defined by logging_on(), and therefore by any trollflow2 process, are not inherited by tasks scheduled using dask.distributed when called inside an if __name__ == "__main__" block. This is what happens when using trollflow2 normally. As a consequence, only messages of level warning or above are displayed and without the configured log definition.

To Reproduce

import logging
import dask
import dask.array as da
import numpy as np
from dask.distributed import Client
from trollflow2.logging import logging_on

logger = logging.getLogger(__name__)

@dask.delayed
def test():
    logger.warning("xilx xilx")
    logger.info("qral larq")
    logger.debug("coic cioc")
    return np.array([0])

def main():
    arr = da.from_delayed(test(), shape=(1,), dtype="uint16")
    print(arr.compute())

if __name__ == "__main__":
    with logging_on():
        with Client():
            main()

Expected behavior

I expect all log messages to be shown, as they are when using the default scheduler.

Actual results

Full output (stderr + stdout):

[DEBUG: 2024-08-01 12:50:08 : asyncio] Using selector: EpollSelector
xilx xilx
[0]

For reference, when leaving out the Client() context manager, the output is as expected:

[WARNING: 2024-08-01 12:50:38 : __main__] xilx xilx
[INFO: 2024-08-01 12:50:38 : __main__] qral larq
[DEBUG: 2024-08-01 12:50:38 : __main__] coic cioc
[0]

When I call logging_on() outside the if __name__ == "__main__", I get the log messages (and others):

[DEBUG: 2024-08-01 12:56:41 : asyncio] Using selector: EpollSelector
[DEBUG: 2024-08-01 12:56:42 : asyncio] Using selector: EpollSelector
[DEBUG: 2024-08-01 12:56:42 : asyncio] Using selector: EpollSelector
[DEBUG: 2024-08-01 12:56:42 : asyncio] Using selector: EpollSelector
[DEBUG: 2024-08-01 12:56:42 : distributed.diskutils] Locking '/tmp/dask-scratch-space/worker-6flgbr8t.dirlock'...
[DEBUG: 2024-08-01 12:56:42 : distributed.diskutils] Checking lock file '/tmp/dask-scratch-space/scheduler-75snhs1x.dirlock'...
[DEBUG: 2024-08-01 12:56:42 : distributed.diskutils] Locking '/tmp/dask-scratch-space/worker-avvfalg_.dirlock'...
[DEBUG: 2024-08-01 12:56:42 : distributed.diskutils] Checking lock file '/tmp/dask-scratch-space/worker-6flgbr8t.dirlock'...
[DEBUG: 2024-08-01 12:56:42 : distributed.diskutils] Checking lock file '/tmp/dask-scratch-space/scheduler-75snhs1x.dirlock'...
[DEBUG: 2024-08-01 12:56:42 : distributed.diskutils] Checking lock file '/tmp/dask-scratch-space/worker-avvfalg_.dirlock'...
[DEBUG: 2024-08-01 12:56:42 : distributed.diskutils] Locking '/tmp/dask-scratch-space/worker-xwpb7xsk.dirlock'...
[INFO: 2024-08-01 12:56:42 : distributed.worker]       Start worker at:      tcp://127.0.0.1:36357
[INFO: 2024-08-01 12:56:42 : distributed.worker]          Listening to:      tcp://127.0.0.1:36357
[INFO: 2024-08-01 12:56:42 : distributed.worker]           Worker name:                          0
[INFO: 2024-08-01 12:56:42 : distributed.worker]          dashboard at:            127.0.0.1:43539
[INFO: 2024-08-01 12:56:42 : distributed.worker] Waiting to connect to:      tcp://127.0.0.1:34505
[INFO: 2024-08-01 12:56:42 : distributed.worker]       Start worker at:      tcp://127.0.0.1:46133
[INFO: 2024-08-01 12:56:42 : distributed.worker] -------------------------------------------------
[INFO: 2024-08-01 12:56:42 : distributed.worker]          Listening to:      tcp://127.0.0.1:46133
[INFO: 2024-08-01 12:56:42 : distributed.worker]               Threads:                          2
[INFO: 2024-08-01 12:56:42 : distributed.worker]           Worker name:                          2
[INFO: 2024-08-01 12:56:42 : distributed.worker]                Memory:                   7.80 GiB
[INFO: 2024-08-01 12:56:42 : distributed.worker]       Local Directory: /tmp/dask-scratch-space/worker-avvfalg_
[INFO: 2024-08-01 12:56:42 : distributed.worker]          dashboard at:            127.0.0.1:34943
[INFO: 2024-08-01 12:56:42 : distributed.worker] Waiting to connect to:      tcp://127.0.0.1:34505
[INFO: 2024-08-01 12:56:42 : distributed.worker] -------------------------------------------------
[INFO: 2024-08-01 12:56:42 : distributed.worker] -------------------------------------------------
[INFO: 2024-08-01 12:56:42 : distributed.worker]               Threads:                          2
[INFO: 2024-08-01 12:56:42 : distributed.worker]                Memory:                   7.80 GiB
[INFO: 2024-08-01 12:56:42 : distributed.worker]       Local Directory: /tmp/dask-scratch-space/worker-6flgbr8t
[DEBUG: 2024-08-01 12:56:42 : distributed.comm.core] Establishing connection to 127.0.0.1:34505
[INFO: 2024-08-01 12:56:42 : distributed.worker] -------------------------------------------------
[DEBUG: 2024-08-01 12:56:42 : distributed.comm.core] Establishing connection to 127.0.0.1:34505
[DEBUG: 2024-08-01 12:56:42 : distributed.comm.tcp] Setting TCP keepalive: nprobes=10, idle=10, interval=2
[DEBUG: 2024-08-01 12:56:42 : distributed.comm.tcp] Setting TCP user timeout: 30000 ms
[DEBUG: 2024-08-01 12:56:42 : distributed.comm.tcp] Setting TCP keepalive: nprobes=10, idle=10, interval=2
[DEBUG: 2024-08-01 12:56:42 : distributed.comm.tcp] Setting TCP user timeout: 30000 ms
[DEBUG: 2024-08-01 12:56:42 : distributed.comm.core] Connection to 127.0.0.1:34505 established
[DEBUG: 2024-08-01 12:56:42 : distributed.comm.core] Connection to 127.0.0.1:34505 established
[INFO: 2024-08-01 12:56:42 : distributed.worker]       Start worker at:      tcp://127.0.0.1:35289
[INFO: 2024-08-01 12:56:42 : distributed.worker]          Listening to:      tcp://127.0.0.1:35289
[INFO: 2024-08-01 12:56:42 : distributed.worker]           Worker name:                          3
[INFO: 2024-08-01 12:56:42 : distributed.worker]          dashboard at:            127.0.0.1:36137
[INFO: 2024-08-01 12:56:42 : distributed.worker] Waiting to connect to:      tcp://127.0.0.1:34505
[INFO: 2024-08-01 12:56:42 : distributed.worker] -------------------------------------------------
[INFO: 2024-08-01 12:56:42 : distributed.worker]               Threads:                          2
[INFO: 2024-08-01 12:56:42 : distributed.worker]                Memory:                   7.80 GiB
[INFO: 2024-08-01 12:56:42 : distributed.worker]       Local Directory: /tmp/dask-scratch-space/worker-xwpb7xsk
[INFO: 2024-08-01 12:56:42 : distributed.worker] -------------------------------------------------
[DEBUG: 2024-08-01 12:56:42 : distributed.comm.core] Establishing connection to 127.0.0.1:34505
[DEBUG: 2024-08-01 12:56:42 : distributed.comm.tcp] Setting TCP keepalive: nprobes=10, idle=10, interval=2
[DEBUG: 2024-08-01 12:56:42 : distributed.comm.tcp] Setting TCP user timeout: 30000 ms
[DEBUG: 2024-08-01 12:56:42 : distributed.comm.core] Connection to 127.0.0.1:34505 established
[DEBUG: 2024-08-01 12:56:42 : asyncio] Using selector: EpollSelector
[DEBUG: 2024-08-01 12:56:42 : distributed.diskutils] Checking lock file '/tmp/dask-scratch-space/worker-6flgbr8t.dirlock'...
[DEBUG: 2024-08-01 12:56:42 : distributed.diskutils] Checking lock file '/tmp/dask-scratch-space/scheduler-75snhs1x.dirlock'...
[DEBUG: 2024-08-01 12:56:42 : distributed.diskutils] Checking lock file '/tmp/dask-scratch-space/worker-avvfalg_.dirlock'...
[DEBUG: 2024-08-01 12:56:42 : distributed.diskutils] Checking lock file '/tmp/dask-scratch-space/worker-xwpb7xsk.dirlock'...
[DEBUG: 2024-08-01 12:56:42 : distributed.diskutils] Locking '/tmp/dask-scratch-space/worker-j5jqbicd.dirlock'...
[INFO: 2024-08-01 12:56:42 : distributed.worker]       Start worker at:      tcp://127.0.0.1:43163
[INFO: 2024-08-01 12:56:42 : distributed.worker]          Listening to:      tcp://127.0.0.1:43163
[INFO: 2024-08-01 12:56:42 : distributed.worker]           Worker name:                          1
[INFO: 2024-08-01 12:56:42 : distributed.worker]          dashboard at:            127.0.0.1:37003
[INFO: 2024-08-01 12:56:42 : distributed.worker] Waiting to connect to:      tcp://127.0.0.1:34505
[INFO: 2024-08-01 12:56:42 : distributed.worker] -------------------------------------------------
[INFO: 2024-08-01 12:56:42 : distributed.worker]               Threads:                          2
[INFO: 2024-08-01 12:56:42 : distributed.worker]                Memory:                   7.80 GiB
[INFO: 2024-08-01 12:56:42 : distributed.worker]       Local Directory: /tmp/dask-scratch-space/worker-j5jqbicd
[INFO: 2024-08-01 12:56:42 : distributed.worker] -------------------------------------------------
[DEBUG: 2024-08-01 12:56:42 : distributed.comm.core] Establishing connection to 127.0.0.1:34505
[DEBUG: 2024-08-01 12:56:42 : distributed.comm.tcp] Setting TCP keepalive: nprobes=10, idle=10, interval=2
[DEBUG: 2024-08-01 12:56:42 : distributed.comm.tcp] Setting TCP user timeout: 30000 ms
[DEBUG: 2024-08-01 12:56:42 : distributed.comm.core] Connection to 127.0.0.1:34505 established
[INFO: 2024-08-01 12:56:42 : distributed.worker] Starting Worker plugin shuffle
[INFO: 2024-08-01 12:56:42 : distributed.worker]         Registered to:      tcp://127.0.0.1:34505
[INFO: 2024-08-01 12:56:42 : distributed.worker] -------------------------------------------------
[INFO: 2024-08-01 12:56:42 : distributed.core] Starting established connection to tcp://127.0.0.1:34505
[INFO: 2024-08-01 12:56:42 : distributed.worker] Starting Worker plugin shuffle
[INFO: 2024-08-01 12:56:42 : distributed.worker]         Registered to:      tcp://127.0.0.1:34505
[INFO: 2024-08-01 12:56:42 : distributed.worker] Starting Worker plugin shuffle
[INFO: 2024-08-01 12:56:42 : distributed.worker] -------------------------------------------------
[INFO: 2024-08-01 12:56:42 : distributed.core] Starting established connection to tcp://127.0.0.1:34505
[INFO: 2024-08-01 12:56:42 : distributed.worker]         Registered to:      tcp://127.0.0.1:34505
[INFO: 2024-08-01 12:56:42 : distributed.worker] -------------------------------------------------
[INFO: 2024-08-01 12:56:42 : distributed.worker] Starting Worker plugin shuffle
[INFO: 2024-08-01 12:56:42 : distributed.worker]         Registered to:      tcp://127.0.0.1:34505
[INFO: 2024-08-01 12:56:42 : distributed.core] Starting established connection to tcp://127.0.0.1:34505
[INFO: 2024-08-01 12:56:42 : distributed.worker] -------------------------------------------------
[INFO: 2024-08-01 12:56:42 : distributed.core] Starting established connection to tcp://127.0.0.1:34505
[WARNING: 2024-08-01 12:56:42 : __main__] xilx xilx
[INFO: 2024-08-01 12:56:42 : __main__] qral larq
[DEBUG: 2024-08-01 12:56:42 : __main__] coic cioc
[DEBUG: 2024-08-01 12:56:42 : distributed.worker.state_machine] Data task <TaskState ('test-from-value-03edcc0a2b842080ed8e1e36abea720a', 0) memory> already known (stimulus_id=compute-task-1722509802.9544415)
[DEBUG: 2024-08-01 12:56:42 : distributed.comm.tcp] Incoming connection from 'tcp://127.0.0.1:42440' to 'tcp://127.0.0.1:35289'
[DEBUG: 2024-08-01 12:56:42 : distributed.comm.tcp] Setting TCP keepalive: nprobes=10, idle=10, interval=2
[DEBUG: 2024-08-01 12:56:42 : distributed.comm.tcp] Setting TCP user timeout: 30000 ms
[DEBUG: 2024-08-01 12:56:42 : distributed.core] Connection from 'tcp://127.0.0.1:42440' to Worker
[DEBUG: 2024-08-01 12:56:42 : distributed.core] Message from 'tcp://127.0.0.1:42440': {'op': 'get_data', 'keys': (('from-value-03edcc0a2b842080ed8e1e36abea720a', 0),), 'who': None, 'reply': True}
[DEBUG: 2024-08-01 12:56:42 : distributed.core] Calling into handler get_data
[DEBUG: 2024-08-01 12:56:42 : distributed.worker.state_machine] Purge task: <TaskState ('test-from-value-03edcc0a2b842080ed8e1e36abea720a', 0) memory>
[DEBUG: 2024-08-01 12:56:42 : distributed.worker.state_machine] Purge task: <TaskState ('from-value-03edcc0a2b842080ed8e1e36abea720a', 0) memory>
[DEBUG: 2024-08-01 12:56:42 : distributed.worker.state_machine] Purge task: <TaskState ('from-value-03edcc0a2b842080ed8e1e36abea720a', 0) released>
[DEBUG: 2024-08-01 12:56:42 : distributed.worker.state_machine] Purge task: <TaskState ('test-from-value-03edcc0a2b842080ed8e1e36abea720a', 0) released>
[INFO: 2024-08-01 12:56:42 : distributed.worker] Stopping worker at tcp://127.0.0.1:46133. Reason: nanny-close
[INFO: 2024-08-01 12:56:42 : distributed.worker] Stopping worker at tcp://127.0.0.1:36357. Reason: nanny-close
[INFO: 2024-08-01 12:56:42 : distributed.worker] Stopping worker at tcp://127.0.0.1:43163. Reason: nanny-close
[INFO: 2024-08-01 12:56:42 : distributed.worker] Stopping worker at tcp://127.0.0.1:35289. Reason: nanny-close
[INFO: 2024-08-01 12:56:42 : distributed.core] Connection to tcp://127.0.0.1:34505 has been closed.
[DEBUG: 2024-08-01 12:56:42 : distributed.worker] Attempted to close worker that is already Status.closing. Reason: worker-handle-scheduler-connection-broken
[INFO: 2024-08-01 12:56:42 : distributed.core] Connection to tcp://127.0.0.1:34505 has been closed.
[DEBUG: 2024-08-01 12:56:42 : distributed.worker] Attempted to close worker that is already Status.closing. Reason: worker-handle-scheduler-connection-broken
[INFO: 2024-08-01 12:56:42 : distributed.core] Connection to tcp://127.0.0.1:34505 has been closed.
[DEBUG: 2024-08-01 12:56:42 : distributed.worker] Attempted to close worker that is already Status.closing. Reason: worker-handle-scheduler-connection-broken
[INFO: 2024-08-01 12:56:42 : distributed.core] Connection to tcp://127.0.0.1:34505 has been closed.
[DEBUG: 2024-08-01 12:56:42 : distributed.worker] Attempted to close worker that is already Status.closing. Reason: worker-handle-scheduler-connection-broken
[DEBUG: 2024-08-01 12:56:42 : distributed.worker] Attempted to close worker that is already Status.closed. Reason: worker-close
[DEBUG: 2024-08-01 12:56:42 : distributed.worker] Attempted to close worker that is already Status.closed. Reason: worker-close
[DEBUG: 2024-08-01 12:56:42 : distributed.worker] Attempted to close worker that is already Status.closed. Reason: worker-close
[DEBUG: 2024-08-01 12:56:42 : distributed.core] Lost connection to 'tcp://127.0.0.1:42440' while reading message: in <TCP (closed)  local=tcp://127.0.0.1:35289 remote=tcp://127.0.0.1:42440>: Stream is closed. Last operation: get_data
[DEBUG: 2024-08-01 12:56:42 : distributed.worker] Attempted to close worker that is already Status.closed. Reason: worker-close
[0]

Environment Info:

  • OS: [e.g. OSX, Windows, Linux]
  • Trollflow2 Version: main (739b1f9)

Additional context

From this stackoverflow question:

As of version 1.15.0 we now fork workers from a clean process, so changes that you make to your process prior to calling Client() won't affect forked workers. For more information search for forkserver here: https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods

so to solve this, logging_on() should be called after Client().

@gerritholl gerritholl added the bug Something isn't working label Aug 1, 2024
@gerritholl
Copy link
Member Author

gerritholl commented Aug 1, 2024

The recommended solution would be to add something like

client.run(logging_on)

after creating the client, but that doesn't work, because logging_on is a context manager or actually a generator, leading to dask distributed failures with TypeError: cannot pickle 'generator' object. Calling satpys debug_on instead seems to work, so it would seem this needs a functional wrapper / version of logging_on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant