One failing task seems to stop independent tasks being recorded as done #114

tesujimath · 2025-03-12T20:54:01Z

I expected a failing task not to interfere with other independent succeeding tasks.

What I am seeing is a slow but successful task does in fact complete, but is left in state RUN when shown in redun console. I would expect to see that task as DONE.

Here's the redun console output some time after this has all completed (sorry about the screenshot):

Here's my code:

import time
from redun import task, File
from typing import List
import logging

redun_namespace = "redun.examples.exceptions"

logging.basicConfig(
    filename="exceptions.log",
    level=logging.DEBUG,
    format="%(asctime)s %(name)-12s %(levelname)-8s %(message)s",
    datefmt="%Y-%m-%d %H:%M",
)

for noisy_module in ["botocore", "urllib3"]:
    logging.getLogger(noisy_module).setLevel(logging.WARN)


logger = logging.getLogger(__name__)


class MyException(Exception):
    pass


@task
def slow_and_steady(out_path: str, content: str) -> File:
    logger.info("slow_and_steady sleeping")
    time.sleep(5)
    logger.info("slow_and_steady awake")
    with open(out_path, "w") as out_f:
        out_f.write(content)
    logger.info(f"slow_and_steady wrote file {out_path}")
    return File(out_path)


@task()
def fast_fail(out_path: str, content: str) -> File:
    logger.info("fast fail failing")
    raise MyException("Oh no!")
    return File(out_path)


@task()
def main() -> List[File]:
    bad_file = fast_fail("out/freddy", "Hello Freddy\n")
    good_file = slow_and_steady("out/jimmy", "Hello Jimmy\n")
    logger.info("main task at end")
    return [bad_file, good_file]

And here's the log output showing that the slow and steady task does actually complete:

it23699> cat exceptions.log
2025-03-13 09:07 redun        INFO     Start Execution fcecf9a8-c4d5-42ac-8ca5-32ede9495fdb:  redun run --no-cache exceptions/main.py main
2025-03-13 09:07 redun        INFO     Run    Job ceb9e847:  redun.examples.exceptions.main() on default
2025-03-13 09:07 main         INFO     main task at end
2025-03-13 09:07 redun        INFO     Run    Job 578495b0:  redun.examples.exceptions.fast_fail(out_path='out/freddy', content='Hello Freddy\n') on default
2025-03-13 09:07 main         INFO     fast fail failing
2025-03-13 09:07 redun        INFO     Run    Job ddbc97b9:  redun.examples.exceptions.slow_and_steady(out_path='out/jimmy', content='Hello Jimmy\n') on default
2025-03-13 09:07 main         INFO     slow_and_steady sleeping
2025-03-13 09:07 redun        INFO     *** Reject Job 578495b0:  redun.examples.exceptions.fast_fail(out_path='out/freddy', content='Hello Freddy\n'):
2025-03-13 09:07 redun        INFO     *** Reject Job ceb9e847:  redun.examples.exceptions.main():
2025-03-13 09:07 redun        INFO
2025-03-13 09:07 redun        INFO     | JOB STATUS 2025/03/13 09:07:01
2025-03-13 09:07 redun        INFO     | TASK                                      PENDING RUNNING  FAILED  CACHED    DONE   TOTAL
2025-03-13 09:07 redun        INFO     |
2025-03-13 09:07 redun        INFO     | ALL                                             0       1       2       0       0       3
2025-03-13 09:07 redun        INFO     | redun.examples.exceptions.fast_fail             0       0       1       0       0       1
2025-03-13 09:07 redun        INFO     | redun.examples.exceptions.main                  0       0       1       0       0       1
2025-03-13 09:07 redun        INFO     | redun.examples.exceptions.slow_and_steady       0       1       0       0       0       1
2025-03-13 09:07 redun        INFO
2025-03-13 09:07 redun        INFO
2025-03-13 09:07 redun        INFO     Execution duration: 0.05 seconds
2025-03-13 09:07 main         INFO     slow_and_steady awake
2025-03-13 09:07 main         INFO     slow_and_steady wrote file out/jimmy
2025-03-13 09:07 redun        INFO     *** Execution failed. Traceback (most recent task last):
2025-03-13 09:07 redun        INFO       Job ceb9e847: File "/home/guestsi/vc/playpen/redun-playpen/exceptions/main.py", line 45, in main
2025-03-13 09:07 redun        INFO         def main() -> List[File]:
2025-03-13 09:07 redun        INFO       Job 578495b0: File "/home/guestsi/vc/playpen/redun-playpen/exceptions/main.py", line 38, in fast_fail
2025-03-13 09:07 redun        INFO         def fast_fail(out_path: str, content: str) -> File:
2025-03-13 09:07 redun        INFO         content  = 'Hello Freddy\n'
2025-03-13 09:07 redun        INFO         out_path = 'out/freddy'
2025-03-13 09:07 redun        INFO       File "/home/guestsi/vc/playpen/redun-playpen/exceptions/main.py", line 40, in fast_fail
2025-03-13 09:07 redun        INFO         raise MyException("Oh no!")
2025-03-13 09:07 redun        INFO     MyException: Oh no!

Is there something I should be doing to get done tasks to be recorded as done in the database?

The text was updated successfully, but these errors were encountered:

mattrasmus · 2025-03-14T13:48:30Z

Thanks for posting.

So the default redun behavior is when multiple child tasks run in parallel (slow_and_steady and fast_fail) and one of them raises an exception, that exception propagates up eagerly to the higher level tasks in the job tree (which is a parallel version of a call stack) to see if any higher level task wants to catch it (see catch()). If not, the whole workflow will halt like your logs show.

To clarify what you're seeing in your logs, redun does not currently implement actively killing sibling jobs (slow_and_steady) in such a situation. Of course, if the sibling job is running in a thread, the thread will be eventually killed when the overall processing running the redun scheduler is killed. Sometimes the thread gets lucky and is able to get to the end of the function ("slow_and_steady wrote file out/jimmy") before the overall process is shutdown. In situations like this, the job effectively completed its work, but it doesn't have a chance to report to the scheduler to mark the job as done before the overall scheduler shutdown. If the long and steady job ran for longer, it would be actively killed before completing by the process shutdown.

If you would like to deliberately wait for all parallel tasks to finish before propagating the failure up the job tree, you can use the catch() task to catch errors and handle them. We even provide a version of catch for this kind of parallel job situation called catch_all().

@task
def recover(values):
    # Here, I can look at values to see which are Files or Exceptions.
    # We can reraise the except or try to process the failed job again, or any number of recovery strategies.
    raise MyException("One of the child tasks failed")

@task
def main() -> List[File]:
    bad_file = fast_fail("out/freddy", "Hello Freddy\n")
    good_file = slow_and_steady("out/jimmy", "Hello Jimmy\n")
    logger.info("main task at end")
    return catch_all([bad_file, good_file], MyException, recover)

Let me know if that helps.

tesujimath · 2025-03-17T03:00:17Z

That's perfect, thank you!

I wonder whether this should be added to the docs somewhere besides the API reference? Perhaps another heading in the Tasks section of the user guide?

tesujimath mentioned this issue Mar 12, 2025

Single library failure should not be fatal for the whole run AgResearch/gbs_prism#64

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

One failing task seems to stop independent tasks being recorded as done #114

One failing task seems to stop independent tasks being recorded as done #114

tesujimath commented Mar 12, 2025 •

edited

Loading

mattrasmus commented Mar 14, 2025 •

edited

Loading

tesujimath commented Mar 17, 2025 •

edited

Loading

One failing task seems to stop independent tasks being recorded as done #114

One failing task seems to stop independent tasks being recorded as done #114

Comments

tesujimath commented Mar 12, 2025 • edited Loading

mattrasmus commented Mar 14, 2025 • edited Loading

tesujimath commented Mar 17, 2025 • edited Loading

tesujimath commented Mar 12, 2025 •

edited

Loading

mattrasmus commented Mar 14, 2025 •

edited

Loading

tesujimath commented Mar 17, 2025 •

edited

Loading