Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

One failing task seems to stop independent tasks being recorded as done #114

Open
tesujimath opened this issue Mar 12, 2025 · 2 comments
Open

Comments

@tesujimath
Copy link

tesujimath commented Mar 12, 2025

I expected a failing task not to interfere with other independent succeeding tasks.

What I am seeing is a slow but successful task does in fact complete, but is left in state RUN when shown in redun console. I would expect to see that task as DONE.

Here's the redun console output some time after this has all completed (sorry about the screenshot):

Image

Here's my code:

import time
from redun import task, File
from typing import List
import logging

redun_namespace = "redun.examples.exceptions"

logging.basicConfig(
    filename="exceptions.log",
    level=logging.DEBUG,
    format="%(asctime)s %(name)-12s %(levelname)-8s %(message)s",
    datefmt="%Y-%m-%d %H:%M",
)

for noisy_module in ["botocore", "urllib3"]:
    logging.getLogger(noisy_module).setLevel(logging.WARN)


logger = logging.getLogger(__name__)


class MyException(Exception):
    pass


@task
def slow_and_steady(out_path: str, content: str) -> File:
    logger.info("slow_and_steady sleeping")
    time.sleep(5)
    logger.info("slow_and_steady awake")
    with open(out_path, "w") as out_f:
        out_f.write(content)
    logger.info(f"slow_and_steady wrote file {out_path}")
    return File(out_path)


@task()
def fast_fail(out_path: str, content: str) -> File:
    logger.info("fast fail failing")
    raise MyException("Oh no!")
    return File(out_path)


@task()
def main() -> List[File]:
    bad_file = fast_fail("out/freddy", "Hello Freddy\n")
    good_file = slow_and_steady("out/jimmy", "Hello Jimmy\n")
    logger.info("main task at end")
    return [bad_file, good_file]

And here's the log output showing that the slow and steady task does actually complete:

it23699> cat exceptions.log
2025-03-13 09:07 redun        INFO     Start Execution fcecf9a8-c4d5-42ac-8ca5-32ede9495fdb:  redun run --no-cache exceptions/main.py main
2025-03-13 09:07 redun        INFO     Run    Job ceb9e847:  redun.examples.exceptions.main() on default
2025-03-13 09:07 main         INFO     main task at end
2025-03-13 09:07 redun        INFO     Run    Job 578495b0:  redun.examples.exceptions.fast_fail(out_path='out/freddy', content='Hello Freddy\n') on default
2025-03-13 09:07 main         INFO     fast fail failing
2025-03-13 09:07 redun        INFO     Run    Job ddbc97b9:  redun.examples.exceptions.slow_and_steady(out_path='out/jimmy', content='Hello Jimmy\n') on default
2025-03-13 09:07 main         INFO     slow_and_steady sleeping
2025-03-13 09:07 redun        INFO     *** Reject Job 578495b0:  redun.examples.exceptions.fast_fail(out_path='out/freddy', content='Hello Freddy\n'):
2025-03-13 09:07 redun        INFO     *** Reject Job ceb9e847:  redun.examples.exceptions.main():
2025-03-13 09:07 redun        INFO
2025-03-13 09:07 redun        INFO     | JOB STATUS 2025/03/13 09:07:01
2025-03-13 09:07 redun        INFO     | TASK                                      PENDING RUNNING  FAILED  CACHED    DONE   TOTAL
2025-03-13 09:07 redun        INFO     |
2025-03-13 09:07 redun        INFO     | ALL                                             0       1       2       0       0       3
2025-03-13 09:07 redun        INFO     | redun.examples.exceptions.fast_fail             0       0       1       0       0       1
2025-03-13 09:07 redun        INFO     | redun.examples.exceptions.main                  0       0       1       0       0       1
2025-03-13 09:07 redun        INFO     | redun.examples.exceptions.slow_and_steady       0       1       0       0       0       1
2025-03-13 09:07 redun        INFO
2025-03-13 09:07 redun        INFO
2025-03-13 09:07 redun        INFO     Execution duration: 0.05 seconds
2025-03-13 09:07 main         INFO     slow_and_steady awake
2025-03-13 09:07 main         INFO     slow_and_steady wrote file out/jimmy
2025-03-13 09:07 redun        INFO     *** Execution failed. Traceback (most recent task last):
2025-03-13 09:07 redun        INFO       Job ceb9e847: File "/home/guestsi/vc/playpen/redun-playpen/exceptions/main.py", line 45, in main
2025-03-13 09:07 redun        INFO         def main() -> List[File]:
2025-03-13 09:07 redun        INFO       Job 578495b0: File "/home/guestsi/vc/playpen/redun-playpen/exceptions/main.py", line 38, in fast_fail
2025-03-13 09:07 redun        INFO         def fast_fail(out_path: str, content: str) -> File:
2025-03-13 09:07 redun        INFO         content  = 'Hello Freddy\n'
2025-03-13 09:07 redun        INFO         out_path = 'out/freddy'
2025-03-13 09:07 redun        INFO       File "/home/guestsi/vc/playpen/redun-playpen/exceptions/main.py", line 40, in fast_fail
2025-03-13 09:07 redun        INFO         raise MyException("Oh no!")
2025-03-13 09:07 redun        INFO     MyException: Oh no!

Is there something I should be doing to get done tasks to be recorded as done in the database?

@mattrasmus
Copy link
Collaborator

mattrasmus commented Mar 14, 2025

Thanks for posting.

So the default redun behavior is when multiple child tasks run in parallel (slow_and_steady and fast_fail) and one of them raises an exception, that exception propagates up eagerly to the higher level tasks in the job tree (which is a parallel version of a call stack) to see if any higher level task wants to catch it (see catch()). If not, the whole workflow will halt like your logs show.

To clarify what you're seeing in your logs, redun does not currently implement actively killing sibling jobs (slow_and_steady) in such a situation. Of course, if the sibling job is running in a thread, the thread will be eventually killed when the overall processing running the redun scheduler is killed. Sometimes the thread gets lucky and is able to get to the end of the function ("slow_and_steady wrote file out/jimmy") before the overall process is shutdown. In situations like this, the job effectively completed its work, but it doesn't have a chance to report to the scheduler to mark the job as done before the overall scheduler shutdown. If the long and steady job ran for longer, it would be actively killed before completing by the process shutdown.

If you would like to deliberately wait for all parallel tasks to finish before propagating the failure up the job tree, you can use the catch() task to catch errors and handle them. We even provide a version of catch for this kind of parallel job situation called catch_all().

@task
def recover(values):
    # Here, I can look at values to see which are Files or Exceptions.
    # We can reraise the except or try to process the failed job again, or any number of recovery strategies.
    raise MyException("One of the child tasks failed")

@task
def main() -> List[File]:
    bad_file = fast_fail("out/freddy", "Hello Freddy\n")
    good_file = slow_and_steady("out/jimmy", "Hello Jimmy\n")
    logger.info("main task at end")
    return catch_all([bad_file, good_file], MyException, recover)

Let me know if that helps.

@tesujimath
Copy link
Author

tesujimath commented Mar 17, 2025

That's perfect, thank you!

I wonder whether this should be added to the docs somewhere besides the API reference? Perhaps another heading in the Tasks section of the user guide?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants