0.0.0+develop

flytekitplugins.kfpytorch.error_handling

Handle errors in elastic training jobs.

Directory

Methods

Method Description
create_recoverable_error_file() Create a file to signal to the agent process that an exception in the worker process is recoverable.
is_recoverable_worker_error() Check if the error in the worker process is recoverable.

Variables

Property Type Description
RECOVERABLE_ERROR_FILE_NAME str

Methods

create_recoverable_error_file()

def create_recoverable_error_file()

Create a file to signal to the agent process that an exception in the worker process is recoverable.

Torch’s elastic_launch gives the agent process access to exceptions raised in the worker processes only as strings in an error file. Instead of parsing this error file in the agent process for the string FlyteRecoverableException - which would not detect exceptions inheriting from FlyteRecoverableException - we create a file in the worker process to signal to the agent process that the exception is recoverable. The file is created in the directory where the default torch elastic error file is written.

Raises: ValueError: If the environment variable TORCHELASTIC_ERROR_FILE is not set.

is_recoverable_worker_error()

def is_recoverable_worker_error(
    failure,
) -> bool

Check if the error in the worker process is recoverable.

The error is considered recoverable if the directory containing the torch elastic error file contains a file named recoverable_error.

Parameter Type
failure