Workflow state transitions

The following state diagram illustrates a high-level view of the state transitions that a workflow with a single task and node would go through as the user observes success.

        flowchart TD
    id1(( ))
    id1 --> Ready
    Ready --> Running
    subgraph Running
    id2(( ))
    id2 --> NodeQueued
    NodeQueued --> NodeRunning
    subgraph NodeRunning
    id3(( ))
    id3 --> TaskQueued
    TaskQueued --> TaskRunning
    TaskRunning --> TaskSuccess
    end
    TaskSuccess --> NodeSuccess
    end
    NodeSuccess --> Success
    

The following sections explain the various observable (and some hidden) states for workflow, node, and task state transitions.

Workflow States

        flowchart TD
    Queued -->|On system errors more than threshold| Aborted
    Queued --> Ready
    Ready--> |Write inputs to workflow| Running
    Running--> |On system error| Running
    Running--> |On all Nodes Success| Succeeding
    Succeeding--> |On successful event send to Admin| Succeeded
    Succeeding--> |On system error| Succeeding
    Ready--> |On precondition failure| Failing
    Running--> |On any Node Failure| Failing
    Ready--> |On user initiated abort| Aborting
    Running--> |On user initiated abort| Aborting
    Succeeding--> |On user initiated abort| Aborting
    Failing--> |If Failure node exists| HandleFailureNode
    Failing--> |On user initiated abort| Aborting
    HandleFailureNode--> |On completing failure node| Failed
    HandleFailureNode--> |On user initiated abort| Aborting
    Failing--> |On successful send of Failure node| Failed
    Aborting--> |On successful event send to Admin| Aborted
    

A workflow always starts in the Ready state and ends either in Failed, Succeeded, or Aborted state. Any system error within a state causes a retry on that state. These retries are capped by system retries <system-retry> which eventually lead to an Aborted state if the failure persists.

Every transition between states is recorded in FlyteAdmin using :std:ref:workflowexecutionevent.

The phases in the above state diagram are captured in the admin database as specified here workflowexecution.phase and are sent as a part of the Execution event.

The state machine specification for the illustration can be found here.

Node States

        flowchart TD
    id1(( ))
    id1-->NotYetStarted
    id1-->|Will stop the node execution |Aborted
    NotYetStarted-->|If all upstream nodes are ready, i.e, inputs are ready | Queued
    NotYetStarted--> |If the branch was not taken |Skipped
    Queued-->|Start task execution- attempt 0 | Running
    Running-->|If task timeout has elapsed and retry_attempts >= max_retries|TimingOut
    Running-->|Internal state|Succeeding
    Running-->|For dynamic nodes generating workflows| DynamicRunning
    DynamicRunning-->TimingOut
    DynamicRunning-->RetryableFailure
    TimingOut-->|If total node timeout has elapsed|TimedOut
    DynamicRunning-->Succeeding
    Succeeding-->|User observes the task as succeeded| Succeeded
    Running-->|on retryable failure| RetryableFailure
    RetryableFailure-->|if retry_attempts < max_retries|Running
    RetryableFailure-->|retry_attempts >= max_retries|Failing
    Failing-->Failed
    Succeeded-->id2(( ))
    Failed-->id2(( ))
    

This state diagram illustrates the node transition through various states. This is the core finite state machine for a node. From the user’s perspective, a workflow simply consists of a sequence of tasks. But to Flyte, a workflow internally creates a meta entity known as node.

Once a Workflow enters the Running state, it triggers the phantom start node of the workflow. The start node is considered to be the entry node of any workflow. The start node begins by executing all its child-nodes using a modified depth first search algorithm recursively.

Nodes can be of different types as listed below, but all the nodes traverse through the same transitions:

  1. Start Node - Only exists during the execution and is not modeled in the core spec.
  2. :std:ref:Task Node
  3. :std:ref:Branch Node
  4. :std:ref:Workflow Node
  5. Dynamic Node - Just a task node that does not return output but constitutes a dynamic workflow. When the task runs, it remains in the RUNNING state. Once the task completes and Flyte starts executing the dynamic workflow, the overarching node that contains both the original task and the dynamic workflow enters DYNAMIC_RUNNING state.
  6. End Node - Only exists during the execution and is not modeled in the core spec

Every transition between states is recorded in FlyteAdmin using nodeexecutionevent.

Every NodeExecutionEvent can have any :std:ref:nodeexecution.phase.

The state machine specification for the illustration can be found here.

Task States

        flowchart TD
    id1(( ))
    id1-->|Aborted by NodeHandler- timeouts, external abort, etc,.| NotReady
    id1-->Aborted
    NotReady-->|Optional-Blocked on resource quota or resource pool | WaitingForResources
    WaitingForResources--> |Optional- Has been submitted, but hasn't started |Queued
    Queued-->|Optional- Prestart initialization | Initializing
    Initializing-->|Actual execution of user code has started|Running
    Running-->|Successful execution|Success
    Running-->|Failed with a retryable error|RetryableFailure
    Running-->|Unrecoverable failure, will stop all execution|PermanentFailure
    Success-->id2(( ))
    RetryableFailure-->id2(( ))
    PermanentFailure-->id2(( ))
    

The state diagram above illustrates the various states through which a task transitions. This is the core finite state machine for a task.

Every transition between states is recorded in FlyteAdmin using :taskexecutionevent.

Every TaskExecutionEvent can have any taskexecution.phase.

The state machine specification for the illustration can be found here.