Serving

Flyte enables you to implement serving in various contexts:

High throughput batch inference with NIMs, vLLM, and Actors
Low latency online inference using frameworks vLLM, SGLang.
Web endpoints using frameworks like FastAPI and Flask.
Interactive web apps using your favorite Python-based front-end frameworks like Streamlit, Gradio, and more.
Edge inference using MLC-LLM.

In this section, we will see examples demonstrating how to implement serving in these contexts using constructs like Union Actors, Serving Apps, and Artifacts.