Serving
Flyte enables you to implement serving in various contexts:
- High throughput batch inference with NIMs, vLLM, and Actors
- Low latency online inference using frameworks vLLM, SGLang.
- Web endpoints using frameworks like FastAPI and Flask.
- Interactive web apps using your favorite Python-based front-end frameworks like Streamlit, Gradio, and more.
- Edge inference using MLC-LLM.
In this section, we will see examples demonstrating how to implement serving in these contexts using constructs like Union Actors, Serving Apps, and Artifacts.