Llm-d team released new post which is focusing on intelligent inference serving and how LLM is different from stateless web requests. Worth to read.
Looking little bit more deeper on what should be the right architecture for AI workloads in my homelab I came across llm-d. It has been launched by CoreWeave, Google, IBM Research, NVIDIA, and Red Hat. Their statement really resonates:
The objective of llm-d is to create a well-lit path for anyone to adopt the leading distributed inference optimizations within their existing deployment framework - Kubernetes.
Llm-d building blocks are vLLM as inferencing engine, K8s as a core platform and Inference gateway to provide intelligent scheduling which is build for LLM type of workloads. I would highly recommend spend a bit more time to read through their announcement which explains very nicely the differences between typical workloads and LLM workloads.
Even though the focus is on deploying large scale inference on Kubernetes using large models (e.g. Llama-70B+, not Llama-8B) with longer input sequence lengths (e.g 10k ISL | 1k OSL, not 200 ISL | 200 OSL) and mostly tested on 16 or 8 Nvidia H200 GPUs but there are parts like Intelligent Inference Scheduling which has been tested and run on single GPU.