Back to list
High-Star Recent
Shell

LLM Inference Optimization on Kubernetes

3,384 stars531 forksShell
aicncfdistributed-inferencegpuinferenceintelligent-routingkubernetesllmmodel-server
This is an infrastructure project focused on optimizing large language model inference performance on Kubernetes. It attempts to fully utilize modern GPU accelerators through intelligent routing and distributed inference. When deploying large models to production, the hard part is not getting a single chat to work, but managing resource scheduling under high concurrency. As a project related to the CNCF ecosystem, it provides a solid solution for teams needing to build their own model serving infrastructure.