⚡ LLM Inference Optimization on Kubernetes

3,384 stars531 forksShell

aicncfdistributed-inferencegpuinferenceintelligent-routingkubernetesllmmodel-server

This is an infrastructure project focused on optimizing large language model inference performance on Kubernetes. It attempts to fully utilize modern GPU accelerators through intelligent routing and distributed inference. When deploying large models to production, the hard part is not getting a single chat to work, but managing resource scheduling under high concurrency. As a project related to the CNCF ecosystem, it provides a solid solution for teams needing to build their own model serving infrastructure.

View on GitHub