⚡ vMLX: Hardcore MLX Model Optimization

515 stars64 forksPython

anthropic-apikvcache-compressionkvcache-optimizationkvcache-reusellmlmstudiomacbookmcp-servermlxmlxllmmlxstudioomlx

The direction here is highly technical, focusing on deep optimizations for the MLX framework on Apple Silicon. It introduces an L2 disk cache that survives restarts, an L1 paged cache for extremely fast time-to-first-token (TTFT), alongside a hybrid SSM scheduler and continuous batching. In short, it tries to push the limits of memory and performance bottlenecks when running large models locally. For developers experimenting with local LLM inference on MacBooks, this level of low-level KV cache compression and reuse is quite interesting. It acts less like a simple runner and more like a performance lab tailored for a specific hardware architecture.

View on GitHub