🚀 New Features
⚡ Performance Improvements
-
DeepSeek-R1 Q4 decoding @ 7.5 tokens/s
Measured on a single-socket Xeon + DDR5 4800 MT/s + A770 platform; enabling dual-NUMA delivers additional speedups. -
Easy benchmarking
Try it yourself with thelocal_chat
script to see these gains firsthand.
🔜 What’s Next
- Balance_serve integration
We’re working to seamlessly merge Intel GPU operators into thebalance_serve
backend for end-to-end support and streamlined maintenance.