Opti
ncompass.sglang.llama.attn()
Upto 3.5x Higher Throughput
Drop-in replacement for torch.nn with model-specific optimizations
Zero Migration
Integrates with your existing inference engine stack
Change one line of code in your existing framework and watch your code speedup
vllm/model_executor/models/llama.py
@@ -254,1 +61,2 @@
...
-254self.mlp = LlamaMLP(
+61import ncompass
+254self.mlp = ncompass.vllm.llama.MLP(
...
20%
Throughput Improvement
Loading models...
Products
License our optimized GPU kernels
Tap into our library of optimized kernels to accelerate your deployment.
- Works on any AI model
- Bring your own custom AI model architecture
- Works on any hardware backend
Deploy a dedicated instance
Let us manage your AI inference infrastructure
- Dedicated: Deploy on our H100 GPUs
- On-prem : Deploy on your infrastructure
- We provide autoscaling, observability and more...
Try us out on our API
Run AI models without any rate limits
- Limited set of optimized open source models
- 40s cold starts for 8B and 70B
- OpenAI compatible API interface