What we do?

At nCompass, we’re building AI inference serving software that can reduce the costs of serving AI models at scale by 50%.

When a high number of concurrent requests are run on state-of-the-art serving systems such as vLLM, response times on a single GPU explode catastrophically. Currently, the only solution is to scale up the number of GPUs which is expensive.

We’ve built custom AI inference serving software that can maintain a high quality-of-service on fewer GPUs. This way, we can provide you with an API to open source models that has no rate limits.

Improve AI model responsiveness by up to 4x

Compared to state-of-the-art serving engines such as vLLM, we improve TTFT by up to 4x at the same concurrency.

Reduce AI GPU Infrastructure Bills by 50%

Our hardware aware request scheduler and Kubernetes autoscaler enables us to maintain good quality-of-service metrics on 50% fewer GPUs than alternatives.

How can you use us?

We’ve currently exposed our optimizations via our own API here!! You get $100 of credit when you sign up to send requests!! You can find a list of supported models here.

Our cold start times for both 8B and 70B models are about 40s and because we can keep the costs of hosting models low, we do not apply rate-limits so you can have a reliable API that allows you to use open source models in production environments!

Contact us if you would like to use our optimizations as an on-prem solution.