vLLM

Today I learned how to use vLLM and its architecture. vLLM is an AI inference engine that is faster than transformers library due to many features such as PagedAttention.

I attempted to deploy few small models to a triton inference server using vLLM underneath.

It’s 11 PM and I can’t write much about it because it was riveting and entirely captivated. I even rented a GPU for this lol.