Serving open models with vLLM
Hands-on guide to self-hosting open-weights LLMs with vLLM: install, serve an OpenAI-compatible API, quantize, benchmark, and manage VRAM.
Writing
9 posts on building with AI.
Hands-on guide to self-hosting open-weights LLMs with vLLM: install, serve an OpenAI-compatible API, quantize, benchmark, and manage VRAM.
Build a typed MCP server in Python, run it over stdio and HTTP, wire it into Cursor, and drive it from a custom client and an OpenAI-compatible model.
What I learned building a Whisper + LLaMA speech recognizer for Malay–English code-switching — where the WER actually came from, and what didn't help.
A hands-on, current (2026) path for taking a PyTorch CV model to ONNX and a TensorRT engine: export, parity, FP16/INT8 build, and latency gating.
Build agentic RAG in Python: hybrid retrieval as a tool, a bounded agent loop, sufficiency and grounding checks, against any OpenAI-compatible endpoint.
Building an end-to-end multi-agent system — support, recommendation, and pricing — with LangChain, AutoGen, FastAPI, Kafka, and Qdrant. What held up and what I'd change.
A hands-on QLoRA fine-tuning walkthrough: dataset prep, 4-bit training with peft and trl, merging, and vLLM serving behind an OpenAI-compatible API.
A layered, runnable approach to reliable structured LLM outputs: pydantic schemas, json_schema enforcement, bounded validate-and-retry, and constrained decoding.
Why combining CNNs, InceptionNeXt, and a Vision Transformer beat either alone for video deepfake detection — and why cross-dataset generalization is the metric that matters.