Mini-Attention
A personal research repo for high-performance GPU attention kernels. Implements Standard, Ring, and KV-Cached/Paged attention across PyTorch, Triton, and CUDA — with dedicated branches optimized for RTX 3060, NVIDIA Blackwell (B200), and AMD MI300. Benchmarks single- and multi-GPU performance, memory efficiency, and throughput.
View on GitHub →