Master GPU Acceleration with Custom Triton Kernels: From Basics to High-Performance Fused Softmax Implementation Pytorch