Senior Performance Software Engineer

Microsoft
United States, Texas, Irving
7000 State Highway 161 (Show on map)
Jul 25, 2025
OverviewWe are looking for a Senior Software Engineer with deep expertise in High-Performance Computing (HPC) and Artificial Intelligence (AI) networking performance, particularly across InfiniBand-based GPU clusters. You will be a key technical leader focused on understanding, analyzing, and optimizing the performance of distributed workloads running at massive scale - often involving tens of thousands of GPUs interconnected via high-speed networks.This role requires strong familiarity with Message Passing Interface (MPI), NVIDIA Collective Communications Library (NCCL), collective communication algorithms, and the underlying transport technologies (Remote Direct Memory Access (RDMA) over InfiniBand). You should have extensive experience with network-level debugging, topology-aware optimization, and low-latency, high-throughput communication tuning in Linux environments. If you enjoy solving hard problems at the convergence of network systems and distributed applications, we want to talk to you. Microsoft's mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond. ResponsibilitiesDiagnose and resolve communication-related performance bottlenecks across large-scale distributed systems.Analyze InfiniBand fabric behavior, including congestion, credit starvation, route imbalance, or link-level errors.Tune and debug Remote Direct Memory Access (RDMA) transport layers, message ordering, path migration, and Quality of Service (QoS) policies on multi-rail or multi-subnet fabrics.Develop custom tooling or scripts to analyze fabric telemetry, traffic flows, and job-induced network behavior.Collaborate with operations team to ensure optimal use of fabric features (adaptive routing, congestion control, service level tuning).Work closely with hardware vendors (e.g., NVIDIA) to evaluate and validate next-gen interconnects and switch firmware.Support benchmarking of High-Performance Computing (HPC) / Artificial Intelligence (AI) workloads, including characterization of latency, bandwidth, and collective efficiency under scale.