R

Senior Site Reliability Engineer, HPC

Recruiting From Scratch
Contract
Remote
Brazil, Brazil, United States, and United Kingdom, United Kingdom
$150,000 - $250,000 USD yearly

 

Recruiting from Scratch is a premier talent firm that focuses on placing the best product managers, software, and hardware talent at innovative companies. Our team is 100% remote and we work with teams across the United States to help them hire.

 



About the role



our client is an AI cloud working with top AI companies like Poolside, Meta, Modal, and Reka. Our HPC Engineer will be responsible for ensuring peak performance of GPU infrastructure and providing top-tier support to customers.


Your main responsibilities include deploying new clusters, automating processes to support scalability, and offering client-facing support for GPU debugging and performance optimization.






Responsibilities



• Deployment. We will be onboarding new clusters at least monthly - you will help take bare-metal servers and deploy them for our customers as high performance compute as a service.

• Automation. Our GPU fleet is large and growing. You will help us to automate many of our processes and systems to allow us to support Fluidstack continuing to scale.

• Support. This will be a client facing role - you will work closely with our customers to make sure that they are able to utilize our infrastructure to achieve their goals. You will work on everything from GPU debugging, Slurm management, to training performance optimization.



 




Candidate requirements








  • 4+ years of experience

  • Experience with shared storage on platforms such as NFS, DDN ,Vast, CephFS, etc.

  • Experience provisioning large scale clusters and networks with e.g. BCM, UFM.

  • Experience with large-scale GPU systems, working with Nvidia GPUs and Infiniband networks.

  • Experience with HPC systems, System Administration, SRE, or DevOps.

  • Experience with large scale workloads utilizing orchestrators like Slurm or Kubernetes.

  • Experience with automation of bare-metal machines and containers, using tools such as Ansible, Bash, or Python.





 






The salary range is $150,000 - $250,000