Description
Job Overview
The High-Performance Computing (HPC) Support Engineer will be part of a globally recognized customer success team focused on supporting global organizations in their High Performance Computing and AI initiatives. This role involves high-level technical engineering to help, advise, and solve customer technical issues, as well as building and maintaining business relationships with key stakeholders for enterprise customers.
Key Responsibilities
- Oversee the management of high-performance computing (HPC) clusters and their associated storage systems, including large parallel file systems, NFS file servers, and underlying storage technologies[1].
- Perform Linux systems administration, including on-call duties, managing hardware, operating systems, I/O, and the installation and maintenance of the software environment[1].
- Support resource managers, schedulers, and ensure seamless client access to parallel and distributed file systems[1].
- Conduct in-depth analysis, testing, scripting, and benchmarking, working with advanced systems, data, and networks in a research and performance evaluation setting[1].
- Contribute to the design, installation, and upgrade of large-scale HPC clusters and storage resources[1].
- Integrate HPC systems into the network, cloud, and user environments, and help develop and implement security procedures to safeguard these systems[1].
- Manage performance testing at all levels, including CPU, memory, GPU, interconnect bandwidth, and file system IOPS[1].
- Deploy monitoring tools with appropriate alert mechanisms to ensure rapid incident detection and response, and collaborate with operations teams to streamline incident evaluation and resolution processes[1].
- Serve as a liaison between the Operations team and computational scientists, providing advanced technical guidance and supporting ongoing training efforts[1].
Technical Experience
- High Performance Computers: Supporting users and managing HPC systems, including experience with high-performance file systems like Lustre, Ceph, and GPFS[1][3][5].
- Linux: Advanced knowledge of administering Linux systems, primarily Red Hat and its derivatives, including services, networking, and file systems[1].
- Infrastructure: Experience with cloud native architecture, batch scheduling, and workload management[3].
- TCP/IP Networking: Strong knowledge of networking infrastructure, including interconnects such as InfiniBand, NVLink, and Ethernet[1].
- Programming Languages: Proficiency in languages such as Python, C, and C++[3][5].
Contract Details
- Contract Type: 18 months contract
- Rate: Competitive rate
- Location: Remote, UK based
Qualifications
- Bachelor’s degree in a related area and/or equivalent experience/training. Professional technical engineering or technical programming experience or Master's Degree preferred[1].
- Advanced knowledge of systems integration and deploying moderately complex systems integration solutions[1].
- Demonstrated experience with a major configuration management software, including application packaging and installation[1].
- Strong demonstrated experience programming and scripting, with experience in Bash and Python preferred[1].
Additional Requirements
- Experience engaging with, and presenting to, technical stakeholders and executive leaders[3].
- Ability to work effectively in collaborative settings, often addressing undefined problems and making impactful recommendations[1].
- Strong communication skills and the ability to build and maintain business relationships with key stakeholders[3].