Job Description:-
- Provide a support to portfolio of technical solutions within a delivery channel focusing on HPC, AI and ML system and software tools.
- Works with application, data, and infrastructure teams to produce optimal, high level, conceptual designs for projects. Supports enterprise level solutions that integrate across applications, systems, and platforms.
- Manages changes in process, policy, and standards as they relate to the architecture and design principles.
- Researches and maintains knowledge in emerging technologies and solutions to solve business problems.
- Serves as a technical expert and critical resource across multiple disciplines.
Roles and Responsibilities:-
- Collaborate with internal stakeholders to understand future NVIDIA deployments to support project exigencies and improve DGX POD efficiency in a Kubernetes based platform.
- Review architecture of applications and supports technical design sessions with architects and developers, including the creation of class models, sequence diagrams, component models and design specifications.
- Creates project and application architecture deliverables that are consistent with architecture principles, standards, methodologies, and best practices. Researches and maintains knowledge in emerging technologies and possible application to the business. Designs and develops new tools to support Software Development Lifecycle (SDLC) processes.
- Serves as a liaison with the engineering team around required features, critical bugs, and testing of new functionality. Communicates implications of architectural decisions, issues and plans to business and IT Leadership. Provides input to the development of project initiation documents including objectives, scope, approach, and deliverables, when needed.
- Partners with ITS business representatives and business leaders to understand business drivers and critical needs. Ensures alignment between the business strategies and application technology roadmap while advising and consulting leadership on costs, benefits, and implementation requirements.
- Supports team initiatives across functions with application triage, performance engineering, and testing activities. Assists in the troubleshooting and triage of complex applications issues. Provides support/guidance to development teams throughout the analysis, design, development, and testing processes. Resolves complex technical issues as needed to support solution development.
Requirements:-
- Bachelor’s in computer science (CS), Computer Engineering (CSEE), or related STEM field and/or equivalent professional experience.
- Strong experience supporting Linux, OS installation and automation (PXE, kickstart, ansible), networking and storage.
- Strong experience supporting TCP/IP networking fundamentals, ports, IP subnets, DNS, routes.
- Expert programming/scripting skills in Linux Shell/CLI, Bash, Python, and Go.
- Strong understanding of CI/CD processes and deployment tools, including ArgoCD, Kubernetes, Helm, and Docker.
- Experience with resource management systems and job scheduling, including running and debugging parallel programs.
- Strong experience using GIT and other version control systems.
- Experience supporting large-scale data management systems serving hundreds of users/data scientists.
- Experience with provisioning and configuration management tools; Puppet, Ansible, Chef, Terraform, etc.
- Excellent critical thinking, verbal communication, and problem-solving skills.
Preferred Qualifications :-
- BS/MS. in Computer Science (CS), Computer Engineering (CSEE), Electrical Engineering (EE), or related/relevant STEM degree with three or more years of experience supporting HPC and AI focused technologies.
- Familiarity with Nvidia GPU’s on Linux, HPC (High Performance Computing), Infiniband, MPI, RDMA technologies.
- Experience supporting AI Data Science Projects and software tools in a HPC environment.
- Experience with supporting modern deep learning software architectures and frameworks including TensorFlow, Pytorch or other frameworks.
- Familiarity with supporting different cloud providers.
- Strong expertise with Agile Methodology and supporting tools.
- Ability to effectively communicate and engage with AI engineering and data science teams.