Since 1996, epay, a business segment of Euronet, has been at the center of connecting local and global brands to consumers. Our capabilities, platforms, products, and solutions cater to the changing consumer demand for content and payments in categories such as mobile, gaming, and entertainment. We are dedicated to developing new distribution capabilities that serve customers' changing needs through our retailer network, helping our brand partners meet consumers where they shop: in physical stores, online, via mobile devices or wallets and through ATMs.
We currently have an opening in Las Vegas, NV for a Site Reliability Engineer Manager working with one of our strategic Fintech partners.
In this role, you will lead a small team dedicated to designing, implementing, and maintaining highly available, scalable, and secure systems in an exciting and fast-paced startup environment. Your responsibilities will include driving the adoption of SRE principles, collaborating with cross-regional teams, and contributing to the strategic direction of our IT infrastructure and services. The ideal candidate should have a hands-on approach and demonstrate strong proficiency in Linux, Kubernetes, CI/CD, and cloud computing. You will lead a small team of talented engineers, oversee the execution of SRE initiatives, and ensure clear and consistent communication.
Responsibilities:
Lead the team in designing, implementing, and maintaining highly available, scalable (99.98%), and secure systems.
· Develop and implement operational processes and procedures to ensure smooth IT infrastructure and service operations.
· Collaborate with cross-regional teams to implement best practices for building, deploying, and monitoring software systems.
· Staying calm under pressure
· Manage major incidents to mitigation/resolution, perform post-incident reviews of all major incidents and determine action items required to avoid similar issues/minimize downtime for future incidents.
· Define and track key performance indicators (KPIs, SLIs, SLAs, SLOs) to measure operational effectiveness.
· Monitor, analyze, and optimize system performance, capacity, and resource utilization.
· Manage budgets and resources effectively.
· Identify and implement continuous improvement initiatives to increase efficiency and reduce risks.
· Lead incident response activities, performing root cause analysis and implementing preventative solutions.
· Drive the development and implementation of automation solutions to streamline operations and reduce manual workloads.
Team Management and Leadership:
· Manage a team of Site Reliability Engineers / DevOps, including hiring, evaluating, training, and developing team members.
· Build a collaborative and productive team culture.
· Own and maintain the company's cloud infrastructure strategy and SRE team roadmap.
· Evaluate and improve SRE processes and procedures.
· Provide technical expertise by collaborating with stakeholders to make high-level decisions and provide technical direction to team members.
· Participate in deep system design and implementation discussions to ensure high-quality systems are built. Work closely with our Software Development and Engineering teams to build platforms before they go live, building a reliable production-ready services and applications.
· Provide rotational on-call support where you’ll respond, detect, triage and resolve production incidents
· Bachelor's degree in Computer Science, Engineering, or a related discipline.
· Over a decade of experience in IT, including at least two years in a leadership capacity.
· Strong technical background in cloud computing, networking, security, and automation.
· Excellent leadership, communication, and interpersonal skills.
· Bachelor’s degree in related field or equivalent experience required.
· Strong knowledge of Linux and Windows operating systems and environment
· Strong knowledge of Networking, Load balancers, DNS, NTP and TCP/IP
· Strong knowledge of GCP, including Google Cloud components such as Cloud Load Balancing, Cloud Run, GKE, Compute Engine, VPC, Cloud Storage, and Cloud SQL, or equivalent experience with another cloud such as AWS or Azure.
· Experience with containers and container orchestration
· Experience with some Infrastructure Automation like Terraform, Ansible, Puppet/Foreman
· Experience with Apache, Nginx.
· Proficiency in the design principles for monitoring and alerting systems.
· Experience with monitoring tools like Zabbix, Nagios, Icinga, SolarWinds, New Relic, Grafana
· Solid scripting skills; experience with Shell, Bash, Ansible, Python, Powershell, Ruby.
· Experience in setting up CI/CD pipelines (Gitlab or AzureDevops)
· A willingness to learn on the job and take on tasks as needed
Additional Desired Experience:
· Certifications such as AWS Certified DevOps Engineer or Google Professional Cloud DevOps Engineer are a plus.
· Experience with one or more of the following F5 products: LTM, AWAF, GTM, AFM, BIGIQ
· Experience with one or more of the technologies used for big data: ELK, Beats, Kafka, Redis, Searchguard.
· Familiarity with ITIL frameworks and best practices for IT service management.
We are an Equal Opportunity Employer, and all qualified applicants will receive consideration for employment without regard to race, color, religion, gender, sexual orientation, gender identity, or national origin, age, disability status, genetic information, protected veteran status, or any other characteristic protected by law.