S

Site Reliability Engineer

Sands Digital Services Opco
Full-time
Remote
United States

Job Description:

Position Overview

The primary responsibility of the Site Reliability Engineer (SRE) is to support applications, systems, operations, administration, configuration, troubleshooting and automation of cloud hosting, monitoring and improving application performance and enhancing all service line objectives. In this role, the SRE will be responsible for overall performance of our cloud applications and AWS cloud infrastructure. The SRE will be working with the product engineering teams, cloud architecture and engineering team, DevOps and DevSecOps teams.


All duties are to be performed in accordance with departmental and Las Vegas Sands Corp.’s policies, practices, and procedures. All Las Vegas Sands Corp. Team Members are expected to conduct and carry themselves in a professional manner at all times. Team Members are required to observe the Company’s standards, work requirements and rules of conduct.   

Essential Duties & Responsibilities

  • Monitor, and enforce service-level agreements (SLAs) and service-level indicators (SLIs).

  • Handle and respond to service outages and interruptions. This includes troubleshooting, root cause analysis, and post-mortem reviews to prevent future incidents.

  • Monitor the infrastructure and application's performance to predict future system demands. This includes provisioning additional resources or optimizing the existing setup to handle the load.

  • Completes complex development, design, implementation, architecture design specification, and maintenance activities as needed.

  • Automate manual operations work, including the deployment of code and configuration changes.

  • Set up and maintain monitoring, logging, and alerting systems.

  • Monitor and analyze infrastructure costs to suggest ways to optimize and reduce unnecessary expenses.

  • Identify and remove bottlenecks in the system to improve performance. This might involve code optimizations, database tuning, or optimizing server configurations.

  • Build software and systems to manage platform infrastructure and applications

  • Provide operational support and engineering for multiple large distributed software applications

  • Fixing escalated issues from development team

  • Documenting technical systems

  • Document best practices, runbooks, and procedures for troubleshooting common issues.

  • Improve reliability, quality, and time-to-market of our suite of software solutions

  • Measure and optimize system performance, pushing our capabilities forward, getting ahead of product team needs, and innovating to continually improve

  • Work collaboratively with product & software engineering professionals to define infrastructure and deployment requirements.

  • Provision, configure and maintain cloud infrastructure defined as code.

  • Ensure that the infrastructure and applications meet security standards and comply with relevant regulations. This might involve regular security audits, patching, and vulnerability assessments.

  • Troubleshoot problems across a wide array of services and functional areas.

  • Gather and analyze metrics from both operating systems and applications to assist in performance tuning and fault finding

  • Partner with development teams to improve services through rigorous testing and release procedures

  • Create sustainable systems and services through automation and uplifts

  • Strong organizational skills, customer service focus, attention to detail, and process orientation

  • Ability to distill and present information to senior leaders

  • Flexibly adapt to a changing environment

  • Perform job duties in a safe manner.

  • Attend work as scheduled on a consistent and regular basis.

  • Perform other related duties as assigned.

Minimum Qualifications

  • At least 21 years of age.

  • Proof of authorization to work in the United States.

  • Bachelor’s degree or equivalent in relevant discipline.

  • Must be able to obtain and maintain any certification or license, as required by law or policy.

  • 5 years of experience building and maintaining AWS infrastructure (VPC, EC2, Security Groups, IAM, ECS, CodeDeploy, CloudFront, S3)

  • Strong understanding of how to secure AWS environments and meet compliance requirements

  • Hands-on experience deploying and managing infrastructure with Terraform

  • Experience with Kubernetes, GitHub, Jenkins, ELK and deploying applications on AWS

  • Ability to learn/use a wide variety of open-source technologies and tools

  • Ability to program (structured and OO) with one or more high level languages, with a strong preference for GoLang

  • Experience with distributed storage technologies like NFS, HDFS, Ceph, S3 as well as dynamic resource management frameworks (Mesos, Kubernetes, Yarn)

  • A proactive approach to spotting problems, areas for improvement, and performance bottlenecks

  • Strong bias for action and ownership

  • Strong interpersonal skills with the ability to communicate effectively and interact appropriately with management, other Team Members and outside contacts of different backgrounds and levels of experience.

  • Previous startup experience would be a huge plus

Physical Requirements

Must be able to:

  • Physically access assigned workspace areas with or without reasonable accommodation.

  • Work remotely as necessary

  • Work indoors and be exposed to various environmental factors such as, but not limited to, CRT, noise, and dust.

  • Utilize laptop and standard keyboard to perform essential functions of the job.