Site Reliability Engineer

Full-time

Remote

United States

Job Description:

Position Overview

The primary responsibility of the Site Reliability Engineer (SRE) is to support applications, systems, operations, administration, configuration, troubleshooting and automation of cloud hosting, monitoring and improving application performance and enhancing all service line objectives. In this role, the SRE will be responsible for overall performance of our cloud applications and AWS cloud infrastructure. The SRE will be working with the product engineering teams, cloud architecture and engineering team, DevOps and DevSecOps teams.

All duties are to be performed in accordance with departmental and Las Vegas Sands Corp.’s policies, practices, and procedures. All Las Vegas Sands Corp. Team Members are expected to conduct and carry themselves in a professional manner at all times. Team Members are required to observe the Company’s standards, work requirements and rules of conduct.

Essential Duties & Responsibilities

Monitor, and enforce service-level agreements (SLAs) and service-level indicators (SLIs).
Handle and respond to service outages and interruptions. This includes troubleshooting, root cause analysis, and post-mortem reviews to prevent future incidents.
Monitor the infrastructure and application's performance to predict future system demands. This includes provisioning additional resources or optimizing the existing setup to handle the load.
Completes complex development, design, implementation, architecture design specification, and maintenance activities as needed.
Automate manual operations work, including the deployment of code and configuration changes.
Set up and maintain monitoring, logging, and alerting systems.
Monitor and analyze infrastructure costs to suggest ways to optimize and reduce unnecessary expenses.
Identify and remove bottlenecks in the system to improve performance. This might involve code optimizations, database tuning, or optimizing server configurations.
Build software and systems to manage platform infrastructure and applications
Provide operational support and engineering for multiple large distributed software applications
Fixing escalated issues from development team
Documenting technical systems
Document best practices, runbooks, and procedures for troubleshooting common issues.
Improve reliability, quality, and time-to-market of our suite of software solutions
Measure and optimize system performance, pushing our capabilities forward, getting ahead of product team needs, and innovating to continually improve
Work collaboratively with product & software engineering professionals to define infrastructure and deployment requirements.
Provision, configure and maintain cloud infrastructure defined as code.
Ensure that the infrastructure and applications meet security standards and comply with relevant regulations. This might involve regular security audits, patching, and vulnerability assessments.
Troubleshoot problems across a wide array of services and functional areas.
Gather and analyze metrics from both operating systems and applications to assist in performance tuning and fault finding
Partner with development teams to improve services through rigorous testing and release procedures
Create sustainable systems and services through automation and uplifts
Strong organizational skills, customer service focus, attention to detail, and process orientation
Ability to distill and present information to senior leaders
Flexibly adapt to a changing environment
Perform job duties in a safe manner.
Attend work as scheduled on a consistent and regular basis.
Perform other related duties as assigned.

Minimum Qualifications

At least 21 years of age.
Proof of authorization to work in the United States.
Bachelor’s degree or equivalent in relevant discipline.
Must be able to obtain and maintain any certification or license, as required by law or policy.
5 years of experience building and maintaining AWS infrastructure (VPC, EC2, Security Groups, IAM, ECS, CodeDeploy, CloudFront, S3)
Strong understanding of how to secure AWS environments and meet compliance requirements
Hands-on experience deploying and managing infrastructure with Terraform
Experience with Kubernetes, GitHub, Jenkins, ELK and deploying applications on AWS
Ability to learn/use a wide variety of open-source technologies and tools
Ability to program (structured and OO) with one or more high level languages, with a strong preference for GoLang
Experience with distributed storage technologies like NFS, HDFS, Ceph, S3 as well as dynamic resource management frameworks (Mesos, Kubernetes, Yarn)
A proactive approach to spotting problems, areas for improvement, and performance bottlenecks
Strong bias for action and ownership
Strong interpersonal skills with the ability to communicate effectively and interact appropriately with management, other Team Members and outside contacts of different backgrounds and levels of experience.
Previous startup experience would be a huge plus

Physical Requirements

Must be able to:

Physically access assigned workspace areas with or without reasonable accommodation.
Work remotely as necessary
Work indoors and be exposed to various environmental factors such as, but not limited to, CRT, noise, and dust.
Utilize laptop and standard keyboard to perform essential functions of the job.

Apply now

Share this job

Twitter Facebook Linkedin Email

Site Reliability Engineer

More jobs

DevOps Engineer

SilverEdge Government Solutions

Senior DevOps Engineer (m/f/d)

Betterlore