Senior DevOps Engineer (Cloud & On-Premises)

Logic Software Solutions

25 days ago

Full-time

On-site

New Jersey, United States

Job Title: Senior DevOps Engineer (Cloud & On-Premises) – Business Systems Development

Location: New Jersey, USA (Hybrid - typically 2 days in-office)
Salary Range: 160,000−160,000−190,000 + Performance Bonus + Comprehensive Benefits
Reporting To: Director of DevOps

The Opportunity

Are you a developer at heart who found a passion for infrastructure and automation? We are seeking a battle-hardened Senior DevOps Engineer for our Business Systems Development team, a role demanding 80-90% hands-on-keyboard technical work. This is not an advisory or architectural ivory tower position; it is for a "crack on" engineer who can step into our complex, hybrid environment and deliver from the first sprint.

You will own the end-to-end lifecycle of our Kubernetes ecosystems, spanning AWS EKS and on-premises data centers. Your primary mission is to build, secure, and optimize the platform that powers our trading and business applications. We operate in a fast-paced, regulated financial environment where stability, security, and performance are paramount. You will work elbow-to-elbow with InfoSec, Network Engineering, and application teams to drive our cloud-native roadmap, bringing a developer’s discipline and an operator’s resilience.

Given the seniority of the role, we explicitly seek a professional with a minimum of 7+ years of experience in a combined on-premises and cloud landscape, including significant tenure within the Financial Services sector (e.g., banking, capital markets, hedge funds, or fintech).

Why You?

You started your career writing code—Python, Java, Go, or Node.js—before pivoting into the infrastructure and automation space. This background is non-negotiable because you will be building against APIs, writing automation frameworks, and deeply understanding the application contracts that run on your clusters. You don't just throw containers at an orchestrator; you understand the runtime, the dependencies, and the network stack from the kernel up.

When you step in, you should be immediately capable of:

Refactoring a Helm chart to enforce new non-root security contexts.
Debugging a complex mTLS handshake failure between an on-prem Kafka cluster and an EKS-based microservice.
Writing a Python operator or a Jenkins shared library to automate a bespoke disaster-recovery process.

Key Responsibilities

Kubernetes Lifecycle & Application Onboarding (Primary 60%)

Own the end-to-end technical onboarding of applications onto on-premises K8s and AWS EKS.
Author and refactor Kubernetes app manifests using Kustomize and Helm Charts, enforcing base-line policies and environment-specific overrides.
Implement and manage GitOps workflows using Flux and ArgoCD for declarative state management and drift reconciliation.
Conduct rigorous production-readiness reviews, enforcing checklist criteria around resource limits, liveness/readiness probes, pod disruption budgets, and topology spread constraints.

CI/CD & Developer Experience (15%)

Architect and maintain multi-branch Jenkins pipelines (scripted/declarative) for building secure container images.
Integrate with image registries (e.g., ECR, Harbor) and implement image-scanning gates.
Build internal tooling and automation using Python to streamline environment turn ups, stress test orchestration, and disaster-recovery runbooks.

Observability, Networking & Security (20%)

Deploy and tune the full observability stack: Prometheus (metrics), Grafana (dashboards), Splunk (logs), and AlertManager for intelligent routing.
Configure, troubleshoot, and optimize Kubernetes networking: Ingress (Nginx/HAProxy) , service mesh (Istio/Linkerd), and container networking (CNI plugins like Calico or Cilium).
Implement robust authentication and authorization workflows, integrating AWS IAM with EKS (IRSA) and centrally managing identity through Keycloak for on-prem workloads.
Master the TLS landscape: internal PKI, certificate renewals via cert-manager, and debugging mutual TLS across service-to-service communication.

Site Reliability Engineering (SRE) & Support (5%)

Define Service Level Indicators (SLIs) and error budgets in collaboration with application owners.
Participate in an on-call ROTA, leading complex incident triage, root cause analysis (RCA), and the implementation of permanent preventive fixes.
Champion a blameless post-mortem culture.

Detailed Technology Stack & Expected Proficiency

This section defines the specific tools and the depth of knowledge expected. Candidates must demonstrate expert-level, production-grade experience in the bolded categories.

1. Containerization & Orchestration (Core Platform)

Docker & ContainerD: Expert-level understanding of image layering, multi-stage builds, docker buildkit, and runtime security. You can debug a failing container start with nsenter.
Kubernetes Administration (EKS & On-Prem):
- On-Prem: Experience building or managing clusters deployed via kubeadm or kops on bare metal or VMs. Deep knowledge of etcd backup/restore, control plane upgrades, and node lifecycle.
- AWS EKS: Managing managed node groups, Fargate profiles, cluster autoscaler, and Karpenter.
- Scheduling: Expert use of taints/tolerations, node selectors, pod affinity/anti-affinity, topology spread constraints.
- RBAC & Security: Creating Roles, ClusterRoles, and Bindings; implementing Pod Security Standards (PSS) to restrict privileged containers.
- CRDs & Operators: Experience installing, managing, and potentially writing custom operators.
Container Storage:
- CSI Drivers: Managing AWS EBS CSI, EFS CSI, and on-prem storage solutions (e.g., Portworx, Longhorn, OpenEBS).
- Volume Management: Dynamic provisioning using StorageClasses, managing PersistentVolumes and PersistentVolumeClaims for stateful workloads.

2. Application Packaging & GitOps (Deployment Strategy)

Helm Charts: Building and maintaining complex Helm charts with _helpers.tpl, subcharts, and global values. Helm test hooks and helm-secrets integration.
Kustomize: Using overlays, patches, transformers, and generators to manage multi-environment variations without templating.
GitOps Agents:
- ArgoCD: Managing Applications, ApplicationSets, Projects, and RBAC. Proficiency with Sync Phases/Waves, hooks (PreSync, PostSync), and automated image updater.
- Flux CD: Creating Kustomizations, HelmReleases, and managing image automation with ImageRepositories and ImagePolicies.

3. Infrastructure as Code (IaC) & Cloud (AWS Heavy)

AWS CDK (Primary): Building stacks using Python or TypeScript. Constructing high-level L3 constructs for reusable patterns. Expert with aws-cdk-lib.
Compute: EC2 (Spot/Reserved Instances, Placement Groups), Lambda (building automation runtimes).
Networking: VPC design (public/private/database subnets, CIDR planning), Transit Gateway for hybrid connectivity, AWS Network Firewall, VPC Peering, and PrivateLink. Deep debugging of Security Groups vs. NACLs.
Load Balancing: Architecting Application Load Balancers (ALB) with path-based routing, WebSockets, and gRPC support; Network Load Balancers (NLB) for low-latency TCP/UDP traffic. In-depth knowledge of target groups, health checks, and SSL termination.
Storage & Database: S3 (lifecycle policies, event notifications, pre-signed URLs), RDS (Multi-AZ failover, read replicas, parameter groups), DynamoDB (RCU/WCU, global tables), ElastiCache.
Security & Identity: IAM – authoring complex, least-privilege policies; building permission boundaries and SCPs. IRSA (IAM Roles for Service Accounts) to map EKS service accounts to IAM roles. KMS multi-region key management.

4. CI/CD & Automation Engineering (The Pipeline)

Jenkins: Architecting Shared Libraries in Groovy to standardize CI steps. Implementing scripted and declarative pipelines, multi-branch scanning, and integrating with Active Directory/LDAP.
Image Registries & Security: AWS ECR (image scanning, lifecycle policies), Docker Hub, or Harbor. Integrating Trivy or Snyk into pipelines as gate mechanisms.
Scripting & Programming:
- Python (Expert): Primary language for automation. Using boto3 to interact with AWS, kubernetes client to build operators/controllers, and requests for API integrations.
- Shell/Bash (Expert): Debugging init scripts, writing robust, idempotent automation scripts. jq, yq, awk, sed for in-line data manipulation.
- Go (Highly Desirable): Ability to write a simple but production-ready Kubernetes controller or a custom CLI tool using client-go.
- Java/Node.js (Desirable): Ability to read and reason about application code to assist in JVM tuning, dependency conflicts, and understanding the contract of the deployed services.

5. Observability & Data Streaming (The Nervous System)

Metrics & Dashboards:
- Prometheus: Writing PromQL queries for SLO calculations, managing ServiceMonitors and PodMonitors, configuring remote write to long-term storage (e.g., Thanos, Cortex, or Grafana Mimir).
- Grafana: Building and provisioning dashboards via ConfigMaps, integrating data sources, and setting up unified alerting.
Logging:
- Splunk: In-depth experience with SPL (Search Processing Language) for log aggregation, statistical analysis, and building dashboards/reports. Managing forwarders (UF/HF) and inputs.conf.
- AWS CloudWatch: Log Groups, Log Insights APIs, Metric Filters, and Composite Alarms.
Alerting & Incident Response: AlertManager configuration, routing trees, inhibition rules, and silencing. Integrating alerts with PagerDuty.
Streaming & Messaging (Desirable):
- Apache Kafka: Cluster management, topic partition strategy, consumer group lag debugging. Understanding of Kraft consensus vs. Zookeeper. Working with Strimzi operator on Kubernetes, KafkaTopics, and KafkaUsers.

6. Networking & Identity (The Secure Perimeter)

Ingress Controllers:
- NGINX Ingress: Expert with annotations for CORS, rewrite targets, server-snippets, and rate limiting. Load balancing algorithms (ewma, ip_hash).
- Service Mesh (Istio Preferred): Configuring VirtualServices, DestinationRules, Gateways, and PeerAuthentication for strict mTLS. Fault injection and circuit breaking.
Certificates & TLS:
- cert-manager: Automating Let's Encrypt or internal CA certificate issuance with ClusterIssuers and Certificate resources.
- mTLS: Deep understanding of X.509 certificate structure, trust chains, and debugging handshake failures using openssl s_client.
Identity Federation:
- Keycloak: Configuring Realms, Clients (OIDC/SAML), User Federation (LDAP/AD), and Identity Brokering.
- AWS IAM Identity Center: Mapping corporate identities to AWS permission sets.

Required Qualifications & Experience

Foundation: Bachelor’s degree in Computer Science, Engineering, or equivalent field.
Total Experience: 7+ years in technical roles, with a proven trajectory from application development into DevOps/Platform Engineering.
Domain: 3+ years of experience specifically within a Financial Services or highly regulated institution. (Fintech, Banking, Insurance). You understand the security and audit rigor this demands.
Hands-on K8s: 4+ years of deep, hands-on experience managing production Kubernetes clusters on-prem and in AWS.
Linux Proficiency: You are comfortable deep in a Linux terminal, troubleshooting system calls, network interfaces, and kernel logs.
Communication: Strong written and verbal skills; able to articulate complex technical trade-offs and lead incident calls calmly and efficiently.
Work Authorization: Must be authorized to work in the United States without visa sponsorship.

Desired Bonuses (The "Unfair Advantages")

Experience migrating legacy monolithic Java/Spring applications into containerized microservices.
Active contributions to open-source projects, notably in the cloud-native landscape.
Experience with AWS CDK in Python.
Experience managing and tuning Apache Kafka clusters in production.

Compensation & Culture

Salary: Base salary of 160,000−160,000−190,000, commensurate with the depth of technical skill and financial sector expertise.
Bonus: Performance-based annual bonus.
Work Style: This is a Hybrid role based in our New Jersey office. We value the in-person collaboration for whiteboard sessions and incident response, balanced with focused remote work days.
Environment: Autonomy, high trust, and direct impact on business-critical infrastructure. No red tape—just a DevOps team that ships and supports elite-grade systems.

Apply now

Senior DevOps Engineer (Cloud & On-Premises)

More jobs

DevOps Engineer

Berkshire Hathaway GUARD Insurance Companies

Senior Cloud Engineer I

Bristol Myers Squibb