Back to all jobs
TalentPulse logo

DevOps Engineer – StackOps (Incident Management)

TalentPulse

On-siteContractSGD 85,000 – SGD 90,000Data & Analytics6+ yrs experience

About the role

About the Role We are seeking a highly technical DevOps Engineer to join our Central Enablement Team. In this role, you will be the core builder of our enterprise "Resiliency-as-aService" platform. Operating within the Resiliency Programme, your mission is to design, code, and deploy the automated infrastructure, toolchains, and workflows that Whole-of-Government (WOG) engineering teams rely on to manage incidents and minimize downtime. You will write the integrations that connect disparate observability tools into incident management product & centralized intelligence hub and build the automation that accelerates recovery across the organization. Key Responsibilities 1. Engineering "Resiliency-as-Code" • Build, maintain, and scale using Infrastructure-as-Code (Pulumi/Terraform) to enable teams to deploy standardized monitoring & incident workflows with a single click. • Develop self-service onboarding portals and APIs that allow distributed engineering teams to easily hook their applications into the central resiliency framework. 2. Building Incident Management Integrations • Develop and maintain robust API integrations and webhooks between specialized observability platforms (Elasticsearch, AWS/Azure Cloud-native tools, Dynatrace) and our central IT Service Management system (Jira Service Management - JSM). • Code the automation that routes alerts, enriches payloads with dependency metadata, and triggers specific JSM workflows without manual human intervention. • Develop and manage automations tools – enabling systems integrations into the central Elastic Intelligence hub. 3. AI Enablement & Auto-Remediation • Implement and configure AIOps capabilities within the observability pipeline to assist with Root Cause Analysis (RCA) and anomaly detection. • Write complex automation scripts (Python, Go, or Node.js) to execute automated runbooks and self-healing tasks (auto-remediation) triggered by specific alerts or AI outputs. • Integrate AI-driven "Incident Scribe" features into JSM to automatically summarize incident timelines and metrics for Post-Incident Reviews (PIRs). 4. Platform Reliability & CI/CD • Ensure the high availability and performance of the central observability and incident management pipelines. • Build and maintain CI/CD pipelines to test and deploy configuration changes to the StackOps architecture seamlessly. • Work closely with the Product Lead to translate resiliency strategies into scalable, technical deliverables. • Provide consultation and support to platform users, solution and enhance platform offerings based on user challenges. Qualifications & Requirements • Experience: 3-5+ years in DevOps, Software Engineering, or Site Reliability Engineering (SRE), ideally within a central platform team or Internal Developer Platform (IDP) environment. • Programming & Automation: Strong coding skills in Python, Go, or Node.js, specifically for writing automation scripts, interacting with REST APIs, and building custom integrations. • Infrastructure-as-Code: Deep, hands-on experience with Terraform, Pulumi, Ansible, or similar IaC tools within large-scale AWS or Azure environments. • Observability Tooling: Practical experience configuring and managing telemetry pipelines using OpenTelemetry, Elastic Stack, Dynatrace, or Datadog. • ITSM Integration: Experience working with the APIs of Jira Service Management (JSM), ServiceNow, or PagerDuty to build automated alerting and ticketing workflows

Ready to apply?

Create a free TalentPulse account to connect with this role and get matched with top opportunities.