In modern DevOps workflows, even the most mature CI/CD pipelines often fail for reasons that aren’t related to code transient network glitches, dependency timeouts, flaky tests, or resource limits. Each of these failures halts deployment and forces engineers to manually restart or troubleshoot pipelines, slowing release velocity and wasting valuable time.
This project introduces a Self-Healing Serverless CI/CD Pipeline built entirely on AWS Developer Tools. The solution intelligently detects pipeline or build failures in real time and automatically recovers from them without human intervention reducing downtime, improving reliability, and embodying the next step toward autonomous DevOps.
Core AWS Services Used
AWS CodePipeline — Orchestrates the CI/CD process (Source → Build → Deploy).
AWS CodeBuild — Builds and tests application code usingbuildspec.yml.
Amazon EventBridge — Detects CodePipeline or CodeBuild failure events instantly.
AWS Lambda (“Pipeline Doctor”) — Analyzes the failure and executes automated recovery actions.
Amazon DynamoDB — Stores retry counts and incident data to avoid infinite loops.
Amazon SNS — Sends notifications to developers if automatic healing fails.
Amazon CloudWatch — Captures logs and metrics for observability.
How It Works
CodePipeline triggers automatically when a commit is pushed to the source repository (CodeCommit or GitHub).
CodeBuild runs build and test commands.
If a build or pipeline fails, EventBridge immediately emits a failure event.
Lambda (“Pipeline Doctor”) receives the event, classifies the failure, and executes a healing playbook, such as: Retrying the failed build, Increasing build timeout, or Retrying the failed pipeline stage. The Doctor records the incident in DynamoDB, increments retry counts, and prevents repeated loops.
If the retry succeeds, CodePipeline resumes automatically and continues to deploy.
If the retry limit is reached, SNS sends a notification to engineers for manual review.
Step 1:
Create the Deploy-target Lambda
go to the AWS Console, search for Lambda then Create function using the parameters below
Name: SelfHealingDemoTarget
Runtime: Node.js 20.x then click Create.
Replace the default code with the contents of: lambda/demo/handler.js
code found https://github.com/Greynalytics/Self-healing-cicd.git
exports.handler = async (event) => {
console.log("Deploy stage invoked with:", JSON.stringify(event));
return { statusCode: 200, body: "Hello from Self-Healing Pipeline target!" };
};
Then Click Deploy. This is just a harmless endpoint your pipeline will “deploy” by invoking it during the Deploy stage.
**
Step 2:
Create the Code Build project**
Go to the Console, search for Code Build
then click Create project with details below
Name: self-healing-build
Source: https://github.com/Greynalytics/Self-healing-cicd.git
repo = self-healing-src,
branch main
Environment: Managed image → Amazon Linux 2 → Standard 7.0
Service role: create new (default)
Buildspec: Use a buildspec file (it’s already in the repo) the click Save.
My repo includes buildspec.yml that runs test.js ~40% of the time to simulate a flaky test, then zips an app/ folder into build.zip as an artifact.
**
Step 3
Create the CodePipeline **
on the console search for CodePipeline and build a custom pipeline
then click Create pipeline with parameters below
Name: SelfHealingPipeline
Source: github
Build: CodeBuild (select self-healing-cicdbuild)
Deploy stage: choose Add stage → + Add action group
Action provider: Lambda
Function name: SelfHealingcicdTarget
User parameters: { “msg”: “deployed” } (optional)
Hit Save and then Create pipeline and let it run once (it may succeed or fail randomly due to the flaky test which is perfect for our demo).
Step 4:
Create the DynamoDB incidents table
go to your Console search for DynamoDB then Create table with parameters below;
Table name: SelfHealingIncidents
Partition key: incidentId (String)
Billing: On-demand (default)
then click Create.
Step 5
Create an SNS topic for alerts
on the console search for SNS then Topic and Create a Topic with the parameters below
Type: Standard
Name: SelfHealingAlerts
Create
then Add subscription → Email → enter your email → confirm via the email you receive.
Note, Copy the Topic ARN as seen above; you’ll need it in the Doctor’s env vars.
**
Step 6
Create the “Pipeline Doctor” Lambda **
to go the Console → Lambda → Create function
Name: PipelineDoctor
Runtime: Node.js 20.x
Create
Open the new function to go Code tab and replace the handler with the contents of:
lambda/doctor/index.js from the repository
Environment variables (Configuration → Environment variables):
TABLE = SelfHealingIncidents
TOPIC_ARN = your SNS topic ARN
MAX_RETRIES = 2
Permissions (Execution role):
Open the function’s Configuration → Permissions → Execution role:
Attach AmazonDynamoDBFullAccess (demo speed; tighten later).
Attach AmazonSNSFullAccess (demo speed; tighten later).
Add inline policy named Doctor-CodeBuild-Access:
{ "Version": "2012-10-17",
"Statement": [ {
"Effect": "Allow",
"Action": ["codebuild:BatchGetBuilds","codebuild:StartBuild","codebuild:UpdateProject"],
"Resource": "*" } ] }
Add inline policy named Doctor-CodePipeline-Access:
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": ["codepipeline:RetryStageExecution","codepipeline:GetPipelineExecution"], "Resource": "*" } ] }
(For production, restrict “Resource”: “*” to your specific project/pipeline ARNs.)
Click Deploy on the Lambda code.
**
Step 7
Wire Event-Bridge rules to the Doctor**
on the Console search Amazon EventBridge then click Rules and Create rule
**
Rule A — CodeBuild failures/timeouts**
Name: OnCodeBuildFail
Event pattern — Pre-defined pattern by service:
Service: CodeBuild,
Event type: CodeBuild Build State Change
Add pattern filter:
detail.build-status IN FAILED, TIMED_OUT
Target: Lambda function → PipelineDoctor
then Create Rule
Defining the Eventbridge Rule
Building the event patterns
Select target
Create CodeBuild failure triggers
// CodeBuild failures
{
"source": ["aws.codebuild"],
"detail-type": ["CodeBuild Build State Change"],
"detail": { "build-status": ["FAILED","TIMED_OUT"] }
}
**
Rule B- Code Pipeline action failures**
Name: OnPipelineActionFail
Event pattern → Pre-defined pattern by service:
Service: CodePipeline, Event type: CodePipeline Action Execution State Change
Add pattern filter:
detail.state = FAILED
Target: Lambda function → PipelineDoctor
Create rule
// CodePipeline action failures
{
"source": ["aws.codepipeline"],
"detail-type": ["CodePipeline Action Execution State Change"],
"detail": { "state": ["FAILED"] }
}
Step 8
Test the self-healing flow
Push a new commit to main in pipeline-src/.
Watch Code Pipeline run.
If Build fails (flaky test.js), EventBridge triggers Pipeline Doctor.
Doctor checks retries in DynamoDB, then retries the build (and may bump timeout).
If the retry passes, pipeline turns green which is self-healed.
If retries exhausted , you get an SNS email, and the incident is marked UNHEALED.
You can Open CloudWatch Logs for PipelineDoctor to see which playbook ran.




































