Skip to content

Commit 950b460

Browse files
committed
docs: add Day 2 Operations workshop on Drift Detection and Remediation
1 parent e189ae6 commit 950b460

File tree

1 file changed

+6
-283
lines changed
  • content/blog/day-2-operations-drift-detection-and-remediation

1 file changed

+6
-283
lines changed

content/blog/day-2-operations-drift-detection-and-remediation/index.md

Lines changed: 6 additions & 283 deletions
Original file line numberDiff line numberDiff line change
@@ -196,263 +196,6 @@ Detecting drift is only valuable if the right people know about it. Pulumi's [we
196196

197197
You can route notifications wherever your team actually pays attention. Send alerts directly to Slack channels where your ops team congregates. Push notifications to Microsoft Teams if that's your collaboration platform. Use custom webhooks to integrate with PagerDuty, Datadog, or any other monitoring system. You can even use deployment triggers to automatically run dependent stacks when drift is detected and fixed.
198198

199-
## Programmatic Drift Detection with Pulumi Service Provider
200-
201-
While the Pulumi Console provides a user-friendly interface for configuring drift detection, many teams prefer to manage everything as code. The [Pulumi Service Provider](/registry/packages/pulumiservice) brings infrastructure-as-code principles to drift detection itself, allowing you to define and manage drift schedules programmatically.
202-
203-
### Creating Drift Schedules with Code
204-
205-
Instead of clicking through UI screens, you can define your drift detection schedules in the same code that defines your infrastructure. This approach ensures your drift detection configuration is version-controlled, reviewable, and reproducible:
206-
207-
{{< chooser language "typescript,python,go,csharp,java,yaml" >}}
208-
209-
{{% choosable language typescript %}}
210-
211-
```typescript
212-
import * as pulumi from "@pulumi/pulumi";
213-
import * as pulumiservice from "@pulumi/pulumiservice";
214-
215-
const driftSchedule = new pulumiservice.DriftSchedule("production-drift-detection", {
216-
organization: "my-org",
217-
project: "core-infrastructure",
218-
stack: "production",
219-
220-
// Run every 4 hours
221-
scheduleCron: "0 */4 * * *",
222-
223-
// Automatically fix any drift found
224-
autoRemediate: true,
225-
});
226-
227-
export const scheduleId = driftSchedule.scheduleId;
228-
```
229-
230-
{{% /choosable %}}
231-
232-
{{% choosable language python %}}
233-
234-
```python
235-
import pulumi
236-
import pulumi_pulumiservice as pulumiservice
237-
238-
drift_schedule = pulumiservice.DriftSchedule("production-drift-detection",
239-
organization="my-org",
240-
project="core-infrastructure",
241-
stack="production",
242-
243-
# Run every 4 hours
244-
schedule_cron="0 */4 * * *",
245-
246-
# Automatically fix any drift found
247-
auto_remediate=True
248-
)
249-
250-
pulumi.export('schedule_id', drift_schedule.schedule_id)
251-
```
252-
253-
{{% /choosable %}}
254-
255-
{{% choosable language go %}}
256-
257-
```go
258-
package main
259-
260-
import (
261-
"github.com/pulumi/pulumi-pulumiservice/sdk/go/pulumiservice"
262-
"github.com/pulumi/pulumi/sdk/v3/go/pulumi"
263-
)
264-
265-
func main() {
266-
pulumi.Run(func(ctx *pulumi.Context) error {
267-
driftSchedule, err := pulumiservice.NewDriftSchedule(ctx, "production-drift-detection", &pulumiservice.DriftScheduleArgs{
268-
Organization: pulumi.String("my-org"),
269-
Project: pulumi.String("core-infrastructure"),
270-
Stack: pulumi.String("production"),
271-
272-
// Run every 4 hours
273-
ScheduleCron: pulumi.String("0 */4 * * *"),
274-
275-
// Automatically fix any drift found
276-
AutoRemediate: pulumi.Bool(true),
277-
})
278-
if err != nil {
279-
return err
280-
}
281-
282-
ctx.Export("scheduleId", driftSchedule.ScheduleId)
283-
return nil
284-
})
285-
}
286-
```
287-
288-
{{% /choosable %}}
289-
290-
{{% choosable language csharp %}}
291-
292-
```csharp
293-
using Pulumi;
294-
using PulumiService = Pulumi.PulumiService;
295-
296-
class Program
297-
{
298-
static Task<int> Main() => Deployment.RunAsync(() => {
299-
var driftSchedule = new PulumiService.DriftSchedule("production-drift-detection", new PulumiService.DriftScheduleArgs
300-
{
301-
Organization = "my-org",
302-
Project = "core-infrastructure",
303-
Stack = "production",
304-
305-
// Run every 4 hours
306-
ScheduleCron = "0 */4 * * *",
307-
308-
// Automatically fix any drift found
309-
AutoRemediate = true,
310-
});
311-
312-
return new Dictionary<string, object?>
313-
{
314-
{ "scheduleId", driftSchedule.ScheduleId }
315-
};
316-
});
317-
}
318-
```
319-
320-
{{% /choosable %}}
321-
322-
{{% choosable language java %}}
323-
324-
```java
325-
import com.pulumi.Context;
326-
import com.pulumi.Pulumi;
327-
import com.pulumi.pulumiservice.DriftSchedule;
328-
import com.pulumi.pulumiservice.DriftScheduleArgs;
329-
330-
public class App {
331-
public static void main(String[] args) {
332-
Pulumi.run(App::stack);
333-
}
334-
335-
private static void stack(Context ctx) {
336-
var driftSchedule = new DriftSchedule("production-drift-detection", DriftScheduleArgs.builder()
337-
.organization("my-org")
338-
.project("core-infrastructure")
339-
.stack("production")
340-
341-
// Run every 4 hours
342-
.scheduleCron("0 */4 * * *")
343-
344-
// Automatically fix any drift found
345-
.autoRemediate(true)
346-
.build());
347-
348-
ctx.export("scheduleId", driftSchedule.scheduleId());
349-
}
350-
}
351-
```
352-
353-
{{% /choosable %}}
354-
355-
{{% choosable language yaml %}}
356-
357-
```yaml
358-
name: drift-detection
359-
runtime: yaml
360-
361-
resources:
362-
production-drift-detection:
363-
type: pulumiservice:index:DriftSchedule
364-
properties:
365-
organization: my-org
366-
project: core-infrastructure
367-
stack: production
368-
369-
# Run every 4 hours
370-
scheduleCron: "0 */4 * * *"
371-
372-
# Automatically fix any drift found
373-
autoRemediate: true
374-
375-
outputs:
376-
scheduleId: ${production-drift-detection.scheduleId}
377-
```
378-
379-
{{% /choosable %}}
380-
381-
{{< /chooser >}}
382-
383-
## Integrating with CI/CD Pipelines
384-
385-
While Pulumi Deployments provides excellent built-in scheduling, many teams prefer to integrate drift detection into their existing CI/CD pipelines. This approach gives you more control over the detection process and integrates naturally with your existing workflows.
386-
387-
### GitHub Actions Integration
388-
389-
For teams using GitHub Actions, adding drift detection to your [existing workflows](/docs/iac/packages-and-automation/continuous-delivery/github-actions) is straightforward:
390-
391-
```yaml
392-
name: Drift Detection
393-
394-
on:
395-
schedule:
396-
# Run every 4 hours
397-
- cron: '0 */4 * * *'
398-
workflow_dispatch: # Allow manual triggers
399-
400-
jobs:
401-
detect-drift:
402-
runs-on: ubuntu-latest
403-
steps:
404-
- uses: actions/checkout@v4
405-
406-
- name: Configure Pulumi
407-
uses: pulumi/actions@v5
408-
with:
409-
cloud-url: ${{ secrets.PULUMI_CLOUD_URL }}
410-
411-
- name: Detect Drift
412-
run: |
413-
pulumi refresh --stack production --preview-only
414-
env:
415-
PULUMI_ACCESS_TOKEN: ${{ secrets.PULUMI_ACCESS_TOKEN }}
416-
417-
- name: Notify on Drift
418-
if: failure()
419-
uses: actions/slack@v1
420-
with:
421-
webhook: ${{ secrets.SLACK_WEBHOOK }}
422-
message: "⚠️ Drift detected in production infrastructure"
423-
```
424-
425-
This workflow runs every four hours and can also be triggered manually when needed. When drift is detected, it automatically sends a notification to your Slack channel, ensuring your team knows immediately.
426-
427-
### GitLab CI Integration
428-
429-
GitLab users can achieve similar functionality with [GitLab CI](/docs/iac/packages-and-automation/continuous-delivery/gitlab-ci):
430-
431-
```yaml
432-
drift-detection:
433-
stage: monitor
434-
image: pulumi/pulumi:latest
435-
436-
script:
437-
- pulumi refresh --stack production --preview-only
438-
439-
only:
440-
- schedules # Run on schedule only
441-
442-
variables:
443-
PULUMI_ACCESS_TOKEN: $PULUMI_ACCESS_TOKEN
444-
445-
after_script:
446-
- |
447-
if [ "$CI_JOB_STATUS" == "failed" ]; then
448-
curl -X POST $SLACK_WEBHOOK \
449-
-H 'Content-Type: application/json' \
450-
-d '{"text":"⚠️ Drift detected in production"}'
451-
fi
452-
```
453-
454-
The GitLab configuration uses scheduled pipelines to run drift detection and sends notifications when drift is found. The `after_script` section ensures notifications are sent even if the drift detection itself fails.
455-
456199
## Handling Common Drift Scenarios
457200

458201
Every organization faces similar drift scenarios. Understanding how to handle these common cases will prepare you for real-world operations.
@@ -518,23 +261,17 @@ By externalizing configuration values, you make them easier to update without co
518261

519262
## The Value of Automated Drift Detection
520263

521-
After implementing drift detection for hundreds of stacks, the value becomes undeniable. Organizations consistently report four major areas of improvement.
522-
523-
### 1. Security Compliance
524-
525-
Security teams sleep better knowing that unauthorized changes are detected within minutes, not discovered weeks later during an audit. Compliance becomes automatic rather than a quarterly fire drill. Every infrastructure change leaves an audit trail, making compliance reporting straightforward and comprehensive.
264+
After implementing drift detection for hundreds of stacks, patterns emerge that tell a compelling story about its value. The transformation begins subtly but grows profound as teams discover what they've been missing.
526265

527-
### 2. Operational Excellence
266+
The first change most organizations notice is in their security posture. Where once unauthorized changes might lurk undetected for weeks, only surfacing during quarterly audits or worse, during incident investigations, now they're caught within minutes. Security teams who previously dreaded audit season find themselves with continuous compliance data at their fingertips. Every change, whether legitimate or not, creates an audit trail that transforms compliance reporting from a scrambling fire drill into a routine export of already-collected data.
528267

529-
Incidents caused by configuration drift drop dramatically. When issues do occur, MTTR improves because teams can quickly check whether drift contributed to the problem. Change management transforms from a bureaucratic process to an automated system with complete visibility into what changed, when, and why.
268+
As teams settle into this new reality, operational improvements become apparent. Incidents that used to take hours to diagnose now resolve in minutes because engineers can immediately check whether drift contributed to the problem. One team reported that their mean time to recovery dropped by 40% simply because they could eliminate drift as a cause within seconds rather than spending hours manually comparing configurations. Change management, once a bureaucratic nightmare of spreadsheets and approval chains, transforms into an automated system where every modification is tracked, timestamped, and traceable to its source.
530269

531-
### 3. Cost Optimization
270+
Then come the unexpected discoveries. Drift detection becomes a cost optimization tool that nobody anticipated. It uncovers orphaned resources spinning away forgotten in some corner of your cloud account, created during a debugging session six months ago. It catches that instance someone manually upgraded to a larger size during a performance investigation and forgot to downsize. One organization discovered they were spending $15,000 monthly on resources that weren't even supposed to exist, all caught by their drift detection system in its first week of operation.
532271

533-
Drift detection regularly uncovers cost savings. Orphaned resources created outside IaC get identified and cleaned up. Oversized instances from manual scaling get right-sized. Resource sprawl from untracked changes gets prevented before it impacts your cloud bill.
272+
Perhaps most importantly, drift detection changes how teams work. Engineers stop wasting time on manual infrastructure audits. Debugging accelerates when you have clear reports showing exactly what changed and when. The constant anxiety about whether production matches your code disappears, replaced by confidence that any discrepancies will be caught and reported immediately. Teams report feeling more empowered to move quickly, knowing they have a safety net that will catch configuration drift before it causes problems.
534273

535-
### 4. Team Productivity
536-
537-
Engineers stop wasting time on manual infrastructure checks. Debugging becomes faster with clear drift reports showing exactly what changed. Teams build confidence in their infrastructure state, knowing that drift detection has their back.
274+
The cumulative effect is transformative. Organizations running drift detection report not just fewer incidents and lower costs, but a fundamental shift in how they think about infrastructure management. It becomes proactive rather than reactive, confident rather than anxious, automated rather than manual. What starts as a simple scheduled check evolves into a cornerstone of operational excellence.
538275

539276
## Getting Started with Drift Detection
540277

@@ -560,20 +297,6 @@ Teams with robust day 2 operations move faster because they have confidence in t
560297

561298
The math is compelling. A single undetected security group change could lead to a breach costing millions. One overlooked configuration drift might cause hours of downtime. Yet implementing comprehensive drift detection takes just hours of setup and minutes of ongoing maintenance. The ROI is immediate and substantial.
562299

563-
## Next Steps
564-
565-
Your journey to comprehensive drift detection starts with these resources:
566-
567-
### 📚 Essential Documentation
568-
569-
Dive deep into [Pulumi Deployments Drift Detection](/docs/pulumi-cloud/deployments/drift) for complete configuration options. Master the [Pulumi Refresh Command](/docs/iac/cli/commands/pulumi_refresh) that powers all drift detection. Explore the [Pulumi Service Provider](/registry/packages/pulumiservice) for managing drift detection as code. For advanced scenarios, the [Automation API Guide](/docs/using-pulumi/automation-api) shows how to build custom drift workflows.
570-
571-
### 🎓 Hands-On Learning
572-
573-
The [Drift Detection Tutorial](/tutorials/drift-detection-and-remediation) provides step-by-step guidance for implementing drift detection in your environment.
574-
575-
## Coming Next in the Series
576-
577300
Our IDP journey continues with **"Extend Your IDP for AI Applications: GPUs, Models, and Cost Controls"**. As AI workloads become central to modern applications, we'll explore how to adapt your platform for machine learning workflows. You'll learn about GPU orchestration, model deployment pipelines, and the unique cost management challenges that AI infrastructure presents.
578301

579302
But don't wait for the next post to get started with drift detection. Even a simple hourly check can prevent major incidents. Set up basic detection today, and your future self will thank you the next time you're debugging a production issue at 3 AM. More importantly, your on-call team will appreciate not having to debug issues caused by accumulated drift that could have been prevented.

0 commit comments

Comments
 (0)