Skip to content

Incident Response: Multi-System Log Correlation

Correlate logs across multiple systems during incident response to identify which errors are widespread vs. environment-specific, helping prioritize remediation efforts.

The Problem

During incidents, errors appear across multiple environments:

  • Hard to distinguish - Which errors are prod-specific vs. widespread?
  • Redundant investigation - Teams investigate same error in multiple environments
  • Poor prioritization - Don't know which issues to fix first
  • Context missing - Can't tell if dev errors are related to prod incident

Input Data

production.log
2024-01-15 14:30:00 [INFO] System startup completed
2024-01-15 14:30:15 [INFO] Database connection pool initialized
2024-01-15 14:30:30 [ERROR] Authentication service unreachable: Connection refused
2024-01-15 14:30:45 [WARN] Falling back to local authentication cache
2024-01-15 14:31:00 [ERROR] Authentication service unreachable: Connection refused
2024-01-15 14:31:15 [ERROR] User login failed: timeout waiting for auth response
2024-01-15 14:31:30 [ERROR] Authentication service unreachable: Connection refused
2024-01-15 14:31:45 [ERROR] User login failed: timeout waiting for auth response
2024-01-15 14:32:00 [CRITICAL] API gateway returning 503 Service Unavailable
2024-01-15 14:32:15 [ERROR] Authentication service unreachable: Connection refused

Production server log with 10 entries, including errors: - Authentication service unreachable (4×) - User login failures (2×) - API gateway 503 error (1×)

dev.log
2024-01-15 14:30:00 [INFO] System startup completed
2024-01-15 14:30:10 [DEBUG] Loading development configuration
2024-01-15 14:30:15 [INFO] Database connection pool initialized
2024-01-15 14:30:20 [DEBUG] Mock authentication service enabled
2024-01-15 14:30:30 [ERROR] Authentication service unreachable: Connection refused
2024-01-15 14:30:35 [DEBUG] Test data loader executing
2024-01-15 14:30:40 [WARN] Deprecated API endpoint /v1/users accessed
2024-01-15 14:30:45 [WARN] Falling back to local authentication cache
2024-01-15 14:30:50 [ERROR] Missing environment variable: STRIPE_API_KEY
2024-01-15 14:31:00 [ERROR] Authentication service unreachable: Connection refused
2024-01-15 14:31:05 [ERROR] Configuration validation failed: invalid webhook URL
2024-01-15 14:31:15 [ERROR] User login failed: timeout waiting for auth response
2024-01-15 14:31:20 [WARN] Hot reload triggered by file change: app.py
2024-01-15 14:31:30 [ERROR] Authentication service unreachable: Connection refused

Development server log with 14 entries, including: - Shared errors (also in production) - Dev-specific errors (configuration, environment) - Debug logging (not in production)

Output Data

expected-dev-only.log (Dev-Specific Errors)
2024-01-15 14:30:10 [DEBUG] Loading development configuration
2024-01-15 14:30:20 [DEBUG] Mock authentication service enabled
2024-01-15 14:30:35 [DEBUG] Test data loader executing
2024-01-15 14:30:40 [WARN] Deprecated API endpoint /v1/users accessed
2024-01-15 14:30:50 [ERROR] Missing environment variable: STRIPE_API_KEY
2024-01-15 14:31:05 [ERROR] Configuration validation failed: invalid webhook URL
2024-01-15 14:31:20 [WARN] Hot reload triggered by file change: app.py

7 patterns unique to dev (not seen in production): - Debug logging - Missing environment variables - Configuration validation errors - Hot reload warnings

expected-common.log (Cross-Environment Errors)
2024-01-15 14:30:00 [INFO] System startup completed
2024-01-15 14:30:15 [INFO] Database connection pool initialized
2024-01-15 14:30:30 [ERROR] Authentication service unreachable: Connection refused
2024-01-15 14:30:45 [WARN] Falling back to local authentication cache
2024-01-15 14:31:15 [ERROR] User login failed: timeout waiting for auth response

5 patterns common to both (production incident also affecting dev): - Authentication service unreachable - User login failures - Fallback to local cache

Solution

$ uniqseq production.log \
    --skip-chars 20 \
    --window-size 1 \
    --library-dir ./prod-patterns \
    --quiet > /dev/null

Result: Unique error patterns from production saved to ./prod-patterns

$ uniqseq dev.log \
    --skip-chars 20 \
    --window-size 1 \
    --read-sequences ./prod-patterns \
    --annotate | \
    grep -v "DUPLICATE" | \
    grep "^2024" > dev-only.log

Result: Errors in dev that were NOT seen in production

$ uniqseq dev.log \
    --skip-chars 20 \
    --window-size 1 \
    --read-sequences ./prod-patterns \
    --annotate | \
    grep "DUPLICATE" | \
    grep -v "^\[" > annotations.txt

$ # Extract the line numbers and pull those lines from dev.log

Result: Errors appearing in both environments

from uniqseq import UniqSeq

# Step 1: Extract unique production patterns into a set
prod_patterns = set()
prod_uniqseq = UniqSeq(
    skip_chars=20,
    window_size=1,
)

with open("production.log") as f:
    for line in f:
        # Capture each emitted line's hash (after skip_chars)
        line_stripped = line.rstrip("\n")
        # Compute hash of the relevant part (after skip_chars)
        import hashlib
        if len(line_stripped) > 20:
            relevant_part = line_stripped[20:]
        else:
            relevant_part = line_stripped
        line_hash = hashlib.md5(relevant_part.encode()).hexdigest()
        prod_patterns.add(line_hash)

# Step 2: Process dev log, outputting only lines NOT in production
with open("dev.log") as f:
    with open("output.log", "w") as out:
        for line in f:
            line_stripped = line.rstrip("\n")
            # Compute hash of relevant part
            if len(line_stripped) > 20:
                relevant_part = line_stripped[20:]
            else:
                relevant_part = line_stripped
            line_hash = hashlib.md5(relevant_part.encode()).hexdigest()

            # Only output if this pattern was NOT in production
            if line_hash not in prod_patterns:
                out.write(line_stripped + '\n')

How It Works

Pattern libraries enable cross-system correlation:

Step 1: Extract production patterns
production.log → uniqseq → ./prod-patterns/
  - Authentication service unreachable
  - User login failed
  - API gateway 503
  - Falling back to local cache
  - (6 total unique patterns)

Step 2: Compare dev against production patterns
dev.log + ./prod-patterns → identify dev-only issues

Patterns in both:
  ✓ Authentication service unreachable
  ✓ User login failed
  ✓ Falling back to local cache

Dev-only patterns:
  ! Loading development configuration
  ! Mock authentication service enabled
  ! Missing environment variable: STRIPE_API_KEY
  ! Configuration validation failed
  (7 total dev-specific issues)

Real-World Workflows

Incident Triage Workflow

#!/bin/bash
# Rapid incident correlation across environments

PROD_LOG="/var/log/production/app.log"
STAGE_LOG="/var/log/staging/app.log"
DEV_LOG="/var/log/dev/app.log"

# Extract production error patterns
echo "Extracting production error patterns..."
grep ERROR "$PROD_LOG" | \
    uniqseq --skip-chars 20 --window-size 1 \
        --library-dir ./prod-errors --quiet > /dev/null

# Check staging
echo "Checking staging environment..."
grep ERROR "$STAGE_LOG" | \
    uniqseq --skip-chars 20 --window-size 1 \
        --read-sequences ./prod-errors \
        --annotate | \
    grep -c "DUPLICATE" | \
    xargs -I {} echo "  Staging has {} errors also in production"

# Check dev
echo "Checking dev environment..."
grep ERROR "$DEV_LOG" | \
    uniqseq --skip-chars 20 --window-size 1 \
        --read-sequences ./prod-errors \
        --annotate | \
    grep -v "DUPLICATE" | \
    grep "^2024" | \
    wc -l | \
    xargs -I {} echo "  Dev has {} unique errors (not in production)"

Output:

Extracting production error patterns...
Checking staging environment...
  Staging has 8 errors also in production
Checking dev environment...
  Dev has 7 unique errors (not in production)

Conclusion: Incident is affecting production and staging, but dev has separate issues.

Multi-Region Correlation

#!/bin/bash
# Compare errors across geographic regions

# Extract US-East patterns
grep ERROR /var/log/us-east/app.log | \
    uniqseq --skip-chars 20 --window-size 1 \
        --library-dir ./us-east-patterns --quiet > /dev/null

# Compare other regions
for region in us-west eu-west ap-southeast; do
    echo "=== $region ==="
    grep ERROR /var/log/$region/app.log | \
        uniqseq --skip-chars 20 --window-size 1 \
            --read-sequences ./us-east-patterns \
            --annotate | \
        grep -v "DUPLICATE" | \
        grep "^2024" > ${region}-unique-errors.log

    echo "  Unique errors: $(wc -l < ${region}-unique-errors.log)"
done

Identifies region-specific issues vs. global outages.

Service Dependency Analysis

#!/bin/bash
# Determine which service is the root cause

# Extract patterns from suspected root cause (auth service)
grep ERROR /var/log/auth-service.log | \
    uniqseq --skip-chars 20 --window-size 1 \
        --library-dir ./auth-errors --quiet > /dev/null

# Check which downstream services show the same errors
for service in api gateway worker; do
    echo "=== $service ==="
    grep ERROR /var/log/${service}.log | \
        uniqseq --skip-chars 20 --window-size 1 \
            --read-sequences ./auth-errors \
            --annotate | \
        grep "DUPLICATE" | \
        wc -l | \
        xargs -I {} echo "  {} errors correlate with auth service"
done

Output:

=== api ===
  45 errors correlate with auth service
=== gateway ===
  67 errors correlate with auth service
=== worker ===
  0 errors correlate with auth service

Conclusion: Auth service is root cause, impacting API and Gateway but not Worker.

Time-Window Correlation

#!/bin/bash
# Compare errors during specific incident window

INCIDENT_START="2024-01-15 14:30:00"
INCIDENT_END="2024-01-15 14:35:00"

# Extract production errors during incident
sed -n "/$INCIDENT_START/,/$INCIDENT_END/p" production.log | \
    grep ERROR | \
    uniqseq --skip-chars 20 --window-size 1 \
        --library-dir ./incident-patterns --quiet > /dev/null

# Check if dev showed same issues
echo "Dev environment during incident:"
sed -n "/$INCIDENT_START/,/$INCIDENT_END/p" dev.log | \
    grep ERROR | \
    uniqseq --skip-chars 20 --window-size 1 \
        --read-sequences ./incident-patterns \
        --annotate | \
    grep "DUPLICATE" | \
    wc -l | \
    xargs -I {} echo "  {} errors match production incident"

Helps determine if incident was environment-specific or systemic.

Advanced Patterns

Multi-Stage Filtering

# Stage 1: Extract critical production errors
grep -E "ERROR|CRITICAL" production.log | \
    grep -v "HealthCheck" | \
    uniqseq --skip-chars 20 --window-size 1 \
        --library-dir ./critical-prod --quiet > /dev/null

# Stage 2: Find matching errors in dev
grep -E "ERROR|CRITICAL" dev.log | \
    uniqseq --skip-chars 20 --window-size 1 \
        --read-sequences ./critical-prod \
        --annotate

# Stage 3: Alert if critical prod errors appear in dev

Baseline Comparison

#!/bin/bash
# Compare current logs against known-good baseline

# Create baseline from last week (before incident)
uniqseq baseline-week.log --skip-chars 20 --window-size 1 \
    --library-dir ./known-good --quiet > /dev/null

# Compare current logs
uniqseq current.log --skip-chars 20 --window-size 1 \
    --read-sequences ./known-good \
    --annotate | \
    grep -v "DUPLICATE" | \
    grep "^2024" > new-error-patterns.log

# These are errors NOT seen in baseline
echo "New error patterns since baseline:"
cat new-error-patterns.log

Normalize Before Comparison

# Remove variable data before correlation
uniqseq production.log \
    --hash-transform 'sed -E "s/[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9:.]+Z//g"' \
    --skip-chars 20 \
    --window-size 1 \
    --library-dir ./prod-normalized --quiet > /dev/null

# Now dev comparison ignores timestamps
uniqseq dev.log \
    --hash-transform 'sed -E "s/[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9:.]+Z//g"' \
    --skip-chars 20 \
    --window-size 1 \
    --read-sequences ./prod-normalized \
    --annotate

Groups errors with different timestamps but identical messages.

Integration with Incident Response Tools

PagerDuty Alert Correlation

#!/bin/bash
# Correlate PagerDuty alerts with log patterns

# Export PagerDuty incident logs
pd incident logs --id $INCIDENT_ID > pd-errors.log

# Extract patterns from production
uniqseq production.log --skip-chars 20 --window-size 1 \
    --library-dir ./prod-patterns --quiet > /dev/null

# Check which PagerDuty errors match production
uniqseq pd-errors.log --skip-chars 20 --window-size 1 \
    --read-sequences ./prod-patterns \
    --annotate | \
    grep "DUPLICATE" | \
    wc -l | \
    xargs -I {} echo "PagerDuty: {} alerts correlate with production logs"

Slack Incident Channel

#!/bin/bash
# Post correlation results to Slack incident channel

DEV_UNIQUE=$(uniqseq dev.log --skip-chars 20 --window-size 1 \
    --read-sequences ./prod-patterns --annotate | \
    grep -v "DUPLICATE" | grep "^2024" | wc -l)

curl -X POST https://hooks.slack.com/... -d "{
  \"text\": \"🔍 Log Correlation Results\",
  \"attachments\": [{
    \"text\": \"Found $DEV_UNIQUE dev-specific errors not in production.\",
    \"color\": \"warning\"
  }]
}"

Grafana Dashboard

# Generate metrics for Grafana
PROD_UNIQUE=$(uniqseq production.log --skip-chars 20 --stats-format json \
    --quiet 2>&1 | jq '.statistics.lines.emitted')

DEV_UNIQUE=$(uniqseq dev.log --skip-chars 20 --read-sequences ./prod-patterns \
    --annotate | grep -v "DUPLICATE" | grep "^2024" | wc -l)

# Push to Prometheus
PUSHGATEWAY="http://pushgateway:9091"
METRICS_JOB="metrics/job/log_correlation"
cat <<EOF | curl --data-binary @- ${PUSHGATEWAY}/${METRICS_JOB}
incident_unique_errors{env="production"} ${PROD_UNIQUE}
incident_unique_errors{env="dev"} ${DEV_UNIQUE}
incident_correlation_overlap{env="dev"} $((PROD_UNIQUE - DEV_UNIQUE))
EOF

When to Use This

Good for: - ✅ Multi-environment incident response - ✅ Root cause analysis across services - ✅ Determining blast radius of outages - ✅ Prioritizing fixes (widespread vs. isolated) - ✅ Baseline comparison (current vs. known-good)

Not ideal for: - ❌ Single-environment debugging - ❌ Real-time streaming (use direct filtering instead) - ❌ Logs with no common patterns - ❌ Highly variable error messages

See Also