Production Monitoring: Long-Running Job Analysis

Monitor batch job progress and clean up verbose logs from long-running ETL processes, data migrations, or scheduled tasks using statistics tracking.

The Problem

Long-running batch jobs produce verbose logs with repeated progress messages:

Repetitive progress updates - "Processing batch N" appears hundreds of times
Hard to spot errors - Important warnings buried in progress noise
Log file bloat - Gigabytes of logs for multi-hour jobs
No summary metrics - Manual counting to understand job completion

Input Data

batch-job.log

2024-01-15 10:00:00 [INFO] Starting batch job: process_daily_transactions
2024-01-15 10:00:01 [INFO] Loading configuration from /etc/batch/config.json
2024-01-15 10:00:05 [INFO] Connected to database: postgres://prod-db:5432/transactions
2024-01-15 10:00:10 [INFO] Processing batch 1/100 (1000 records)
2024-01-15 10:00:15 [INFO] Processed 1000 records (1% complete)
2024-01-15 10:00:20 [INFO] Processing batch 2/100 (1000 records)
2024-01-15 10:00:25 [INFO] Processed 2000 records (2% complete)
2024-01-15 10:00:30 [INFO] Processing batch 3/100 (1000 records)
2024-01-15 10:00:35 [INFO] Processed 3000 records (3% complete)
2024-01-15 10:00:40 [INFO] Processing batch 4/100 (1000 records)
2024-01-15 10:00:45 [INFO] Processed 4000 records (4% complete)
2024-01-15 10:00:50 [INFO] Processing batch 5/100 (1000 records)
2024-01-15 10:00:55 [INFO] Processed 5000 records (5% complete)
2024-01-15 10:01:00 [WARN] Slow query detected: transaction_summary took 2.5s
2024-01-15 10:01:05 [INFO] Processing batch 6/100 (1000 records)
2024-01-15 10:01:10 [INFO] Processed 6000 records (6% complete)
2024-01-15 10:01:15 [INFO] Processing batch 7/100 (1000 records)
2024-01-15 10:01:20 [INFO] Processed 7000 records (7% complete)
2024-01-15 10:01:25 [WARN] Slow query detected: transaction_summary took 2.5s
2024-01-15 10:01:30 [INFO] Processing batch 8/100 (1000 records)
2024-01-15 10:01:35 [INFO] Processed 8000 records (8% complete)
2024-01-15 10:01:40 [INFO] Processing batch 9/100 (1000 records)
2024-01-15 10:01:45 [INFO] Processed 9000 records (9% complete)
2024-01-15 10:01:50 [INFO] Processing batch 10/100 (1000 records)
2024-01-15 10:01:55 [INFO] Processed 10000 records (10% complete)
2024-01-15 10:02:00 [INFO] Checkpoint: Committed 10000 records to database
2024-01-15 10:02:05 [INFO] Processing batch 11/100 (1000 records)
2024-01-15 10:02:10 [INFO] Processed 11000 records (11% complete)
2024-01-15 10:02:15 [INFO] Processing batch 12/100 (1000 records)
2024-01-15 10:02:20 [INFO] Processed 12000 records (12% complete)

Batch job log with 30 lines covering 12 batches:

"Processing batch N/100" appears 12×
"Processed N records" appears 12×
"Slow query detected" appears 2× (lines 14, 19)
Other: startup, connection, checkpoint messages

Output Data

expected-clean.log

2024-01-15 10:00:00 [INFO] Starting batch job: process_daily_transactions
2024-01-15 10:00:01 [INFO] Loading configuration from /etc/batch/config.json
2024-01-15 10:00:05 [INFO] Connected to database: postgres://prod-db:5432/transactions
2024-01-15 10:00:10 [INFO] Processing batch 1/100 (1000 records)
2024-01-15 10:00:15 [INFO] Processed 1000 records (1% complete)
2024-01-15 10:00:20 [INFO] Processing batch 2/100 (1000 records)
2024-01-15 10:00:25 [INFO] Processed 2000 records (2% complete)
2024-01-15 10:00:30 [INFO] Processing batch 3/100 (1000 records)
2024-01-15 10:00:35 [INFO] Processed 3000 records (3% complete)
2024-01-15 10:00:40 [INFO] Processing batch 4/100 (1000 records)
2024-01-15 10:00:45 [INFO] Processed 4000 records (4% complete)
2024-01-15 10:00:50 [INFO] Processing batch 5/100 (1000 records)
2024-01-15 10:00:55 [INFO] Processed 5000 records (5% complete)
2024-01-15 10:01:00 [WARN] Slow query detected: transaction_summary took 2.5s
2024-01-15 10:01:05 [INFO] Processing batch 6/100 (1000 records)
2024-01-15 10:01:10 [INFO] Processed 6000 records (6% complete)
2024-01-15 10:01:15 [INFO] Processing batch 7/100 (1000 records)
2024-01-15 10:01:20 [INFO] Processed 7000 records (7% complete)
2024-01-15 10:01:30 [INFO] Processing batch 8/100 (1000 records)
2024-01-15 10:01:35 [INFO] Processed 8000 records (8% complete)
2024-01-15 10:01:40 [INFO] Processing batch 9/100 (1000 records)
2024-01-15 10:01:45 [INFO] Processed 9000 records (9% complete)
2024-01-15 10:01:50 [INFO] Processing batch 10/100 (1000 records)
2024-01-15 10:01:55 [INFO] Processed 10000 records (10% complete)
2024-01-15 10:02:00 [INFO] Checkpoint: Committed 10000 records to database
2024-01-15 10:02:05 [INFO] Processing batch 11/100 (1000 records)
2024-01-15 10:02:10 [INFO] Processed 11000 records (11% complete)
2024-01-15 10:02:15 [INFO] Processing batch 12/100 (1000 records)
2024-01-15 10:02:20 [INFO] Processed 12000 records (12% complete)

Result: 29 lines (1 duplicate "Slow query" warning removed)

The repeated progress messages have different content (batch numbers, record counts, percentages), so they're not exact duplicates when ignoring timestamps.

Solution

CLI (Clean Log)CLI (Track Progress)Python (With Stats)

uniqseq batch-job.log \ --skip-chars 20 \ --window-size 1 \ --quiet > clean-job.log

Options:

--skip-chars 20: Skip timestamp prefix
--window-size 1: Deduplicate individual lines
--quiet: Suppress statistics (only output clean log)

$ uniqseq batch-job.log \
    --skip-chars 20 \
    --window-size 1 \
    --stats-format json \
    > clean-job.log \
    2> job-stats.json

Result: Clean log to stdout, metrics to stderr

Extract specific metrics with jq:

$ jq '.statistics.redundancy_pct' job-stats.json
3.3

$ jq '.statistics.lines' job-stats.json
{
  "total": 30,
  "emitted": 29,
  "skipped": 1
}

from uniqseq import UniqSeq
import json

uniqseq = UniqSeq(
    skip_chars=20,    # (1)!
    window_size=1,    # (2)!
)

# Process log file
with open("batch-job.log") as f:
    with open("output.log", "w") as out:
        for line in f:
            uniqseq.process_line(line.rstrip("\n"), out)
        uniqseq.flush_to_stream(out)

# Get statistics programmatically
stats = uniqseq.get_stats()  # (3)!

# Save metrics
with open("job-stats.json", "w") as f:
    json.dump(stats, f, indent=2)

Skip 20-character timestamp prefix
Deduplicate individual log lines
Access statistics from uniqseq object

How It Works

Deduplication removes exact duplicate messages while preserving unique progress updates:

Before (30 lines):
2024-01-15 10:00:10 [INFO] Processing batch 1/100 (1000 records)
2024-01-15 10:00:15 [INFO] Processed 1000 records (1% complete)
2024-01-15 10:00:20 [INFO] Processing batch 2/100 (1000 records)
2024-01-15 10:00:25 [INFO] Processed 2000 records (2% complete)
2024-01-15 10:01:00 [WARN] Slow query detected: transaction_summary took 2.5s
...
2024-01-15 10:01:25 [WARN] Slow query detected: transaction_summary \
    took 2.5s  ← duplicate

After (29 lines):
Same content, but duplicate warning removed

Real-World Workflows

Monitor Job Progress in Real-Time

Watch a running job and see only new, unique log entries:

# Follow log file, show only new unique patterns
tail -f /var/log/batch-job.log | \
    uniqseq --skip-chars 20 --window-size 1

Only new unique messages appear (repeated "Processing batch N" with different N will show, exact duplicates won't).

Extract Job Statistics

Monitor deduplication metrics to understand log verbosity:

#!/bin/bash
# Analyze job logs after completion

JOB_LOG="/var/log/etl-job-$(date +%Y%m%d).log"

uniqseq "$JOB_LOG" --skip-chars 20 --window-size 1 \
    --stats-format json \
    > "/var/clean-logs/$(basename $JOB_LOG)" \
    2> "/var/metrics/$(basename $JOB_LOG).json"

# Extract metrics
METRICS_FILE="/var/metrics/$(basename $JOB_LOG).json"
REDUNDANCY=$(jq '.statistics.redundancy_pct' "$METRICS_FILE")
echo "Log redundancy: ${REDUNDANCY}%"

# Alert if excessive duplication
if (( $(echo "$REDUNDANCY > 20" | bc -l) )); then
    echo "WARNING: High log duplication - consider reducing log verbosity"
fi

Daily Batch Job Cleanup

Automatically clean and archive batch job logs:

#!/bin/bash
# Daily log cleanup for batch jobs

LOG_DIR="/var/log/batch-jobs"
ARCHIVE_DIR="/var/archive/clean-logs"
STATS_DIR="/var/metrics/batch-stats"

for log in ${LOG_DIR}/*.log; do
    basename=$(basename "$log")

    # Deduplicate and capture stats
    uniqseq "$log" --skip-chars 20 --window-size 1 \
        --stats-format json \
        > "${ARCHIVE_DIR}/${basename}" \
        2> "${STATS_DIR}/${basename}.json"

    # Extract key metrics
    total=$(jq '.statistics.lines.total' "${STATS_DIR}/${basename}.json")
    emitted=$(jq '.statistics.lines.emitted' "${STATS_DIR}/${basename}.json")

    echo "${basename}: ${total} → ${emitted} lines"
done

# Compress original logs
gzip ${LOG_DIR}/*.log

Compare Job Runs

Track deduplication metrics over time to detect logging changes:

#!/bin/bash
# Compare batch job patterns week-over-week

# Process week 1 logs
uniqseq week1-job.log --stats-format json --quiet \
    > /dev/null 2> week1-stats.json

# Process week 2 logs
uniqseq week2-job.log --stats-format json --quiet \
    > /dev/null 2> week2-stats.json

# Compare redundancy
W1_RED=$(jq '.statistics.redundancy_pct' week1-stats.json)
W2_RED=$(jq '.statistics.redundancy_pct' week2-stats.json)

echo "Week 1 redundancy: ${W1_RED}%"
echo "Week 2 redundancy: ${W2_RED}%"

# Alert if significant change
DIFF=$(echo "$W2_RED - $W1_RED" | bc)
if (( $(echo "$DIFF > 10" | bc -l) )); then
    echo "⚠️  Redundancy increased by ${DIFF}% - check for logging loops"
fi

Integration with Monitoring

Push deduplication metrics to monitoring systems:

#!/bin/bash
# Send batch job metrics to Prometheus

JOB_NAME="daily_etl"
LOG_FILE="/var/log/${JOB_NAME}.log"

# Process and get stats
uniqseq "$LOG_FILE" --stats-format json --quiet \
    > /dev/null 2> stats.json

# Extract metrics
TOTAL=$(jq '.statistics.lines.total' stats.json)
EMITTED=$(jq '.statistics.lines.emitted' stats.json)
SKIPPED=$(jq '.statistics.lines.skipped' stats.json)
REDUNDANCY=$(jq '.statistics.redundancy_pct' stats.json)

# Push to Prometheus pushgateway
cat <<EOF | curl --data-binary @- http://pushgateway:9091/metrics/job/batch_logs
# TYPE batch_log_lines gauge
batch_log_lines{job="${JOB_NAME}",type="total"} ${TOTAL}
batch_log_lines{job="${JOB_NAME}",type="emitted"} ${EMITTED}
batch_log_lines{job="${JOB_NAME}",type="skipped"} ${SKIPPED}
batch_log_redundancy_pct{job="${JOB_NAME}"} ${REDUNDANCY}
EOF

Advanced Scenarios

Multi-Stage Log Processing

Clean logs and filter for errors:

# Stage 1: Deduplicate
uniqseq batch-job.log --skip-chars 20 --window-size 1 --quiet | \
# Stage 2: Extract errors and warnings
    grep -E "ERROR|WARN" | \
# Stage 3: Save for review
    tee errors.log

Normalize Variable Data

Remove timestamps and counters for more aggressive deduplication:

# Normalize progress messages before deduplication
uniqseq batch-job.log \
    --hash-transform 'sed -E "s/[0-9]+/N/g"' \
    --window-size 1 \
    --quiet

This groups all "Processing batch N" messages together (more aggressive).

Extract Job Summary

Build a concise job summary from verbose logs:

#!/bin/bash
# Create executive summary of batch job

echo "=== Batch Job Summary ==="
echo ""

# Start/end times
echo "Started: $(head -1 batch-job.log | cut -d' ' -f1-2)"
echo "Ended: $(tail -1 batch-job.log | cut -d' ' -f1-2)"
echo ""

# Deduplication metrics
uniqseq batch-job.log --stats-format json --quiet \
    > /dev/null 2> stats.json

echo "Log Statistics:"
jq -r '.statistics |
    "  Total lines: \(.lines.total)
     Unique events: \(.lines.emitted)
     Duplicates removed: \(.lines.skipped)
     Redundancy: \(.redundancy_pct)%"' stats.json
echo ""

# Error summary
echo "Issues Found:"
uniqseq batch-job.log --skip-chars 20 --window-size 1 --quiet | \
    grep -c ERROR | \
    xargs -I {} echo "  Errors: {}"

uniqseq batch-job.log --skip-chars 20 --window-size 1 --quiet | \
    grep -c WARN | \
    xargs -I {} echo "  Warnings: {}"

Detect Logging Loops

Find suspiciously repeated messages:

# Show messages that appear many times (potential loops)
uniqseq batch-job.log --skip-chars 20 --annotate --quiet | \
    grep "DUPLICATE" | \
    grep -oE "seen [0-9]+ times" | \
    awk '{print $2}' | \
    sort -rn | \
    head -1 | \
    xargs -I {} echo "Most repeated message seen {} times"

If a message repeats hundreds of times, investigate for logging loops.

Performance Benefits

Storage Savings

# Before cleanup
$ wc -c < batch-job-full.log
5242880  # 5 MB

# After deduplication (typical 10-30% reduction for batch jobs)
$ uniqseq batch-job-full.log --quiet | wc -c
4194304  # 4 MB → 20% reduction

Analysis Speed

# Time to grep through original log
$ time grep ERROR batch-job-full.log
real    0m2.5s

# Time to grep through deduplicated log
$ time grep ERROR batch-job-clean.log
real    0m2.0s  # 20% faster

When to Use This

Good candidates: - ✅ ETL jobs with repeated progress updates - ✅ Data migration scripts with batch processing - ✅ Long-running imports/exports (hours to days) - ✅ Scheduled tasks with verbose logging

Not recommended: - ❌ Short jobs (<1 minute) with unique messages - ❌ Jobs where every log line is critical for compliance - ❌ Real-time streaming applications (use tail -f | uniqseq instead)