Skip to content

Production Monitoring: Long-Running Job Analysis

Monitor batch job progress and clean up verbose logs from long-running ETL processes, data migrations, or scheduled tasks using statistics tracking.

The Problem

Long-running batch jobs produce verbose logs with repeated progress messages:

  • Repetitive progress updates - "Processing batch N" appears hundreds of times
  • Hard to spot errors - Important warnings buried in progress noise
  • Log file bloat - Gigabytes of logs for multi-hour jobs
  • No summary metrics - Manual counting to understand job completion

Input Data

batch-job.log
2024-01-15 10:00:00 [INFO] Starting batch job: process_daily_transactions
2024-01-15 10:00:01 [INFO] Loading configuration from /etc/batch/config.json
2024-01-15 10:00:05 [INFO] Connected to database: postgres://prod-db:5432/transactions
2024-01-15 10:00:10 [INFO] Processing batch 1/100 (1000 records)
2024-01-15 10:00:15 [INFO] Processed 1000 records (1% complete)
2024-01-15 10:00:20 [INFO] Processing batch 2/100 (1000 records)
2024-01-15 10:00:25 [INFO] Processed 2000 records (2% complete)
2024-01-15 10:00:30 [INFO] Processing batch 3/100 (1000 records)
2024-01-15 10:00:35 [INFO] Processed 3000 records (3% complete)
2024-01-15 10:00:40 [INFO] Processing batch 4/100 (1000 records)
2024-01-15 10:00:45 [INFO] Processed 4000 records (4% complete)
2024-01-15 10:00:50 [INFO] Processing batch 5/100 (1000 records)
2024-01-15 10:00:55 [INFO] Processed 5000 records (5% complete)
2024-01-15 10:01:00 [WARN] Slow query detected: transaction_summary took 2.5s
2024-01-15 10:01:05 [INFO] Processing batch 6/100 (1000 records)
2024-01-15 10:01:10 [INFO] Processed 6000 records (6% complete)
2024-01-15 10:01:15 [INFO] Processing batch 7/100 (1000 records)
2024-01-15 10:01:20 [INFO] Processed 7000 records (7% complete)
2024-01-15 10:01:25 [WARN] Slow query detected: transaction_summary took 2.5s
2024-01-15 10:01:30 [INFO] Processing batch 8/100 (1000 records)
2024-01-15 10:01:35 [INFO] Processed 8000 records (8% complete)
2024-01-15 10:01:40 [INFO] Processing batch 9/100 (1000 records)
2024-01-15 10:01:45 [INFO] Processed 9000 records (9% complete)
2024-01-15 10:01:50 [INFO] Processing batch 10/100 (1000 records)
2024-01-15 10:01:55 [INFO] Processed 10000 records (10% complete)
2024-01-15 10:02:00 [INFO] Checkpoint: Committed 10000 records to database
2024-01-15 10:02:05 [INFO] Processing batch 11/100 (1000 records)
2024-01-15 10:02:10 [INFO] Processed 11000 records (11% complete)
2024-01-15 10:02:15 [INFO] Processing batch 12/100 (1000 records)
2024-01-15 10:02:20 [INFO] Processed 12000 records (12% complete)

Batch job log with 30 lines covering 12 batches:

  • "Processing batch N/100" appears 12×
  • "Processed N records" appears 12×
  • "Slow query detected" appears (lines 14, 19)
  • Other: startup, connection, checkpoint messages

Output Data

expected-clean.log
2024-01-15 10:00:00 [INFO] Starting batch job: process_daily_transactions
2024-01-15 10:00:01 [INFO] Loading configuration from /etc/batch/config.json
2024-01-15 10:00:05 [INFO] Connected to database: postgres://prod-db:5432/transactions
2024-01-15 10:00:10 [INFO] Processing batch 1/100 (1000 records)
2024-01-15 10:00:15 [INFO] Processed 1000 records (1% complete)
2024-01-15 10:00:20 [INFO] Processing batch 2/100 (1000 records)
2024-01-15 10:00:25 [INFO] Processed 2000 records (2% complete)
2024-01-15 10:00:30 [INFO] Processing batch 3/100 (1000 records)
2024-01-15 10:00:35 [INFO] Processed 3000 records (3% complete)
2024-01-15 10:00:40 [INFO] Processing batch 4/100 (1000 records)
2024-01-15 10:00:45 [INFO] Processed 4000 records (4% complete)
2024-01-15 10:00:50 [INFO] Processing batch 5/100 (1000 records)
2024-01-15 10:00:55 [INFO] Processed 5000 records (5% complete)
2024-01-15 10:01:00 [WARN] Slow query detected: transaction_summary took 2.5s
2024-01-15 10:01:05 [INFO] Processing batch 6/100 (1000 records)
2024-01-15 10:01:10 [INFO] Processed 6000 records (6% complete)
2024-01-15 10:01:15 [INFO] Processing batch 7/100 (1000 records)
2024-01-15 10:01:20 [INFO] Processed 7000 records (7% complete)
2024-01-15 10:01:30 [INFO] Processing batch 8/100 (1000 records)
2024-01-15 10:01:35 [INFO] Processed 8000 records (8% complete)
2024-01-15 10:01:40 [INFO] Processing batch 9/100 (1000 records)
2024-01-15 10:01:45 [INFO] Processed 9000 records (9% complete)
2024-01-15 10:01:50 [INFO] Processing batch 10/100 (1000 records)
2024-01-15 10:01:55 [INFO] Processed 10000 records (10% complete)
2024-01-15 10:02:00 [INFO] Checkpoint: Committed 10000 records to database
2024-01-15 10:02:05 [INFO] Processing batch 11/100 (1000 records)
2024-01-15 10:02:10 [INFO] Processed 11000 records (11% complete)
2024-01-15 10:02:15 [INFO] Processing batch 12/100 (1000 records)
2024-01-15 10:02:20 [INFO] Processed 12000 records (12% complete)

Result: 29 lines (1 duplicate "Slow query" warning removed)

The repeated progress messages have different content (batch numbers, record counts, percentages), so they're not exact duplicates when ignoring timestamps.

Solution

uniqseq batch-job.log \ --skip-chars 20 \ --window-size 1 \ --quiet > clean-job.log

Options:

  • --skip-chars 20: Skip timestamp prefix
  • --window-size 1: Deduplicate individual lines
  • --quiet: Suppress statistics (only output clean log)
$ uniqseq batch-job.log \
    --skip-chars 20 \
    --window-size 1 \
    --stats-format json \
    > clean-job.log \
    2> job-stats.json

Result: Clean log to stdout, metrics to stderr

Extract specific metrics with jq:

$ jq '.statistics.redundancy_pct' job-stats.json
3.3

$ jq '.statistics.lines' job-stats.json
{
  "total": 30,
  "emitted": 29,
  "skipped": 1
}

from uniqseq import UniqSeq
import json

uniqseq = UniqSeq(
    skip_chars=20,    # (1)!
    window_size=1,    # (2)!
)

# Process log file
with open("batch-job.log") as f:
    with open("output.log", "w") as out:
        for line in f:
            uniqseq.process_line(line.rstrip("\n"), out)
        uniqseq.flush_to_stream(out)

# Get statistics programmatically
stats = uniqseq.get_stats()  # (3)!

# Save metrics
with open("job-stats.json", "w") as f:
    json.dump(stats, f, indent=2)

  1. Skip 20-character timestamp prefix
  2. Deduplicate individual log lines
  3. Access statistics from uniqseq object

How It Works

Deduplication removes exact duplicate messages while preserving unique progress updates:

Before (30 lines):
2024-01-15 10:00:10 [INFO] Processing batch 1/100 (1000 records)
2024-01-15 10:00:15 [INFO] Processed 1000 records (1% complete)
2024-01-15 10:00:20 [INFO] Processing batch 2/100 (1000 records)
2024-01-15 10:00:25 [INFO] Processed 2000 records (2% complete)
2024-01-15 10:01:00 [WARN] Slow query detected: transaction_summary took 2.5s
...
2024-01-15 10:01:25 [WARN] Slow query detected: transaction_summary \
    took 2.5s  ← duplicate

After (29 lines):
Same content, but duplicate warning removed

Real-World Workflows

Monitor Job Progress in Real-Time

Watch a running job and see only new, unique log entries:

# Follow log file, show only new unique patterns
tail -f /var/log/batch-job.log | \
    uniqseq --skip-chars 20 --window-size 1

Only new unique messages appear (repeated "Processing batch N" with different N will show, exact duplicates won't).

Extract Job Statistics

Monitor deduplication metrics to understand log verbosity:

#!/bin/bash
# Analyze job logs after completion

JOB_LOG="/var/log/etl-job-$(date +%Y%m%d).log"

uniqseq "$JOB_LOG" --skip-chars 20 --window-size 1 \
    --stats-format json \
    > "/var/clean-logs/$(basename $JOB_LOG)" \
    2> "/var/metrics/$(basename $JOB_LOG).json"

# Extract metrics
METRICS_FILE="/var/metrics/$(basename $JOB_LOG).json"
REDUNDANCY=$(jq '.statistics.redundancy_pct' "$METRICS_FILE")
echo "Log redundancy: ${REDUNDANCY}%"

# Alert if excessive duplication
if (( $(echo "$REDUNDANCY > 20" | bc -l) )); then
    echo "WARNING: High log duplication - consider reducing log verbosity"
fi

Daily Batch Job Cleanup

Automatically clean and archive batch job logs:

#!/bin/bash
# Daily log cleanup for batch jobs

LOG_DIR="/var/log/batch-jobs"
ARCHIVE_DIR="/var/archive/clean-logs"
STATS_DIR="/var/metrics/batch-stats"

for log in ${LOG_DIR}/*.log; do
    basename=$(basename "$log")

    # Deduplicate and capture stats
    uniqseq "$log" --skip-chars 20 --window-size 1 \
        --stats-format json \
        > "${ARCHIVE_DIR}/${basename}" \
        2> "${STATS_DIR}/${basename}.json"

    # Extract key metrics
    total=$(jq '.statistics.lines.total' "${STATS_DIR}/${basename}.json")
    emitted=$(jq '.statistics.lines.emitted' "${STATS_DIR}/${basename}.json")

    echo "${basename}: ${total}${emitted} lines"
done

# Compress original logs
gzip ${LOG_DIR}/*.log

Compare Job Runs

Track deduplication metrics over time to detect logging changes:

#!/bin/bash
# Compare batch job patterns week-over-week

# Process week 1 logs
uniqseq week1-job.log --stats-format json --quiet \
    > /dev/null 2> week1-stats.json

# Process week 2 logs
uniqseq week2-job.log --stats-format json --quiet \
    > /dev/null 2> week2-stats.json

# Compare redundancy
W1_RED=$(jq '.statistics.redundancy_pct' week1-stats.json)
W2_RED=$(jq '.statistics.redundancy_pct' week2-stats.json)

echo "Week 1 redundancy: ${W1_RED}%"
echo "Week 2 redundancy: ${W2_RED}%"

# Alert if significant change
DIFF=$(echo "$W2_RED - $W1_RED" | bc)
if (( $(echo "$DIFF > 10" | bc -l) )); then
    echo "⚠️  Redundancy increased by ${DIFF}% - check for logging loops"
fi

Integration with Monitoring

Push deduplication metrics to monitoring systems:

#!/bin/bash
# Send batch job metrics to Prometheus

JOB_NAME="daily_etl"
LOG_FILE="/var/log/${JOB_NAME}.log"

# Process and get stats
uniqseq "$LOG_FILE" --stats-format json --quiet \
    > /dev/null 2> stats.json

# Extract metrics
TOTAL=$(jq '.statistics.lines.total' stats.json)
EMITTED=$(jq '.statistics.lines.emitted' stats.json)
SKIPPED=$(jq '.statistics.lines.skipped' stats.json)
REDUNDANCY=$(jq '.statistics.redundancy_pct' stats.json)

# Push to Prometheus pushgateway
cat <<EOF | curl --data-binary @- http://pushgateway:9091/metrics/job/batch_logs
# TYPE batch_log_lines gauge
batch_log_lines{job="${JOB_NAME}",type="total"} ${TOTAL}
batch_log_lines{job="${JOB_NAME}",type="emitted"} ${EMITTED}
batch_log_lines{job="${JOB_NAME}",type="skipped"} ${SKIPPED}
batch_log_redundancy_pct{job="${JOB_NAME}"} ${REDUNDANCY}
EOF

Advanced Scenarios

Multi-Stage Log Processing

Clean logs and filter for errors:

# Stage 1: Deduplicate
uniqseq batch-job.log --skip-chars 20 --window-size 1 --quiet | \
# Stage 2: Extract errors and warnings
    grep -E "ERROR|WARN" | \
# Stage 3: Save for review
    tee errors.log

Normalize Variable Data

Remove timestamps and counters for more aggressive deduplication:

# Normalize progress messages before deduplication
uniqseq batch-job.log \
    --hash-transform 'sed -E "s/[0-9]+/N/g"' \
    --window-size 1 \
    --quiet

This groups all "Processing batch N" messages together (more aggressive).

Extract Job Summary

Build a concise job summary from verbose logs:

#!/bin/bash
# Create executive summary of batch job

echo "=== Batch Job Summary ==="
echo ""

# Start/end times
echo "Started: $(head -1 batch-job.log | cut -d' ' -f1-2)"
echo "Ended: $(tail -1 batch-job.log | cut -d' ' -f1-2)"
echo ""

# Deduplication metrics
uniqseq batch-job.log --stats-format json --quiet \
    > /dev/null 2> stats.json

echo "Log Statistics:"
jq -r '.statistics |
    "  Total lines: \(.lines.total)
     Unique events: \(.lines.emitted)
     Duplicates removed: \(.lines.skipped)
     Redundancy: \(.redundancy_pct)%"' stats.json
echo ""

# Error summary
echo "Issues Found:"
uniqseq batch-job.log --skip-chars 20 --window-size 1 --quiet | \
    grep -c ERROR | \
    xargs -I {} echo "  Errors: {}"

uniqseq batch-job.log --skip-chars 20 --window-size 1 --quiet | \
    grep -c WARN | \
    xargs -I {} echo "  Warnings: {}"

Detect Logging Loops

Find suspiciously repeated messages:

# Show messages that appear many times (potential loops)
uniqseq batch-job.log --skip-chars 20 --annotate --quiet | \
    grep "DUPLICATE" | \
    grep -oE "seen [0-9]+ times" | \
    awk '{print $2}' | \
    sort -rn | \
    head -1 | \
    xargs -I {} echo "Most repeated message seen {} times"

If a message repeats hundreds of times, investigate for logging loops.

Performance Benefits

Storage Savings

# Before cleanup
$ wc -c < batch-job-full.log
5242880  # 5 MB

# After deduplication (typical 10-30% reduction for batch jobs)
$ uniqseq batch-job-full.log --quiet | wc -c
4194304  # 4 MB → 20% reduction

Analysis Speed

# Time to grep through original log
$ time grep ERROR batch-job-full.log
real    0m2.5s

# Time to grep through deduplicated log
$ time grep ERROR batch-job-clean.log
real    0m2.0s  # 20% faster

When to Use This

Good candidates: - ✅ ETL jobs with repeated progress updates - ✅ Data migration scripts with batch processing - ✅ Long-running imports/exports (hours to days) - ✅ Scheduled tasks with verbose logging

Not recommended: - ❌ Short jobs (<1 minute) with unique messages - ❌ Jobs where every log line is critical for compliance - ❌ Real-time streaming applications (use tail -f | uniqseq instead)

See Also