Operations: Log Ingestion Preprocessing
Reduce log storage costs and improve query performance by deduplicating logs before sending to Loki, Elasticsearch, or other centralized logging systems.
The Problem
Centralized logging systems charge based on ingestion volume:
- Storage costs - Duplicate logs waste expensive storage
- Query performance - Larger indices slow down searches
- Retention limits - Duplicates fill retention windows faster
- Network bandwidth - Shipping duplicate logs wastes bandwidth
Typical savings: 50-90% reduction in log volume for verbose applications.
Input Data
app-verbose.log
2024-01-15 10:30:01 INFO Starting request processing
2024-01-15 10:30:02 ERROR Database connection failed: timeout after 30s
2024-01-15 10:30:03 WARN Retrying database connection
2024-01-15 10:30:04 INFO Request queued for retry
2024-01-15 10:30:05 ERROR Database connection failed: timeout after 30s
2024-01-15 10:30:06 WARN Retrying database connection
2024-01-15 10:30:07 INFO Request queued for retry
2024-01-15 10:30:08 INFO Processing user authentication
2024-01-15 10:30:09 ERROR Database connection failed: timeout after 30s
2024-01-15 10:30:10 WARN Retrying database connection
A verbose application log with 10 entries:
- "Database connection failed" (lines 2, 5, 9) - appears 3×
- "Retrying database connection" (lines 3, 6, 10) - appears 3×
- "Request queued for retry" (lines 4, 7) - appears 2×
Output Data
expected-loki-input.log
2024-01-15 10:30:01 INFO Starting request processing
2024-01-15 10:30:02 ERROR Database connection failed: timeout after 30s
2024-01-15 10:30:03 WARN Retrying database connection
2024-01-15 10:30:04 INFO Request queued for retry
2024-01-15 10:30:08 INFO Processing user authentication
Result: 5 duplicate entries removed → 50% reduction in ingestion volume
Solution
How it works:
cat- Stream log file (in production, usetail -ffor live logs)uniqseq- Deduplicate in real-time- Output can be piped to promtail, filebeat, or other log shippers
Process log files in batches before ingestion with filebeat or other shippers.
from uniqseq import UniqSeq
uniqseq = UniqSeq(
skip_chars=20, # (1)!
window_size=1, # (2)!
)
# Stream processing for log shipper
with open("app-verbose.log") as f:
with open("output.log", "w") as out:
for line in f:
uniqseq.process_line(line.rstrip("\n"), out)
uniqseq.flush_to_stream(out)
- Skip 20-character timestamp prefix
- Deduplicate individual log lines
How It Works
Before sending logs to your centralized logging system, uniqseq removes duplicate entries while preserving timestamps:
Before (10 lines → Loki):
2024-01-15 10:30:02 ERROR Database connection failed...
2024-01-15 10:30:05 ERROR Database connection failed... ← duplicate
2024-01-15 10:30:09 ERROR Database connection failed... ← duplicate
After (5 lines → Loki):
2024-01-15 10:30:02 ERROR Database connection failed...
(2 duplicates removed, saving storage and bandwidth)
Real-World Integration
Loki with Promtail
Stream logs through uniqseq before Promtail ingests them:
# In systemd service or Docker container
tail -f /var/log/app.log | \
uniqseq --skip-chars 20 --window-size 1 | \
promtail --config /etc/promtail/config.yml
Promtail config (promtail.yml):
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: deduplicated-logs
stdin:
use_incoming_timestamp: true
Elasticsearch with Filebeat
Deduplicate log files before Filebeat processes them:
#!/bin/bash
# Preprocessing script run by cron
for logfile in /var/log/app/*.log; do
uniqseq "$logfile" --skip-chars 20 --window-size 1 > "${logfile}.deduped"
done
# Filebeat picks up *.deduped files
filebeat -c /etc/filebeat/filebeat.yml
Filebeat config (filebeat.yml):
Fluentd Pipeline
Integrate uniqseq as a preprocessing filter:
# Fluentd exec input
<source>
@type exec
command tail -f /var/log/app.log | uniqseq --skip-chars 20 --window-size 1
format json
tag app.deduped
</source>
<match app.deduped>
@type elasticsearch
host elasticsearch.internal
index_name app-logs-deduped
</match>
Cost Analysis
Storage Savings
Before deduplication:
After deduplication:
$ uniqseq app-verbose.log --skip-chars 20 --window-size 1 | wc -c
150000000 # 150 MB/day → 70% reduction
Annual savings (at $0.10/GB/month): - Before: 182.5 GB/year × $0.10 = $18.25/month - After: 54.75 GB/year × $0.10 = $5.48/month - Savings: $12.77/month per application
Query Performance
Smaller indices mean faster queries:
# Loki query with duplicates: 10M entries
# Query time: 5-10 seconds
# Loki query without duplicates: 3M entries
# Query time: 1-2 seconds
Advanced Workflows
Multi-Stage Deduplication
Combine with other preprocessing:
# Stage 1: Remove debug logs
grep -v DEBUG app.log | \
# Stage 2: Deduplicate
uniqseq --skip-chars 20 --window-size 1 | \
# Stage 3: Send to Loki
promtail --config promtail.yml
Application-Specific Normalization
Normalize application-specific fields before deduplication:
# Remove request IDs before deduplicating
uniqseq app.log \
--hash-transform 'sed -E "s/req-[a-z0-9]+/REQ/g"' \
--skip-chars 20 \
--window-size 1 | \
promtail --config promtail.yml
Per-Service Deduplication
Deduplicate each microservice separately:
#!/bin/bash
for service in auth api gateway; do
tail -f /var/log/${service}.log | \
uniqseq --skip-chars 20 --window-size 1 | \
promtail --config ${service}-promtail.yml &
done
Monitoring Deduplication
Track reduction metrics:
#!/bin/bash
# Log preprocessing metrics
ORIGINAL=$(wc -l < app.log)
uniqseq app.log --skip-chars 20 --window-size 1 > deduped.log
DEDUPED=$(wc -l < deduped.log)
REDUCTION=$(( 100 - (DEDUPED * 100 / ORIGINAL) ))
echo "Log reduction: ${REDUCTION}% (${ORIGINAL} → ${DEDUPED} lines)"
# Send to Prometheus/Grafana
echo "log_reduction_percent{app=\"myapp\"} ${REDUCTION}" | \
curl --data-binary @- http://pushgateway:9091/metrics/job/log_preprocessing
When to Use This
Good candidates for preprocessing: - ✅ Verbose applications with retry logic - ✅ High-frequency logging (>1000 lines/second) - ✅ Expensive log retention (Splunk, Datadog, etc.) - ✅ Bandwidth-constrained environments
Not recommended: - ❌ Low-volume logs (<100 MB/day) - ❌ Logs where every entry is unique - ❌ Compliance/audit logs requiring complete history - ❌ Real-time alerting on duplicate detection
See Also
- Skip Chars - Ignoring timestamp prefixes
- Hash Transform - Normalizing before deduplication
- Statistics - Measuring deduplication effectiveness