Data Processing: Log Normalization
Normalize log entries by removing timestamps, IDs, and normalizing whitespace before deduplication. Find truly unique messages despite formatting variations.
The Problem
Application logs often have the same message repeated with variations:
- Timestamps differ - Same error at different times
- Request IDs differ - Unique IDs make every line look different
- Whitespace varies - Inconsistent spacing in log messages
- UUIDs and session IDs - Obscure duplicate patterns
These variations prevent traditional line-based deduplication from working.
Input Data
request.log
2024-01-15T10:30:01.123Z | req-a1b2c3 | INFO | Processing payment
2024-01-15T10:30:02.456Z | req-d4e5f6 | ERROR | Payment gateway timeout
2024-01-15T10:30:03.789Z | req-g7h8i9 | INFO | Processing payment
2024-01-15T10:30:04.012Z | req-j0k1l2 | ERROR | Payment gateway timeout
2024-01-15T10:30:05.345Z | req-m3n4o5 | WARN | Retry attempt 1
2024-01-15T10:30:06.678Z | req-p6q7r8 | ERROR | Payment gateway timeout
The log contains 6 entries, but only 3 unique messages:
- "Processing payment" (lines 1, 3) - different timestamps/request IDs
- "Payment gateway timeout" (lines 2, 4, 6) - different whitespace & IDs
- "Retry attempt 1" (line 5) - unique
Output Data
expected-normalized.log
2024-01-15T10:30:01.123Z | req-a1b2c3 | INFO | Processing payment
2024-01-15T10:30:02.456Z | req-d4e5f6 | ERROR | Payment gateway timeout
2024-01-15T10:30:05.345Z | req-m3n4o5 | WARN | Retry attempt 1
Result: 3 duplicate entries removed → 3 unique messages
Solution
How it works:
- Remove ISO timestamps:
sed -E 's/[0-9]{4}-.../g' - Remove request IDs:
sed -E 's/req-[a-z0-9]+//g' - Normalize whitespace:
tr -s ' '(squeeze multiple spaces to one)
import re
from uniqseq import UniqSeq
def normalize_log(line):
# Remove ISO timestamp
line = re.sub(r'\d{4}-\d{2}-\d{2}T[\d:.]+Z', '', line)
# Remove request IDs
line = re.sub(r'req-[a-z0-9]+', '', line)
# Normalize whitespace
line = ' '.join(line.split())
return line
uniqseq = UniqSeq(
hash_transform=normalize_log, # (1)!
window_size=1, # (2)!
)
with open("request.log") as f:
with open("output.log", "w") as out:
for line in f:
uniqseq.process_line(line.rstrip("\n"), out)
uniqseq.flush_to_stream(out)
- Python lambda for multi-step normalization
- Deduplicate individual lines
How It Works
The --hash-transform normalizes each line before hashing, but preserves the original line in output:
Original:
2024-01-15T10:30:02.456Z | req-d4e5f6 | ERROR | Payment gateway timeout
After normalization (for hashing):
| | ERROR | Payment gateway timeout
↓ (timestamps/IDs removed, whitespace normalized)
Output (original preserved):
2024-01-15T10:30:02.456Z | req-d4e5f6 | ERROR | Payment gateway timeout
Lines with identical normalized content are considered duplicates.
Multi-Step Transformation
Complex normalizations combine multiple steps:
- Remove timestamps: Strip ISO 8601 timestamps
- Remove IDs: Strip request/session/trace IDs
- Normalize whitespace: Convert multiple spaces to single space
Real-World Workflows
Remove All Variable Data
Normalize timestamps, IDs, IP addresses, and numbers:
uniqseq app.log \
--hash-transform \
"sed -E 's/[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}/IP/g' | \
sed -E 's/[0-9]+/NUM/g' | \
tr -s ' '" \
--window-size 1 > normalized.log
Case-Insensitive + Normalization
Combine case conversion with other normalizations:
uniqseq app.log \
--hash-transform "tr '[:upper:]' '[:lower:]' | \
sed -E 's/user-[0-9]+/USER/g' | \
tr -s ' '" \
--window-size 1 > output.log
Extract Error Patterns
Remove context to find error message patterns:
# Keep only the error message part (field 4)
uniqseq app.log \
--hash-transform "awk -F'|' '{print \$4}' | tr -s ' '" \
--window-size 1 > error-patterns.log
Production Log Analysis
Analyze production logs with high cardinality IDs:
# Remove UUIDs, session IDs, timestamps
uniqseq production.log \
--hash-transform \
"sed -E 's/[a-f0-9]{8}-[a-f0-9]{4}-[a-f0-9]{4}-\
[a-f0-9]{4}-[a-f0-9]{12}//g' | \
sed -E 's/session_[a-zA-Z0-9]+//g' | \
sed -E 's/[0-9]{4}-[0-9]{2}-[0-9]{2}//g' | \
tr -s ' '" \
--window-size 1 \
--annotate > unique-errors.log
The --annotate flag shows how many times each pattern appeared.
Performance Considerations
Hash transforms run a subprocess for each line. For large files:
Optimize the pipeline:
# Slower: Complex regex
--hash-transform "sed -E 's/very-complex-pattern//g'"
# Faster: Simple patterns
--hash-transform "sed 's/simple-string//g'"
# Faster: Multiple simple sed commands
--hash-transform "sed 's/foo//g' | sed 's/bar//g'"
Consider preprocessing:
# For very large files, preprocess once outside uniqseq
sed -E 's/[0-9]{4}-[0-9]{2}-[0-9]{2}T[\d:.]+Z//g' huge.log | \
uniqseq --window-size 1
Common Normalization Patterns
# Remove ISO timestamps
sed -E 's/[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9:.]+Z//g'
# Remove UUIDs
sed -E 's/[a-f0-9]{8}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{12}//g'
# Remove IP addresses
sed -E 's/[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}//g'
# Remove all numbers
sed 's/[0-9]\+//g'
# Normalize whitespace
tr -s ' '
# Case-insensitive
tr '[:upper:]' '[:lower:]'
# Remove email addresses
sed -E 's/[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}//g'
See Also
- Hash Transform - Detailed hash transform documentation
- Field Extraction - Extract specific fields
- Skip Chars - Skip fixed-width prefixes