Operations: Log Template Extraction
Extract log message templates by normalizing variable parameters, enabling pattern discovery and integration with log analysis tools like Drain3, Spell, or LogPAI.
The Problem
Application logs contain repeated message templates with varying parameters:
- High cardinality - Thousands of unique messages with same structure
- Hard to analyze - Variable data obscures common patterns
- Poor aggregation - Can't group by message type
- Template extraction is slow - ML-based tools require significant computation
Preprocessing with normalization dramatically reduces log cardinality and speeds up template extraction.
Input Data
varied-logs.txt
2024-01-15 10:00:00 [INFO] User alice logged in from 192.168.1.100
2024-01-15 10:00:05 [INFO] User bob logged in from 192.168.1.101
2024-01-15 10:00:10 [ERROR] Failed to connect to database server db-prod-01 on port 5432
2024-01-15 10:00:15 [INFO] User charlie logged in from 192.168.1.102
2024-01-15 10:00:20 [ERROR] Failed to connect to database server db-prod-02 on port 5432
2024-01-15 10:00:25 [INFO] Processing request req-a1b2c3 for user alice took 150ms
2024-01-15 10:00:30 [INFO] User dave logged in from 192.168.1.103
2024-01-15 10:00:35 [INFO] Processing request req-d4e5f6 for user bob took 200ms
2024-01-15 10:00:40 [ERROR] Failed to connect to database server db-prod-01 on port 5432
2024-01-15 10:00:45 [INFO] Processing request req-g7h8i9 for user charlie took 175ms
2024-01-15 10:00:50 [INFO] User eve logged in from 192.168.1.104
2024-01-15 10:00:55 [WARN] Cache miss for key user:alice:profile
2024-01-15 10:01:00 [INFO] Processing request req-j1k2l3 for user dave took 180ms
2024-01-15 10:01:05 [WARN] Cache miss for key user:bob:profile
2024-01-15 10:01:10 [ERROR] Failed to connect to database server db-prod-03 on port 5432
Application log with 15 entries across 4 template patterns:
- "User X logged in from Y" (5 instances, different users/IPs)
- "Failed to connect to database server X" (3 instances, different servers)
- "Processing request X for user Y took Zms" (4 instances, different IDs/users/times)
- "Cache miss for key X" (2 instances, different keys)
Output Data
expected-templates.txt
2024-01-15 10:00:00 [INFO] User <USER> logged in from <IP>
2024-01-15 10:00:10 [ERROR] Failed to connect to database server db-prod-<N> on port 5432
2024-01-15 10:00:25 [INFO] Processing request req-<ID> for user <USER> took <TIME>ms
2024-01-15 10:00:55 [WARN] Cache miss for key user:<USER>:profile
Result: 4 unique templates (15 → 4 lines, 73% reduction)
Variable parameters replaced with placeholders:
- <USER> - Username
- <IP> - IP address
- <ID> - Request ID
- <TIME> - Timing value
- <N> - Server number
Solution
Pipeline stages:
sed- Normalize variable parameters to placeholdersuniqseq- Deduplicate to extract unique templates- Output contains one instance of each template
$ uniqseq varied-logs.txt \
--skip-chars 20 \
--hash-transform 'sed -E "s/(user |User )[a-z]+/\1<USER>/g; \
s/from [0-9.]+/from <IP>/g; \
s/req-[a-z0-9]+/req-<ID>/g; \
s/took [0-9]+ms/took <TIME>ms/g; \
s/db-prod-[0-9]+/db-prod-<N>/g; \
s/user:[a-z]+/user:<USER>/g"' \
--window-size 1 \
--quiet > grouped-by-template.log
Options:
--hash-transform: Normalize before comparing (groups similar logs)- Original log lines are preserved in output
- Deduplication happens on normalized version
import re
from uniqseq import UniqSeq
def normalize_log(line):
"""Normalize variable parameters to placeholders"""
line = re.sub(r'(user |User )[a-z]+', r'\1<USER>', line)
line = re.sub(r'from [0-9.]+', 'from <IP>', line)
line = re.sub(r'req-[a-z0-9]+', 'req-<ID>', line)
line = re.sub(r'took [0-9]+ms', 'took <TIME>ms', line)
line = re.sub(r'db-prod-[0-9]+', 'db-prod-<N>', line)
line = re.sub(r'user:[a-z]+', 'user:<USER>', line)
return line
uniqseq = UniqSeq(
skip_chars=20, # (1)!
window_size=1, # (2)!
)
with open("varied-logs.txt") as f:
with open("output.log", "w") as out:
for line in f:
line_clean = line.rstrip("\n")
normalized = normalize_log(line_clean) # (3)!
# Process normalized line, deduplication happens automatically
uniqseq.process_line(normalized, out)
uniqseq.flush_to_stream(out)
- Skip 20-character timestamp prefix
- Deduplicate individual log lines
- Normalize before checking for duplicates
How It Works
Normalization converts variable data to placeholders, revealing underlying templates:
Before normalization (15 unique lines):
User alice logged in from 192.168.1.100
User bob logged in from 192.168.1.101
User charlie logged in from 192.168.1.102
User dave logged in from 192.168.1.103
User eve logged in from 192.168.1.104
...
After normalization (1 template):
User <USER> logged in from <IP>
Real-World Workflows
Discover Log Templates
Extract all unique log message patterns:
#!/bin/bash
# Extract templates from application logs
cat /var/log/app.log | \
# Normalize common variable patterns
sed -E 's/[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}/<IP>/g; \
s/[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9:.]+Z/<TIMESTAMP>/g; \
s/[a-f0-9]{8}-[a-f0-9]{4}-[a-f0-9]{4}/\
[a-f0-9]{4}-[a-f0-9]{12}/<UUID>/g; \
s/(user|User|USER)[:=][a-zA-Z0-9]+/\1:<USER>/g; \
s/[0-9]+ ms/<TIME>ms/g' | \
# Deduplicate to get unique templates
uniqseq --skip-chars 20 --window-size 1 --quiet > templates.txt
# Count how many templates exist
echo "Discovered $(wc -l < templates.txt) unique log templates"
Frequency Analysis
Rank templates by occurrence:
# Normalize, count occurrences, sort by frequency
cat /var/log/app.log | \
sed -E 's/<normalization-pattern>/<PLACEHOLDER>/g' | \
uniqseq --skip-chars 20 --window-size 1 --inverse | \
sort | \
uniq -c | \
sort -rn | \
head -10
Output:
1250 [INFO] User <USER> logged in from <IP>
890 [ERROR] Failed to connect to database server db-prod-<N> on port 5432
567 [INFO] Processing request req-<ID> for user <USER> took <TIME>ms
123 [WARN] Cache miss for key user:<USER>:profile
...
Integration with Drain3
Drain3 is an ML-based log template extraction tool. Preprocessing with uniqseq speeds it up:
from drain3 import TemplateMiner
from drain3.template_miner_config import TemplateMinerConfig
import re
# Step 1: Pre-normalize obvious patterns
def pre_normalize(line):
line = re.sub(r'\b\d{1,3}(\.\d{1,3}){3}\b', '<IP>', line)
line = re.sub(r'\b[a-f0-9]{32}\b', '<HASH>', line)
line = re.sub(r'\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}', '<TIMESTAMP>', line)
return line
# Step 2: Feed to Drain3 (now processes faster with less cardinality)
config = TemplateMinerConfig()
template_miner = TemplateMiner(config=config)
with open("/var/log/app.log") as f:
for line in f:
normalized = pre_normalize(line)
result = template_miner.add_log_message(normalized)
# Print discovered templates
for template in template_miner.drain.clusters:
print(f"{template.size:6d} occurrences: {template.get_template()}")
Generate Template Catalog
Build a catalog of all application log templates:
#!/bin/bash
# Create comprehensive template catalog
echo "# Application Log Templates" > catalog.md
echo "" >> catalog.md
echo "Generated: $(date)" >> catalog.md
echo "" >> catalog.md
# Process each log level separately
for level in INFO WARN ERROR CRITICAL; do
echo "## $level Messages" >> catalog.md
echo "" >> catalog.md
grep "\\[$level\\]" /var/log/app.log | \
sed -E 's/<normalization>/<PLACEHOLDER>/g' | \
uniqseq --skip-chars 20 --window-size 1 --quiet | \
sed 's/^/- /' >> catalog.md
echo "" >> catalog.md
done
Output (catalog.md):
# Application Log Templates
Generated: 2024-01-15 10:00:00
## INFO Messages
- User <USER> logged in from <IP>
- Processing request req-<ID> for user <USER> took <TIME>ms
...
## ERROR Messages
- Failed to connect to database server db-prod-<N> on port 5432
...
Compare Template Changes
Detect new log templates introduced by code changes:
#!/bin/bash
# Compare templates before and after deployment
# Extract baseline templates
cat logs-before-deploy.log | \
sed -E 's/<normalization>/<PLACEHOLDER>/g' | \
uniqseq --skip-chars 20 --window-size 1 \
--library-dir ./baseline-templates --quiet > /dev/null
# Find new templates after deployment
cat logs-after-deploy.log | \
sed -E 's/<normalization>/<PLACEHOLDER>/g' | \
uniqseq --skip-chars 20 --window-size 1 \
--read-sequences ./baseline-templates \
--annotate | \
grep -v "DUPLICATE" | \
grep "^2024" > new-templates.txt
echo "New log templates introduced:"
cat new-templates.txt
Anomaly Detection
Use template frequency changes to detect anomalies:
#!/bin/bash
# Compare template frequencies week-over-week
# Week 1 template counts
cat week1.log | \
sed -E 's/<normalization>/<PLACEHOLDER>/g' | \
sort | uniq -c > week1-counts.txt
# Week 2 template counts
cat week2.log | \
sed -E 's/<normalization>/<PLACEHOLDER>/g' | \
sort | uniq -c > week2-counts.txt
# Find templates with significant frequency increase
join week1-counts.txt week2-counts.txt | \
awk '{
increase = ($2 - $1) / $1 * 100
if (increase > 50) print increase "% increase:", $3
}' | \
sort -rn
Advanced Patterns
Multi-Stage Normalization
Different normalization strategies for different log sections:
# Stage 1: Normalize timestamps and IDs
cat app.log | \
sed -E 's/[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9:.]+Z/<TS>/g; \
s/[a-f0-9]{8}-[a-f0-9]{4}/<ID>/g' | \
# Stage 2: Normalize numeric values
sed -E 's/[0-9]+ (ms|MB|requests)/<NUM> \1/g' | \
# Stage 3: Deduplicate
uniqseq --skip-chars 20 --window-size 1 --quiet
Hierarchical Templates
Extract templates at different specificity levels:
# High specificity (preserve more detail)
cat app.log | \
sed -E 's/[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+/<IP>/g' | \
uniqseq --skip-chars 20 --window-size 1 --quiet > templates-specific.txt
# Low specificity (more aggressive normalization)
cat app.log | \
sed -E 's/[0-9]+/<NUM>/g; s/[a-z]+/<WORD>/g' | \
uniqseq --skip-chars 20 --window-size 1 --quiet > templates-general.txt
Template-Based Filtering
Use extracted templates to filter logs:
# Extract error templates
grep ERROR app.log | \
sed -E 's/<normalization>/<PLACEHOLDER>/g' | \
uniqseq --skip-chars 20 --window-size 1 \
--library-dir ./error-templates --quiet > /dev/null
# Filter production logs for known error templates
uniqseq production.log --skip-chars 20 --window-size 1 \
--read-sequences ./error-templates \
--track 'ERROR' | \
grep -v "NEW PATTERN"
Shows only errors matching known templates (filters out new errors).
Generate Regex Patterns
Convert templates to regex for monitoring:
import re
# Read templates
with open("templates.txt") as f:
templates = f.readlines()
# Convert placeholders to regex
for template in templates:
regex = template
regex = re.sub(r'<USER>', r'[a-zA-Z0-9]+', regex)
regex = re.sub(r'<IP>', r'\\d{1,3}(\\.\\d{1,3}){3}', regex)
regex = re.sub(r'<ID>', r'[a-f0-9-]+', regex)
regex = re.sub(r'<TIME>', r'\\d+', regex)
print(f"Template: {template.strip()}")
print(f"Regex: {regex}")
print()
Use these regex patterns in monitoring tools like Grafana, Datadog, or Splunk.
Integration Examples
Elasticsearch Mapping
# Extract templates for Elasticsearch field mapping
cat app.log | \
sed -E 's/<normalization>/<PLACEHOLDER>/g' | \
uniqseq --skip-chars 20 --window-size 1 --quiet | \
jq -R '{template: .}' | \
jq -s '{templates: .}'
Prometheus Alert Rules
# Generate alert rules from templates
groups:
- name: log_patterns
rules:
- alert: NewLogTemplate
expr: |
log_template_count{template="User <USER> failed login"} > 100
annotations:
summary: High occurrence of login failures
LogPAI Integration
from logpai.logparser import Drain
# Pre-normalize logs
with open("app.log") as f, open("normalized.log", "w") as out:
for line in f:
normalized = pre_normalize(line)
out.write(normalized + "\n")
# Run Drain parser on pre-normalized logs (faster convergence)
parser = Drain.LogParser(log_format='<Time> <Level> <Content>')
parser.parse("normalized.log")
Performance Benefits
Reduced Cardinality
# Before normalization
$ cat app.log | wc -l
1,000,000 lines
$ cat app.log | sort | uniq | wc -l
850,000 unique lines (85% cardinality)
# After normalization
$ cat app.log | sed -E 's/<normalization>/<PLACEHOLDER>/g' | \
uniqseq --skip-chars 20 --quiet | wc -l
1,200 unique templates (0.12% cardinality)
99.88% reduction in cardinality for ML-based template extraction.
Faster Template Mining
# Without preprocessing
$ time drain3-mine app.log
real 15m23s
# With uniqseq preprocessing
$ cat app.log | sed -E 's/<normalization>/<PLACEHOLDER>/g' > normalized.log
$ time drain3-mine normalized.log
real 2m15s # 6.8× faster
Common Normalization Patterns
| Pattern | Regex | Example |
|---|---|---|
| IP Address | \d{1,3}(\.\d{1,3}){3} |
192.168.1.1 → <IP> |
| UUID | [a-f0-9]{8}-[a-f0-9]{4}... |
550e8400-e29b-... → <UUID> |
| Timestamp (ISO) | \d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2} |
2024-01-15T10:00:00 → <TS> |
| Numbers | \d+ |
12345 → <NUM> |
| Hex Hash | [a-f0-9]{32,64} |
a3b5c7d9... → <HASH> |
[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+ |
user@example.com → <EMAIL> |
|
| URL | https?://[^\s]+ |
https://api.example.com/... → <URL> |
When to Use This
Good for: - ✅ High-cardinality log analysis - ✅ Template discovery and cataloging - ✅ Pre-processing for ML-based log mining - ✅ Anomaly detection (new template patterns) - ✅ Log message standardization
Not ideal for: - ❌ Logs already using structured logging (JSON) - ❌ Low-volume logs (<1000 lines/day) - ❌ Logs with no repeated patterns - ❌ Real-time streaming (batch processing more effective)
See Also
- Hash Transform - Normalization before deduplication
- Library Mode - Saving template patterns
- Pattern Filtering - Filtering by template
- Log Normalization - Multi-step normalization workflows