Data Processing: Field-Based Deduplication
Deduplicate log lines based on specific fields while preserving the complete original lines in output. Useful when only certain fields matter for uniqueness.
The Problem
Server logs often have unique timestamps, server IDs, or request IDs, but the underlying messages repeat:
- Same error from different servers - Different server names, same error message
- Same message at different times - Different timestamps, same log content
- Unique IDs obscure patterns - Request IDs make every line look unique
Traditional line-based deduplication can't ignore these varying fields.
Input Data
server.log
2024-01-15 10:30:01 | INFO | server-01 | Request processed successfully
2024-01-15 10:30:02 | ERROR | server-02 | Connection timeout
2024-01-15 10:30:03 | WARN | server-01 | High memory usage detected
2024-01-15 10:30:04 | INFO | server-03 | Request processed successfully
2024-01-15 10:30:05 | ERROR | server-01 | Connection timeout
2024-01-15 10:30:06 | INFO | server-02 | Request processed successfully
2024-01-15 10:30:07 | WARN | server-03 | High memory usage detected
2024-01-15 10:30:08 | ERROR | server-03 | Connection timeout
The log contains 8 entries, but only 3 unique messages:
- "Request processed successfully" (lines 1, 4, 6) - appears 3×
- "Connection timeout" (lines 2, 5, 8) - appears 3×
- "High memory usage detected" (lines 3, 7) - appears 2×
Output Data
expected-field-output.log
2024-01-15 10:30:01 | INFO | server-01 | Request processed successfully
2024-01-15 10:30:02 | ERROR | server-02 | Connection timeout
2024-01-15 10:30:03 | WARN | server-01 | High memory usage detected
Result: 5 duplicate lines removed → only unique messages remain
Solution
Options:
--hash-transform 'awk...': Extract field 4 (message) for comparison--window-size 1: Deduplicate individual lines (not sequences)--quiet: Suppress statistics
from uniqseq import UniqSeq
uniqseq = UniqSeq(
hash_transform=lambda line: line.split("|")[3].strip(), # (1)!
window_size=1, # (2)!
)
with open("server.log") as f:
with open("output.log", "w") as out:
for line in f:
uniqseq.process_line(line.rstrip("\n"), out)
uniqseq.flush_to_stream(out)
- Extract field 4 (message) using Python lambda
- Deduplicate individual lines (window_size=1)
How It Works
The --hash-transform flag transforms each line for hashing purposes while keeping the original line in output:
Original line:
2024-01-15 10:30:01 | INFO | server-01 | Request processed successfully
↓
Hash only this part
(field 4: message)
↓
Output (original line preserved):
2024-01-15 10:30:01 | INFO | server-01 | Request processed successfully
Lines with the same field 4 value are considered duplicates, but the complete original line is written to output.
Why Window Size 1?
By default, uniqseq looks for repeated sequences of 10 lines. For field-based deduplication of individual log entries, use --window-size 1 to treat each line independently.
Real-World Workflows
Deduplicate by Error Code
Extract only the error code for comparison:
# Log format: "timestamp | level | ERROR_CODE_123 | message"
uniqseq app.log \
--hash-transform 'awk -F"|" "{print \$3}"' \
--window-size 1 > unique-errors.log
Multi-Field Deduplication
Combine multiple fields for uniqueness:
# Deduplicate by level + message (ignore timestamp and server)
uniqseq server.log \
--hash-transform 'awk -F"|" "{print \$2 \$4}"' \
--window-size 1 > output.log
Case-Insensitive Field Matching
Combine with case normalization:
uniqseq server.log \
--hash-transform 'awk -F"|" "{print \$4}" | tr "[:upper:]" "[:lower:]"' \
--window-size 1 > output.log
Track Unique Messages Across Servers
Use a library to accumulate unique messages:
# Day 1: server-01 logs
uniqseq server-01.log \
--hash-transform 'awk -F"|" "{print \$4}"' \
--window-size 1 \
--library-dir messages-lib/ > clean-01.log
# Day 2: server-02 logs (reuses library)
uniqseq server-02.log \
--hash-transform 'awk -F"|" "{print \$4}"' \
--window-size 1 \
--library-dir messages-lib/ > clean-02.log
The library tracks messages across all servers.
Performance Considerations
Hash transforms run an external command for each line. For large files:
- Use efficient commands:
cutis faster thanawk,awkis faster thansed - Avoid complex regex: Simple field extraction is fastest
- Consider preprocessing: If possible, preprocess outside uniqseq
# Slower (subprocess per line)
uniqseq large.log --hash-transform 'awk...' --window-size 1
# Faster (preprocess once)
awk -F"|" '{$1=""; $3=""; print}' large.log | uniqseq --window-size 1
See Also
- Hash Transform - Detailed hash transform documentation
- Window Size - Understanding window sizes
- Pattern Libraries - Cross-file deduplication