Choosing the Right Window Size
Window size is the most important parameter for tuning uniqseq to your data. This guide helps you choose the right value for different scenarios.
Quick Reference
| Scenario | Recommended Window Size | Rationale |
|---|---|---|
| Single-line log entries | 1 |
Each line is independent |
| Short error messages (2-3 lines) | 3 |
Detects 2+ line patterns |
| Stack traces (5-10 lines) | 5 |
Typical stack trace length |
| Multi-line JSON/XML | 10 (default) |
Structured data blocks |
| Large code blocks | 20-50 |
Function-length patterns |
| Unknown data | 1 then increase |
Start conservative |
Understanding Window Size
Window size = minimum sequence length to detect
- Window size
3→ detects sequences of 3+ lines - Window size
10→ detects sequences of 10+ lines - Sequences shorter than window size are never detected
Key trade-off:
- Smaller = more sensitive, finds shorter patterns
- Larger = less sensitive, only finds longer patterns
Decision Process
Step 1: Identify Your Pattern Length
Look at your data and estimate how long repeated patterns typically are:
Count the lines in a typical repeated pattern:
- Error message with timestamp? → 1 line
- Exception with stack trace? → 5-10 lines
- Build output block? → 10-20 lines
- Multi-line JSON object? → Variable
Step 2: Start Conservative
Rule of thumb: Start with a window size slightly smaller than your estimated pattern length.
- Estimated 5-line pattern? Try
--window-size 3 - Estimated 10-line pattern? Try
--window-size 5 - Estimated 20-line pattern? Try
--window-size 10
Why smaller? Catches both your target patterns AND any shorter ones that also repeat.
Step 3: Test and Adjust
Run uniqseq with your initial window size and check the results:
# Test with initial window size
uniqseq your-file.log --window-size 5 > output.log
# Compare sizes
wc -l your-file.log output.log
If too many lines removed: Increase window size (being too aggressive) If too few lines removed: Decrease window size (missing patterns)
Step 4: Use Statistics
Check the statistics to understand what's happening:
Key metrics:
redundancy_pct: Percentage of duplicate lines (target: 20-80%)unique_sequences_tracked: Number of distinct patterns foundlines.skipped: Total lines removed
Common Scenarios
Single-Line Deduplication
Use case: Log files where each line is independent
Window size 1:
- Treats each line as its own sequence
- Perfect for flat log files
- Equivalent to
sort | uniqbut preserves order
Application Error Logs
Use case: Errors with timestamps + message
2024-01-15 10:30:15 ERROR: Database connection failed
2024-01-15 10:30:16 ERROR: Database connection failed
2024-01-15 10:30:17 ERROR: Database connection failed
Solution: Use --skip-chars with --window-size 1
# Ignore timestamps (first 20 chars), deduplicate messages
uniqseq error.log --skip-chars 20 --window-size 1
Stack Traces
Use case: Multi-line exceptions that repeat
Traceback (most recent call last):
File "app.py", line 42, in handler
process_request()
File "app.py", line 87, in process_request
db.connect()
ConnectionError: Connection refused
Typical length: 5-10 lines
Recommended: --window-size 5
Build Output
Use case: Compiler warnings or test failures
warning: unused variable: `result`
--> src/main.rs:42:9
|
42 | let result = calculate();
| ^^^^^^ help: if this is intentional,
| prefix it with an underscore: `_result`
Typical length: 3-5 lines
Recommended: --window-size 3
JSON/Structured Data
Use case: Multi-line JSON objects
{
"timestamp": "2024-01-15T10:30:15Z",
"level": "ERROR",
"message": "Connection failed",
"stack": "..."
}
Variable length: 5-20+ lines
Recommended: Start with --window-size 5, adjust up if needed
Advanced Tuning
Finding the Optimal Window Size
Try multiple window sizes and compare results:
#!/bin/bash
# Test different window sizes
for size in 1 3 5 10 20; do
lines=$(uniqseq your-file.log --window-size $size --quiet | wc -l)
echo "Window $size: $lines lines remaining"
done
Example output:
Window 1: 8500 lines remaining
Window 3: 7200 lines remaining ← Good balance
Window 5: 6800 lines remaining
Window 10: 9500 lines remaining ← Too large, missing patterns
Window 20: 9900 lines remaining
Choose the "elbow": Where increasing window size starts having less effect.
Window Size Too Large?
Symptoms:
- Very few lines removed
- Statistics show low redundancy (< 10%)
- Known duplicate patterns not detected
Solution: Decrease window size
# Before: Missing 4-line patterns
uniqseq app.log --window-size 10 # Only finds 10+ line patterns
# After: Catching 4-line patterns
uniqseq app.log --window-size 3 # Finds 3+ line patterns
Window Size Too Small?
Symptoms:
- Too many lines removed
- Important variations being deduplicated
- Statistics show very high redundancy (> 90%)
Solution: Increase window size or use pattern filtering
# Option 1: Increase window size
uniqseq app.log --window-size 10
# Option 2: Track only specific patterns
uniqseq app.log --window-size 3 --track "^ERROR"
Mixed Pattern Lengths
Problem: Your data has both short (3-line) and long (20-line) patterns
Solution 1: Choose smaller window (catches both)
Solution 2: Multiple passes
# Pass 1: Remove long patterns
uniqseq app.log --window-size 20 | \
# Pass 2: Remove short patterns
uniqseq --window-size 3 > output.log
Real-World Examples
CI Build Logs
Pattern: Repeated test failures with setup + error + teardown
# Typical test failure is 10-15 lines
# Use window size 5 to catch partial matches too
uniqseq build.log --window-size 5 > clean-build.log
Production Error Monitoring
Pattern: Same error repeating with different timestamps
Memory Dump Analysis
Pattern: Repeated 16-byte or 32-byte blocks
# Hexdump output is 4 lines per 64 bytes
# Use window size 4 for 64-byte block detection
uniqseq memory.hex --window-size 4
Troubleshooting
"Nothing is being deduplicated"
-
Check if patterns actually repeat:
-
Try window size 1:
-
Check for variable data (timestamps, IDs):
"Too much is being removed"
-
Increase window size:
-
Use pattern filtering:
-
Check with annotations:
Best Practices
- Start small, increase if needed: Window size 1-3 is a safe starting point
- Use statistics: Let the data guide your decision
- Test on a sample: Try on 1000 lines before processing gigabytes
- Document your choice: Record why you chose a particular window size
- Re-evaluate periodically: Data patterns may change over time
See Also
- Window Size Feature - Technical details and examples
- Performance Guide - Optimization tips
- Pattern Filtering - Selective deduplication
- Skip Chars - Ignoring variable prefixes