Basic Concepts
Understand the core concepts behind uniqseq.
What is a Sequence?
A sequence is one or more consecutive lines that repeat in your input.
Single-Line Sequences (Window Size 1)
The default behavior - find duplicate lines:
Multi-Line Sequences (Window Size > 1)
With --window-size 3, uniqseq looks for repeating 3-line patterns:
Input: Output:
Line 1 Line 1
Line 2 Line 2
Line 3 Line 3
Line 4 Line 4
Line 1 ┐ Line 5
Line 2 │ duplicate
Line 3 ┘
Line 4
Line 5
The sequence "Line 1, Line 2, Line 3" repeats, so the second occurrence is removed.
How Deduplication Works
1. Sliding Window
uniqseq uses a sliding window to scan through your input:
Window Size 3:
[Line 1] ← Check if seen before
[Line 2]
[Line 3]
[Line 2] ← Move window down
[Line 3]
[Line 4]
[Line 3] ← Move window down
[Line 4]
[Line 5]
2. Hash-Based Detection
For each window position, uniqseq:
- Computes a hash of the lines in the window
- Checks if this hash was seen before
- Keeps or removes lines based on whether it's a duplicate
This makes deduplication very fast, even for large files.
3. First-Occurrence Preservation
When uniqseq finds a duplicate, it always keeps the first occurrence and removes later ones:
Key Parameters
Window Size
--window-size N - How many consecutive lines to consider as a unit
--window-size 1(default): Deduplicate individual lines--window-size 3: Deduplicate 3-line sequences--window-size 5: Deduplicate 5-line sequences
Rule of thumb: Set window size to match the number of lines in the pattern you want to remove.
Skip Characters
--skip-chars N - Ignore first N characters of each line when comparing
Useful for lines that differ only in prefixes like timestamps:
Before skip-chars:
[2024-01-15 10:30:01] Error: Connection failed ← Different
[2024-01-15 10:30:05] Error: Connection failed ← Different
After --skip-chars 21:
Error: Connection failed ← Same!
Error: Connection failed ← Same! (duplicate)
The first 21 characters ([2024-01-15 10:30:01]) are ignored during comparison.
Pattern Filtering
Track Patterns (--track)
Only deduplicate lines matching a regex pattern:
- ERROR lines: Deduplicated
- INFO lines: Passed through unchanged
Bypass Patterns (--bypass)
Never deduplicate lines matching a regex pattern:
- WARN lines: Always included, never deduplicated
- Other lines: Deduplicated normally
Memory Efficiency
uniqseq is designed for streaming - it processes input line-by-line without loading the entire file into memory.
Memory usage depends only on:
- Window size (larger windows need more memory per entry)
- Number of unique sequences (not total lines)
This means you can process GB-sized files with minimal memory.
Next Steps
- Quick Start - Try uniqseq with simple examples
- Choosing Window Size - Guide for selecting the right window size
- Common Patterns - Copy-paste ready examples
- Use Cases - See real-world applications
- Algorithm Details - Deep dive into how it works