Data Processing: CSV Deduplication
Remove duplicate records from CSV files while preserving the first occurrence of each unique record.
The Problem
CSV exports from databases or APIs often contain duplicate records due to:
- Data sync issues - Same records exported multiple times
- Join operations - SQL joins creating duplicate rows
- Merge operations - Combining files with overlapping data
Traditional tools like sort | uniq only work on complete lines, not custom delimiters.
Input Data
records.csv
user_001,Alice,alice@example.com,2024-01-15
user_002,Bob,bob@example.com,2024-01-16
user_003,Charlie,charlie@example.com,2024-01-17
user_004,Diana,diana@example.com,2024-01-18
user_005,Eve,eve@example.com,2024-01-19
user_001,Alice,alice@example.com,2024-01-15
user_002,Bob,bob@example.com,2024-01-16
user_003,Charlie,charlie@example.com,2024-01-17
user_006,Frank,frank@example.com,2024-01-20
user_007,Grace,grace@example.com,2024-01-21
The file contains 10 records, but 3 are duplicates:
user_001(lines 1 and 6)user_002(lines 2 and 7)user_003(lines 3 and 8)
Output Data
expected-output.csv
user_001
Alice
alice@example.com
2024-01-15
user_002
Bob
bob@example.com
2024-01-16
user_003
Charlie
charlie@example.com
2024-01-17
user_004
Diana
diana@example.com
2024-01-18
user_005
Eve
eve@example.com
2024-01-19
user_006
Frank
frank@example.com
2024-01-20
user_007
Grace
grace@example.com
2024-01-21
Result: 3 duplicate records removed → 7 unique records
Solution
How it works:
tr '\n' ','- Convert newlines to commas (make one line)uniqseq --delimiter ','- Deduplicate comma-separated recordstr ',' '\n'- Convert commas back to newlines
from uniqseq import UniqSeq
uniqseq = UniqSeq(
delimiter=",", # (1)!
)
from io import StringIO
with open("records.csv") as f:
# Convert file to single line with commas
content = f.read().replace("\n", ",")
# Process as comma-delimited stream
output = StringIO()
for record in content.split(","):
if record: # Skip empty
uniqseq.process_line(record, output)
uniqseq.flush_to_stream(output)
# Convert back to CSV format
result = output.getvalue().replace(",", "\n")
with open("output.csv", "w") as out:
out.write(result)
- Use comma as the record delimiter instead of newline
How It Works
By default, uniqseq treats newlines as record delimiters. For CSV:
- Convert format: Transform CSV so each record becomes a "line" with
,delimiter - Deduplicate: Use
--delimiter ','to treat commas as record boundaries - Restore format: Convert back to standard CSV with newlines
Alternative: Single-Line CSV
If your CSV is already single-line (one record per comma):
Real-World Workflows
TSV Files
For tab-separated values:
Multi-Column Deduplication
Deduplicate based on specific columns using --hash-transform:
# Deduplicate based on column 1 only (user ID)
cat records.csv | \
uniqseq --hash-transform "cut -d',' -f1" \
--quiet > output.csv
This keeps first occurrence of each unique user ID, even if other columns differ.
Large Files
For files too large for memory, process in chunks:
split -l 10000 huge.csv chunk_
for file in chunk_*; do
cat $file | tr '\n' ',' | \
uniqseq --delimiter ',' --library-dir dedup-lib/ --quiet | \
tr ',' '\n' >> output.csv
done
The library tracks seen records across chunks.
See Also
- Custom Delimiters - Using different record separators
- Hash Transform - Column-specific deduplication
- Pattern Libraries - Processing large files in chunks