Duplicate Detection & Deduplication¶
Identify and remove duplicate records based on key columns.
User Story¶
You import customer data from multiple sources into a Google Sheet. Some customers appear multiple times with slight variations (duplicate entries, data entry errors).
You need to:
- Identify duplicate records based on key columns (email, customer ID)
- Flag which records are duplicates and count occurrences
- Remove duplicates while keeping the first occurrence
- Export a clean dataset and a duplicate report
- Validate data types before deduplication
With gaslamp¶
JavaScript
function cookbook_deduplication() {
// Step 1: Define schema for customer data
const schema = {
email: gaslamp.FlameGuards.isString,
name: gaslamp.FlameGuards.isString,
phone: gaslamp.FlameGuards.isString,
};
const sourceSheet = SpreadsheetApp.getActiveSheet();
const workbook = SpreadsheetApp.getActiveSpreadsheet();
// Step 2: Load and validate customer data
const df = gaslamp.BareFrame.fromSheet(sourceSheet);
const { passed, failed } = gaslamp.FlameFrame.from(df, schema);
if (failed.length > 0) {
console.error(failed.length + ' validation error(s) found:');
console.error(failed.display());
return;
}
// Step 3: Normalize email for deduplication (lowercase, trimmed)
const normalized = passed.withColumn('email_normalized', (row) => {
return row.get('email').toLowerCase().trim();
});
// Step 4: Count occurrences of each email
const emailCounts = normalized
.groupBy(['email_normalized'])
.count();
// Build a map of normalized emails to their count
const countMap = {};
emailCounts.toRows().forEach((row) => {
countMap[row.get('email_normalized')] = row.get('count');
});
// Step 5: Mark duplicates with occurrence count
const withDuplicateInfo = normalized
.withColumn('is_duplicate', (row) => {
return countMap[row.get('email_normalized')] > 1;
})
.withColumn('occurrence_count', (row) => {
return countMap[row.get('email_normalized')];
});
// Step 6: Separate duplicates and unique records
const duplicateRows = withDuplicateInfo.filter((row) => row.get('is_duplicate'));
const uniqueRows = withDuplicateInfo.filter((row) => !row.get('is_duplicate'));
// Step 7: Keep only first occurrence of each duplicated email
const seenEmails = {};
const deduplicatedDuplicates = duplicateRows.filter((row) => {
const normalizedEmail = row.get('email_normalized');
if (!seenEmails[normalizedEmail]) {
seenEmails[normalizedEmail] = true;
return true;
}
return false;
});
// Step 8: Combine unique records with first occurrence of duplicates
const cleanData = uniqueRows.concat(deduplicatedDuplicates);
console.log(`Cleaned ${passed.length} records → ${cleanData.length} unique records`);
console.log(`Found ${duplicateRows.length} duplicate records (${duplicateRows.length - deduplicatedDuplicates.length} removed)`);
// Step 9: Export clean data (remove helper columns)
const cleanDataExport = cleanData
.drop(['email_normalized', 'is_duplicate', 'occurrence_count']);
const cleanSheet = workbook.getSheetByName('Clean Data')
?? workbook.insertSheet('Clean Data');
cleanDataExport.toSheet(cleanSheet);
// Step 10: Export duplicate report for review
const duplicateReport = duplicateRows
.select(['email', 'name', 'phone', 'is_duplicate', 'occurrence_count']);
const reportSheet = workbook.getSheetByName('Duplicate Report')
?? workbook.insertSheet('Duplicate Report');
duplicateReport.toSheet(reportSheet);
console.log('Clean data and duplicate report exported');
}
Key gaslamp features used:
- Schema validation with
FlameFrame.from()— ensure data types before deduplication .withColumn()for normalization — lowercase and trim for consistent comparison.groupBy()and.count()— count occurrences per group- Building lookup maps from grouped data — O(1) deduplication check
.filter()with boolean conditions — separate unique and duplicate rows- Chainable deduplication — filter duplicates, keep first occurrence
.concat()to combine results — merge unique and deduplicated records.drop()to remove helper columns — clean output before export.select()to choose columns — create focused duplicate report
Without gaslamp¶
JavaScript
function cookbook_deduplication_vanilla() {
const sourceSheet = SpreadsheetApp.getActiveSheet();
const workbook = SpreadsheetApp.getActiveSpreadsheet();
// Step 1: Load and validate customer data
const range = sourceSheet.getDataRange();
const values = range.getValues();
const headers = values[0];
const rows = values.slice(1);
const emailIdx = headers.indexOf('email');
const nameIdx = headers.indexOf('name');
const phoneIdx = headers.indexOf('phone');
const validated = [];
for (const row of rows) {
if (typeof row[emailIdx] !== 'string' || typeof row[nameIdx] !== 'string' || typeof row[phoneIdx] !== 'string') {
console.warn('Type error in row:', row);
continue;
}
validated.push(row);
}
if (validated.length === 0) {
console.warn('Cannot deduplicate: no valid rows');
return;
}
// Step 2: Manually normalize emails and build count map
const countMap = {};
for (const row of validated) {
const email = row[emailIdx].toLowerCase().trim();
countMap[email] = (countMap[email] || 0) + 1;
}
// Step 3: Mark duplicates and add occurrence count (manually)
const withDuplicateInfo = [];
for (const row of validated) {
const email = row[emailIdx].toLowerCase().trim();
const count = countMap[email];
const isDuplicate = count > 1;
withDuplicateInfo.push([...row, email, isDuplicate, count]);
}
// Step 4: Separate duplicates and unique records
const uniqueRows = [];
const duplicateRows = [];
for (const row of withDuplicateInfo) {
const isDuplicate = row[row.length - 2]; // is_duplicate column
if (isDuplicate) {
duplicateRows.push(row);
} else {
uniqueRows.push(row.slice(0, 3)); // Keep only original columns
}
}
// Step 5: Keep only first occurrence of each duplicated email
const seenEmails = {};
const deduplicatedDuplicates = [];
for (const row of duplicateRows) {
const normalizedEmail = row[row.length - 3]; // email_normalized column
if (!seenEmails[normalizedEmail]) {
seenEmails[normalizedEmail] = true;
deduplicatedDuplicates.push(row.slice(0, 3)); // Keep only original columns
}
}
// Step 6: Combine unique records with first occurrence of duplicates
const cleanData = [...uniqueRows, ...deduplicatedDuplicates];
console.log(`Cleaned ${validated.length} records → ${cleanData.length} unique records`);
console.log(`Found ${duplicateRows.length} duplicate records (${duplicateRows.length - deduplicatedDuplicates.length} removed)`);
// Step 7: Write clean data to sheet
const cleanSheet = workbook.getSheetByName('Clean Data')
?? workbook.insertSheet('Clean Data');
if (cleanData.length > 0) {
cleanSheet.getRange(1, 1, 1, headers.length).setValues([headers]);
cleanSheet.getRange(2, 1, cleanData.length, headers.length).setValues(cleanData);
}
// Step 8: Write duplicate report for review
const reportSheet = workbook.getSheetByName('Duplicate Report')
?? workbook.insertSheet('Duplicate Report');
const reportHeaders = ['Email', 'Name', 'Phone', 'Email Normalized', 'Is Duplicate', 'Occurrence Count'];
if (duplicateRows.length > 0) {
reportSheet.getRange(1, 1, 1, reportHeaders.length).setValues([reportHeaders]);
reportSheet.getRange(2, 1, duplicateRows.length, reportHeaders.length).setValues(duplicateRows);
}
console.log('Clean data and duplicate report exported');
}
Without gaslamp, you need to:
- Manually validate types for each row
- Manually normalize emails (lowercase, trim) in multiple places
- Manually count occurrences by iterating through data and updating map
- Manually separate rows into unique and duplicate arrays
- Manually track seen emails to avoid duplicate duplicates
- Manually combine unique and deduplicated results with array spread
- Manually slice arrays to remove helper columns before export
- Manually select and reorder columns for reports
- Manually write headers and data separately for each output sheet
- Duplicate normalization logic (once for counting, again for deduplication)
Key Techniques¶
- Normalize before comparison: Use
.withColumn()to standardize comparison keys (lowercase, trim) - Count occurrences: Use
.groupBy()and.count()to identify duplicates - Build lookup maps: Convert grouped counts to maps for O(1) duplicate detection
- Flag with markers: Add boolean
is_duplicatecolumn to mark duplicates - Keep first occurrence: Use a tracking object to skip subsequent duplicates
- Separate reporting: Export clean data and duplicate details to different sheets
- Clean before export: Use
.drop()to remove helper columns from final output
See Select & Drop to customize which columns appear in reports, and Filter Guide for more filtering patterns.