Duplicate Detection & Deduplication¶

Identify and remove duplicate records based on key columns.

User Story¶

You import customer data from multiple sources into a Google Sheet. Some customers appear multiple times with slight variations (duplicate entries, data entry errors).

You need to:

Identify duplicate records based on key columns (email, customer ID)
Flag which records are duplicates and count occurrences
Remove duplicates while keeping the first occurrence
Export a clean dataset and a duplicate report
Validate data types before deduplication

With gaslamp¶

JavaScript

function cookbook_deduplication() {
  // Step 1: Define schema for customer data
  const schema = {
    email:  gaslamp.FlameGuards.isString,
    name:   gaslamp.FlameGuards.isString,
    phone:  gaslamp.FlameGuards.isString,
  };

  const sourceSheet = SpreadsheetApp.getActiveSheet();
  const workbook    = SpreadsheetApp.getActiveSpreadsheet();

  // Step 2: Load and validate customer data
  const df = gaslamp.BareFrame.fromSheet(sourceSheet);
  const { passed, failed } = gaslamp.FlameFrame.from(df, schema);

  if (failed.length > 0) {
    console.error(failed.length + ' validation error(s) found:');
    console.error(failed.display());
    return;
  }

  // Step 3: Normalize email for deduplication (lowercase, trimmed)
  const normalized = passed.withColumn('email_normalized', (row) => {
    return row.get('email').toLowerCase().trim();
  });

  // Step 4: Count occurrences of each email
  const emailCounts = normalized
    .groupBy(['email_normalized'])
    .count();

  // Build a map of normalized emails to their count
  const countMap = {};
  emailCounts.toRows().forEach((row) => {
    countMap[row.get('email_normalized')] = row.get('count');
  });

  // Step 5: Mark duplicates with occurrence count
  const withDuplicateInfo = normalized
    .withColumn('is_duplicate', (row) => {
      return countMap[row.get('email_normalized')] > 1;
    })
    .withColumn('occurrence_count', (row) => {
      return countMap[row.get('email_normalized')];
    });

  // Step 6: Separate duplicates and unique records
  const duplicateRows = withDuplicateInfo.filter((row) => row.get('is_duplicate'));
  const uniqueRows = withDuplicateInfo.filter((row) => !row.get('is_duplicate'));

  // Step 7: Keep only first occurrence of each duplicated email
  const seenEmails = {};
  const deduplicatedDuplicates = duplicateRows.filter((row) => {
    const normalizedEmail = row.get('email_normalized');
    if (!seenEmails[normalizedEmail]) {
      seenEmails[normalizedEmail] = true;
      return true;
    }
    return false;
  });

  // Step 8: Combine unique records with first occurrence of duplicates
  const cleanData = uniqueRows.concat(deduplicatedDuplicates);

  console.log(`Cleaned ${passed.length} records → ${cleanData.length} unique records`);
  console.log(`Found ${duplicateRows.length} duplicate records (${duplicateRows.length - deduplicatedDuplicates.length} removed)`);

  // Step 9: Export clean data (remove helper columns)
  const cleanDataExport = cleanData
    .drop(['email_normalized', 'is_duplicate', 'occurrence_count']);

  const cleanSheet = workbook.getSheetByName('Clean Data')
    ?? workbook.insertSheet('Clean Data');
  cleanDataExport.toSheet(cleanSheet);

  // Step 10: Export duplicate report for review
  const duplicateReport = duplicateRows
    .select(['email', 'name', 'phone', 'is_duplicate', 'occurrence_count']);

  const reportSheet = workbook.getSheetByName('Duplicate Report')
    ?? workbook.insertSheet('Duplicate Report');
  duplicateReport.toSheet(reportSheet);

  console.log('Clean data and duplicate report exported');
}

Key gaslamp features used:

Schema validation with FlameFrame.from() — ensure data types before deduplication
.withColumn() for normalization — lowercase and trim for consistent comparison
.groupBy() and .count() — count occurrences per group
Building lookup maps from grouped data — O(1) deduplication check
.filter() with boolean conditions — separate unique and duplicate rows
Chainable deduplication — filter duplicates, keep first occurrence
.concat() to combine results — merge unique and deduplicated records
.drop() to remove helper columns — clean output before export
.select() to choose columns — create focused duplicate report

Without gaslamp¶

JavaScript

function cookbook_deduplication_vanilla() {
  const sourceSheet = SpreadsheetApp.getActiveSheet();
  const workbook    = SpreadsheetApp.getActiveSpreadsheet();

  // Step 1: Load and validate customer data
  const range = sourceSheet.getDataRange();
  const values = range.getValues();
  const headers = values[0];
  const rows = values.slice(1);

  const emailIdx = headers.indexOf('email');
  const nameIdx = headers.indexOf('name');
  const phoneIdx = headers.indexOf('phone');

  const validated = [];
  for (const row of rows) {
    if (typeof row[emailIdx] !== 'string' || typeof row[nameIdx] !== 'string' || typeof row[phoneIdx] !== 'string') {
      console.warn('Type error in row:', row);
      continue;
    }
    validated.push(row);
  }

  if (validated.length === 0) {
    console.warn('Cannot deduplicate: no valid rows');
    return;
  }

  // Step 2: Manually normalize emails and build count map
  const countMap = {};
  for (const row of validated) {
    const email = row[emailIdx].toLowerCase().trim();
    countMap[email] = (countMap[email] || 0) + 1;
  }

  // Step 3: Mark duplicates and add occurrence count (manually)
  const withDuplicateInfo = [];
  for (const row of validated) {
    const email = row[emailIdx].toLowerCase().trim();
    const count = countMap[email];
    const isDuplicate = count > 1;
    withDuplicateInfo.push([...row, email, isDuplicate, count]);
  }

  // Step 4: Separate duplicates and unique records
  const uniqueRows = [];
  const duplicateRows = [];
  for (const row of withDuplicateInfo) {
    const isDuplicate = row[row.length - 2]; // is_duplicate column
    if (isDuplicate) {
      duplicateRows.push(row);
    } else {
      uniqueRows.push(row.slice(0, 3)); // Keep only original columns
    }
  }

  // Step 5: Keep only first occurrence of each duplicated email
  const seenEmails = {};
  const deduplicatedDuplicates = [];
  for (const row of duplicateRows) {
    const normalizedEmail = row[row.length - 3]; // email_normalized column
    if (!seenEmails[normalizedEmail]) {
      seenEmails[normalizedEmail] = true;
      deduplicatedDuplicates.push(row.slice(0, 3)); // Keep only original columns
    }
  }

  // Step 6: Combine unique records with first occurrence of duplicates
  const cleanData = [...uniqueRows, ...deduplicatedDuplicates];

  console.log(`Cleaned ${validated.length} records → ${cleanData.length} unique records`);
  console.log(`Found ${duplicateRows.length} duplicate records (${duplicateRows.length - deduplicatedDuplicates.length} removed)`);

  // Step 7: Write clean data to sheet
  const cleanSheet = workbook.getSheetByName('Clean Data')
    ?? workbook.insertSheet('Clean Data');

  if (cleanData.length > 0) {
    cleanSheet.getRange(1, 1, 1, headers.length).setValues([headers]);
    cleanSheet.getRange(2, 1, cleanData.length, headers.length).setValues(cleanData);
  }

  // Step 8: Write duplicate report for review
  const reportSheet = workbook.getSheetByName('Duplicate Report')
    ?? workbook.insertSheet('Duplicate Report');

  const reportHeaders = ['Email', 'Name', 'Phone', 'Email Normalized', 'Is Duplicate', 'Occurrence Count'];
  if (duplicateRows.length > 0) {
    reportSheet.getRange(1, 1, 1, reportHeaders.length).setValues([reportHeaders]);
    reportSheet.getRange(2, 1, duplicateRows.length, reportHeaders.length).setValues(duplicateRows);
  }

  console.log('Clean data and duplicate report exported');
}

Without gaslamp, you need to:

Manually validate types for each row
Manually normalize emails (lowercase, trim) in multiple places
Manually count occurrences by iterating through data and updating map
Manually separate rows into unique and duplicate arrays
Manually track seen emails to avoid duplicate duplicates
Manually combine unique and deduplicated results with array spread
Manually slice arrays to remove helper columns before export
Manually select and reorder columns for reports
Manually write headers and data separately for each output sheet
Duplicate normalization logic (once for counting, again for deduplication)

Key Techniques¶

Normalize before comparison: Use .withColumn() to standardize comparison keys (lowercase, trim)
Count occurrences: Use .groupBy() and .count() to identify duplicates
Build lookup maps: Convert grouped counts to maps for O(1) duplicate detection
Flag with markers: Add boolean is_duplicate column to mark duplicates
Keep first occurrence: Use a tracking object to skip subsequent duplicates
Separate reporting: Export clean data and duplicate details to different sheets
Clean before export: Use .drop() to remove helper columns from final output

See Select & Drop to customize which columns appear in reports, and Filter Guide for more filtering patterns.