Skip to content

Duplicate Detection & Deduplication

Identify and remove duplicate records based on key columns.

User Story

You import customer data from multiple sources into a Google Sheet. Some customers appear multiple times with slight variations (duplicate entries, data entry errors).

You need to:

  • Identify duplicate records based on key columns (email, customer ID)
  • Flag which records are duplicates and count occurrences
  • Remove duplicates while keeping the first occurrence
  • Export a clean dataset and a duplicate report
  • Validate data types before deduplication

With gaslamp

JavaScript
function cookbook_deduplication() {
  // Step 1: Define schema for customer data
  const schema = {
    email:  gaslamp.FlameGuards.isString,
    name:   gaslamp.FlameGuards.isString,
    phone:  gaslamp.FlameGuards.isString,
  };

  const sourceSheet = SpreadsheetApp.getActiveSheet();
  const workbook    = SpreadsheetApp.getActiveSpreadsheet();

  // Step 2: Load and validate customer data
  const df = gaslamp.BareFrame.fromSheet(sourceSheet);
  const { passed, failed } = gaslamp.FlameFrame.from(df, schema);

  if (failed.length > 0) {
    console.error(failed.length + ' validation error(s) found:');
    console.error(failed.display());
    return;
  }

  // Step 3: Normalize email for deduplication (lowercase, trimmed)
  const normalized = passed.withColumn('email_normalized', (row) => {
    return row.get('email').toLowerCase().trim();
  });

  // Step 4: Count occurrences of each email
  const emailCounts = normalized
    .groupBy(['email_normalized'])
    .count();

  // Build a map of normalized emails to their count
  const countMap = {};
  emailCounts.toRows().forEach((row) => {
    countMap[row.get('email_normalized')] = row.get('count');
  });

  // Step 5: Mark duplicates with occurrence count
  const withDuplicateInfo = normalized
    .withColumn('is_duplicate', (row) => {
      return countMap[row.get('email_normalized')] > 1;
    })
    .withColumn('occurrence_count', (row) => {
      return countMap[row.get('email_normalized')];
    });

  // Step 6: Separate duplicates and unique records
  const duplicateRows = withDuplicateInfo.filter((row) => row.get('is_duplicate'));
  const uniqueRows = withDuplicateInfo.filter((row) => !row.get('is_duplicate'));

  // Step 7: Keep only first occurrence of each duplicated email
  const seenEmails = {};
  const deduplicatedDuplicates = duplicateRows.filter((row) => {
    const normalizedEmail = row.get('email_normalized');
    if (!seenEmails[normalizedEmail]) {
      seenEmails[normalizedEmail] = true;
      return true;
    }
    return false;
  });

  // Step 8: Combine unique records with first occurrence of duplicates
  const cleanData = uniqueRows.concat(deduplicatedDuplicates);

  console.log(`Cleaned ${passed.length} records → ${cleanData.length} unique records`);
  console.log(`Found ${duplicateRows.length} duplicate records (${duplicateRows.length - deduplicatedDuplicates.length} removed)`);

  // Step 9: Export clean data (remove helper columns)
  const cleanDataExport = cleanData
    .drop(['email_normalized', 'is_duplicate', 'occurrence_count']);

  const cleanSheet = workbook.getSheetByName('Clean Data')
    ?? workbook.insertSheet('Clean Data');
  cleanDataExport.toSheet(cleanSheet);

  // Step 10: Export duplicate report for review
  const duplicateReport = duplicateRows
    .select(['email', 'name', 'phone', 'is_duplicate', 'occurrence_count']);

  const reportSheet = workbook.getSheetByName('Duplicate Report')
    ?? workbook.insertSheet('Duplicate Report');
  duplicateReport.toSheet(reportSheet);

  console.log('Clean data and duplicate report exported');
}

Key gaslamp features used:

  • Schema validation with FlameFrame.from() — ensure data types before deduplication
  • .withColumn() for normalization — lowercase and trim for consistent comparison
  • .groupBy() and .count() — count occurrences per group
  • Building lookup maps from grouped data — O(1) deduplication check
  • .filter() with boolean conditions — separate unique and duplicate rows
  • Chainable deduplication — filter duplicates, keep first occurrence
  • .concat() to combine results — merge unique and deduplicated records
  • .drop() to remove helper columns — clean output before export
  • .select() to choose columns — create focused duplicate report

Without gaslamp

JavaScript
function cookbook_deduplication_vanilla() {
  const sourceSheet = SpreadsheetApp.getActiveSheet();
  const workbook    = SpreadsheetApp.getActiveSpreadsheet();

  // Step 1: Load and validate customer data
  const range = sourceSheet.getDataRange();
  const values = range.getValues();
  const headers = values[0];
  const rows = values.slice(1);

  const emailIdx = headers.indexOf('email');
  const nameIdx = headers.indexOf('name');
  const phoneIdx = headers.indexOf('phone');

  const validated = [];
  for (const row of rows) {
    if (typeof row[emailIdx] !== 'string' || typeof row[nameIdx] !== 'string' || typeof row[phoneIdx] !== 'string') {
      console.warn('Type error in row:', row);
      continue;
    }
    validated.push(row);
  }

  if (validated.length === 0) {
    console.warn('Cannot deduplicate: no valid rows');
    return;
  }

  // Step 2: Manually normalize emails and build count map
  const countMap = {};
  for (const row of validated) {
    const email = row[emailIdx].toLowerCase().trim();
    countMap[email] = (countMap[email] || 0) + 1;
  }

  // Step 3: Mark duplicates and add occurrence count (manually)
  const withDuplicateInfo = [];
  for (const row of validated) {
    const email = row[emailIdx].toLowerCase().trim();
    const count = countMap[email];
    const isDuplicate = count > 1;
    withDuplicateInfo.push([...row, email, isDuplicate, count]);
  }

  // Step 4: Separate duplicates and unique records
  const uniqueRows = [];
  const duplicateRows = [];
  for (const row of withDuplicateInfo) {
    const isDuplicate = row[row.length - 2]; // is_duplicate column
    if (isDuplicate) {
      duplicateRows.push(row);
    } else {
      uniqueRows.push(row.slice(0, 3)); // Keep only original columns
    }
  }

  // Step 5: Keep only first occurrence of each duplicated email
  const seenEmails = {};
  const deduplicatedDuplicates = [];
  for (const row of duplicateRows) {
    const normalizedEmail = row[row.length - 3]; // email_normalized column
    if (!seenEmails[normalizedEmail]) {
      seenEmails[normalizedEmail] = true;
      deduplicatedDuplicates.push(row.slice(0, 3)); // Keep only original columns
    }
  }

  // Step 6: Combine unique records with first occurrence of duplicates
  const cleanData = [...uniqueRows, ...deduplicatedDuplicates];

  console.log(`Cleaned ${validated.length} records → ${cleanData.length} unique records`);
  console.log(`Found ${duplicateRows.length} duplicate records (${duplicateRows.length - deduplicatedDuplicates.length} removed)`);

  // Step 7: Write clean data to sheet
  const cleanSheet = workbook.getSheetByName('Clean Data')
    ?? workbook.insertSheet('Clean Data');

  if (cleanData.length > 0) {
    cleanSheet.getRange(1, 1, 1, headers.length).setValues([headers]);
    cleanSheet.getRange(2, 1, cleanData.length, headers.length).setValues(cleanData);
  }

  // Step 8: Write duplicate report for review
  const reportSheet = workbook.getSheetByName('Duplicate Report')
    ?? workbook.insertSheet('Duplicate Report');

  const reportHeaders = ['Email', 'Name', 'Phone', 'Email Normalized', 'Is Duplicate', 'Occurrence Count'];
  if (duplicateRows.length > 0) {
    reportSheet.getRange(1, 1, 1, reportHeaders.length).setValues([reportHeaders]);
    reportSheet.getRange(2, 1, duplicateRows.length, reportHeaders.length).setValues(duplicateRows);
  }

  console.log('Clean data and duplicate report exported');
}

Without gaslamp, you need to:

  • Manually validate types for each row
  • Manually normalize emails (lowercase, trim) in multiple places
  • Manually count occurrences by iterating through data and updating map
  • Manually separate rows into unique and duplicate arrays
  • Manually track seen emails to avoid duplicate duplicates
  • Manually combine unique and deduplicated results with array spread
  • Manually slice arrays to remove helper columns before export
  • Manually select and reorder columns for reports
  • Manually write headers and data separately for each output sheet
  • Duplicate normalization logic (once for counting, again for deduplication)

Key Techniques

  • Normalize before comparison: Use .withColumn() to standardize comparison keys (lowercase, trim)
  • Count occurrences: Use .groupBy() and .count() to identify duplicates
  • Build lookup maps: Convert grouped counts to maps for O(1) duplicate detection
  • Flag with markers: Add boolean is_duplicate column to mark duplicates
  • Keep first occurrence: Use a tracking object to skip subsequent duplicates
  • Separate reporting: Export clean data and duplicate details to different sheets
  • Clean before export: Use .drop() to remove helper columns from final output

See Select & Drop to customize which columns appear in reports, and Filter Guide for more filtering patterns.