Skip to content

Data Cleanup

Clean and standardize messy data from different sources.

User Story

You've created a Google Form to collect customer information including names, emails, and subscription preferences. Form responses are automatically saved to a Google Sheet, but the data is messy:

  • Names have inconsistent capitalization and whitespace
  • Some emails are in different cases
  • Subscription preference values don't match the expected options ("yes" or "no")
  • A few rows have empty or invalid data

You need to validate the data against a schema (including the enum constraint for preferences), standardize it (lowercase, trimmed), and remove invalid rows before processing.

With gaslamp

JavaScript
function cookbook_dataCleanup() {
  // Step 1: Define what valid data looks like
  // Use FlameGuards for type validation and FlameWright for constraints like enum
  const schema = {
    name:              gaslamp.FlameGuards.isString,
    email:             gaslamp.FlameGuards.isString,
    subscription:      gaslamp.FlameWright.enumOf(['yes', 'no']),
  };

  // Step 2: Load data from the active sheet (typically Form Responses)
  const sheet = SpreadsheetApp.getActiveSheet();
  const df = gaslamp.BareFrame.fromSheet(sheet);

  // Step 3: Validate data against the schema
  // Returns { passed, failed } — use passed for valid rows, failed for errors
  const { passed, failed } = gaslamp.FlameFrame.from(df, schema);

  // Step 4: If there are validation errors, show them with suggestions
  if (failed.length > 0) {
    // Display error rows as a formatted table with invalid_message column
    console.error(failed.length + ' validation error(s) found:');
    console.error(failed.display(50, 30));
    console.error('→ Fix these rows in the Form Responses sheet and run again');
    return;
  }

  // Step 5: Normalize the data first
  // Standardize format: trim whitespace, convert to lowercase
  // Data is already validated as strings, so toString() is not needed
  const normalized = passed
    .withColumn('name',         (row) => row.get('name').trim().toLowerCase())
    .withColumn('email',        (row) => row.get('email').toLowerCase())
    .withColumn('subscription', (row) => row.get('subscription').toLowerCase());

  // Step 6: Filter out incomplete or invalid rows
  // Use Expression for readable, composable filter conditions
  const hasValidName = gaslamp.Expression.col('name').ne('');
  const hasValidEmail = gaslamp.Expression.col('email').contains('@');
  const isValid = hasValidName.and(hasValidEmail);

  const standardized = normalized.filter(isValid.toFunction());

  // Step 7: Report how many rows were cleaned
  console.log(`Cleaned ${df.length} -> ${standardized.length} rows`);

  // Step 8: Write the cleaned data to a new sheet
  const workbook   = SpreadsheetApp.getActiveSpreadsheet();
  const cleanSheet = workbook.getSheetByName('Cleaned Data')
    ?? workbook.insertSheet('Cleaned Data');

  standardized.toSheet(cleanSheet);
  console.log('Cleaned data exported to new sheet');
}

Key gaslamp features used:

  • Schema validation with FlameFrame.from() — catch type errors early
  • Chainable .filter() and .withColumn() — readable data transformations
  • .toSheet() — write DataFrame directly to Google Sheet

Without gaslamp

JavaScript
function cookbook_dataCleanup_vanilla() {
  // Step 1: Fetch all data from the sheet as a 2D array
  const sheet = SpreadsheetApp.getActiveSheet();
  const range = sheet.getDataRange();
  const values = range.getValues();
  const headers = values[0];
  const rows = values.slice(1);

  // Step 2: Find the column indexes manually
  const nameIdx = headers.indexOf('name');
  const emailIdx = headers.indexOf('email');
  const subscriptionIdx = headers.indexOf('subscription');

  // Step 3: Define valid enum values for subscription
  const validSubscriptions = ['yes', 'no'];

  // Step 4: Manually validate types and constraints, separate valid rows
  const passed = [];
  const failed = [];
  for (const row of rows) {
    const name = row[nameIdx];
    const email = row[emailIdx];
    const subscription = row[subscriptionIdx];

    // Manual type checking and validation (no schema validation)
    if (typeof name !== 'string' || typeof email !== 'string' || typeof subscription !== 'string') {
      failed.push(row);
      continue;
    }

    // Manual enum validation
    if (!validSubscriptions.includes(subscription.toLowerCase())) {
      failed.push(row);
      continue;
    }

    passed.push(row);
  }

  if (failed.length > 0) {
    console.error('Validation errors:', failed);
    return;
  }

  // Step 5: Normalize valid rows (only selected columns, headers lost)
  const normalized = passed.map((row) => [
    row[nameIdx].trim().toLowerCase(),
    row[emailIdx].toLowerCase(),
    row[subscriptionIdx].toLowerCase(),
  ]);

  // Step 6: Filter rows with business rules
  const standardized = normalized.filter((row) => {
    const [name, email, subscription] = row;
    return name !== '' && email.includes('@');
  });

  // Step 7: Report how many rows were cleaned
  console.log(`Cleaned ${rows.length} -> ${standardized.length} rows`);

  // Step 8: Write the cleaned data to a new sheet
  const workbook = SpreadsheetApp.getActiveSpreadsheet();
  const cleanSheet = workbook.getSheetByName('Cleaned Data')
    ?? workbook.insertSheet('Cleaned Data');

  if (standardized.length > 0) {
    // Must manually calculate range dimensions and write data
    // Note: headers are NOT written by vanilla code (unlike gaslamp's toSheet())
    cleanSheet.getRange(1, 1, standardized.length, 3).setValues(standardized);
  }

  console.log('Cleaned data exported to new sheet (without headers)');
}

Without gaslamp, you need to:

  • Manually index columns and track their positions
  • Manually type-check each value with typeof
  • Manually define and validate enum constraints (e.g., validSubscriptions array and .includes() checks)
  • Manually loop through rows and accumulate results with separate logic for each constraint
  • Select and transform only specific columns (losing the full row structure)
  • Manually write headers and data separately (no automatic header preservation)
  • Manually call setValues() with explicit range dimensions
  • Track column count and data size manually

Key Techniques

  • Schema validation first: Always check types before processing
  • Normalize before filtering: Transform data to a consistent format, then apply business rules
  • Use Expression for filter conditions: Separate and name conditions for clarity (e.g., hasValidName, hasValidEmail)
  • Write to a separate sheet: Keep the original data intact as a backup

See BareFrame Guide for more transformation patterns.