🎯

gcp-bq-data-loading

🎯Skill

from funnelenvy/agents_webinar_demos

VibeIndex|
What it does

gcp-bq-data-loading skill from funnelenvy/agents_webinar_demos

gcp-bq-data-loading

Installation

Install skill:
npx skills add https://github.com/funnelenvy/agents_webinar_demos --skill gcp-bq-data-loading
5
Last UpdatedJan 16, 2026

Skill Details

SKILL.md

Use when loading data into BigQuery from CSV, JSON, Avro, Parquet files, Cloud Storage, or local files. Covers bq load command, source formats, schema detection, incremental loading, and handling parsing errors.

Overview

# BigQuery Data Loading

Use this skill when importing data into BigQuery from various file formats and sources.

Basic Load Command

```bash

bq load \

--location=LOCATION \

--source_format=FORMAT \

PROJECT:DATASET.TABLE \

SOURCE_PATH \

SCHEMA

```

Loading from CSV

Basic CSV Load

```bash

bq load \

--source_format=CSV \

--skip_leading_rows=1 \

dataset.table \

gs://bucket/data.csv \

customer_id:STRING,amount:FLOAT,date:DATE

```

CSV with Schema File

```bash

# Create schema.json

echo '[

{"name": "customer_id", "type": "STRING"},

{"name": "amount", "type": "FLOAT"},

{"name": "date", "type": "DATE"}

]' > schema.json

# Load with schema file

bq load \

--source_format=CSV \

--skip_leading_rows=1 \

dataset.table \

gs://bucket/data.csv \

./schema.json

```

CSV Options

```bash

bq load \

--source_format=CSV \

--skip_leading_rows=1 \

--field_delimiter=',' \

--quote='"' \

--allow_quoted_newlines \

--allow_jagged_rows \

--max_bad_records=100 \

--null_marker='NULL' \

dataset.table \

gs://bucket/data.csv \

schema.json

```

Key flags:

  • --skip_leading_rows=N - Skip header rows
  • --field_delimiter=',' - Column separator (default: comma)
  • --allow_quoted_newlines - Allow newlines in quoted fields
  • --allow_jagged_rows - Allow rows with missing fields
  • --max_bad_records=N - Tolerate N parsing errors
  • --null_marker='NULL' - String representing NULL values

CSV with Auto-detect

```bash

bq load \

--source_format=CSV \

--autodetect \

--skip_leading_rows=1 \

dataset.table \

gs://bucket/data.csv

```

Warning: Auto-detect is convenient but not recommended for production. Schema may change between loads.

Loading from JSON

Newline-Delimited JSON

```bash

bq load \

--source_format=NEWLINE_DELIMITED_JSON \

dataset.table \

gs://bucket/data.json \

customer_id:STRING,amount:FLOAT,date:DATE

```

JSON format required:

```json

{"customer_id": "C001", "amount": 99.99, "date": "2024-01-15"}

{"customer_id": "C002", "amount": 149.99, "date": "2024-01-15"}

```

NOT standard JSON array:

```json

// ❌ This won't work

[

{"customer_id": "C001", "amount": 99.99},

{"customer_id": "C002", "amount": 149.99}

]

```

JSON with Auto-detect

```bash

bq load \

--source_format=NEWLINE_DELIMITED_JSON \

--autodetect \

dataset.table \

gs://bucket/data.json

```

JSON with Nested Fields

Schema with nested STRUCT:

```json

[

{"name": "customer_id", "type": "STRING"},

{"name": "address", "type": "RECORD", "fields": [

{"name": "street", "type": "STRING"},

{"name": "city", "type": "STRING"},

{"name": "zip", "type": "STRING"}

]},

{"name": "orders", "type": "RECORD", "mode": "REPEATED", "fields": [

{"name": "order_id", "type": "STRING"},

{"name": "amount", "type": "FLOAT"}

]}

]

```

Loading from Avro

Basic Avro Load

```bash

bq load \

--source_format=AVRO \

dataset.table \

gs://bucket/data.avro

```

Key benefit: Schema is auto-detected from Avro metadata. No schema specification needed!

Avro with Wildcards

```bash

bq load \

--source_format=AVRO \

dataset.table \

"gs://bucket/path/to/*.avro"

```

Note: Use quotes around wildcard paths.

Avro Advantages

  • Self-describing: Schema embedded in file
  • Efficient: Compact binary format
  • Type-safe: Strong typing preserved
  • No schema drift: Schema always matches data

Loading from Parquet

Basic Parquet Load

```bash

bq load \

--source_format=PARQUET \

dataset.table \

gs://bucket/data.parquet

```

Like Avro: Schema auto-detected from Parquet metadata.

Parquet with Compression

```bash

bq load \

--source_format=PARQUET \

dataset.table \

gs://bucket/data.snappy.parquet

```

Supported compression: SNAPPY, GZIP, LZO, BROTLI, LZ4, ZSTD

Loading Strategies

Append Data (Default)

```bash

bq load \

--source_format=CSV \

dataset.table \

gs://bucket/new_data.csv \

schema.json

```

Behavior: Adds rows to existing table.

Replace Table

```bash

bq load \

--source_format=CSV \

--replace \

dataset.table \

gs://bucket/data.csv \

schema.json

```

Behavior: Deletes all existing data, loads new data.

Overwrite Partition

```bash

bq load \

--source_format=CSV \

--replace \

--time_partitioning_field=date \

dataset.table\$20240115 \

gs://bucket/data_20240115.csv \

schema.json

```

Syntax: TABLE$YYYYMMDD targets specific partition.

Behavior: Replaces only that partition, leaves others intact.

Append or Skip

```bash

# Skip load if table already has data

bq load \

--source_format=CSV \

--if_not_exists \

dataset.table \

gs://bucket/data.csv \

schema.json

```

Loading from Cloud Storage

Single File

```bash

bq load \

--source_format=CSV \

dataset.table \

gs://bucket/data.csv \

schema.json

```

Multiple Files (List)

```bash

bq load \

--source_format=CSV \

dataset.table \

gs://bucket/file1.csv,gs://bucket/file2.csv,gs://bucket/file3.csv \

schema.json

```

Wildcard Pattern

```bash

bq load \

--source_format=CSV \

dataset.table \

"gs://bucket/path/data-*.csv" \

schema.json

```

Patterns:

  • gs://bucket/*.csv - All CSV files in bucket root
  • gs://bucket/2024//.csv - All CSV files in subdirectories
  • gs://bucket/data-[0-9]*.csv - Files matching pattern

Loading from Local Files

Local CSV

```bash

bq load \

--source_format=CSV \

--skip_leading_rows=1 \

dataset.table \

/path/to/local/data.csv \

schema.json

```

Limitation: Files are uploaded first, then loaded. Slower for large files. Use GCS for better performance.

Schema Handling

Inline Schema

```bash

bq load \

--source_format=CSV \

dataset.table \

gs://bucket/data.csv \

customer_id:STRING,amount:FLOAT64,order_date:DATE,active:BOOLEAN

```

Schema File

```json

[

{"name": "customer_id", "type": "STRING", "mode": "REQUIRED"},

{"name": "amount", "type": "FLOAT64"},

{"name": "order_date", "type": "DATE"},

{"name": "metadata", "type": "JSON"}

]

```

Modes:

  • REQUIRED - Field must have value
  • NULLABLE - Field can be NULL (default)
  • REPEATED - Field is an array

Auto-detect Schema

```bash

bq load \

--source_format=CSV \

--autodetect \

dataset.table \

gs://bucket/data.csv

```

Pros:

  • Quick for exploration
  • No schema definition needed

Cons:

  • Not reliable for production
  • Schema may change between loads
  • May misinterpret types

Schema Evolution

Add new columns:

```bash

bq load \

--source_format=CSV \

--schema_update_option=ALLOW_FIELD_ADDITION \

dataset.table \

gs://bucket/data_with_new_column.csv \

schema.json

```

Relax required columns:

```bash

bq load \

--schema_update_option=ALLOW_FIELD_RELAXATION \

dataset.table \

gs://bucket/data.csv \

schema.json

```

Compression

Compressed Files

BigQuery automatically detects compression:

```bash

# Gzip compressed CSV

bq load \

--source_format=CSV \

dataset.table \

gs://bucket/data.csv.gz \

schema.json

```

Supported: GZIP, DEFLATE, SNAPPY, BZIP2, LZ4, ZSTD

Performance note: Uncompressed files load faster (parallel processing). Use compression only if network/storage is bottleneck.

Error Handling

Allow Bad Records

```bash

bq load \

--source_format=CSV \

--max_bad_records=1000 \

dataset.table \

gs://bucket/data.csv \

schema.json

```

Behavior: Skip up to 1000 rows with errors, load the rest.

Ignore Unknown Values

```bash

bq load \

--source_format=JSON \

--ignore_unknown_values \

dataset.table \

gs://bucket/data.json \

schema.json

```

Behavior: Ignore JSON fields not in schema.

Validate Only (Dry Run)

```bash

bq load \

--dry_run \

--source_format=CSV \

dataset.table \

gs://bucket/data.csv \

schema.json

```

Behavior: Validate schema and format without loading data.

Loading to Partitioned Tables

Time-Partitioned Table

```bash

bq load \

--source_format=CSV \

--time_partitioning_field=order_date \

--time_partitioning_type=DAY \

dataset.partitioned_orders \

gs://bucket/orders_2024.csv \

order_id:STRING,customer_id:STRING,order_date:DATE,amount:FLOAT

```

Load Specific Partition

```bash

# Load into 2024-01-15 partition

bq load \

--source_format=CSV \

dataset.orders\$20240115 \

gs://bucket/orders_20240115.csv \

schema.json

```

Range-Partitioned Table

```bash

bq load \

--source_format=CSV \

--range_partitioning=customer_id,0,1000,100 \

dataset.range_partitioned \

gs://bucket/data.csv \

customer_id:INTEGER,amount:FLOAT

```

Performance Optimization

Parallel Loading

BigQuery automatically parallelizes loads from:

  • Multiple files with wildcards
  • Uncompressed files (internal parallelization)

Optimal: Split large files into 1GB chunks

File Format Recommendations

Best performance:

  1. Avro - Fastest, self-describing, splittable
  2. Parquet - Fast, columnar, good compression
  3. CSV uncompressed - Good for parallel processing
  4. JSON - Flexible but slower

Avoid:

  • Very large single files (>10GB)
  • CSV with complex escaping
  • Highly compressed files (limits parallelism)

Monitoring Loads

Check Load Jobs

```bash

bq ls --jobs --max_results=10

```

Job Details

```bash

bq show -j JOB_ID

```

Failed Loads

```sql

SELECT

job_id,

user_email,

error_result.message as error_message,

creation_time

FROM region-us.INFORMATION_SCHEMA.JOBS_BY_PROJECT

WHERE

job_type = 'LOAD'

AND state = 'DONE'

AND error_result IS NOT NULL

AND creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 24 HOUR)

ORDER BY creation_time DESC;

```

Common Patterns

Daily Batch Load

```bash

#!/bin/bash

DATE=$(date +%Y%m%d)

bq load \

--source_format=CSV \

--replace \

dataset.daily_data\$$DATE \

gs://bucket/data_$DATE.csv \

schema.json

```

Incremental Load with Deduplication

```bash

# Load to staging

bq load \

--source_format=CSV \

dataset.staging_orders \

gs://bucket/new_orders.csv \

schema.json

# Merge to production (dedup)

bq query --use_legacy_sql=false '

MERGE project.dataset.orders T

USING project.dataset.staging_orders S

ON T.order_id = S.order_id

WHEN NOT MATCHED THEN INSERT ROW

'

```

Troubleshooting

"Too many errors"

Problem: Exceeds max_bad_records

Solution: Increase --max_bad_records or fix data quality

"Schema mismatch"

Problem: CSV columns don't match schema

Solution: Verify column order and count

"Invalid CSV format"

Problem: Unescaped quotes or newlines

Solution: Use --allow_quoted_newlines

"Permission denied"

Problem: No access to GCS bucket

Solution: Grant BigQuery service account Storage Object Viewer role

Quick Reference

Format priorities (fastest first):

  1. Avro (splittable, self-describing)
  2. Parquet (columnar, efficient)
  3. CSV uncompressed (parallel)
  4. JSON (flexible)

Schema strategies:

  • Production: Explicit schema file
  • Development: Auto-detect
  • Evolution: Allow field addition/relaxation

Loading strategies:

  • New data: Append (default)
  • Replace all: --replace
  • Replace partition: TABLE$YYYYMMDD
  • Incremental: Load staging β†’ MERGE

More from this repository9

🎯
codex-qa🎯Skill

Queries and analyzes code using OpenAI Codex CLI, providing read-only codebase exploration and technical insights.

🎯
playwright-browser🎯Skill

Automates browser interactions using Playwright Python, enabling web scraping, screenshot capture, and dynamic page testing with precise element selection.

🎯
data-export-formats🎯Skill

Generates precise CSV export formats for Google Ads Enhanced Conversions, Customer Match, and Meta Custom Audiences with specific data schema requirements.

🎯
gemini-qa🎯Skill

Queries code and files using Google Gemini CLI, providing AI-powered insights, explanations, and analysis for codebases.

🎯
gcp-cli-gotchas🎯Skill

gcp-cli-gotchas skill from funnelenvy/agents_webinar_demos

🎯
gcp-bq-table-management🎯Skill

Automates BigQuery table operations like creation, schema updates, data loading, and management across Google Cloud Platform projects

🎯
hubspot-crm🎯Skill

Syncs contacts and lists to HubSpot CRM using a private API token, enabling seamless data integration and management.

🎯
gcp-bq-data-export🎯Skill

Exports BigQuery data to Cloud Storage in various formats like CSV, JSON, Avro, and Parquet with flexible compression and export options.

🎯
n8n-workflow🎯Skill

Automates anomaly detection workflows by connecting data sources, triggering alerts in Slack, and orchestrating multi-step notification processes using N8N workflow automation.