gcp-bq-data-loading
π―Skillfrom funnelenvy/agents_webinar_demos
gcp-bq-data-loading skill from funnelenvy/agents_webinar_demos
Installation
npx skills add https://github.com/funnelenvy/agents_webinar_demos --skill gcp-bq-data-loadingSkill Details
Use when loading data into BigQuery from CSV, JSON, Avro, Parquet files, Cloud Storage, or local files. Covers bq load command, source formats, schema detection, incremental loading, and handling parsing errors.
Overview
# BigQuery Data Loading
Use this skill when importing data into BigQuery from various file formats and sources.
Basic Load Command
```bash
bq load \
--location=LOCATION \
--source_format=FORMAT \
PROJECT:DATASET.TABLE \
SOURCE_PATH \
SCHEMA
```
Loading from CSV
Basic CSV Load
```bash
bq load \
--source_format=CSV \
--skip_leading_rows=1 \
dataset.table \
gs://bucket/data.csv \
customer_id:STRING,amount:FLOAT,date:DATE
```
CSV with Schema File
```bash
# Create schema.json
echo '[
{"name": "customer_id", "type": "STRING"},
{"name": "amount", "type": "FLOAT"},
{"name": "date", "type": "DATE"}
]' > schema.json
# Load with schema file
bq load \
--source_format=CSV \
--skip_leading_rows=1 \
dataset.table \
gs://bucket/data.csv \
./schema.json
```
CSV Options
```bash
bq load \
--source_format=CSV \
--skip_leading_rows=1 \
--field_delimiter=',' \
--quote='"' \
--allow_quoted_newlines \
--allow_jagged_rows \
--max_bad_records=100 \
--null_marker='NULL' \
dataset.table \
gs://bucket/data.csv \
schema.json
```
Key flags:
--skip_leading_rows=N- Skip header rows--field_delimiter=','- Column separator (default: comma)--allow_quoted_newlines- Allow newlines in quoted fields--allow_jagged_rows- Allow rows with missing fields--max_bad_records=N- Tolerate N parsing errors--null_marker='NULL'- String representing NULL values
CSV with Auto-detect
```bash
bq load \
--source_format=CSV \
--autodetect \
--skip_leading_rows=1 \
dataset.table \
gs://bucket/data.csv
```
Warning: Auto-detect is convenient but not recommended for production. Schema may change between loads.
Loading from JSON
Newline-Delimited JSON
```bash
bq load \
--source_format=NEWLINE_DELIMITED_JSON \
dataset.table \
gs://bucket/data.json \
customer_id:STRING,amount:FLOAT,date:DATE
```
JSON format required:
```json
{"customer_id": "C001", "amount": 99.99, "date": "2024-01-15"}
{"customer_id": "C002", "amount": 149.99, "date": "2024-01-15"}
```
NOT standard JSON array:
```json
// β This won't work
[
{"customer_id": "C001", "amount": 99.99},
{"customer_id": "C002", "amount": 149.99}
]
```
JSON with Auto-detect
```bash
bq load \
--source_format=NEWLINE_DELIMITED_JSON \
--autodetect \
dataset.table \
gs://bucket/data.json
```
JSON with Nested Fields
Schema with nested STRUCT:
```json
[
{"name": "customer_id", "type": "STRING"},
{"name": "address", "type": "RECORD", "fields": [
{"name": "street", "type": "STRING"},
{"name": "city", "type": "STRING"},
{"name": "zip", "type": "STRING"}
]},
{"name": "orders", "type": "RECORD", "mode": "REPEATED", "fields": [
{"name": "order_id", "type": "STRING"},
{"name": "amount", "type": "FLOAT"}
]}
]
```
Loading from Avro
Basic Avro Load
```bash
bq load \
--source_format=AVRO \
dataset.table \
gs://bucket/data.avro
```
Key benefit: Schema is auto-detected from Avro metadata. No schema specification needed!
Avro with Wildcards
```bash
bq load \
--source_format=AVRO \
dataset.table \
"gs://bucket/path/to/*.avro"
```
Note: Use quotes around wildcard paths.
Avro Advantages
- Self-describing: Schema embedded in file
- Efficient: Compact binary format
- Type-safe: Strong typing preserved
- No schema drift: Schema always matches data
Loading from Parquet
Basic Parquet Load
```bash
bq load \
--source_format=PARQUET \
dataset.table \
gs://bucket/data.parquet
```
Like Avro: Schema auto-detected from Parquet metadata.
Parquet with Compression
```bash
bq load \
--source_format=PARQUET \
dataset.table \
gs://bucket/data.snappy.parquet
```
Supported compression: SNAPPY, GZIP, LZO, BROTLI, LZ4, ZSTD
Loading Strategies
Append Data (Default)
```bash
bq load \
--source_format=CSV \
dataset.table \
gs://bucket/new_data.csv \
schema.json
```
Behavior: Adds rows to existing table.
Replace Table
```bash
bq load \
--source_format=CSV \
--replace \
dataset.table \
gs://bucket/data.csv \
schema.json
```
Behavior: Deletes all existing data, loads new data.
Overwrite Partition
```bash
bq load \
--source_format=CSV \
--replace \
--time_partitioning_field=date \
dataset.table\$20240115 \
gs://bucket/data_20240115.csv \
schema.json
```
Syntax: TABLE$YYYYMMDD targets specific partition.
Behavior: Replaces only that partition, leaves others intact.
Append or Skip
```bash
# Skip load if table already has data
bq load \
--source_format=CSV \
--if_not_exists \
dataset.table \
gs://bucket/data.csv \
schema.json
```
Loading from Cloud Storage
Single File
```bash
bq load \
--source_format=CSV \
dataset.table \
gs://bucket/data.csv \
schema.json
```
Multiple Files (List)
```bash
bq load \
--source_format=CSV \
dataset.table \
gs://bucket/file1.csv,gs://bucket/file2.csv,gs://bucket/file3.csv \
schema.json
```
Wildcard Pattern
```bash
bq load \
--source_format=CSV \
dataset.table \
"gs://bucket/path/data-*.csv" \
schema.json
```
Patterns:
gs://bucket/*.csv- All CSV files in bucket rootgs://bucket/2024//.csv- All CSV files in subdirectoriesgs://bucket/data-[0-9]*.csv- Files matching pattern
Loading from Local Files
Local CSV
```bash
bq load \
--source_format=CSV \
--skip_leading_rows=1 \
dataset.table \
/path/to/local/data.csv \
schema.json
```
Limitation: Files are uploaded first, then loaded. Slower for large files. Use GCS for better performance.
Schema Handling
Inline Schema
```bash
bq load \
--source_format=CSV \
dataset.table \
gs://bucket/data.csv \
customer_id:STRING,amount:FLOAT64,order_date:DATE,active:BOOLEAN
```
Schema File
```json
[
{"name": "customer_id", "type": "STRING", "mode": "REQUIRED"},
{"name": "amount", "type": "FLOAT64"},
{"name": "order_date", "type": "DATE"},
{"name": "metadata", "type": "JSON"}
]
```
Modes:
REQUIRED- Field must have valueNULLABLE- Field can be NULL (default)REPEATED- Field is an array
Auto-detect Schema
```bash
bq load \
--source_format=CSV \
--autodetect \
dataset.table \
gs://bucket/data.csv
```
Pros:
- Quick for exploration
- No schema definition needed
Cons:
- Not reliable for production
- Schema may change between loads
- May misinterpret types
Schema Evolution
Add new columns:
```bash
bq load \
--source_format=CSV \
--schema_update_option=ALLOW_FIELD_ADDITION \
dataset.table \
gs://bucket/data_with_new_column.csv \
schema.json
```
Relax required columns:
```bash
bq load \
--schema_update_option=ALLOW_FIELD_RELAXATION \
dataset.table \
gs://bucket/data.csv \
schema.json
```
Compression
Compressed Files
BigQuery automatically detects compression:
```bash
# Gzip compressed CSV
bq load \
--source_format=CSV \
dataset.table \
gs://bucket/data.csv.gz \
schema.json
```
Supported: GZIP, DEFLATE, SNAPPY, BZIP2, LZ4, ZSTD
Performance note: Uncompressed files load faster (parallel processing). Use compression only if network/storage is bottleneck.
Error Handling
Allow Bad Records
```bash
bq load \
--source_format=CSV \
--max_bad_records=1000 \
dataset.table \
gs://bucket/data.csv \
schema.json
```
Behavior: Skip up to 1000 rows with errors, load the rest.
Ignore Unknown Values
```bash
bq load \
--source_format=JSON \
--ignore_unknown_values \
dataset.table \
gs://bucket/data.json \
schema.json
```
Behavior: Ignore JSON fields not in schema.
Validate Only (Dry Run)
```bash
bq load \
--dry_run \
--source_format=CSV \
dataset.table \
gs://bucket/data.csv \
schema.json
```
Behavior: Validate schema and format without loading data.
Loading to Partitioned Tables
Time-Partitioned Table
```bash
bq load \
--source_format=CSV \
--time_partitioning_field=order_date \
--time_partitioning_type=DAY \
dataset.partitioned_orders \
gs://bucket/orders_2024.csv \
order_id:STRING,customer_id:STRING,order_date:DATE,amount:FLOAT
```
Load Specific Partition
```bash
# Load into 2024-01-15 partition
bq load \
--source_format=CSV \
dataset.orders\$20240115 \
gs://bucket/orders_20240115.csv \
schema.json
```
Range-Partitioned Table
```bash
bq load \
--source_format=CSV \
--range_partitioning=customer_id,0,1000,100 \
dataset.range_partitioned \
gs://bucket/data.csv \
customer_id:INTEGER,amount:FLOAT
```
Performance Optimization
Parallel Loading
BigQuery automatically parallelizes loads from:
- Multiple files with wildcards
- Uncompressed files (internal parallelization)
Optimal: Split large files into 1GB chunks
File Format Recommendations
Best performance:
- Avro - Fastest, self-describing, splittable
- Parquet - Fast, columnar, good compression
- CSV uncompressed - Good for parallel processing
- JSON - Flexible but slower
Avoid:
- Very large single files (>10GB)
- CSV with complex escaping
- Highly compressed files (limits parallelism)
Monitoring Loads
Check Load Jobs
```bash
bq ls --jobs --max_results=10
```
Job Details
```bash
bq show -j JOB_ID
```
Failed Loads
```sql
SELECT
job_id,
user_email,
error_result.message as error_message,
creation_time
FROM region-us.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE
job_type = 'LOAD'
AND state = 'DONE'
AND error_result IS NOT NULL
AND creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 24 HOUR)
ORDER BY creation_time DESC;
```
Common Patterns
Daily Batch Load
```bash
#!/bin/bash
DATE=$(date +%Y%m%d)
bq load \
--source_format=CSV \
--replace \
dataset.daily_data\$$DATE \
gs://bucket/data_$DATE.csv \
schema.json
```
Incremental Load with Deduplication
```bash
# Load to staging
bq load \
--source_format=CSV \
dataset.staging_orders \
gs://bucket/new_orders.csv \
schema.json
# Merge to production (dedup)
bq query --use_legacy_sql=false '
MERGE project.dataset.orders T
USING project.dataset.staging_orders S
ON T.order_id = S.order_id
WHEN NOT MATCHED THEN INSERT ROW
'
```
Troubleshooting
"Too many errors"
Problem: Exceeds max_bad_records
Solution: Increase --max_bad_records or fix data quality
"Schema mismatch"
Problem: CSV columns don't match schema
Solution: Verify column order and count
"Invalid CSV format"
Problem: Unescaped quotes or newlines
Solution: Use --allow_quoted_newlines
"Permission denied"
Problem: No access to GCS bucket
Solution: Grant BigQuery service account Storage Object Viewer role
Quick Reference
Format priorities (fastest first):
- Avro (splittable, self-describing)
- Parquet (columnar, efficient)
- CSV uncompressed (parallel)
- JSON (flexible)
Schema strategies:
- Production: Explicit schema file
- Development: Auto-detect
- Evolution: Allow field addition/relaxation
Loading strategies:
- New data: Append (default)
- Replace all:
--replace - Replace partition:
TABLE$YYYYMMDD - Incremental: Load staging β MERGE
More from this repository9
Queries and analyzes code using OpenAI Codex CLI, providing read-only codebase exploration and technical insights.
Automates browser interactions using Playwright Python, enabling web scraping, screenshot capture, and dynamic page testing with precise element selection.
Generates precise CSV export formats for Google Ads Enhanced Conversions, Customer Match, and Meta Custom Audiences with specific data schema requirements.
Queries code and files using Google Gemini CLI, providing AI-powered insights, explanations, and analysis for codebases.
gcp-cli-gotchas skill from funnelenvy/agents_webinar_demos
Automates BigQuery table operations like creation, schema updates, data loading, and management across Google Cloud Platform projects
Syncs contacts and lists to HubSpot CRM using a private API token, enabling seamless data integration and management.
Exports BigQuery data to Cloud Storage in various formats like CSV, JSON, Avro, and Parquet with flexible compression and export options.
Automates anomaly detection workflows by connecting data sources, triggering alerts in Slack, and orchestrating multi-step notification processes using N8N workflow automation.