gcp-bq-table-management
π―Skillfrom funnelenvy/agents_webinar_demos
Automates BigQuery table operations like creation, schema updates, data loading, and management across Google Cloud Platform projects
Installation
npx skills add https://github.com/funnelenvy/agents_webinar_demos --skill gcp-bq-table-managementSkill Details
Use when creating BigQuery tables, implementing partitioning or clustering, managing table schemas, or optimizing table structure. Covers time-based partitioning, range partitioning, clustering strategies, DDL commands, and table configuration.
Overview
# BigQuery Table Management
Use this skill when creating, modifying, or optimizing BigQuery table structures with partitioning and clustering.
Creating Tables
Basic Table Creation
Using bq mk:
```bash
bq mk -t \
--schema 'customer_id:STRING,amount:FLOAT,date:DATE' \
--description "Customer orders table" \
project:dataset.orders
```
Using SQL DDL:
```sql
CREATE TABLE project.dataset.orders (
customer_id STRING,
amount FLOAT64,
date DATE,
created_at TIMESTAMP
);
```
Partitioning Strategies
Time-Based Partitioning
Create time-partitioned table:
```bash
bq mk -t \
--schema 'timestamp:TIMESTAMP,customer_id:STRING,amount:FLOAT' \
--time_partitioning_field timestamp \
--time_partitioning_type DAY \
project:dataset.orders
```
SQL DDL version:
```sql
CREATE TABLE project.dataset.orders (
timestamp TIMESTAMP,
customer_id STRING,
amount FLOAT64
)
PARTITION BY DATE(timestamp);
```
Partitioning options:
- DAY - Daily partitions (most common)
- HOUR - Hourly partitions (for high-volume data)
- MONTH - Monthly partitions (for historical data)
- YEAR - Yearly partitions (for very old data)
Ingestion-Time Partitioning
Create table with automatic _PARTITIONTIME:
```bash
bq mk -t \
--schema 'customer_id:STRING,amount:FLOAT' \
--time_partitioning_type DAY \
project:dataset.orders
```
Query with ingestion-time partition:
```sql
SELECT * FROM project.dataset.orders
WHERE _PARTITIONTIME >= '2024-01-01'
```
Range Partitioning
Create range-partitioned table:
```bash
bq mk -t \
--schema 'customer_id:INTEGER,region:STRING,sales:FLOAT' \
--range_partitioning=customer_id,0,100,10 \
project:dataset.sales
```
Parameters: field,start,end,interval
- Creates partitions: [0,10), [10,20), [20,30), ..., [90,100)
SQL DDL version:
```sql
CREATE TABLE project.dataset.sales (
customer_id INT64,
region STRING,
sales FLOAT64
)
PARTITION BY RANGE_BUCKET(customer_id, GENERATE_ARRAY(0, 100, 10));
```
Clustering
Basic Clustering
Create clustered table:
```bash
bq mk -t \
--schema 'timestamp:TIMESTAMP,customer_id:STRING,product_id:STRING,amount:FLOAT' \
--clustering_fields customer_id,product_id \
project:dataset.orders
```
SQL DDL version:
```sql
CREATE TABLE project.dataset.orders (
timestamp TIMESTAMP,
customer_id STRING,
product_id STRING,
amount FLOAT64
)
CLUSTER BY customer_id, product_id;
```
Clustering rules:
- Up to 4 clustering columns
- Order matters (first column has most impact)
- Works best with WHERE, GROUP BY, JOIN filters
Partitioning + Clustering (Recommended)
Combined approach:
```bash
bq mk -t \
--schema 'timestamp:TIMESTAMP,customer_id:STRING,transaction_amount:FLOAT' \
--time_partitioning_field timestamp \
--clustering_fields customer_id \
--description "Partitioned by day, clustered by customer" \
project:dataset.transactions
```
SQL DDL version:
```sql
CREATE TABLE project.dataset.transactions (
timestamp TIMESTAMP,
customer_id STRING,
transaction_amount FLOAT64
)
PARTITION BY DATE(timestamp)
CLUSTER BY customer_id;
```
Query benefits:
```sql
-- Partition pruning + clustering optimization
SELECT * FROM project.dataset.transactions
WHERE DATE(timestamp) = '2024-01-15' -- Partition filter
AND customer_id = 'CUST123' -- Cluster filter
```
Table Configuration Options
Expiration
Set table expiration:
```bash
bq mk -t \
--expiration 2592000 \
--schema 'field:TYPE' \
project:dataset.temp_table
```
Expiration in seconds: 2592000 = 30 days
Update existing table:
```bash
bq update --expiration 604800 project:dataset.table
```
Remove expiration:
```bash
bq update --expiration 0 project:dataset.table
```
Labels
Add labels:
```bash
bq mk -t \
--schema 'field:TYPE' \
--label environment:production \
--label team:analytics \
project:dataset.table
```
Update labels:
```bash
bq update --set_label environment:staging project:dataset.table
```
Description
Set description:
```bash
bq update \
--description "Customer transaction history with daily partitioning" \
project:dataset.transactions
```
Schema Management
Adding Columns
Cannot add required columns to existing data:
```bash
# Add optional column
bq query --use_legacy_sql=false \
'ALTER TABLE project.dataset.table
ADD COLUMN new_field STRING'
```
Changing Column Modes
REQUIRED β NULLABLE (allowed):
```sql
ALTER TABLE project.dataset.table
ALTER COLUMN field_name DROP NOT NULL;
```
NULLABLE β REQUIRED (NOT allowed if data exists)
Relaxing Column Types
Allowed type changes:
- INT64 β FLOAT64 β
- INT64 β NUMERIC β
- INT64 β BIGNUMERIC β
- INT64 β STRING β
Example:
```sql
ALTER TABLE project.dataset.table
ALTER COLUMN amount SET DATA TYPE FLOAT64;
```
External Tables
Create External Table (GCS)
CSV in GCS:
```bash
bq mk \
--external_table_definition=gs://bucket/*.csv@CSV \
--schema='customer_id:STRING,amount:FLOAT' \
project:dataset.external_orders
```
Parquet in GCS (schema auto-detected):
```bash
bq mk \
--external_table_definition=gs://bucket/*.parquet@PARQUET \
project:dataset.external_data
```
Supported formats: CSV, JSON, AVRO, PARQUET, ORC
External Table Limitations
- No DML operations (INSERT, UPDATE, DELETE)
- No guaranteed performance SLAs
- Data must be in GCS, Drive, or Bigtable
- Cannot be partitioned (but can use hive partitioning)
Snapshots and Clones
Table Snapshots
Create snapshot:
```sql
CREATE SNAPSHOT TABLE project.dataset.orders_snapshot
CLONE project.dataset.orders;
```
Restore from snapshot:
```sql
CREATE OR REPLACE TABLE project.dataset.orders
CLONE project.dataset.orders_snapshot;
```
Snapshot retention: 7 days by default
Table Clones
Create table clone:
```sql
CREATE TABLE project.dataset.orders_clone
CLONE project.dataset.orders;
```
Difference from snapshot:
- Clone = new independent table
- Snapshot = point-in-time reference
Time Travel
Query historical data:
```sql
-- Query table as it was 1 hour ago
SELECT * FROM project.dataset.orders
FOR SYSTEM_TIME AS OF TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR);
-- Query table at specific time
SELECT * FROM project.dataset.orders
FOR SYSTEM_TIME AS OF '2024-01-15 10:00:00 UTC';
```
Time travel window: 7 days (168 hours) by default
Best Practices
Partition Selection
Use time-based partitioning when:
- Data has timestamp/date column
- Queries filter by time ranges
- Data arrives chronologically
- Want automatic partition management
Use range partitioning when:
- Partitioning on integer column (ID, age, etc.)
- Predictable value distribution
- Fixed range boundaries
Use ingestion-time partitioning when:
- No natural timestamp column
- Loading data from streaming sources
- Want simple partition management
Clustering Selection
Cluster on columns that are:
- Frequently used in WHERE clauses
- Used in JOIN conditions
- Used in GROUP BY
- High cardinality (many distinct values)
Order matters:
- Most filtered column first
- Then second most filtered
- Up to 4 columns total
Partitioning + Clustering Strategy
Optimal pattern:
```sql
CREATE TABLE project.dataset.optimized (
event_timestamp TIMESTAMP, -- Partition on this
customer_id STRING, -- Cluster on this (1st)
product_category STRING, -- Cluster on this (2nd)
amount FLOAT64
)
PARTITION BY DATE(event_timestamp)
CLUSTER BY customer_id, product_category;
```
Query pattern:
```sql
-- Both partition and cluster benefit
SELECT SUM(amount)
FROM project.dataset.optimized
WHERE DATE(event_timestamp) BETWEEN '2024-01-01' AND '2024-01-31'
AND customer_id = 'CUST123'
GROUP BY product_category;
```
Checking Table Metadata
Get table information:
```bash
bq show --format=prettyjson project:dataset.table
```
Check partition info:
```sql
SELECT
partition_id,
total_rows,
total_logical_bytes,
last_modified_time
FROM project.dataset.INFORMATION_SCHEMA.PARTITIONS
WHERE table_name = 'orders'
ORDER BY partition_id DESC
LIMIT 10;
```
Check clustering info:
```sql
SELECT
table_name,
clustering_ordinal_position,
column_name
FROM project.dataset.INFORMATION_SCHEMA.COLUMNS
WHERE clustering_ordinal_position IS NOT NULL
ORDER BY table_name, clustering_ordinal_position;
```
Common Pitfalls
β Too many partitions
Problem: Creating 100,000+ partitions
Limit: 10,000 partitions per table
Solution: Use larger partition granularity (MONTH vs DAY)
β Wrong partition column
Problem: Partitioning on column not used in queries
Solution: Partition on columns used in WHERE clauses
β Not filtering on partition
Problem: Query scans all partitions
Solution: Always include partition filter in WHERE
β Clustering too many columns
Problem: Clustering on 5+ columns
Limit: 4 columns maximum
Solution: Choose most selective columns
β Wrong cluster order
Problem: Least selective column first
Solution: Put most selective column first
Table Maintenance
Update partition expiration
Set partition expiration:
```bash
bq update \
--time_partitioning_expiration 2592000 \
project:dataset.partitioned_table
```
Query shows this: Partitions older than 30 days auto-delete
Optimize table storage
Run optimization query:
```sql
-- BigQuery automatically optimizes storage
-- No manual VACUUM or OPTIMIZE needed
```
BigQuery automatically:
- Compacts data
- Sorts by clustering columns
- Removes deleted rows
- Optimizes storage format
Access Control & Security
Row-Level Security
Row-level access policies filter data based on user/group membership. They coexist with column-level security.
Creating row-level policies:
```sql
CREATE ROW ACCESS POLICY policy_name
ON dataset.table
GRANT TO ("user:[email protected]")
FILTER USING (region = "US");
```
Multiple policies:
```sql
-- Policy for US users
CREATE ROW ACCESS POLICY us_users_policy
ON dataset.orders
GRANT TO ("group:[email protected]")
FILTER USING (region = "US");
-- Policy for managers (see all regions)
CREATE ROW ACCESS POLICY managers_policy
ON dataset.orders
GRANT TO ("group:[email protected]")
FILTER USING (TRUE);
```
Viewing policies:
```sql
SELECT * FROM dataset.INFORMATION_SCHEMA.ROW_ACCESS_POLICIES
WHERE table_name = 'orders';
```
Dropping policies:
```sql
DROP ROW ACCESS POLICY policy_name ON dataset.table;
```
Column-Level Security
Use policy tags from Data Catalog to restrict access to sensitive columns:
Creating table with policy tags:
```sql
CREATE TABLE dataset.customers (
customer_id STRING,
name STRING,
email STRING,
ssn STRING OPTIONS(
policy_tags=("projects/PROJECT/locations/LOCATION/taxonomies/TAXONOMY/policyTags/PII_TAG")
),
credit_score INT64 OPTIONS(
policy_tags=("projects/PROJECT/locations/LOCATION/taxonomies/TAXONOMY/policyTags/SENSITIVE_TAG")
)
);
```
Adding policy tags to existing columns:
```sql
ALTER TABLE dataset.customers
ALTER COLUMN ssn SET OPTIONS(
policy_tags=("projects/PROJECT/locations/LOCATION/taxonomies/TAXONOMY/policyTags/PII_TAG")
);
```
How it works:
- Create taxonomy and policy tags in Data Catalog
- Apply policy tags to table columns
- Grant IAM roles on policy tags (datacatalog.categoryFineGrainedReader)
- Users without permission cannot query those columns
Authorized Views
Views that allow users to query data without direct table access:
Use cases:
- Sharing specific columns/rows without full table access
- Implementing business logic in access control
- Best performance for row/column filtering
Setup process:
```sql
-- 1. Create view in dataset A
CREATE VIEW datasetA.public_orders AS
SELECT order_id, customer_id, amount, order_date
FROM datasetA.orders
WHERE status = 'completed';
-- 2. Grant dataset B's view access to dataset A's table
-- This is done via dataset permissions in Cloud Console or:
bq update --source datasetA.orders \
--view datasetB.public_view
```
Example authorized view:
```sql
-- View in shared_views dataset
CREATE VIEW shared_views.customer_summary AS
SELECT
customer_id,
COUNT(*) as order_count,
SUM(amount) as total_spent
FROM private_data.orders
GROUP BY customer_id;
-- Grant access to view (not underlying table)
-- Users can query shared_views.customer_summary
-- but cannot access private_data.orders
```
Benefits:
- Row/column filtering without policy overhead
- Business logic in SQL (e.g., only show completed orders)
- Best query performance
- Centralized access control
Security Best Practices
1. Layered security:
- Use row-level policies for user-based filtering
- Use column-level security for sensitive data (PII, PHI)
- Use authorized views for complex access patterns
2. Performance:
- Authorized views: Best performance (compiled into query)
- Row-level policies: Slight overhead (filter applied)
- Column-level: No performance impact
3. Combining approaches:
```sql
-- Table with column-level security AND row-level policy
CREATE TABLE dataset.sensitive_data (
user_id STRING,
region STRING,
ssn STRING OPTIONS(policy_tags=("...")),
data JSON
)
PARTITION BY DATE(created_at);
-- Row-level policy
CREATE ROW ACCESS POLICY regional_access
ON dataset.sensitive_data
GRANT TO ("group:[email protected]")
FILTER USING (region = "US");
```
4. Auditing:
Monitor access with Cloud Audit Logs:
```sql
SELECT
timestamp,
principal_email,
resource_name,
method_name
FROM PROJECT.DATASET.cloudaudit_googleapis_com_data_access_*
WHERE resource.type = "bigquery_dataset"
ORDER BY timestamp DESC;
```
Quick Reference
Partition types:
- Time-based: HOUR, DAY, MONTH, YEAR
- Ingestion-time: Automatic _PARTITIONTIME
- Range: Integer column ranges
Clustering:
- Max 4 columns
- Order matters
- Works with or without partitioning
Security:
- Row-level: Filter by user/group
- Column-level: Policy tags for sensitive data
- Authorized views: Business logic filtering
Limits:
- 10,000 partitions per table
- 4 clustering columns
- 7-day time travel window
- 10,000 columns per table
- 100 row-level policies per table
More from this repository9
Queries and analyzes code using OpenAI Codex CLI, providing read-only codebase exploration and technical insights.
Queries code and files using Google Gemini CLI, providing AI-powered insights, explanations, and analysis for codebases.
gcp-cli-gotchas skill from funnelenvy/agents_webinar_demos
gcp-bq-data-loading skill from funnelenvy/agents_webinar_demos
Automates browser interactions using Playwright Python, enabling web scraping, screenshot capture, and dynamic page testing with precise element selection.
Generates precise CSV export formats for Google Ads Enhanced Conversions, Customer Match, and Meta Custom Audiences with specific data schema requirements.
Automates anomaly detection workflows by connecting data sources, triggering alerts in Slack, and orchestrating multi-step notification processes using N8N workflow automation.
Exports BigQuery data to Cloud Storage in various formats like CSV, JSON, Avro, and Parquet with flexible compression and export options.
Syncs contacts and lists to HubSpot CRM using a private API token, enabling seamless data integration and management.