🎯

polars

🎯Skill

from ovachiever/droid-tings

VibeIndex|
What it does

Enables high-performance data manipulation with lazy evaluation, expression-based transformations, and efficient DataFrame operations using Apache Arrow.

πŸ“¦

Part of

ovachiever/droid-tings(370 items)

polars

Installation

git cloneClone repository
git clone https://github.com/ovachiever/droid-tings.git
πŸ“– Extracted from docs: ovachiever/droid-tings
18Installs
-
AddedFeb 4, 2026

Skill Details

SKILL.md

"Fast DataFrame library (Apache Arrow). Select, filter, group_by, joins, lazy evaluation, CSV/Parquet I/O, expression API, for high-performance data analysis workflows."

Overview

# Polars

Overview

Polars is a lightning-fast DataFrame library for Python and Rust built on Apache Arrow. Work with Polars' expression-based API, lazy evaluation framework, and high-performance data manipulation capabilities for efficient data processing, pandas migration, and data pipeline optimization.

Quick Start

Installation and Basic Usage

Install Polars:

```python

uv pip install polars

```

Basic DataFrame creation and operations:

```python

import polars as pl

# Create DataFrame

df = pl.DataFrame({

"name": ["Alice", "Bob", "Charlie"],

"age": [25, 30, 35],

"city": ["NY", "LA", "SF"]

})

# Select columns

df.select("name", "age")

# Filter rows

df.filter(pl.col("age") > 25)

# Add computed columns

df.with_columns(

age_plus_10=pl.col("age") + 10

)

```

Core Concepts

Expressions

Expressions are the fundamental building blocks of Polars operations. They describe transformations on data and can be composed, reused, and optimized.

Key principles:

  • Use pl.col("column_name") to reference columns
  • Chain methods to build complex transformations
  • Expressions are lazy and only execute within contexts (select, with_columns, filter, group_by)

Example:

```python

# Expression-based computation

df.select(

pl.col("name"),

(pl.col("age") * 12).alias("age_in_months")

)

```

Lazy vs Eager Evaluation

Eager (DataFrame): Operations execute immediately

```python

df = pl.read_csv("file.csv") # Reads immediately

result = df.filter(pl.col("age") > 25) # Executes immediately

```

Lazy (LazyFrame): Operations build a query plan, optimized before execution

```python

lf = pl.scan_csv("file.csv") # Doesn't read yet

result = lf.filter(pl.col("age") > 25).select("name", "age")

df = result.collect() # Now executes optimized query

```

When to use lazy:

  • Working with large datasets
  • Complex query pipelines
  • When only some columns/rows are needed
  • Performance is critical

Benefits of lazy evaluation:

  • Automatic query optimization
  • Predicate pushdown
  • Projection pushdown
  • Parallel execution

For detailed concepts, load references/core_concepts.md.

Common Operations

Select

Select and manipulate columns:

```python

# Select specific columns

df.select("name", "age")

# Select with expressions

df.select(

pl.col("name"),

(pl.col("age") * 2).alias("double_age")

)

# Select all columns matching a pattern

df.select(pl.col("^.*_id$"))

```

Filter

Filter rows by conditions:

```python

# Single condition

df.filter(pl.col("age") > 25)

# Multiple conditions (cleaner than using &)

df.filter(

pl.col("age") > 25,

pl.col("city") == "NY"

)

# Complex conditions

df.filter(

(pl.col("age") > 25) | (pl.col("city") == "LA")

)

```

With Columns

Add or modify columns while preserving existing ones:

```python

# Add new columns

df.with_columns(

age_plus_10=pl.col("age") + 10,

name_upper=pl.col("name").str.to_uppercase()

)

# Parallel computation (all columns computed in parallel)

df.with_columns(

pl.col("value") * 10,

pl.col("value") * 100,

)

```

Group By and Aggregations

Group data and compute aggregations:

```python

# Basic grouping

df.group_by("city").agg(

pl.col("age").mean().alias("avg_age"),

pl.len().alias("count")

)

# Multiple group keys

df.group_by("city", "department").agg(

pl.col("salary").sum()

)

# Conditional aggregations

df.group_by("city").agg(

(pl.col("age") > 30).sum().alias("over_30")

)

```

For detailed operation patterns, load references/operations.md.

Aggregations and Window Functions

Aggregation Functions

Common aggregations within group_by context:

  • pl.len() - count rows
  • pl.col("x").sum() - sum values
  • pl.col("x").mean() - average
  • pl.col("x").min() / pl.col("x").max() - extremes
  • pl.first() / pl.last() - first/last values

Window Functions with `over()`

Apply aggregations while preserving row count:

```python

# Add group statistics to each row

df.with_columns(

avg_age_by_city=pl.col("age").mean().over("city"),

rank_in_city=pl.col("salary").rank().over("city")

)

# Multiple grouping columns

df.with_columns(

group_avg=pl.col("value").mean().over("category", "region")

)

```

Mapping strategies:

  • group_to_rows (default): Preserves original row order
  • explode: Faster but groups rows together
  • join: Creates list columns

Data I/O

Supported Formats

Polars supports reading and writing:

  • CSV, Parquet, JSON, Excel
  • Databases (via connectors)
  • Cloud storage (S3, Azure, GCS)
  • Google BigQuery
  • Multiple/partitioned files

Common I/O Operations

CSV:

```python

# Eager

df = pl.read_csv("file.csv")

df.write_csv("output.csv")

# Lazy (preferred for large files)

lf = pl.scan_csv("file.csv")

result = lf.filter(...).select(...).collect()

```

Parquet (recommended for performance):

```python

df = pl.read_parquet("file.parquet")

df.write_parquet("output.parquet")

```

JSON:

```python

df = pl.read_json("file.json")

df.write_json("output.json")

```

For comprehensive I/O documentation, load references/io_guide.md.

Transformations

Joins

Combine DataFrames:

```python

# Inner join

df1.join(df2, on="id", how="inner")

# Left join

df1.join(df2, on="id", how="left")

# Join on different column names

df1.join(df2, left_on="user_id", right_on="id")

```

Concatenation

Stack DataFrames:

```python

# Vertical (stack rows)

pl.concat([df1, df2], how="vertical")

# Horizontal (add columns)

pl.concat([df1, df2], how="horizontal")

# Diagonal (union with different schemas)

pl.concat([df1, df2], how="diagonal")

```

Pivot and Unpivot

Reshape data:

```python

# Pivot (wide format)

df.pivot(values="sales", index="date", columns="product")

# Unpivot (long format)

df.unpivot(index="id", on=["col1", "col2"])

```

For detailed transformation examples, load references/transformations.md.

Pandas Migration

Polars offers significant performance improvements over pandas with a cleaner API. Key differences:

Conceptual Differences

  • No index: Polars uses integer positions only
  • Strict typing: No silent type conversions
  • Lazy evaluation: Available via LazyFrame
  • Parallel by default: Operations parallelized automatically

Common Operation Mappings

| Operation | Pandas | Polars |

|-----------|--------|--------|

| Select column | df["col"] | df.select("col") |

| Filter | df[df["col"] > 10] | df.filter(pl.col("col") > 10) |

| Add column | df.assign(x=...) | df.with_columns(x=...) |

| Group by | df.groupby("col").agg(...) | df.group_by("col").agg(...) |

| Window | df.groupby("col").transform(...) | df.with_columns(...).over("col") |

Key Syntax Patterns

Pandas sequential (slow):

```python

df.assign(

col_a=lambda df_: df_.value * 10,

col_b=lambda df_: df_.value * 100

)

```

Polars parallel (fast):

```python

df.with_columns(

col_a=pl.col("value") * 10,

col_b=pl.col("value") * 100,

)

```

For comprehensive migration guide, load references/pandas_migration.md.

Best Practices

Performance Optimization

  1. Use lazy evaluation for large datasets:

```python

lf = pl.scan_csv("large.csv") # Don't use read_csv

result = lf.filter(...).select(...).collect()

```

  1. Avoid Python functions in hot paths:

- Stay within expression API for parallelization

- Use .map_elements() only when necessary

- Prefer native Polars operations

  1. Use streaming for very large data:

```python

lf.collect(streaming=True)

```

  1. Select only needed columns early:

```python

# Good: Select columns early

lf.select("col1", "col2").filter(...)

# Bad: Filter on all columns first

lf.filter(...).select("col1", "col2")

```

  1. Use appropriate data types:

- Categorical for low-cardinality strings

- Appropriate integer sizes (i32 vs i64)

- Date types for temporal data

Expression Patterns

Conditional operations:

```python

pl.when(condition).then(value).otherwise(other_value)

```

Column operations across multiple columns:

```python

df.select(pl.col("^._value$") 2) # Regex pattern

```

Null handling:

```python

pl.col("x").fill_null(0)

pl.col("x").is_null()

pl.col("x").drop_nulls()

```

For additional best practices and patterns, load references/best_practices.md.

Resources

This skill includes comprehensive reference documentation:

references/

  • core_concepts.md - Detailed explanations of expressions, lazy evaluation, and type system
  • operations.md - Comprehensive guide to all common operations with examples
  • pandas_migration.md - Complete migration guide from pandas to Polars
  • io_guide.md - Data I/O operations for all supported formats
  • transformations.md - Joins, concatenation, pivots, and reshaping operations
  • best_practices.md - Performance optimization tips and common patterns

Load these references as needed when users require detailed information about specific topics.