Make your Data Exploration 10x faster using Polars

7 min readAug 24, 2023

Polars represents a DataFrame library crafted entirely in Rust. This piece delves into the fundamentals of Polars and its potential as a Pandas alternative.

In brief, Polars can be perceived as an enhanced dataframe library when compared to Pandas.

Speedy Performance: Polars outpaces Pandas in terms of speed.
Robust Expression Language: It boasts a potent expression syntax.
Lazy Evaluation Support: Polars can delay computations intelligently.
Memory-Friendly: It’s designed to be memory-efficient as well.

Why Opt for Polars Over pandas?

pandas is known for its flexibility, but it can slow down with big data due to single-threaded work. As data grows, processing times get too slow in pandas.

Polars is made for handling big data. It works smart with lazy evaluation and parallel processing, making it super fast. It spreads work across many cores, giving a speed boost.

To install Polars, just use the pip command for installation.

pip install polars

Comparing Pandas vs. Polars Read time:

Let’s gauge how quickly Polars performs compared to Pandas by measuring how long it takes to read data. We’ll employ Python’s ‘time’ module to achieve this. For instance, we’ll read the same CSV file using both Pandas and Polars and compare their read times.

import time
import pandas as pd
import polars as pl

# Measure read time with pandas
start_time = time.time()
pandas_df = pd.read_csv('google_historical_data.csv')
pandas_read_time = time.time() - start_time

# Measure read time with Polars
start_time = time.time()
polars_df = pl.read_csv('google_historical_data.csv')
polars_read_time = time.time() - start_time

print("Pandas read time:", pandas_read_time)
print("Polars read time:", polars_read_time)


Output: 
Pandas read time: 0.010694742202758789
Polars read time: 0.0019254684448242188

From the displayed results, it’s clear that Polars is 10x faster in speed than Pandas. This is evident in the code where we calculate the time difference between the start and end of the read operation.

Sorting comparison time:

We expanded our evaluation to compare Polars and Pandas performance, now considering sorting. Using Python’s ‘time’ module, we measured data reading and sorting efficiency. After loading matching CSV files, we timed both libraries, revealing Polars’ consistent advantage in speed and efficiency.

import time
import pandas as pd
import polars as pl

# Measure read time with pandas
start_time = time.time()
pandas_df = pd.read_csv('google_historical_data.csv')
pandas_read_time = time.time() - start_time
pandas_sorted_df = pandas_df.sort_values(by='Date')
pandas_sort_time = time.time() - start_time

# Measure read time with Polars
start_time = time.time()
polars_df = pl.read_csv('google_historical_data.csv')
polars_read_time = time.time() - start_time
polars_sorted_df = polars_df.sort('Date')
polars_sort_time = time.time() - start_time

print("Pandas sorting time:", pandas_sort_time)
print("Polars sorting time:", polars_sort_time)

Output:
Pandas sorting time: 0.012721776962280273
Polars sorting time: 0.0016827583312988281

Upon reviewing the outcomes, it is evident that Polars continues to outperform Pandas not only in terms of data reading but also when it comes to sorting operations. The script calculates the time difference between the initiation and completion of the sorting task for each library. The results underscore the efficiency advantage of Polars over Pandas, particularly when handling larger datasets and computationally intensive operations.

GroupBy Function Time:

import time
import pandas as pd
import polars as pl

# Sample data
data = {'Category': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B'],
        'Value': [10, 20, 15, 25, 30, 40, 35, 45]}

# Create a Pandas DataFrame
pandas_df = pd.DataFrame(data)

# Create a Polars DataFrame
polars_df = pl.DataFrame(data)

# Measure Pandas groupby time
start_time = time.time()
pandas_grouped = pandas_df.groupby('Category').sum()
pandas_groupby_time = time.time() - start_time

# Measure Polars groupby time
start_time = time.time()
polars_grouped = polars_df.groupby('Category').agg(pl.sum('Value'))
polars_groupby_time = time.time() - start_time

print("Pandas groupby time:", pandas_groupby_time)
print("Polars groupby time:", polars_groupby_time)

Output:
Pandas groupby time: 0.014025449752807617
Polars groupby time: 0.02013707160949707

This code analyzes data with categories and values. It groups the data by categories and calculates the sum of values for each category. It then measures and shows how long it took for this grouping process using both Pandas and Polars. Make sure you have Pandas and Polars libraries installed to run this code.

map function time:

data = {'keys': ['a', 'a', 'b'],
        'values': [10, 7, 1]}

# Create a Polars DataFrame
df = pl.DataFrame(data)

# Measure time for map function using Polars
start_time = time.time()
out_polars = df.groupby("keys", maintain_order=True).agg(
    pl.col("values").map(lambda s: s.shift()).alias("shift_map"),
    pl.col("values").shift().alias("shift_expression"),
)
polars_map_time = time.time() - start_time

print("Polars map function time:", polars_map_time)
print(out_polars)

# Convert Polars DataFrame to Pandas DataFrame
pandas_df = df.to_pandas()

# Measure time for map function using Pandas
start_time = time.time()
pandas_df['shift_map'] = pandas_df['values'].shift()
pandas_df['shift_expression'] = pandas_df['values'].shift()
pandas_map_time = time.time() - start_time

print("Pandas map function time:", pandas_map_time)
print(pandas_df)


Output:
Polars map function time: 0.00629425048828125
shape: (2, 3)
┌──────┬────────────┬──────────────────┐
│ keys ┆ shift_map  ┆ shift_expression │
│ ---  ┆ ---        ┆ ---              │
│ str  ┆ list[i64]  ┆ list[i64]        │
╞══════╪════════════╪══════════════════╡
│ a    ┆ [null, 10] ┆ [null, 10]       │
│ b    ┆ [7]        ┆ [null]           │
└──────┴────────────┴──────────────────┘
Pandas map function time: 0.0028917789459228516
  keys  values  shift_map  shift_expression
0    a      10        NaN               NaN
1    a       7       10.0              10.0
2    b       1        7.0               7.0

calculate resampling time:

# Polars code
df_polars = pl.DataFrame(
    {"time": pl.date_range(low=datetime(2021, 12, 16), high=datetime(2021, 12, 16, 3), interval="30m"),
     "groups": ["a", "a", "a", "b", "b", "a", "a"],
     "values": [1., 2., 3., 4., 5., 6., 7.]
    })

# Measure time for creating DataFrame using Polars
start_time = time.time()
df_polars = df_polars
polars_time = time.time() - start_time

print("Polars DataFrame creation time:", polars_time)
print(df_polars)

# Pandas code
data_pandas = {"time": pd.date_range(start=datetime(2021, 12, 16), end=datetime(2021, 12, 16, 3), freq='30T'),
               "groups": ["a", "a", "a", "b", "b", "a", "a"],
               "values": [1., 2., 3., 4., 5., 6., 7.]}

# Create a Pandas DataFrame
df_pandas = pd.DataFrame(data_pandas)

# Measure time for creating DataFrame using Pandas
start_time = time.time()
df_pandas = df_pandas
pandas_time = time.time() - start_time

print("Pandas DataFrame creation time:", pandas_time)
print(df_pandas)



Output:
Polars DataFrame creation time: 0.00010323524475097656
shape: (7, 3)
┌─────────────────────┬────────┬────────┐
│ time                ┆ groups ┆ values │
│ ---                 ┆ ---    ┆ ---    │
│ datetime[μs]        ┆ str    ┆ f64    │
╞═════════════════════╪════════╪════════╡
│ 2021-12-16 00:00:00 ┆ a      ┆ 1.0    │
│ 2021-12-16 00:30:00 ┆ a      ┆ 2.0    │
│ 2021-12-16 01:00:00 ┆ a      ┆ 3.0    │
│ 2021-12-16 01:30:00 ┆ b      ┆ 4.0    │
│ 2021-12-16 02:00:00 ┆ b      ┆ 5.0    │
│ 2021-12-16 02:30:00 ┆ a      ┆ 6.0    │
│ 2021-12-16 03:00:00 ┆ a      ┆ 7.0    │
└─────────────────────┴────────┴────────┘
Pandas DataFrame creation time: 0.00010347366333007812
                 time groups  values
0 2021-12-16 00:00:00      a     1.0
1 2021-12-16 00:30:00      a     2.0
2 2021-12-16 01:00:00      a     3.0
3 2021-12-16 01:30:00      b     4.0
4 2021-12-16 02:00:00      b     5.0
5 2021-12-16 02:30:00      a     6.0
6 2021-12-16 03:00:00      a     7.0

Exploring the Data

Dive into data exploration by obtaining summary statistics like count, mean, minimum, maximum, and more. Achieve this using the “describe” method as shown.

df.describe()

The “shape” method provides information about the data frame’s size, specifically the total count of rows and columns.

print(df.shape)

Output:
(251, 7)

code and table for all the time comparison:

import time
import pandas as pd
import polars as pl
from datetime import datetime

def measure_time(func):
    start_time = time.time()
    result = func()
    elapsed_time = time.time() - start_time
    return result, elapsed_time

# Read time comparison
def read_time_comparison():
    def pandas_read():
        return pd.read_csv('google_historical_data.csv')

    def polars_read():
        return pl.read_csv('google_historical_data.csv')

    _, pandas_read_time = measure_time(pandas_read)
    _, polars_read_time = measure_time(polars_read)
    
    return pandas_read_time, polars_read_time

# Sorting comparison time
def sorting_comparison():
    def pandas_sorting(df):
        return df.sort_values(by='Date')

    def polars_sorting(df):
        return df.sort('Date')

    pandas_df = pd.read_csv('google_historical_data.csv')
    polars_df = pl.read_csv('google_historical_data.csv')

    _, pandas_read_time = measure_time(lambda: pandas_df)
    pandas_sorted_df, pandas_sort_time = measure_time(lambda: pandas_sorting(pandas_df))

    _, polars_read_time = measure_time(lambda: polars_df)
    polars_sorted_df, polars_sort_time = measure_time(lambda: polars_sorting(polars_df))

    return pandas_sort_time, polars_sort_time

# GroupBy function time comparison
def groupby_comparison():
    data = {'Category': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B'],
            'Value': [10, 20, 15, 25, 30, 40, 35, 45]}
    
    pandas_df = pd.DataFrame(data)
    polars_df = pl.DataFrame(data)
    
    def pandas_groupby(df):
        return df.groupby('Category').sum()

    def polars_groupby(df):
        return df.groupby('Category').agg(pl.sum('Value'))

    _, pandas_groupby_time = measure_time(lambda: pandas_groupby(pandas_df))
    _, polars_groupby_time = measure_time(lambda: polars_groupby(polars_df))

    return pandas_groupby_time, polars_groupby_time

# Map function time comparison
def map_function_comparison():
    data = {'keys': ['a', 'a', 'b'],
            'values': [10, 7, 1]}

    df = pl.DataFrame(data)

    def polars_map_function(df):
        return df.groupby("keys", maintain_order=True).agg(
            pl.col("values").map(lambda s: s.shift()).alias("shift_map"),
            pl.col("values").shift().alias("shift_expression"),
        )

    _, polars_map_time = measure_time(lambda: polars_map_function(df))
    out_polars = polars_map_function(df)
    pandas_df = df.to_pandas()

    _, pandas_map_time = measure_time(lambda: pandas_df.assign(shift_map=pandas_df['values'].shift(), shift_expression=pandas_df['values'].shift()))

    return pandas_map_time, polars_map_time, out_polars

# Resampling time comparison
def resampling_comparison():
    # Polars code
    df_polars = pl.DataFrame(
        {"time": pl.date_range(low=datetime(2021, 12, 16), high=datetime(2021, 12, 16, 3), interval="30m"),
         "groups": ["a", "a", "a", "b", "b", "a", "a"],
         "values": [1., 2., 3., 4., 5., 6., 7.]
        })

    def polars_creation():
        return df_polars

    _, polars_time = measure_time(polars_creation)

    # Pandas code
    data_pandas = {"time": pd.date_range(start=datetime(2021, 12, 16), end=datetime(2021, 12, 16, 3), freq='30T'),
                   "groups": ["a", "a", "a", "b", "b", "a", "a"],
                   "values": [1., 2., 3., 4., 5., 6., 7.]}

    def pandas_creation():
        return pd.DataFrame(data_pandas)

    _, pandas_time = measure_time(pandas_creation)

    return pandas_time, polars_time

# Main function
if __name__ == "__main__":
    results = {
        "Operation": ["Read", "Sort", "GroupBy", "Map Function", "Resampling"],
        "Pandas Time": [read_time_comparison()[0], sorting_comparison()[0], groupby_comparison()[0], map_function_comparison()[0], resampling_comparison()[0]],
        "Polars Time": [read_time_comparison()[1], sorting_comparison()[1], groupby_comparison()[1], map_function_comparison()[1], resampling_comparison()[1]],
    }

    # Creating a DataFrame for the time comparison table
    comparison_df = pd.DataFrame.from_dict(results)
    
    print(comparison_df)

      Operation    Pandas Time  Polars Time
0          Read     0.008151    1.175642e-03
1          Sort     0.001095    4.956722e-04
2       GroupBy     0.002908    7.240772e-04
3  Map Function     0.002230    1.012325e-03
4    Resampling     0.000970    9.536743e-07

Give me a follow in case you enjoyed reading the article. Visit my profile for more such insights.

So let’s see how much you’ve retained from the blog

Make your Data Exploration 10x faster using Polars

Why Opt for Polars Over pandas?

Exploring the Data

QuizShorts | Make your Data Exploration 10x faster using Polars

Title: Make your Data Exploration 10x faster using Polars 🔗

Written by Gaurav Kumar

No responses yet