Make your Data Exploration 10x faster using Polars
Polars represents a DataFrame library crafted entirely in Rust. This piece delves into the fundamentals of Polars and its potential as a Pandas alternative.
In brief, Polars can be perceived as an enhanced dataframe library when compared to Pandas.
- Speedy Performance: Polars outpaces Pandas in terms of speed.
- Robust Expression Language: It boasts a potent expression syntax.
- Lazy Evaluation Support: Polars can delay computations intelligently.
- Memory-Friendly: It’s designed to be memory-efficient as well.
Why Opt for Polars Over pandas?
pandas is known for its flexibility, but it can slow down with big data due to single-threaded work. As data grows, processing times get too slow in pandas.
Polars is made for handling big data. It works smart with lazy evaluation and parallel processing, making it super fast. It spreads work across many cores, giving a speed boost.
To install Polars, just use the pip command for installation.
pip install polars
Comparing Pandas vs. Polars Read time:
Let’s gauge how quickly Polars performs compared to Pandas by measuring how long it takes to read data. We’ll employ Python’s ‘time’ module to achieve this. For instance, we’ll read the same CSV file using both Pandas and Polars and compare their read times.
import time
import pandas as pd
import polars as pl
# Measure read time with pandas
start_time = time.time()
pandas_df = pd.read_csv('google_historical_data.csv')
pandas_read_time = time.time() - start_time
# Measure read time with Polars
start_time = time.time()
polars_df = pl.read_csv('google_historical_data.csv')
polars_read_time = time.time() - start_time
print("Pandas read time:", pandas_read_time)
print("Polars read time:", polars_read_time)
Output:
Pandas read time: 0.010694742202758789
Polars read time: 0.0019254684448242188
From the displayed results, it’s clear that Polars is 10x faster in speed than Pandas. This is evident in the code where we calculate the time difference between the start and end of the read operation.
Sorting comparison time:
We expanded our evaluation to compare Polars and Pandas performance, now considering sorting. Using Python’s ‘time’ module, we measured data reading and sorting efficiency. After loading matching CSV files, we timed both libraries, revealing Polars’ consistent advantage in speed and efficiency.
import time
import pandas as pd
import polars as pl
# Measure read time with pandas
start_time = time.time()
pandas_df = pd.read_csv('google_historical_data.csv')
pandas_read_time = time.time() - start_time
pandas_sorted_df = pandas_df.sort_values(by='Date')
pandas_sort_time = time.time() - start_time
# Measure read time with Polars
start_time = time.time()
polars_df = pl.read_csv('google_historical_data.csv')
polars_read_time = time.time() - start_time
polars_sorted_df = polars_df.sort('Date')
polars_sort_time = time.time() - start_time
print("Pandas sorting time:", pandas_sort_time)
print("Polars sorting time:", polars_sort_time)
Output:
Pandas sorting time: 0.012721776962280273
Polars sorting time: 0.0016827583312988281
Upon reviewing the outcomes, it is evident that Polars continues to outperform Pandas not only in terms of data reading but also when it comes to sorting operations. The script calculates the time difference between the initiation and completion of the sorting task for each library. The results underscore the efficiency advantage of Polars over Pandas, particularly when handling larger datasets and computationally intensive operations.
GroupBy Function Time:
import time
import pandas as pd
import polars as pl
# Sample data
data = {'Category': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B'],
'Value': [10, 20, 15, 25, 30, 40, 35, 45]}
# Create a Pandas DataFrame
pandas_df = pd.DataFrame(data)
# Create a Polars DataFrame
polars_df = pl.DataFrame(data)
# Measure Pandas groupby time
start_time = time.time()
pandas_grouped = pandas_df.groupby('Category').sum()
pandas_groupby_time = time.time() - start_time
# Measure Polars groupby time
start_time = time.time()
polars_grouped = polars_df.groupby('Category').agg(pl.sum('Value'))
polars_groupby_time = time.time() - start_time
print("Pandas groupby time:", pandas_groupby_time)
print("Polars groupby time:", polars_groupby_time)
Output:
Pandas groupby time: 0.014025449752807617
Polars groupby time: 0.02013707160949707
This code analyzes data with categories and values. It groups the data by categories and calculates the sum of values for each category. It then measures and shows how long it took for this grouping process using both Pandas and Polars. Make sure you have Pandas and Polars libraries installed to run this code.
map function time:
data = {'keys': ['a', 'a', 'b'],
'values': [10, 7, 1]}
# Create a Polars DataFrame
df = pl.DataFrame(data)
# Measure time for map function using Polars
start_time = time.time()
out_polars = df.groupby("keys", maintain_order=True).agg(
pl.col("values").map(lambda s: s.shift()).alias("shift_map"),
pl.col("values").shift().alias("shift_expression"),
)
polars_map_time = time.time() - start_time
print("Polars map function time:", polars_map_time)
print(out_polars)
# Convert Polars DataFrame to Pandas DataFrame
pandas_df = df.to_pandas()
# Measure time for map function using Pandas
start_time = time.time()
pandas_df['shift_map'] = pandas_df['values'].shift()
pandas_df['shift_expression'] = pandas_df['values'].shift()
pandas_map_time = time.time() - start_time
print("Pandas map function time:", pandas_map_time)
print(pandas_df)
Output:
Polars map function time: 0.00629425048828125
shape: (2, 3)
┌──────┬────────────┬──────────────────┐
│ keys ┆ shift_map ┆ shift_expression │
│ --- ┆ --- ┆ --- │
│ str ┆ list[i64] ┆ list[i64] │
╞══════╪════════════╪══════════════════╡
│ a ┆ [null, 10] ┆ [null, 10] │
│ b ┆ [7] ┆ [null] │
└──────┴────────────┴──────────────────┘
Pandas map function time: 0.0028917789459228516
keys values shift_map shift_expression
0 a 10 NaN NaN
1 a 7 10.0 10.0
2 b 1 7.0 7.0
calculate resampling time:
# Polars code
df_polars = pl.DataFrame(
{"time": pl.date_range(low=datetime(2021, 12, 16), high=datetime(2021, 12, 16, 3), interval="30m"),
"groups": ["a", "a", "a", "b", "b", "a", "a"],
"values": [1., 2., 3., 4., 5., 6., 7.]
})
# Measure time for creating DataFrame using Polars
start_time = time.time()
df_polars = df_polars
polars_time = time.time() - start_time
print("Polars DataFrame creation time:", polars_time)
print(df_polars)
# Pandas code
data_pandas = {"time": pd.date_range(start=datetime(2021, 12, 16), end=datetime(2021, 12, 16, 3), freq='30T'),
"groups": ["a", "a", "a", "b", "b", "a", "a"],
"values": [1., 2., 3., 4., 5., 6., 7.]}
# Create a Pandas DataFrame
df_pandas = pd.DataFrame(data_pandas)
# Measure time for creating DataFrame using Pandas
start_time = time.time()
df_pandas = df_pandas
pandas_time = time.time() - start_time
print("Pandas DataFrame creation time:", pandas_time)
print(df_pandas)
Output:
Polars DataFrame creation time: 0.00010323524475097656
shape: (7, 3)
┌─────────────────────┬────────┬────────┐
│ time ┆ groups ┆ values │
│ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ str ┆ f64 │
╞═════════════════════╪════════╪════════╡
│ 2021-12-16 00:00:00 ┆ a ┆ 1.0 │
│ 2021-12-16 00:30:00 ┆ a ┆ 2.0 │
│ 2021-12-16 01:00:00 ┆ a ┆ 3.0 │
│ 2021-12-16 01:30:00 ┆ b ┆ 4.0 │
│ 2021-12-16 02:00:00 ┆ b ┆ 5.0 │
│ 2021-12-16 02:30:00 ┆ a ┆ 6.0 │
│ 2021-12-16 03:00:00 ┆ a ┆ 7.0 │
└─────────────────────┴────────┴────────┘
Pandas DataFrame creation time: 0.00010347366333007812
time groups values
0 2021-12-16 00:00:00 a 1.0
1 2021-12-16 00:30:00 a 2.0
2 2021-12-16 01:00:00 a 3.0
3 2021-12-16 01:30:00 b 4.0
4 2021-12-16 02:00:00 b 5.0
5 2021-12-16 02:30:00 a 6.0
6 2021-12-16 03:00:00 a 7.0
Exploring the Data
Dive into data exploration by obtaining summary statistics like count, mean, minimum, maximum, and more. Achieve this using the “describe” method as shown.
df.describe()
The “shape” method provides information about the data frame’s size, specifically the total count of rows and columns.
print(df.shape)
Output:
(251, 7)
code and table for all the time comparison:
import time
import pandas as pd
import polars as pl
from datetime import datetime
def measure_time(func):
start_time = time.time()
result = func()
elapsed_time = time.time() - start_time
return result, elapsed_time
# Read time comparison
def read_time_comparison():
def pandas_read():
return pd.read_csv('google_historical_data.csv')
def polars_read():
return pl.read_csv('google_historical_data.csv')
_, pandas_read_time = measure_time(pandas_read)
_, polars_read_time = measure_time(polars_read)
return pandas_read_time, polars_read_time
# Sorting comparison time
def sorting_comparison():
def pandas_sorting(df):
return df.sort_values(by='Date')
def polars_sorting(df):
return df.sort('Date')
pandas_df = pd.read_csv('google_historical_data.csv')
polars_df = pl.read_csv('google_historical_data.csv')
_, pandas_read_time = measure_time(lambda: pandas_df)
pandas_sorted_df, pandas_sort_time = measure_time(lambda: pandas_sorting(pandas_df))
_, polars_read_time = measure_time(lambda: polars_df)
polars_sorted_df, polars_sort_time = measure_time(lambda: polars_sorting(polars_df))
return pandas_sort_time, polars_sort_time
# GroupBy function time comparison
def groupby_comparison():
data = {'Category': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B'],
'Value': [10, 20, 15, 25, 30, 40, 35, 45]}
pandas_df = pd.DataFrame(data)
polars_df = pl.DataFrame(data)
def pandas_groupby(df):
return df.groupby('Category').sum()
def polars_groupby(df):
return df.groupby('Category').agg(pl.sum('Value'))
_, pandas_groupby_time = measure_time(lambda: pandas_groupby(pandas_df))
_, polars_groupby_time = measure_time(lambda: polars_groupby(polars_df))
return pandas_groupby_time, polars_groupby_time
# Map function time comparison
def map_function_comparison():
data = {'keys': ['a', 'a', 'b'],
'values': [10, 7, 1]}
df = pl.DataFrame(data)
def polars_map_function(df):
return df.groupby("keys", maintain_order=True).agg(
pl.col("values").map(lambda s: s.shift()).alias("shift_map"),
pl.col("values").shift().alias("shift_expression"),
)
_, polars_map_time = measure_time(lambda: polars_map_function(df))
out_polars = polars_map_function(df)
pandas_df = df.to_pandas()
_, pandas_map_time = measure_time(lambda: pandas_df.assign(shift_map=pandas_df['values'].shift(), shift_expression=pandas_df['values'].shift()))
return pandas_map_time, polars_map_time, out_polars
# Resampling time comparison
def resampling_comparison():
# Polars code
df_polars = pl.DataFrame(
{"time": pl.date_range(low=datetime(2021, 12, 16), high=datetime(2021, 12, 16, 3), interval="30m"),
"groups": ["a", "a", "a", "b", "b", "a", "a"],
"values": [1., 2., 3., 4., 5., 6., 7.]
})
def polars_creation():
return df_polars
_, polars_time = measure_time(polars_creation)
# Pandas code
data_pandas = {"time": pd.date_range(start=datetime(2021, 12, 16), end=datetime(2021, 12, 16, 3), freq='30T'),
"groups": ["a", "a", "a", "b", "b", "a", "a"],
"values": [1., 2., 3., 4., 5., 6., 7.]}
def pandas_creation():
return pd.DataFrame(data_pandas)
_, pandas_time = measure_time(pandas_creation)
return pandas_time, polars_time
# Main function
if __name__ == "__main__":
results = {
"Operation": ["Read", "Sort", "GroupBy", "Map Function", "Resampling"],
"Pandas Time": [read_time_comparison()[0], sorting_comparison()[0], groupby_comparison()[0], map_function_comparison()[0], resampling_comparison()[0]],
"Polars Time": [read_time_comparison()[1], sorting_comparison()[1], groupby_comparison()[1], map_function_comparison()[1], resampling_comparison()[1]],
}
# Creating a DataFrame for the time comparison table
comparison_df = pd.DataFrame.from_dict(results)
print(comparison_df)
Operation Pandas Time Polars Time
0 Read 0.008151 1.175642e-03
1 Sort 0.001095 4.956722e-04
2 GroupBy 0.002908 7.240772e-04
3 Map Function 0.002230 1.012325e-03
4 Resampling 0.000970 9.536743e-07
Give me a follow in case you enjoyed reading the article. Visit my profile for more such insights.
So let’s see how much you’ve retained from the blog