Pandas is one of the most used libraries in Python for Data Analysis. With Storj, you have to option of decentralized storage for datasets that allows greater levels of durability, privacy and security. This tutorial is a short implementation of decentralized storage.
This post was originally published in Storj.
In this post, we will look at how to use pandas to save and load data to Storj DCS.
Requirements for Decentralized Storage
We rely on the new feature introduced by pandas called storage_options. This extra option gives us the capability to use specific storage connections. Storage option was introduced on version 1.2.0, further details you can find here storage_options.
From pandas 0.20.1 documentation:
“pandas now uses s3fs for handling S3 connections. This shouldn’t break any code. However, since s3fs is not a required dependency, you will need to install it separately, like boto in prior versions of pandas.”
Installing pandas Version 1.2.0
pip3 install pandas==1.20
Installing s3fs
pip3 install s3fs
Configuring pandas
If you already have a Storj DCS account, you just need to get your keys and endpoint url.
We are going to load the credentials from environment variables. You should have these 3 variables available: ACCESS_KEY_ID, SECRET_ACCESS_KEY and ENDPOINT_URL
This configuration will work for all methods that allows custom storage options such as read_csv, read_excel, read_table etc
import os
# loading environment variables
ACCESS_KEY_ID = os.getenv("ACCESS_KEY_ID")
SECRET_ACCESS_KEY = os.getenv("SECRET_ACCESS_KEY")
ENDPOINT_URL = os.getenv("ENDPOINT_URL")
We need to override the client_kwargs and set the endpoint_url, in this case the address must be the gateway url. Example: https://gateway.us1.storjshare.io
storage_options = {
'key': ACCESS_KEY_ID,
'secret': SECRET_ACCESS_KEY,
'client_kwargs': {
'endpoint_url': ENDPOINT_URL
}
}
Saving Data to Storj DCS
In this blog post, we are going to save and load our pandas Dataframe in CSV format. Other formats are allowed too, as mentioned in the previous section.
import numpy as np
import pandas as pd
bucket = "mybucket"
key = "random.csv"
# Creating a random dataframe.
df = pd.DataFrame(np.random.uniform(0,1,[10**3,3]), columns=list('ABC'))
# Saving as CSV
df.to_csv(
f"s3://{bucket}/{key}",
index=False,
storage_options=storage_options)
Loading Data from Storj DCS
The load process is the same, just pass the storage_options as a parameter.
new_df = pd.read_csv(
f"s3://{bucket}/{key}",
storage_options=storage_options)
Conclusion
Using pandas + Storj DCS is very easy, just requires a few lines of configuration.
If you already use pandas with S3, the migration to Storj DCS is very straightforward.