Managing Data Files With Decentralized Storage

Pandas is one of the most used libraries in Python for Data Analysis. With Storj, you have to option of decentralized storage for datasets that allows greater levels of durability, privacy and security. This tutorial is a short implementation of decentralized storage.

This post was originally published in Storj.

In this post, we will look at how to use pandas to save and load data to Storj DCS.

Requirements for Decentralized Storage

We rely on the new feature introduced by pandas called storage_options. This extra option gives us the capability to use specific storage connections. Storage option was introduced on version 1.2.0, further details you can find here storage_options.

From pandas 0.20.1 documentation:

“pandas now uses s3fs for handling S3 connections. This shouldn’t break any code. However, since s3fs is not a required dependency, you will need to install it separately, like boto in prior versions of pandas.”

s3fs_required.

Installing pandas Version 1.2.0

pip3 install pandas==1.20

Installing s3fs

pip3 install s3fs

Configuring pandas

If you already have a Storj DCS account, you just need to get your keys and endpoint url.

We are going to load the credentials from environment variables. You should have these 3 variables available: ACCESS_KEY_ID, SECRET_ACCESS_KEY and ENDPOINT_URL

This configuration will work for all methods that allows custom storage options such as read_csv, read_excel, read_table etc

import os

# loading environment variables

ACCESS_KEY_ID = os.getenv("ACCESS_KEY_ID")

SECRET_ACCESS_KEY = os.getenv("SECRET_ACCESS_KEY")

ENDPOINT_URL = os.getenv("ENDPOINT_URL")

We need to override the client_kwargs and set the endpoint_url, in this case the address must be the gateway url. Example: https://gateway.us1.storjshare.io

storage_options = {

'key': ACCESS_KEY_ID,

'secret': SECRET_ACCESS_KEY,

'client_kwargs': {

'endpoint_url': ENDPOINT_URL

}

Saving Data to Storj DCS

In this blog post, we are going to save and load our pandas Dataframe in CSV format. Other formats are allowed too, as mentioned in the previous section.

import numpy as np

import pandas as pd

bucket = "mybucket"

key = "random.csv"

# Creating a random dataframe.

df = pd.DataFrame(np.random.uniform(0,1,[10**3,3]), columns=list('ABC'))

# Saving as CSV

df.to_csv(

f"s3://{bucket}/{key}",

index=False,

storage_options=storage_options)
‍

Loading Data from Storj DCS

The load process is the same, just pass the storage_options as a parameter.

new_df = pd.read_csv(

f"s3://{bucket}/{key}",

storage_options=storage_options)

Conclusion

Using pandas + Storj DCS is very easy, just requires a few lines of configuration.

If you already use pandas with S3, the migration to Storj DCS is very straightforward.

About the author

Radiostud.io Staff

Showcasing and curating a knowledge base of tech use cases from across the web.

Home

Services

Back

Tech Content

Tech PR

Tech Advisory

Use Cases

Back

Cloud Computing

Artificial Intelligence

Internet of Things

Blockchain

Web Technologies

Next Generation Networks

Resources

Back

Emerging Tech Blog

Technology Use Case Index

TechForCXO Newsletter

About

Contact

Managing Data Files With Decentralized Storage

Requirements for Decentralized Storage

Installing pandas Version 1.2.0

Configuring pandas

Saving Data to Storj DCS

Loading Data from Storj DCS

Conclusion

Radiostud.io Staff