tutorialsreferenceguide

Storage

The SPAI Library also has a module called storage containing a set of Storage object declarations in order to easily manage different type of storages. Very useful for EO Applications in order to manage cloud buckets for TIFF images, vector files and other type of analytics files such as .csv or Dataframes.

The Storage class serves as an interface for accessing different storage backends. It inherits from the BaseStorage class and dynamically sets up access to various storage options such as local file systems and Amazon S3, based on the environment variables provided (If a spai project has been created, the environment variables for storage are created automatically from the spai.config.yaml file). The storage backends available to the Storage class are determined by the environment variables, allowing for flexible configuration and initialization of the storage solutions in use.

class Storage(BaseStorage):

Attributes:

  • storage_names: An attribute that holds a list of storage backend names as strings. These names are read from the “SPAI_STORAGE_NAMES” environment variable and represent the storage backends that are enabled and can be utilized by the Storage class.

  • storages: A dictionary attribute that maps the storage backend names (keys) to their respective initialized instances (values). This allows for quick retrieval and manipulation of the different storage backends configured.

Methods:

  • __init__(self)

Initializes the Storage class instance by reading the relevant environment variables and setting up storage backends as specified in those variables.

  • __getitem__(self, name)

Provides a convenient dictionary-like access to the storage backends using the storage backend name.

  • Parameters:

    • name: A string representing the name of the storage backend to access.
  • Returns: Returns the storage backend instance corresponding to the provided name.

  • initialize_storage(self)

Interprets the environment variables and initializes each named storage backend with the appropriate settings and configurations. This method is responsible for the bootstrapping process of all storage backends.

  • initialize_local(self)

Specifically initializes a local storage backend, setting up the file system paths and other necessary details for the storage backend to function correctly with local files.

  • initialize_s3(self)

Specifically initializes an Amazon S3 storage backend, configuring the necessary credentials, bucket names, and other required parameters to interact with S3 services.

Initialization

Imagine you have an spai.config.yaml with two storages defined as:

storage:
  -name: data 
  type: local
  path: data            # optional
  -name: cloud 
  type: s3
  url: localhost:9000   # optional
  access: devuser       # optional
  secret: devpassword   # optional
  region: us-east       # optional
  bucket: devbucket     # optional

After this, it is posible to instantiate a Storage object in your code like this:

storage = Storage()

After Storage object initialization, it is possible to access both types of storages like this:

data = storage["data"]      # Accessing local storage by name
minio = storage["cloud"]    # Accessing S3 storage by name

Local Storage

LocalStorage is a storage system that initializes a directory on the local filesystem. This storage class inherits from BaseStorage and is responsible for managing a local directory where data can be stored.

class LocalStorage(BaseStorage):

Attributes:

  • path (str): The file system path to the directory where data will be stored.

Methods:

  • __init__(self, path="data"): Initializes the local storage with the given path.

Upon initialization, if the directory does not exist at the specified path, it will be created. If it exists, a ready-to-use message is printed.

  • exists(self, name): Checks if a file with the given name exists in the storage.

  • get_path(self, name): Constructs and returns the full path to a file with the given name within the local storage.

  • create_from_path(self, data, name): Moves a file from its current location to the storage, renaming it if necessary.

  • create_from_dict(self, data, name): Saves a dictionary to a file in JSON format within the storage path.

  • create_from_string(self, data, name): Writes a string to a plain text file within the storage.

  • create_from_dataframe(self, data, name): Saves a pandas DataFrame in CSV or JSON format within the storage, based on the file extension provided.

  • create_from_image(self, data, name): Saves an image file within the storage.

  • create_from_rasterio(self, rio, x, name, ds, window=None): Saves a raster dataset from a NumPy array using rasterio to the storage.

  • create_from_array(self, data, name): Saves a NumPy array to a binary file within the storage.

  • create_from_csv(self, data, name): Saves a pandas DataFrame as a CSV file within the storage.

  • create_from_json(self, data, name): Saves data in JSON format to a file within the storage.

  • create_from_parquet(self, data, name): Saves a pandas DataFrame as a Parquet file within the storage.

  • create_from_zarr(self, data, name): Saves data as a zarr archive within the storage.

  • list(self, pattern="*"): Lists all file names in the storage that match the given pattern.

  • read_from_array(self, name): Loads a NumPy array from a binary file within the storage.

  • read_from_rasterio(self, rio, name): Loads a raster file into a rasterio dataset.

  • read_from_csv(self, name): Reads a CSV file into a pandas DataFrame.

  • read_from_json(self, name): Reads a JSON file into a pandas DataFrame.

  • read_from_geojson(self, gpd, name): Reads a GeoJSON file and converts it to a GeoDataFrame.

  • read_from_parquet(self, gpd, name): Reads a Parquet file and converts it to a DataFrame.

  • read_from_zarr(self, xr, name): Opens a zarr archive and returns an xarray Dataset.

S3 Cloud Storage

S3Storage is a storage system that initializes and interacts with an S3-compatible object storage service. This storage class inherits from BaseStorage and is meant for handling objects in remote buckets using the MinIO client.

class S3Storage(BaseStorage):

Attributes:

  • url (str): The URL of the S3-compatible service.
  • access (str): The access key used to authenticate with the storage service.
  • secret (str): The secret key used to authenticate with the storage service.
  • bucket (str): The name of the default bucket where data will be stored.
  • region (str, optional): The region where the bucket is located (may be necessary for certain storage providers).
  • client (Minio): An instance of the Minio client for interacting with the object storage.

Methods:

  • __init__(self, url, access, secret, bucket, region=None): Initializes the S3 storage with the given credentials and bucket.

When the storage is initialized, it will create an S3 client instance with the provided details. If the specified bucket does not exist, it will be created, accompanied by a creation message; otherwise, a ready-to-use message will be displayed. In the absence of credential details, the bucket will be created within EarthPulse’s proprietary cloud infrastructure, incurring applicable payment charges.

  • get_path(self, name): Returns a full path for an object within the bucket.
  • get_url(self, name): Provides a presigned URL to access the specified object.
  • list(self, pattern="*"): Lists all objects in the bucket that match the given pattern.
  • create_from_path(self, data, name): Stores an object in the bucket based on the file path provided.
  • create_from_image(self, data, name): Stores an image object in the bucket.
  • create_from_rasterio(self, rio, x, name, ds, window=None): Stores a rasterio object.
  • create_from_array(self, data, name): Stores a NumPy array.
  • create_from_csv(self, data, name): Stores a pandas DataFrame as a CSV file.
  • create_from_json(self, data, name): Stores a JSON file from a pandas DataFrame or a dictionary.
  • create_from_dict(self, data, name): Stores a dictionary as a JSON file.
  • create_from_string(self, data, name): Stores a string as a plain text file.
  • create_from_dataframe(self, data, name): Stores a pandas DataFrame in either CSV or JSON format.
  • read_object(self, name): Retrieves an object as BytesIO from the bucket.
  • read_from_json(self, name): Reads a JSON file from the bucket into a pandas DataFrame.
  • read_from_array(self, name): Reads binary data into a NumPy array.
  • read_from_geojson(self, gpd, name): Reads a GeoJSON file and converts it to a GeoDataFrame.
  • read_from_rasterio(self, rio, name): Reads a raster file into a rasterio dataset.
  • read_from_csv(self, name): Reads a CSV file into a pandas DataFrame.

CRUD Operations

Once local and/or S3 storages are initialized, you can interact with them in a unified way thanks to BaseStorage super class.

class BaseStorage:
    """
    A base storage class providing CRUD operations for different types of data representation.
    """

Create

Creates a new storage entity based on the data provided. It can also update an storage object.

create(self, data, name, **kwargs):
  • data: The data to be stored. This might be a string representing either file content or a path to a file, an image object from the PIL.Image module, a NumPy array, a pandas DataFrame, or a dictionary.
  • name: The name (including extension) to be used for the file that represents the stored data.
  • **kwargs: Additional keyword arguments that may be required for certain data types, such as a dataset (ds) for GeoTIFF files when working with rasterio.

Example:

# Local Storage
data.create(data='source/growth.json', name='growth.json')
# S3 Storage
cloud.create(data='data/S2L2A_2019-06-02.tif', name='S2L2A_2019-06-02.tif')

Read

Reads data from storage.

read(self, name):
  • name: The name of the file to be read. The extension is used to determine the method of reading the data.

Example:

# Local Storage
data.read('S2L2A_2019-06-02.tif')
# S3 Storage
cloud.read('S2L2A_2019-06-02.tif')

List

Lists all entities in the storage that match the given pattern.

list(self, pattern="*"):
  • pattern: A pattern string to filter the files. Defaults to "*" which lists all files.

Example:

# Local Storage
data.list()
# S3 Storage
cloud.list()

Delete

Deletes data from storage.

delete(self, name):

Example:

# Local Storage
 data.delete('verges.geojson')
# S3 Storage
 minio.delete('S2L2A_2019-06-02.tif')

Troubleshooting

If you encounter any issues during the installation of SPAI, please get in touch with use through our Discord server.

Back to top