Dataflux for Google Cloud Storage Python client library

Overview

This is the client library backing the Dataflux Dataset for Pytorch. The purpose of this client is to quickly list and download data stored in GCS for use in Python machine learning applications. The core functionalities of this client can be broken down into two key parts.

Fast List

The fast list component of this client leverages Python multiprocessing to parallelize the listing of files within a GCS bucket. It does this by implementing a workstealing algorithm, where each worker in the list operation is able to steal work from its siblings once it has finished all currently slated listing work. This parallelization leads to a real world speed increase up to 10 times faster than sequential listing. Note that paralellization is limited by the machine on which the client runs, and optimal performance is typically found with a worker count that is 1:1 with the available cores. Benchmarking has demonstrated that the larger the object count, the better Dataflux performs when compared to a linear listing.

Example Code

from dataflux_core import fast_list

number_of_workers = 20
project = "MyProject"
bucket = "TargetBucket"
target_folder_prefix = "folder1/"

print("Fast list operation starting...")
list_result = fast_list.ListingController(
    max_parallelism=number_of_workers,
    project=project,
    bucket=bucket,
    prefix=target_folder_prefix,
).run()

Storage Class

By default, fast list will only list objects of STANDARD class in GCS buckets. This can be overridden by passing in a string list of storage classes to include while running the Listing Controller. Note that this default behavior was chosen to avoid the cost associated with downloading non-standard GCS classes. Details on GCS Storage Classes can be further explored in the Storage Class Documentation.

API Call Count

Using fast list increases the total number of calls made to the GCS bucket. The increased amount can vary greatly based on size of the workload and number of threads, but our benchmarking has shown an uper bound of approximately 2x the number of API calls when compared to a sequential list. API call count tracking will be displayed in logs if log level is set to debug. To enable these logs we recommend using the --log-cli-level=DEBUG flag.

Fast List Benchmark Results

File Count	VM Core Count	List Time Without Dataflux	List Time With Dataflux
17944239 Obj	48 Core	1630.75s	79.55s
5000000 Obj	48 Core	289.95s	23.43s
1999002 Obj	48 Core	117.61s	12.45s
578411 Obj	48 Core	30.70s	9.39s
10013 Obj	48 Core	2.35s	6.06s

Composed Download

The composed download component of the client uses the results of the fast list to efficiently download the files necessary for a machine learning workload. When downloading files from remote stores, small file size often bottlenecks the speed at which files can be downloaded. To avoid this bottleneck, composed download leverages the GCS Compose Objects API to concatinate small files into larger composed files in GCS prior to downloading. This greatly improves download performance, particularly on datasets with very large numbers of small files.

Example Code

from dataflux_core import download

# The maximum size in bytes of a composite download object.
# If this value is set to 0, no composition will occur.
max_compose_bytes = 10000000
project = "MyProject"
bucket = "TargetBucket"

download_params = download.DataFluxDownloadOptimizationParams(
    max_compose_bytes
)

print("Download operation starting...")
download_result = download.dataflux_download(
    project_name=project,
    bucket_name=bucket,
    # The list_results parameter is the value returned by fast list in the previous code example.
    objects=list_result,
    dataflux_download_optimization_params=download_params,
)

Multiple Download Options

Looking at the download code you will notice three distinct download functions. The default function used in the dataflux-pytorch client is dataflux_download. The other functions serve to improve performance for specific use cases.

Parallel Download

The dataflux_download_parallel function is the most performant stand-alone download function. When using the dataflux client library in isolation, this is the recommended download function. Parallelization must be tuned based on available CPU power and network bandwidth.

Threaded Download

The dataflux_download_threaded function allows for some amount of downlod parallelization while running within daemonic processes (e.g. a distributed ML workload leveraging ray). Daemonic processes are not permitted to spin up child processes, and thus threading must be used in these instances. Threading download performance is similar to that of multiprocessing for most use-cases, but loses out on performance as the thread/process count increases. Additionally, threading does not allow for signal interuption, so SIGINT cleanup triggers are disabled when running a threaded download.

Dataflux Download Benchmark Results

These benchmarks were performed on a n2-standard-48 48 vCPU virtual machine on files of approximately 10kb each.

Number of Objects	Standard Linear Download	Dataflux Composed Download	Dataflux Threaded Composed Download (48 Threads)	Dataflux Parallel Composed Download (48 Processes)
111	18.27 Seconds	5.17 Seconds	3.94 Seconds	2.06 Seconds
1111	176.22 Seconds	61.78 Seconds	5.21 Seconds	3.14 Seconds
11098	1396.98 Seconds	392.23 Seconds	16.85 Seconds	14.88 Seconds

Getting Started

To get started leveraging the dataflux client library, we encourage you to start from the Dataflux Dataset for Pytorch. For an example of client-specific implementation, please see the benchmark code.

Support

Please file a GitHub issue in this repository
If you need to get in touch with us, email dataflux-customer-support@google.com

Contributing

We welcome your feedback, issues, and bug fixes. We have a tight roadmap at this time so if you have a major feature or change in functionality you'd like to contribute, please open a GitHub Issue for discussion prior to sending a pull request. Please see CONTRIBUTING for more information on how to report bugs or submit pull requests.

Code of Conduct

This project has adopted the Google Open Source Code of Conduct. Please see code-of-conduct.md for more information.

License

The Dataflux Python Client has an Apache License 2.0. Please see the LICENSE file for more information.

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
.github		.github
.vscode		.vscode
dataflux_core		dataflux_core
docs		docs
kokoro		kokoro
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dataflux for Google Cloud Storage Python client library

Overview

Fast List

Example Code

Storage Class

API Call Count

Fast List Benchmark Results

Composed Download

Example Code

Multiple Download Options

Parallel Download

Threaded Download

Dataflux Download Benchmark Results

Getting Started

Support

Contributing

Code of Conduct

License

About

Releases 2

Packages

Contributors 5

Languages

License

GoogleCloudPlatform/dataflux-client-python

Folders and files

Latest commit

History

Repository files navigation

Dataflux for Google Cloud Storage Python client library

Overview

Fast List

Example Code

Storage Class

API Call Count

Fast List Benchmark Results

Composed Download

Example Code

Multiple Download Options

Parallel Download

Threaded Download

Dataflux Download Benchmark Results

Getting Started

Support

Contributing

Code of Conduct

License

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 5

Languages

Packages