How do I load very large datasets to Catalog?

Datasets larger than tens to hundreds of GB are generally too unwieldy to upload from a single machine. In these cases it is best to ingest from a cloud storage system directly to the DL Catalog. This eliminates the bottleneck a single local machine and its (generally slower) network connection introduces. The Tasks scaled compute environment allows a ingest like this to maximize ingest rates by running ingest jobs in parallel.

Prerequisites

Imagery loaded in this manner must reside on a cloud datastore capable of handling many concurrent download requests. This would include cloud object stores such as Google Cloud Storage, Amazon S3, Azure Storage, as well as the Storage service provided by Descartes Labs. Many imagery providers will deliver data into these services directly. If given a choice, Google's Cloud Storage is the most efficient service to ingest from, as the DL platform resides on Google Cloud Platform.

Pipeline architecture

When building an ingest pipeline on Tasks, consider where in the processing chain delays are most likely to occur. For instance, if the maximum read concurrency rate supported by the service housing the imagery is low, then the maximum concurrent worker setting on the Task group should be set to match. Otherwise, rate limits will be exceeded on the cloud storage system which will in turn lead to task timeouts and retries in DL Tasks. Also consider the size of imagery loaded and how much preprocessing each image will require when sizing the task group workers.

How do I load very large datasets to Catalog?

Building ingest pipelines with DL Tasks

Prerequisites

Pipeline architecture