Parallelizing Processes

Skip to content

English - United States

LoB_logo_Mining@4x

Contact us

Parallelizing - raster/storage/metadata requests, workflows jobs

What and when to parallelize?
If you have something interactive you want to parallelize locally
If you want to send a large task to run a long time reliably (ie overnight) then you want to use Tasks
Best practices in Tasks is published in the docs

Use cases for parallel processing

Speed up access to an API (ex: requests to Storage, or Task results)

Download a large collection of model outputs (ex: csv files) from DL Storage
Quickly download imagery across many AOI with Raster

Handle compute heavy workload that can be split into pieces

Deploy a model over a large AOI split into tiles

Scale handling of large datasets that can be split into pieces

Ingest a large dataset consisting of many images on a object storage service (Google Cloud Storage, Azure Storage)
Leverage larger memory footprints in DL Tasks to download very large files and upload to Catalog

Methods for parallel processing

Multithreading and multiprocessing on a single machine (e.g. Workbench)
Split across multiple machines in a scaled-compute environment (e.g Tasks)
Combination of both methods (multithreading within a Tasks worker)
Workflows map functions and built-in parallelism. (autoscaling)

Considerations to make when choosing a method

If the worker process is accessing an API, what are that API’s rate limits?
How much memory will each individual unit of work require? (Will the memory consumed by all active workers exceed local memory? If so choose Tasks instead of local processing)
Will a unit of work exceed Workflows memory constraints or complexity?
Can a unit of work be expressed in a Workflows query?