Parallelizing Processes
Parallelizing - raster/storage/metadata requests, workflows jobs
- What and when to parallelize?
- If you have something interactive you want to parallelize locally
- If you want to send a large task to run a long time reliably (ie overnight) then you want to use Tasks
- Best practices in Tasks is published in the docs
Use cases for parallel processing
- Speed up access to an API (ex: requests to Storage, or Task results)
- Download a large collection of model outputs (ex: csv files) from DL Storage
- Quickly download imagery across many AOI with Raster
- Handle compute heavy workload that can be split into pieces
- Deploy a model over a large AOI split into tiles
- Scale handling of large datasets that can be split into pieces
- Ingest a large dataset consisting of many images on a object storage service (Google Cloud Storage, Azure Storage)
- Leverage larger memory footprints in DL Tasks to download very large files and upload to Catalog
Methods for parallel processing
- Multithreading and multiprocessing on a single machine (e.g. Workbench)
- Split across multiple machines in a scaled-compute environment (e.g Tasks)
- Combination of both methods (multithreading within a Tasks worker)
- Workflows map functions and built-in parallelism. (autoscaling)
Considerations to make when choosing a method
- If the worker process is accessing an API, what are that API’s rate limits?
- How much memory will each individual unit of work require? (Will the memory consumed by all active workers exceed local memory? If so choose Tasks instead of local processing)
- Will a unit of work exceed Workflows memory constraints or complexity?
- Can a unit of work be expressed in a Workflows query?