Skip to content

Dask Gateway

Calrissian's nominal behaviour is to schedule pods per CWL steps while CWL tools may need distributed compute for array- or window-based processing.

The Calrissian DaskGatewayRequirement extension provides a first-class way to provision and manage a transient Dask cluster that a CommandLineTool can talk to.

DaskGatewayRequirement extension lets a tool request a short-lived Dask cluster (via Dask Gateway) as part of step execution.

Calrissian will:

  • Bring up a Dask cluster prior to tool execution (init container),
  • Surface scheduler connection info to the main container,
  • Optionally run user-provided init and dispose scripts,
  • Tear everything down reliably (including on failures).

The Calrissian DaskGatewayRequirement extension makes Dask usage declarative and portable at the CWL layer while keeping cluster lifecycle under Calrissian control.

DaskGatewayRequirement extension schema

Below the schema for the DaskGatewayRequirement extension:

$base: https://calrissian-cwl.github.io/schema#
$namespaces:
  cwl: "https://w3id.org/cwl/cwl#"
$graph:
- $import: https://w3id.org/cwl/CommonWorkflowLanguage.yml

- name: DaskGatewayRequirement
  type: record
  extends: cwl:ProcessRequirement
  inVocab: false
  doc: "Indicates that a process requires a Dask cluster procured via [Dask Gateway](https://gateway.dask.org/) runtime."
  fields:
    class:
      type: 'string'
      doc: "Always 'DaskGatewayRequirement'"
      jsonldPredicate:
        "_id": "@type"
        "_type": "@vocab"
    workerCores:
      type:
        - 'int'
        - 'cwl:Expression'
      doc: |
        Number of cpu-cores available for a dask worker.
    workerCoresLimit:
      type:
        - 'int'
        - 'cwl:Expression'
      doc: |
        Maximum number of cpu-cores available for a dask worker.
    workerMemory:
      type:
        - 'string'
        - 'cwl:Expression'
      doc: |
        Maximum amount of memory available for a dask worker.
    clusterMaxCores:
      type:
        - 'int'
        - 'cwl:Expression'
      doc: |
        Maximum number of cores available for the dask cluster
    clusterMaxMemory:
      type:
        - 'string'
        - 'cwl:Expression'
      doc: |
        Maximum amount of memory available for the dask cluster

CWL CommandlineTool example

cwlVersion: v1.2

$namespaces:
  s: https://schema.org/
  calrissian: https://calrissian-cwl.github.io/schema#
schemas:
  - http://schema.org/version/9.0/schemaorg-current-http.rdf

class: CommandLineTool
id: wrs-coverage-tool
requirements:
    EnvVarRequirement:
    envDef: {}
    NetworkAccess:
    networkAccess: true
hints:
    DockerRequirement:
        dockerPull: docker.io/library/wrs-coverage
    calrissian:DaskGatewayRequirement:
        workerCores: 1
        workerCoresLimit: 1
        workerMemory: "1G"
        clusterMaxCores: 10
        clusterMaxMemory: "20G"
baseCommand: ["wrs-coverage"]
arguments: []
inputs:
    collection-id:
    type: string
    inputBinding:
        position: 1
        prefix: "--collection-id"
outputs:
    wrs-coverage-image:
    type: File
    outputBinding:
        glob: acq-by-wrs-tile.png
    wrs-coverage-parquet:
    type: File
    outputBinding:
        glob: acq-by-wrs-tile.parquet

Additional CLI options

The Calrissian CLI options below allow defining the Dask Gateway configuration:

  • --dask-gateway-url: defines the Dask Gateway URL. This is the Dask Gateway internal service URL.

Optional:

  • --dask-script-configmap: name of an existing configmap with custom script for dask to override the default Dask cluster initialization.