Skip to content

CLI Documentation

datachain

DataChain: Wrangle unstructured AI data at scale

Usage:

datachain [-h] [-V] command ...

Options:

  • -V / --version — show program's version number and exit (default: "SUPPRESS")

Arguments:

  • command — Use datachain command --help for command-specific help.

datachain clear-cache

Clear the local file cache

Usage:

datachain clear-cache [-h] [--aws-endpoint-url AWS_ENDPOINT_URL]
                             [--anon] [-u] [-v] [-q] [--debug-sql] [--pdb]

Options:

  • --aws-endpoint-url — AWS endpoint URL
  • --anon — AWS anon (aka awscli's --no-sign-request)
  • -u / --update — Update cache
  • -v / --verbose — Verbose
  • -q / --quiet — Be quiet
  • --debug-sql — Show All SQL Queries (very verbose output, for debugging only)
  • --pdb — Drop into the pdb debugger on fatal exception

datachain clone

Copy data files from the cloud

Usage:

datachain clone [-h] [--aws-endpoint-url AWS_ENDPOINT_URL] [--anon]
                       [-u] [-v] [-q] [--debug-sql] [--pdb] [-f] [-r]
                       [--no-glob] [--no-cp] [--edatachain]
                       [--edatachain-file EDATACHAIN_FILE]
                       sources [sources ...] output

Options:

  • --aws-endpoint-url — AWS endpoint URL
  • --anon — AWS anon (aka awscli's --no-sign-request)
  • -u / --update — Update cache
  • -v / --verbose — Verbose
  • -q / --quiet — Be quiet
  • --debug-sql — Show All SQL Queries (very verbose output, for debugging only)
  • --pdb — Drop into the pdb debugger on fatal exception
  • -f / --force — Force creating outputs
  • -r / -R / --recursive — Copy directories recursively
  • --no-glob — Do not expand globs (such as * or ?)
  • --no-cp — Do not copy files, just create a dataset
  • --edatachain — Create a .edatachain file
  • --edatachain-file — Use a different filename for the resulting .edatachain file

Arguments:

  • sources — Data sources - paths to cloud storage dirs
  • output — Output

datachain completion

Output shell completion script

Usage:

datachain completion [-h] [--aws-endpoint-url AWS_ENDPOINT_URL]
                            [--anon] [-u] [-v] [-q] [--debug-sql] [--pdb]
                            [-s {bash,zsh,tcsh}]

Options:

  • --aws-endpoint-url — AWS endpoint URL
  • --anon — AWS anon (aka awscli's --no-sign-request)
  • -u / --update — Update cache
  • -v / --verbose — Verbose
  • -q / --quiet — Be quiet
  • --debug-sql — Show All SQL Queries (very verbose output, for debugging only)
  • --pdb — Drop into the pdb debugger on fatal exception
  • -s / --shell — Shell syntax for completions. (default: "bash")

datachain cp

Copy data files from the cloud

Usage:

datachain cp [-h] [--aws-endpoint-url AWS_ENDPOINT_URL] [--anon] [-u]
                    [-v] [-q] [--debug-sql] [--pdb] [-f] [-r] [--no-glob]
                    sources [sources ...] output

Options:

  • --aws-endpoint-url — AWS endpoint URL
  • --anon — AWS anon (aka awscli's --no-sign-request)
  • -u / --update — Update cache
  • -v / --verbose — Verbose
  • -q / --quiet — Be quiet
  • --debug-sql — Show All SQL Queries (very verbose output, for debugging only)
  • --pdb — Drop into the pdb debugger on fatal exception
  • -f / --force — Force creating outputs
  • -r / -R / --recursive — Copy directories recursively
  • --no-glob — Do not expand globs (such as * or ?)

Arguments:

  • sources — Data sources - paths to cloud storage dirs
  • output — Output

datachain dataset-stats

Shows basic dataset stats

Usage:

datachain dataset-stats [-h] [--aws-endpoint-url AWS_ENDPOINT_URL]
                               [--anon] [-u] [-v] [-q] [--debug-sql] [--pdb]
                               [--version VERSION] [-b] [--si]
                               name

Options:

  • --aws-endpoint-url — AWS endpoint URL
  • --anon — AWS anon (aka awscli's --no-sign-request)
  • -u / --update — Update cache
  • -v / --verbose — Verbose
  • -q / --quiet — Be quiet
  • --debug-sql — Show All SQL Queries (very verbose output, for debugging only)
  • --pdb — Drop into the pdb debugger on fatal exception
  • --version — Dataset version
  • -b / --bytes — Display size in bytes instead of human-readable size
  • --si — Display size using powers of 1000 not 1024

Arguments:

  • name — Dataset name

datachain datasets

List datasets

Usage:

datachain datasets [-h] [--aws-endpoint-url AWS_ENDPOINT_URL] [--anon]
                          [-u] [-v] [-q] [--debug-sql] [--pdb] [--studio] [-L]
                          [-a] [--team TEAM]

Options:

  • --aws-endpoint-url — AWS endpoint URL
  • --anon — AWS anon (aka awscli's --no-sign-request)
  • -u / --update — Update cache
  • -v / --verbose — Verbose
  • -q / --quiet — Be quiet
  • --debug-sql — Show All SQL Queries (very verbose output, for debugging only)
  • --pdb — Drop into the pdb debugger on fatal exception
  • --studio — List the files in the Studio
  • -L / --local — List local files only
  • -a / --all — List all files including hidden files (default: true)
  • --team — The team to list datasets for. By default, it will use team from config.

datachain du

Display space usage

Usage:

datachain du [-h] [--aws-endpoint-url AWS_ENDPOINT_URL] [--anon] [-u]
                    [-v] [-q] [--debug-sql] [--pdb] [-b] [-d N] [--si]
                    sources [sources ...]

Options:

  • --aws-endpoint-url — AWS endpoint URL
  • --anon — AWS anon (aka awscli's --no-sign-request)
  • -u / --update — Update cache
  • -v / --verbose — Verbose
  • -q / --quiet — Be quiet
  • --debug-sql — Show All SQL Queries (very verbose output, for debugging only)
  • --pdb — Drop into the pdb debugger on fatal exception
  • -b / --bytes — Display sizes in bytes instead of human-readable sizes
  • -d / --depth / --max-depth — Display sizes for N directory depths below the given directory, the default is 0 (summarize provided directory only).
  • --si — Display sizes using powers of 1000 not 1024

Arguments:

  • sources — Data sources - paths to cloud storage dirs

datachain edit-dataset

Edit dataset metadata

Usage:

datachain edit-dataset [-h] [--aws-endpoint-url AWS_ENDPOINT_URL]
                              [--anon] [-u] [-v] [-q] [--debug-sql] [--pdb]
                              [--new-name NEW_NAME]
                              [--description DESCRIPTION]
                              [--labels LABELS [LABELS ...]]
                              name

Options:

  • --aws-endpoint-url — AWS endpoint URL
  • --anon — AWS anon (aka awscli's --no-sign-request)
  • -u / --update — Update cache
  • -v / --verbose — Verbose
  • -q / --quiet — Be quiet
  • --debug-sql — Show All SQL Queries (very verbose output, for debugging only)
  • --pdb — Drop into the pdb debugger on fatal exception
  • --new-name — Dataset new name
  • --description — Dataset description
  • --labels — Dataset labels

Arguments:

  • name — Dataset name

datachain find

Search in a directory hierarchy

Usage:

datachain find [-h] [--aws-endpoint-url AWS_ENDPOINT_URL] [--anon] [-u]
                      [-v] [-q] [--debug-sql] [--pdb] [--name NAME]
                      [--iname INAME] [--path PATH] [--ipath IPATH]
                      [--size SIZE] [--type TYPE] [-c COLUMNS]
                      sources [sources ...]

Options:

  • --aws-endpoint-url — AWS endpoint URL
  • --anon — AWS anon (aka awscli's --no-sign-request)
  • -u / --update — Update cache
  • -v / --verbose — Verbose
  • -q / --quiet — Be quiet
  • --debug-sql — Show All SQL Queries (very verbose output, for debugging only)
  • --pdb — Drop into the pdb debugger on fatal exception
  • --name — Filename to match pattern.
  • --iname — Like -name but case insensitive.
  • --path — Path to match pattern.
  • --ipath — Like -path but case insensitive.
  • --size — Filter by size (+ is greater or equal, - is less or equal). Specified size is in bytes, or use a suffix like K, M, G for kilobytes, megabytes, gigabytes, etc.
  • --type — File type: "f" - regular, "d" - directory
  • -c / --columns — A comma-separated list of columns to print for each result. Options are: du,name,path,size,type (Default: path)

Arguments:

  • sources — Data sources - paths to cloud storage dirs

datachain gc

Garbage collect temporary tables

Usage:

datachain gc [-h] [--aws-endpoint-url AWS_ENDPOINT_URL] [--anon] [-u]
                    [-v] [-q] [--debug-sql] [--pdb]

Options:

  • --aws-endpoint-url — AWS endpoint URL
  • --anon — AWS anon (aka awscli's --no-sign-request)
  • -u / --update — Update cache
  • -v / --verbose — Verbose
  • -q / --quiet — Be quiet
  • --debug-sql — Show All SQL Queries (very verbose output, for debugging only)
  • --pdb — Drop into the pdb debugger on fatal exception

datachain index

Index storage location

Usage:

datachain index [-h] [--aws-endpoint-url AWS_ENDPOINT_URL] [--anon]
                       [-u] [-v] [-q] [--debug-sql] [--pdb]
                       sources [sources ...]

Options:

  • --aws-endpoint-url — AWS endpoint URL
  • --anon — AWS anon (aka awscli's --no-sign-request)
  • -u / --update — Update cache
  • -v / --verbose — Verbose
  • -q / --quiet — Be quiet
  • --debug-sql — Show All SQL Queries (very verbose output, for debugging only)
  • --pdb — Drop into the pdb debugger on fatal exception

Arguments:

  • sources — Data sources - paths to cloud storage dirs

datachain internal-run-udf

Usage:

datachain internal-run-udf [-h] [--aws-endpoint-url AWS_ENDPOINT_URL]
                                  [--anon] [-u] [-v] [-q] [--debug-sql]
                                  [--pdb]

Options:

  • --aws-endpoint-url — AWS endpoint URL
  • --anon — AWS anon (aka awscli's --no-sign-request)
  • -u / --update — Update cache
  • -v / --verbose — Verbose
  • -q / --quiet — Be quiet
  • --debug-sql — Show All SQL Queries (very verbose output, for debugging only)
  • --pdb — Drop into the pdb debugger on fatal exception

datachain internal-run-udf-worker

Usage:

datachain internal-run-udf-worker [-h]
                                         [--aws-endpoint-url AWS_ENDPOINT_URL]
                                         [--anon] [-u] [-v] [-q] [--debug-sql]
                                         [--pdb]

Options:

  • --aws-endpoint-url — AWS endpoint URL
  • --anon — AWS anon (aka awscli's --no-sign-request)
  • -u / --update — Update cache
  • -v / --verbose — Verbose
  • -q / --quiet — Be quiet
  • --debug-sql — Show All SQL Queries (very verbose output, for debugging only)
  • --pdb — Drop into the pdb debugger on fatal exception

datachain ls

List storage contents

Usage:

datachain ls [-h] [--aws-endpoint-url AWS_ENDPOINT_URL] [--anon] [-u]
                    [-v] [-q] [--debug-sql] [--pdb] [-l] [--studio] [-L] [-a]
                    [--team TEAM]
                    [sources ...]

Options:

  • --aws-endpoint-url — AWS endpoint URL
  • --anon — AWS anon (aka awscli's --no-sign-request)
  • -u / --update — Update cache
  • -v / --verbose — Verbose
  • -q / --quiet — Be quiet
  • --debug-sql — Show All SQL Queries (very verbose output, for debugging only)
  • --pdb — Drop into the pdb debugger on fatal exception
  • -l / --long — List files in the long format
  • --studio — List the files in the Studio
  • -L / --local — List local files only
  • -a / --all — List all files including hidden files (default: true)
  • --team — The team to list datasets for. By default, it will use team from config.

Arguments:

  • sources — Data sources - paths to cloud storage dirs

datachain pull

Pull specific dataset version from SaaS

Usage:

datachain pull [-h] [--aws-endpoint-url AWS_ENDPOINT_URL] [--anon] [-u]
                      [-v] [-q] [--debug-sql] [--pdb] [-o OUTPUT] [-f] [-r]
                      [--no-cp] [--edatachain]
                      [--edatachain-file EDATACHAIN_FILE]
                      dataset

Options:

  • --aws-endpoint-url — AWS endpoint URL
  • --anon — AWS anon (aka awscli's --no-sign-request)
  • -u / --update — Update cache
  • -v / --verbose — Verbose
  • -q / --quiet — Be quiet
  • --debug-sql — Show All SQL Queries (very verbose output, for debugging only)
  • --pdb — Drop into the pdb debugger on fatal exception
  • -o / --output — Output
  • -f / --force — Force creating outputs
  • -r / -R / --recursive — Copy directories recursively
  • --no-cp — Do not copy files, just pull a remote dataset into local DB
  • --edatachain — Create .edatachain file
  • --edatachain-file — Use a different filename for the resulting .edatachain file

Arguments:

  • dataset — Name and version of remote dataset created in SaaS

datachain query

Create a new dataset with a query script

Usage:

datachain query [-h] [--aws-endpoint-url AWS_ENDPOINT_URL] [--anon]
                       [-u] [-v] [-q] [--debug-sql] [--pdb] [--parallel [N]]
                       [-p param=value]
                       <script.py>

Options:

  • --aws-endpoint-url — AWS endpoint URL
  • --anon — AWS anon (aka awscli's --no-sign-request)
  • -u / --update — Update cache
  • -v / --verbose — Verbose
  • -q / --quiet — Be quiet
  • --debug-sql — Show All SQL Queries (very verbose output, for debugging only)
  • --pdb — Drop into the pdb debugger on fatal exception
  • --parallel — Use multiprocessing to run any query script UDFs with N worker processes. N defaults to the CPU count.
  • -p / --param — Query parameters

Arguments:

  • <script.py> — Filepath for script

datachain rm-dataset

Removes dataset

Usage:

datachain rm-dataset [-h] [--aws-endpoint-url AWS_ENDPOINT_URL]
                            [--anon] [-u] [-v] [-q] [--debug-sql] [--pdb]
                            [--version VERSION] [--force | --no-force]
                            name

Options:

  • --aws-endpoint-url — AWS endpoint URL
  • --anon — AWS anon (aka awscli's --no-sign-request)
  • -u / --update — Update cache
  • -v / --verbose — Verbose
  • -q / --quiet — Be quiet
  • --debug-sql — Show All SQL Queries (very verbose output, for debugging only)
  • --pdb — Drop into the pdb debugger on fatal exception
  • --version — Dataset version
  • --force / --no-force — Force delete registered dataset with all of it's versions (default: falses)

Arguments:

  • name — Dataset name

datachain show

Create a new dataset with a query script

Usage:

datachain show [-h] [--aws-endpoint-url AWS_ENDPOINT_URL] [--anon] [-u]
                      [-v] [-q] [--debug-sql] [--pdb] [--version VERSION]
                      [--schema] [--limit LIMIT] [--offset OFFSET]
                      [--columns COLUMNS] [--no-collapse]
                      name

Options:

  • --aws-endpoint-url — AWS endpoint URL
  • --anon — AWS anon (aka awscli's --no-sign-request)
  • -u / --update — Update cache
  • -v / --verbose — Verbose
  • -q / --quiet — Be quiet
  • --debug-sql — Show All SQL Queries (very verbose output, for debugging only)
  • --pdb — Drop into the pdb debugger on fatal exception
  • --version — Dataset version
  • --schema — Show schema
  • --limit — Number of rows to show (default: 10)
  • --offset — Number of rows to offset
  • --columns — Columns to show
  • --no-collapse — Do not collapse the columns

Arguments:

  • name — Dataset name

datachain studio

Authenticate DataChain with Studio and set the token. Once this token has been properly configured, DataChain will utilize it for seamlessly sharing datasets and using Studio features from CLI

Usage:

datachain studio [-h] [--aws-endpoint-url AWS_ENDPOINT_URL] [--anon]
                        [-u] [-v] [-q] [--debug-sql] [--pdb]
                        {login,logout,team,token,datasets} ...

Options:

  • --aws-endpoint-url — AWS endpoint URL
  • --anon — AWS anon (aka awscli's --no-sign-request)
  • -u / --update — Update cache
  • -v / --verbose — Verbose
  • -q / --quiet — Be quiet
  • --debug-sql — Show All SQL Queries (very verbose output, for debugging only)
  • --pdb — Drop into the pdb debugger on fatal exception

Arguments:

  • cmd — Use DataChain studio CMD --help to display command-specific help.

datachain studio datasets

This command lists all the datasets available in Studio. It will show the dataset name and the number of versions available.

Usage:

datachain studio datasets [-h] [--aws-endpoint-url AWS_ENDPOINT_URL]
                                 [--anon] [-u] [-v] [-q] [--debug-sql] [--pdb]
                                 [--team TEAM]

Options:

  • --aws-endpoint-url — AWS endpoint URL
  • --anon — AWS anon (aka awscli's --no-sign-request)
  • -u / --update — Update cache
  • -v / --verbose — Verbose
  • -q / --quiet — Be quiet
  • --debug-sql — Show All SQL Queries (very verbose output, for debugging only)
  • --pdb — Drop into the pdb debugger on fatal exception
  • --team — The team to list datasets for. By default, it will use team from config.

datachain studio login

By default, this command authenticates the DataChain with Studio using default scopes and assigns a random name as the token name.

Usage:

datachain studio login [-h] [--aws-endpoint-url AWS_ENDPOINT_URL]
                              [--anon] [-u] [-v] [-q] [--debug-sql] [--pdb]
                              [-H HOSTNAME] [-s SCOPES] [-n NAME] [--no-open]

Options:

  • --aws-endpoint-url — AWS endpoint URL
  • --anon — AWS anon (aka awscli's --no-sign-request)
  • -u / --update — Update cache
  • -v / --verbose — Verbose
  • -q / --quiet — Be quiet
  • --debug-sql — Show All SQL Queries (very verbose output, for debugging only)
  • --pdb — Drop into the pdb debugger on fatal exception
  • -H / --hostname — The hostname of the Studio instance to authenticate with.
  • -s / --scopes — The scopes for the authentication token.
  • -n / --name — The name of the authentication token. It will be used to identify token shown in Studio profile.
  • --no-open — Use authentication flow based on user code. You will be presented with user code to enter in browser. DataChain will also use this if it cannot launch browser on your behalf.

datachain studio logout

This removes the studio token from your global config.

Usage:

datachain studio logout [-h] [--aws-endpoint-url AWS_ENDPOINT_URL]
                               [--anon] [-u] [-v] [-q] [--debug-sql] [--pdb]

Options:

  • --aws-endpoint-url — AWS endpoint URL
  • --anon — AWS anon (aka awscli's --no-sign-request)
  • -u / --update — Update cache
  • -v / --verbose — Verbose
  • -q / --quiet — Be quiet
  • --debug-sql — Show All SQL Queries (very verbose output, for debugging only)
  • --pdb — Drop into the pdb debugger on fatal exception

datachain studio team

Set the default team for DataChain to use when interacting with Studio.

Usage:

datachain studio team [-h] [--aws-endpoint-url AWS_ENDPOINT_URL]
                             [--anon] [-u] [-v] [-q] [--debug-sql] [--pdb]
                             [--global]
                             team_name

Options:

  • --aws-endpoint-url — AWS endpoint URL
  • --anon — AWS anon (aka awscli's --no-sign-request)
  • -u / --update — Update cache
  • -v / --verbose — Verbose
  • -q / --quiet — Be quiet
  • --debug-sql — Show All SQL Queries (very verbose output, for debugging only)
  • --pdb — Drop into the pdb debugger on fatal exception
  • --global — Set the team globally for all DataChain projects.

Arguments:

  • team_name — The name of the team to set as the default.

datachain studio token

View the token datachain uses to contact Studio

Usage:

datachain studio token [-h] [--aws-endpoint-url AWS_ENDPOINT_URL]
                              [--anon] [-u] [-v] [-q] [--debug-sql] [--pdb]

Options:

  • --aws-endpoint-url — AWS endpoint URL
  • --anon — AWS anon (aka awscli's --no-sign-request)
  • -u / --update — Update cache
  • -v / --verbose — Verbose
  • -q / --quiet — Be quiet
  • --debug-sql — Show All SQL Queries (very verbose output, for debugging only)
  • --pdb — Drop into the pdb debugger on fatal exception