Datasets

Dataset Storage Options

Each of these will show how to define datasets in both Python and YAML formats. For all Python examples, dataset must be imported from servicex.

See also

For how datasets are added to a sample, see Samples.

Rucio

This dataset declaration looks up a dataset using a query to the Rucio data management system. The request is assumed to be for a Rucio dataset or container.

"Dataset": servicex.dataset.Rucio("my.rucio.dataset.name")
Dataset: !Rucio my.rucio.dataset.name

EOS

For files stored on EOS, two access methods are available. For discrete file selection, FileList is recommended; for entire directories or wildcard patterns, XRootD is the appropriate dataset type.

Danger

The ServiceX instance must have permissions to read these files; in particular if generic members of your experiment can’t access the files, ServiceX will probably not be able to either.

FileList

"Dataset": servicex.dataset.FileList(["root://eospublic.cern.ch//eos/opendata/mystuff/file1.root", "root://eospublic.cern.ch//eos/opendata/mystuff/file2.root"])

XRootD

Added in version 3.0.1.

"Dataset": servicex.dataset.XRootD("root://eospublic.cern.ch//eos/opendata/mystuff/*")

FileList

Dataset: !FileList ["root://eospublic.cern.ch//eos/opendata/mystuff/file1.root", "root://eospublic.cern.ch//eos/opendata/mystuff/file2.root"]

XRootD

Added in version 3.0.1.

Dataset: !XRootD root://eospublic.cern.ch//eos/opendata/mystuff/*

CERN Open Data Portal

Datasets from the CERN Open Data Portal are referenced by their numeric record ID.

"Dataset": servicex.dataset.CERNOpenData(179)
Dataset: !CERNOpenData 179

Network Accessible Files

Files accessible via HTTP or XRootD protocols can be provided directly as a list of URLs.

Danger

The ServiceX instance must have permissions to read these files; in particular if generic members of your experiment can’t access the files, ServiceX will probably not be able to either.

"Dataset": servicex.dataset.FileList(["http://server/file1.root", "root://server/file2.root"])
Dataset: !FileList ["http://server/file1.root", "root://server/file2.root"]

API Reference

DatasetGroup

class servicex.dataset_group.DatasetGroup(datasets: List[Query])[source]

A group of datasets that are to be transformed together. This is a convenience class to allow you to submit multiple datasets to a ServiceX instance and then wait for all of them to complete.

Parameters:

datasets – List of transform request as dataset instances

set_result_format

DatasetGroup.set_result_format(result_format: ResultFormat)[source]

Set the result format for all the datasets in the group.

Parameters:

result_format – ResultFormat instance

as_signed_urls

DatasetGroup.as_signed_urls(display_progress: bool = True, provided_progress: Progress | None = None, return_exceptions: bool = False, overall_progress: bool = False) List[TransformedResults | BaseException]

as_files

DatasetGroup.as_files(display_progress: bool = True, provided_progress: Progress | None = None, return_exceptions: bool = False, overall_progress: bool = False) List[TransformedResults | BaseException]

Rucio

class servicex.dataset_identifier.RucioDatasetIdentifier(dataset: str, num_files: int | None = None)[source]

Rucio Dataset - this will be looked up using the Rucio data management service.

Parameters:
  • dataset – The rucio DID - this can be a dataset or a container of datasets.

  • num_files – Maximum number of files to return. This is useful during development to perform quick runs. ServiceX is careful to make sure it always returns the same subset of files.

FileList

class servicex.dataset_identifier.FileListDataset(files: List[str] | str)[source]

Dataset specified as a list of XRootD URIs.

Parameters:

files – Either a list of URIs or a single URI string

XRootD

class servicex.dataset_identifier.XRootDDatasetIdentifier(pattern: str, num_files: int | None = None)[source]

XRootD pattern Dataset - this will be looked up using the XRootD protocol using wildcards.

Parameters:
  • pattern – The wildcard pattern to be evaluated.

  • num_files – Maximum number of files to return. This is useful during development to perform quick runs. ServiceX is careful to make sure it always returns the same subset of files.

CERNOpenData

class servicex.dataset_identifier.CERNOpenDataDatasetIdentifier(dataset: int, num_files: int | None = None)[source]

CERN Open Data Dataset - this will be looked up using the CERN Open Data DID finder.

Parameters:
  • dataset – The dataset ID - this is an integer.

  • num_files – Maximum number of files to return. This is useful during development to perform quick runs. ServiceX is careful to make sure it always returns the same subset of files.

GenericDataSet

class servicex.dataset_identifier.DataSetIdentifier(scheme: str, dataset: str, num_files: int | None = None)[source]

Base class for specifying the dataset to transform. This can either be a list of xRootD URIs or a rucio DID