Datasets¶

Dataset Storage Options¶

Each of these will show how to define datasets in both Python and YAML formats. For all Python examples, dataset must be imported from servicex.

Rucio¶

This dataset declaration looks up a dataset using a query to the Rucio data management system. The request is assumed to be for a Rucio dataset or container.

Python

"Dataset": servicex.dataset.Rucio("my.rucio.dataset.name")

YAML

Dataset: !Rucio my.rucio.dataset.name

EOS¶

For files stored on EOS, two access methods are available. For discrete file selection, FileList is recommended; for entire directories or wildcard patterns, XRootD is the appropriate dataset type.

Danger

The ServiceX instance must have permissions to read these files; in particular if generic members of your experiment can’t access the files, ServiceX will probably not be able to either.

Python

FileList

"Dataset": servicex.dataset.FileList(["root://eospublic.cern.ch//eos/opendata/mystuff/file1.root", "root://eospublic.cern.ch//eos/opendata/mystuff/file2.root"])

XRootD

Added in version 3.0.1.

"Dataset": servicex.dataset.XRootD("root://eospublic.cern.ch//eos/opendata/mystuff/*")

YAML

FileList

Dataset: !FileList ["root://eospublic.cern.ch//eos/opendata/mystuff/file1.root", "root://eospublic.cern.ch//eos/opendata/mystuff/file2.root"]

XRootD

Added in version 3.0.1.

Dataset: !XRootD root://eospublic.cern.ch//eos/opendata/mystuff/*

CERN Open Data Portal¶

Datasets from the CERN Open Data Portal are referenced by their numeric record ID.

Python

"Dataset": servicex.dataset.CERNOpenData(179)

YAML

Dataset: !CERNOpenData 179

Network Accessible Files¶

Files accessible via HTTP or XRootD protocols can be provided directly as a list of URLs.

Danger

The ServiceX instance must have permissions to read these files; in particular if generic members of your experiment can’t access the files, ServiceX will probably not be able to either.

Python

"Dataset": servicex.dataset.FileList(["http://server/file1.root", "root://server/file2.root"])

YAML

Dataset: !FileList ["http://server/file1.root", "root://server/file2.root"]

API Reference¶

DatasetGroup¶

class servicex.dataset_group.DatasetGroup(datasets: List[Query])[source]

A group of datasets that are to be transformed together. This is a convenience class to allow you to submit multiple datasets to a ServiceX instance and then wait for all of them to complete.

Parameters:: datasets – List of transform request as dataset instances

set_result_format¶

DatasetGroup.set_result_format(result_format: ResultFormat)[source]

Set the result format for all the datasets in the group.

Parameters:: result_format – ResultFormat instance

as_signed_urls¶

DatasetGroup.as_signed_urls(display_progress: bool = True, provided_progress: Progress | None = None, return_exceptions: bool = False, overall_progress: bool = False) → List[TransformedResults | BaseException]

as_files¶

DatasetGroup.as_files(display_progress: bool = True, provided_progress: Progress | None = None, return_exceptions: bool = False, overall_progress: bool = False) → List[TransformedResults | BaseException]

Rucio¶

class servicex.dataset_identifier.RucioDatasetIdentifier(dataset: str, num_files: int | None = None)[source]

Rucio Dataset - this will be looked up using the Rucio data management service.

Parameters:

dataset – The rucio DID - this can be a dataset or a container of datasets.
num_files – Maximum number of files to return. This is useful during development to perform quick runs. ServiceX is careful to make sure it always returns the same subset of files.

FileList¶

class servicex.dataset_identifier.FileListDataset(files: List[str] | str)[source]

Dataset specified as a list of XRootD URIs.

Parameters:: files – Either a list of URIs or a single URI string

XRootD¶

class servicex.dataset_identifier.XRootDDatasetIdentifier(pattern: str, num_files: int | None = None)[source]

XRootD pattern Dataset - this will be looked up using the XRootD protocol using wildcards.

Parameters:

pattern – The wildcard pattern to be evaluated.
num_files – Maximum number of files to return. This is useful during development to perform quick runs. ServiceX is careful to make sure it always returns the same subset of files.

CERNOpenData¶

class servicex.dataset_identifier.CERNOpenDataDatasetIdentifier(dataset: int, num_files: int | None = None)[source]

CERN Open Data Dataset - this will be looked up using the CERN Open Data DID finder.

Parameters:

dataset – The dataset ID - this is an integer.
num_files – Maximum number of files to return. This is useful during development to perform quick runs. ServiceX is careful to make sure it always returns the same subset of files.

GenericDataSet¶

class servicex.dataset_identifier.DataSetIdentifier(scheme: str, dataset: str, num_files: int | None = None)[source]: Base class for specifying the dataset to transform. This can either be a list of xRootD URIs or a rucio DID