Datasets¶
Dataset Storage Options¶
Each of these will show how to define datasets in both Python and YAML formats. For all Python examples, dataset must be imported from servicex.
See also
For how datasets are added to a sample, see Samples.
Rucio¶
This dataset declaration looks up a dataset using a query to the Rucio data management system. The request is assumed to be for a Rucio dataset or container.
"Dataset": servicex.dataset.Rucio("my.rucio.dataset.name")
Dataset: !Rucio my.rucio.dataset.name
EOS¶
For files stored on EOS, two access methods are available. For discrete file selection, FileList is recommended; for entire directories or wildcard patterns, XRootD is the appropriate dataset type.
Danger
The ServiceX instance must have permissions to read these files; in particular if generic members of your experiment can’t access the files, ServiceX will probably not be able to either.
FileList
"Dataset": servicex.dataset.FileList(["root://eospublic.cern.ch//eos/opendata/mystuff/file1.root", "root://eospublic.cern.ch//eos/opendata/mystuff/file2.root"])
XRootD
Added in version 3.0.1.
"Dataset": servicex.dataset.XRootD("root://eospublic.cern.ch//eos/opendata/mystuff/*")
FileList
Dataset: !FileList ["root://eospublic.cern.ch//eos/opendata/mystuff/file1.root", "root://eospublic.cern.ch//eos/opendata/mystuff/file2.root"]
XRootD
Added in version 3.0.1.
Dataset: !XRootD root://eospublic.cern.ch//eos/opendata/mystuff/*
CERN Open Data Portal¶
Datasets from the CERN Open Data Portal are referenced by their numeric record ID.
"Dataset": servicex.dataset.CERNOpenData(179)
Dataset: !CERNOpenData 179
Network Accessible Files¶
Files accessible via HTTP or XRootD protocols can be provided directly as a list of URLs.
Danger
The ServiceX instance must have permissions to read these files; in particular if generic members of your experiment can’t access the files, ServiceX will probably not be able to either.
"Dataset": servicex.dataset.FileList(["http://server/file1.root", "root://server/file2.root"])
Dataset: !FileList ["http://server/file1.root", "root://server/file2.root"]
API Reference¶
DatasetGroup¶
- class servicex.dataset_group.DatasetGroup(datasets: List[Query])[source]
A group of datasets that are to be transformed together. This is a convenience class to allow you to submit multiple datasets to a ServiceX instance and then wait for all of them to complete.
- Parameters:
datasets – List of transform request as dataset instances
set_result_format¶
- DatasetGroup.set_result_format(result_format: ResultFormat)[source]
Set the result format for all the datasets in the group.
- Parameters:
result_format – ResultFormat instance
as_signed_urls¶
- DatasetGroup.as_signed_urls(display_progress: bool = True, provided_progress: Progress | None = None, return_exceptions: bool = False, overall_progress: bool = False) List[TransformedResults | BaseException]
as_files¶
- DatasetGroup.as_files(display_progress: bool = True, provided_progress: Progress | None = None, return_exceptions: bool = False, overall_progress: bool = False) List[TransformedResults | BaseException]
Rucio¶
- class servicex.dataset_identifier.RucioDatasetIdentifier(dataset: str, num_files: int | None = None)[source]
Rucio Dataset - this will be looked up using the Rucio data management service.
- Parameters:
dataset – The rucio DID - this can be a dataset or a container of datasets.
num_files – Maximum number of files to return. This is useful during development to perform quick runs. ServiceX is careful to make sure it always returns the same subset of files.
FileList¶
- class servicex.dataset_identifier.FileListDataset(files: List[str] | str)[source]
Dataset specified as a list of XRootD URIs.
- Parameters:
files – Either a list of URIs or a single URI string
XRootD¶
- class servicex.dataset_identifier.XRootDDatasetIdentifier(pattern: str, num_files: int | None = None)[source]
XRootD pattern Dataset - this will be looked up using the XRootD protocol using wildcards.
- Parameters:
pattern – The wildcard pattern to be evaluated.
num_files – Maximum number of files to return. This is useful during development to perform quick runs. ServiceX is careful to make sure it always returns the same subset of files.
CERNOpenData¶
- class servicex.dataset_identifier.CERNOpenDataDatasetIdentifier(dataset: int, num_files: int | None = None)[source]
CERN Open Data Dataset - this will be looked up using the CERN Open Data DID finder.
- Parameters:
dataset – The dataset ID - this is an integer.
num_files – Maximum number of files to return. This is useful during development to perform quick runs. ServiceX is careful to make sure it always returns the same subset of files.
GenericDataSet¶
- class servicex.dataset_identifier.DataSetIdentifier(scheme: str, dataset: str, num_files: int | None = None)[source]
Base class for specifying the dataset to transform. This can either be a list of xRootD URIs or a rucio DID