Specifying Datasets¶
You Will Learn:
What dataset source types are supported by ServiceX
How to define a dataset in Python and YAML for each source type
Physics analyses use a wide range of data types stored in a wide range of locations. The storage location determines the dataset definition, while the data type requires no special configuration. Four dataset source types are currently accepted.
Dataset Storage Options¶
Each of these will show how to define datasets in both Python and YAML formats. For all Python examples, dataset must be imported from servicex.
See also
For how datasets are added to a sample, see Samples.
Rucio¶
This dataset declaration looks up a dataset using a query to the Rucio data management system. The request is assumed to be for a Rucio dataset or container.
"Dataset": servicex.dataset.Rucio("my.rucio.dataset.name")
Dataset: !Rucio my.rucio.dataset.name
EOS¶
For files stored on EOS, two access methods are available. For discrete file selection, FileList is recommended; for entire directories or wildcard patterns, XRootD is the appropriate dataset type.
Danger
The ServiceX instance must have permissions to read these files; in particular if generic members of your experiment can’t access the files, ServiceX will probably not be able to either.
FileList
"Dataset": servicex.dataset.FileList(["root://eospublic.cern.ch//eos/opendata/mystuff/file1.root", "root://eospublic.cern.ch//eos/opendata/mystuff/file2.root"])
XRootD
Added in version 3.0.1.
"Dataset": servicex.dataset.XRootD("root://eospublic.cern.ch//eos/opendata/mystuff/*")
FileList
Dataset: !FileList ["root://eospublic.cern.ch//eos/opendata/mystuff/file1.root", "root://eospublic.cern.ch//eos/opendata/mystuff/file2.root"]
XRootD
Added in version 3.0.1.
Dataset: !XRootD root://eospublic.cern.ch//eos/opendata/mystuff/*
CERN Open Data Portal¶
Datasets from the CERN Open Data Portal are referenced by their numeric record ID.
"Dataset": servicex.dataset.CERNOpenData(179)
Dataset: !CERNOpenData 179
Network Accessible Files¶
Files accessible via HTTP or XRootD protocols can be provided directly as a list of URLs.
Danger
The ServiceX instance must have permissions to read these files; in particular if generic members of your experiment can’t access the files, ServiceX will probably not be able to either.
"Dataset": servicex.dataset.FileList(["http://server/file1.root", "root://server/file2.root"])
Dataset: !FileList ["http://server/file1.root", "root://server/file2.root"]