Remote File Introspecting¶
The get_structure() function allows inspection of the internal structure of datasets available through ServiceX. This is useful for determining which branches exist in a given dataset before running a full transformation, and for any lightweight exploration where only metadata or structure information is required without fetching event-level data.
Overview¶
The function internally issues a ServiceX request, using the Python function backend, for the specified dataset(s) and returns a simplified summary of the file structure, such as branches and types, in a string formatted for readability.
It accepts both programmatic and command-line usage with parametric return types.
Function¶
get_structure(datasets, array_out=False, **kwargs)
Parameters:
datasets(dict,str, orlist[str]): One or more datasets to inspect. Designed for Rucio Dataset Identifiers (DIDs). If a dictionary is used, keys are used as labels for each dataset in the output string.array_out(bool): IfTrue, empty Awkward Arrays are reconstructed from the structure information. The function returns a dictionary ofak.Array.typeobjects, allowing programmatic access to the dataset structure.**kwargs: Additional arguments forwarded to the helper functionprint_structure_from_str, such asfilter_branchto apply a filter to displayed branches,do_printto print the output during the function call, orsave_to_txtto save the output tosamples_structure.txt.
Returns:
str: The formatted file structure string.None: Ifdo_printorsave_to_txtisTrue, the function prints or saves the output to a file instead of returning it.dict: keys are sample names and values areak.Array.typeobjects with the same dataset structure.
Command-Line Usage¶
The function is also available as a CLI tool:
$ servicex-get-structure "scope:dataset-rucio-id" --filter_branch "el_"
This dumps to the shell a summary of the structure of the dataset, filtered to branches that contain "el_" in their names.
$ servicex-get-structure "scope:dataset-rucio-id1" "scope:dataset-rucio-id2" --filter_branch "el_"
This outputs a combined summary of both datasets with the same filter applied.
Practical Output Example¶
The following command filters dataset branches by the pattern el_pt:
Command:
$ servicex-get-structure user.mtost:user.mtost.all.Mar11 --filter-branch el_pt
Output on shell:
File structure of all samples with branch filter 'el_pt':
---------------------------
📁 Sample: user.mtost:user.mtost.all.Mar11
---------------------------
🌳 Tree: EventLoop_FileExecuted
├── Branches:
🌳 Tree: EventLoop_JobStats
├── Branches:
🌳 Tree: reco
├── Branches:
│ ├── el_pt_NOSYS ; dtype: AsJagged(AsDtype('>f4'), header_bytes=10)
│ ├── el_pt_EG_RESOLUTION_ALL__1down ; dtype: AsJagged(AsDtype('>f4'), header_bytes=10)
│ ├── el_pt_EG_RESOLUTION_ALL__1up ; dtype: AsJagged(AsDtype('>f4'), header_bytes=10)
│ ├── el_pt_EG_SCALE_ALL__1down ; dtype: AsJagged(AsDtype('>f4'), header_bytes=10)
│ ├── el_pt_EG_SCALE_ALL__1up ; dtype: AsJagged(AsDtype('>f4'), header_bytes=10)
The output lists all trees and branch names matching the specified filter pattern for each requested dataset.
It shows the branch data type information as interpreted by uproot. This includes the vector nesting level (jagged arrays) and the base type (e.g., f4 for 32-bit floats).
JSON Input¶
A JSON file can be used as input to simplify the command for multiple samples. The file path is passed as the sole argument to servicex-get-structure.
$ servicex-get-structure "path/to/datasets.json"
The JSON file maps sample labels to Rucio Dataset Identifiers (DIDs). For example, datasets.json might contain:
{
"Signal": "mc23_13TeV:signal-dataset-rucio-id",
"Background W+jets": "mc23_13TeV:background-dataset-rucio-id1",
"Background Z+jets": "mc23_13TeV:background-dataset-rucio-id2",
"Background Drell-Yan": "mc23_13TeV:background-dataset-rucio-id3",
}
Programmatic Example¶
Similar to the CLI, the output string containing the dataset structure can be retrieved programmatically:
from servicex_analysis_utils import get_structure
# Retrieve structure of a specific dataset
file_structure=get_structure("mc23_13TeV:some-dataset-rucio-id")
Other options¶
With do_print and save_to_txt, the dataset-structure string can instead be routed to std_out or to a text file in the running path.
from servicex_analysis_utils import get_structure
# Directly dump structure to std_out
get_structure("mc23_13TeV:some-dataset-rucio-id", do_print=True)
# Save to samples_summaty.txt
get_structure("mc23_13TeV:some-dataset-rucio-id", save_to_txt=True)
Return Awkward Array Type¶
If array_out is set to True, the function reconstructs dummy arrays with the correct structure and returns their ak.Array.type object.
from servicex_analysis_utils import get_structure
DS = {"sample1": "user.mtost:user.mtost.all.Mar11"}
ak_type = get_structure(DS, array_out=True)
rec = ak_type["sample1"].content #get RecordType
# Find index of reco tree and runNumber branch
reco_idx = rec.fields.index("reco")
branch_idx = rec.contents[reco_idx].fields.index("runNumber")
print("Type for branch 'runNumber':", rec.contents[reco_idx].contents[branch_idx])
Output:
Type for branch 'runNumber': var * int64
Notes¶
The function does not retrieve event data — only structure/metadata.
When using
jsoninput to the CLI, the same branch filtering is applied to all samples.Many types show as
Noneor unknown when not interpretable byuprootor when reconstruction toak.Arrayfails.