Remote File Introspecting

The get_structure() function allows inspection of the internal structure of datasets available through ServiceX. This is useful for determining which branches exist in a given dataset before running a full transformation, and for any lightweight exploration where only metadata or structure information is required without fetching event-level data.


Overview

The function internally issues a ServiceX request, using the Python function backend, for the specified dataset(s) and returns a simplified summary of the file structure, such as branches and types, in a string formatted for readability.

It accepts both programmatic and command-line usage with parametric return types.


Function

get_structure(datasets, array_out=False, **kwargs)

Parameters:

  • datasets (dict, str, or list[str]): One or more datasets to inspect. Designed for Rucio Dataset Identifiers (DIDs). If a dictionary is used, keys are used as labels for each dataset in the output string.

  • array_out (bool): If True, empty Awkward Arrays are reconstructed from the structure information. The function returns a dictionary of ak.Array.type objects, allowing programmatic access to the dataset structure.

  • **kwargs: Additional arguments forwarded to the helper function print_structure_from_str, such as filter_branch to apply a filter to displayed branches, do_print to print the output during the function call, or save_to_txt to save the output to samples_structure.txt.

Returns:

  • str: The formatted file structure string.

  • None: If do_print or save_to_txt is True, the function prints or saves the output to a file instead of returning it.

  • dict: keys are sample names and values are ak.Array.type objects with the same dataset structure.


Command-Line Usage

The function is also available as a CLI tool:

$ servicex-get-structure "scope:dataset-rucio-id" --filter_branch "el_"

This dumps to the shell a summary of the structure of the dataset, filtered to branches that contain "el_" in their names.

$ servicex-get-structure "scope:dataset-rucio-id1" "scope:dataset-rucio-id2" --filter_branch "el_"

This outputs a combined summary of both datasets with the same filter applied.


Practical Output Example

The following command filters dataset branches by the pattern el_pt:

Command:

$ servicex-get-structure  user.mtost:user.mtost.all.Mar11 --filter-branch el_pt 

Output on shell:

File structure of all samples with branch filter 'el_pt':

---------------------------
📁 Sample: user.mtost:user.mtost.all.Mar11
---------------------------

🌳 Tree: EventLoop_FileExecuted
   ├── Branches:

🌳 Tree: EventLoop_JobStats
   ├── Branches:

🌳 Tree: reco
   ├── Branches:
      ├── el_pt_NOSYS ; dtype: AsJagged(AsDtype('>f4'), header_bytes=10)
      ├── el_pt_EG_RESOLUTION_ALL__1down ; dtype: AsJagged(AsDtype('>f4'), header_bytes=10)
      ├── el_pt_EG_RESOLUTION_ALL__1up ; dtype: AsJagged(AsDtype('>f4'), header_bytes=10)
      ├── el_pt_EG_SCALE_ALL__1down ; dtype: AsJagged(AsDtype('>f4'), header_bytes=10)
      ├── el_pt_EG_SCALE_ALL__1up ; dtype: AsJagged(AsDtype('>f4'), header_bytes=10)

The output lists all trees and branch names matching the specified filter pattern for each requested dataset. It shows the branch data type information as interpreted by uproot. This includes the vector nesting level (jagged arrays) and the base type (e.g., f4 for 32-bit floats).

JSON Input

A JSON file can be used as input to simplify the command for multiple samples. The file path is passed as the sole argument to servicex-get-structure.

$ servicex-get-structure "path/to/datasets.json"

The JSON file maps sample labels to Rucio Dataset Identifiers (DIDs). For example, datasets.json might contain:

{
  "Signal": "mc23_13TeV:signal-dataset-rucio-id",
  "Background W+jets": "mc23_13TeV:background-dataset-rucio-id1",
  "Background Z+jets": "mc23_13TeV:background-dataset-rucio-id2",
  "Background Drell-Yan": "mc23_13TeV:background-dataset-rucio-id3",
}

Programmatic Example

Similar to the CLI, the output string containing the dataset structure can be retrieved programmatically:

from servicex_analysis_utils import get_structure

# Retrieve structure of a specific dataset
file_structure=get_structure("mc23_13TeV:some-dataset-rucio-id")

Other options

With do_print and save_to_txt, the dataset-structure string can instead be routed to std_out or to a text file in the running path.

from servicex_analysis_utils import get_structure

# Directly dump structure to std_out
get_structure("mc23_13TeV:some-dataset-rucio-id", do_print=True)
# Save to samples_summaty.txt
get_structure("mc23_13TeV:some-dataset-rucio-id", save_to_txt=True)

Return Awkward Array Type

If array_out is set to True, the function reconstructs dummy arrays with the correct structure and returns their ak.Array.type object.

from servicex_analysis_utils import get_structure

DS = {"sample1": "user.mtost:user.mtost.all.Mar11"}
ak_type = get_structure(DS, array_out=True)

rec = ak_type["sample1"].content #get RecordType

# Find index of reco tree and runNumber branch
reco_idx = rec.fields.index("reco")
branch_idx = rec.contents[reco_idx].fields.index("runNumber")

print("Type for branch 'runNumber':", rec.contents[reco_idx].contents[branch_idx])

Output:

Type for branch 'runNumber': var * int64

Notes

  • The function does not retrieve event data — only structure/metadata.

  • When using json input to the CLI, the same branch filtering is applied to all samples.

  • Many types show as None or unknown when not interpretable by uproot or when reconstruction to ak.Array fails.