Experimental data files
Data made available by experimental collaborations comes in a variety of
formats. For use in a fitting code, this data must be converted into a common
format that contains all the required information for use in PDF fitting.
Existing formats commonly used by the community, such as in HepData,
are generally unsuitable. Principally as they often do not fully describe the
breakdown of systematic uncertainties. Therefore over several years an NNPDF
standard data format has been iteratively developed, now denoted CommonData
.
This documentation describes the CommonData
format
used in NNPDF starting from code version 4.0.10 and compatible with releases beyond 4.0.
Naming convention and organization of the datasets
All datasets in the new data format follow the exact same naming convention:
<setname>_<observable>
where the setname is defined by:
<experiment>_<process>_<energy>{_<extras>}
The naming convention for the set names is defined in the naming convention documentation.
Each <setname>
defines a folder in which the data is contained.
While the separation of data in different folders can be arbitrary,
a folder cannot contain more than one hepdata entry
or datasets that mix different processes, energies or experiment.
Due to historical reasons and for backwards compatibility the special energy NOTFIXED
is used
for datasets where more than one center of mass energy is used.
When in doubt, it is preferable to utilize two different folders.
The <extras>
string is free and can be used to disambiguate.
The data downloaded or parsed from hepdata or other sources is kept in the
<setname>/<rawdata>
folder and it is not installed with the rest of the code.
Each folder must contain a <setname>/metadata.yaml
file which will define
all datasets implemented within the folder and that will be described below.
Only .yaml
files are allowed to be installed together with the nnpdf
code.
In order to keep backward compatibility and allow the reproducibility of the 4.0 family of fits
a dataset_names.yml
file keeps a mapping of the datasets that were used in 4.0.
When using the old names in a runcard, validphys
will automatically translate
them using this file.
The format of this mapping is as follow:
old_name_1:
dataset: new_name_1
variant: legacy
CommonData Metadata specification
The metadata.yaml
file defines unequivocally the datasets implemented within a folder.
The general structure is a first portion of general information (references, name of the set)
and a list of implemented_observables
which define separate datasets.
Observable specific information
Within a metadata.yaml
we can find one or more implemented datasets.
These correspond to different observables of a single measurement.
For instance, the LHCB publication of Z rapidity measurements at 13 TeV
(setname: LHCB_Z0_13TEV
) contains two observables: Z decay into two electrons
and Z decay into 2 muons.
This setname contain two datasets: LHCB_Z0_13TEV_DIELECTRON-Y
and LHCB_Z0_13TEV_DIMUON-Y
.
In the following we describe the metadata corresponding to the observable within the metadata.yaml
file.
implemented_observables:
- observable_name: "DIMUON-Y"
process_type: "EWK_RAP"
tables: [5]
ndata: 18
observable:
description: "Differential cross-section of Z-->µµ as a function of Z-rapidity"
label: r"$d\sigma / d|y|$"
units: "[fb]"
kinematics:
file: kinematics_dimuon.yaml
variables:
y: {description: "Z boson rapidity", label: "$y$", units: ""}
M2: {description: "Z boson Mass", label: "$M^2$", units: "$GeV^2$"}
sqrts: {description: "Center of Mass Energy", label: '$\sqrt{s}$', units: "$GeV$"}
kinematic_coverage: [y, M2, sqrts]
data_central: data_dimuon.yaml
data_uncertainties:
- uncertainties_dimuon.yaml
variants:
- example_variant:
data_uncertainties:
- uncertainties_different_treatment.yaml
theory:
FK_tables:
- - LHCB_DY_13TEV_DIMUON
operation: 'null'
conversion_factor: 1000.0
# Plotting information
plotting:
dataset_label: "LHCb $Z\\to µµ$"
plot_x: y
y_label: '$d\sigma_{Z}/dy$ (fb)'
observable_name
The observable name is used to construct the full name of the dataset <setname>_<observable_name>
.
It must be unique within a set and contain no _
(as it could lead to confusion).
process_name
One of the processes defined in the process_options
module at
validphys/src/validphys2/process_options.py
.
This is used internally by validphys to describe the combination of observable
and process in various plots, to check that the kinematic variables utilized by the
dataset are sensible and to generate derived plots such as the x-q2
kinematic coverage plots.
tables
Tables from the hepdata entries that have been used to construct the dataset
ndata
Number of datapoints in the dataset. While this quantity could be derived from the data itself, many other pieces (crucially backwards compatibility with cuts and theories) requires the number of datapoints to be set in stone. If an update requires to change the number of datapoint, it should be added as a separate observable.
observable
This is a dictionary with the entries description
, label
and units
.
All entries must be latex-compilable as they are used by various plotting routines in validphys
.
kinematics::file
A reference to a .yaml
file containing all kinematic information.
The file contains a list of ndata
bins
for which information about all variables
is included for all bins.
When mid
is not given, it will be automatically filled with the midpoint between min and max.
Only mid
is used for cuts, while min
and max
may be used for plotting routines.
bins:
- var_1:
min: 0
max: 1
mid: 0.5
var_2:
min: 0
max: 1
mid: 0.5
kinematics::variables
Metadata for each of the variables contained in the kinematics::file
and which can be description
, label
and units
.
Latex syntax is accepted and encouraged since they will be used by plotting routines.
variables:
var_1: {description: "my var 1", label: "$m$", "units: "GeV"}
kinematic_coverage
A list of the variables within the kinematic files
data_central
A reference to a yaml
file containing the measurement central data.
The format of the data is a yaml
file with an entry data_central
which
list for all values for all bins.
data_central:
- val1
- val2
- val3
data_uncertainties
A list of .yaml
files containing the uncertainty information for the measurement.
When using more than one uncertainty file they will be concatenated.
This allows the user the flexibility of creating variants
where only a subset of the uncertainties are modified.
The format of the uncertainty files is of two fields, a definitions
field that contains
metadata about all the uncertainties: name, treatment (ADD
or MULT
) and type
and a second field bins
which is a list of mappings with ndata
entries
with the named uncertainties.
Note that, regardless of their treatment, uncertainties should always be written as absolute values and not relative to the data values. If the data is updated, the uncertainties have to be too.
definitions:
stat:
description: statistical error
treatment: ADD
type:
error_name:
description: an additive uncertainty
treatment: ADD
type:
error_name_2:
description: an multiplicative uncertainty
treatment: MULT
type:
bins:
- stat: 1.0
error_name: 2.0
error_name_2: 3.0
variants
In some occasions we might want to maintain two variations of the same observable. For instance, we might have two incompatible sources of uncertainties. In such case a variant can be added. These variants can overwrite certain keys if necessary. When a variant is used, the key under the variant will be used instead of the key defined in the observable.
A variant
can only overwrite the entries data_central
, theory
, and data_uncertainties
.
If the kinematics or the number of points change, it should be considered a different observable.
Example:
data_uncertainties:
- uncertainties.yaml
variants:
name_of_the_variant:
data_uncertainties:
- uncertainties.yaml
- extra_uncertainties.yaml
another_variant:
data_central: different_data.yaml
data_uncertainties:
- different_uncertainties.yaml
When loading this dataset with no variant only the uncertainties.yaml
file will be read.
Instead, when choosing variant: name_of_the_variant
, both uncertainties.yaml
and extra_uncertainties.yaml
will be loaded.
If we select variant: another_variant
both the data_uncertainties
and the data_central
keys will be substituted.
Note that if we want to substitute the default set of uncertainties we just need to not include it in the variant (as done in another_variant
).
theory
The theory field defines how predictions for the dataset are to be computed. It includes two entries:
FK_tables
: this is a list of lists which defines the FK Tables to be loaded. The outermost list are the operands (in case an operation is needed to recover the observable, more on that below). The innermost list are the grids that are to be concatenated in order to form the operands.operation
: operation to be applied in order to compute the observable. If no operation is needed it can be written as ‘null’ or None. vp currently supportsRATIO
,ASY
,ADD
,``SMN``,COM
,SMT
,NULL
Example:
theory:
FK_tables:
- - Z_contribution
- Wp_contribution
- Wm_total
- - total_xs
operation: 'ratio'
In this case the fktables
for the Z, W+ and W- contributions will be concatenated (the dataset might include predictions for all three contributions).
After that, the final observable will be computed by taking the ratio of the concatenation of all those observables and the total cross section (total_xs
).
plotting
The plotting
section defines the plotting style inside validphys
and is described in detail in Plotting format.
Note that the names of the variables need to be the same in the plotting and kinematics.