Experimental data files
Data made available by experimental collaborations comes in a variety of
formats. For use in a fitting code, this data must be converted into a common
format that contains all the required information for use in PDF fitting.
Existing formats commonly used by the community, such as in HepData,
are generally unsuitable. Principally as they often do not fully describe the
breakdown of systematic uncertainties. Therefore over several years an NNPDF
standard data format has been iteratively developed, now denoted
CommonData
. In addition to the CommonData
files themselves, in the
nnpdf++
project the user has the ability to vary the treatment of individual
systematic errors by use of parameter files denoted SYSTYPE
files. In this
section we shall detail the specifications of these two files.
In principle, the file specification and classes described in this section are
independent of the nnpdf++
project and may be generated by whatever means
the user sees fit. In practice, the CommonData
and SYSTYPE
files
are generated by the buildmaster
project of nnpdf++
from the raw
experimental data files.
Process types and kinematics
Before going into the file formats, we shall summarise the identifying features
used for data in the nnpdf++
code.
Each data point has an associated process type string. This can be specified by the user, but must begin with the appropriate identifying base process type. Additionally for each data point three kinematic values are given, the process type being primarily to identify the nature of these values. Typically the first kinematic variable is the principal differential quantity used in the measurement. The second kinematic variable defines the scale of the process. The third is generally the centre-of-mass energy of the process, or inelasticity in the case of DIS. The allowed basic process types, and their corresponding three kinematic variables are outlined below.
DIS - Deep inelastic scattering measurements: \((x,Q^2,y)\)
DYP - Fixed-target Drell-Yan measurements: \((y,M^2,\sqrt{s})\)
JET - Jet production: \((\eta,p_T^2,\sqrt{s})\)
DIJET - Dijet production: \((\eta,m_{12},\sqrt{s})\)
PHT - Photon production: \((\eta_\gamma,E_{T,\gamma}^2,\sqrt{s})\)
INC - A total inclusive cross-section: \((0,\mu^2,\sqrt{s})\)
EWK_RAP - Collider electroweak rapidity distribution: \((\eta/y,M^2,\sqrt{s})\)
EWK_PT - Collider electroweak \(p_T\) distribution: \((p_T,M^2,\sqrt{s})\)
EWK_PTRAP - Collider electroweak \(p_T, y\) distribution: \((\eta/y, p_T^2,\sqrt{s})\)
EWK_MLL - Collider electroweak lepton-pair mass distribution: \((M_{ll},M_{ll}^2,\sqrt{s})\)
EWJ_(J)RAP - Collider electroweak + jet boson(jet) rapidity distribution: \((\eta/y,M^2,\sqrt{s})\)
EWJ_(J)PT - Collider electroweak + jet boson(jet) \(p_T\) distribution: \((p_T,M^2,\sqrt{s})\)
EWJ_(J)PTRAP - Collider electroweak + jet boson(jet) \(p_T, y\) distribution: \((\eta/y, p_T^2,\sqrt{s})\)
EWJ_MLL - Collider electroweak+jet lepton-pair mass distribution: \((M_{ll},M_{ll}^2,\sqrt{s})\)
HQP_YQQ - Heavy diquark system rapidity \((y^{QQ},\mu^2,\sqrt{s})\)
HQP_MQQ - Heavy diquark system mass \((M^{QQ},\mu^2,\sqrt{s})\)
HQP_PTQQ - Heavy diquark system \(p_T\) \((p_T^{QQ},\mu^2,\sqrt{s})\)
HQP_YQ - Heavy quark rapidity \((y^Q,\mu^2,\sqrt{s})\)
HQP_PTQ - Heavy quark \(p_T\) \((p_T^Q,\mu^2,\sqrt{s})\)
HIG_RAP - Higgs boson rapidity distribution \((y,M_H^2,\sqrt{s})\)
As examples of process type strings, consider EWK_RAP for a collider \(W\) boson asymmetry measurement binned in rapidity, and DIS_F2P for the \(F_2^p\) structure function in DIS. The user is free to choose something identifying for the second segment of the process type, the important feature being the basic process type. However, users are encouraged to only use this freedom when absolutely necessary (such as when used in combination with APFEL).
One special case is that of \(W\) boson lepton asymmetry measurements, which being cross-section asymmetries may occasionally have negative data points. Therefore asymmetry measurements must have the final tag ASY to ensure that artificial data generation permits negative data values. An example process type string would be EWK_RAP_ASY.
Notes for the future
In the future it would be nice to have a more flexible treatment of the kinematic variables, both in their number and labelling.
CommonData
file format
Each experimental Dataset has its own CommonData
file.
CommonData
files contain the bulk of the experimental information used in the
nnpdf++
project, with the only other experimental data files controlling
the treatment and correlation of systematic errors. Each CommonData
file
is a plaintext file whose layout is described in the following.
The first line begins with the Dataset name, the number of systematic errors, and the number of data points in the set, whitespace separated. For example, for the ATLAS 2010 jet measurement the first line of the file reads:
ATLASR04JETS36PB 91 90
Which demonstrates that the set name is ‘ATLASR04JETS36PB’, that there
are 91 sources of systematic uncertainty, 90 data points, one associated FK
table, and that the FK
table corresponds to a proton initial state. As
another example, consider the NMCPD Dataset:
NMCPD 5 211
Here there are 5 sources of systematic uncertainty and 211 data points. Following this, each line specifies the details of a single data point. The first value being the data point index \(1< i_{\text{dat}} \leq N_{\mathrm{dat}}\), followed by the process type string as outlined above, and the three kinematic variables in order. These are followed by the value of the experimental data point itself, and the value of the statistical uncertainty associated with it (absolute value). Finally the systematic uncertainties are specified. The layout per data point is therefore
\(i_{\mathrm{dat}}\) ProcessType \(\text{kin}_1 \text{kin}_2 \text{kin}_3\) data_value stat_error \([..\) systematics \(..]\)
For example, in the case of a DIS data point from the BCDMSD Dataset:
1 DIS_F2D 7.0e-02 8.75e+00 5.666e-01 3.6575e-01 6.43e-03 \([..\) systematics \(..]\)
In these lines the systematic uncertainties are laid out as so. For each uncertainty, additive and multiplicative versions are given. The additive uncertainty is given by absolute value, and the multiplicative as a percentage of the data value (that is, relative error multiplied by 100). The systematics string is formed by the sequence of \(N_{\text{sys}}\) pairs of systematic uncertainties:
\([..\) systematics \(..] = \sigma^{\mathrm{add}}_0 \quad \sigma^{\mathrm{mul}}_0\quad \sigma^{\mathrm{add}}_1 \quad \sigma^{\mathrm{mul}}_1 \quad....\quad \sigma^{\mathrm{add}}_n \quad\sigma^{\mathrm{mul}}_n\)
where \(\sigma^{\mathrm{add}}_i\) and \(\sigma^{\mathrm{mul}}_i\) are the additive
and multiplicative versions respectively of the systematic uncertainty arising
from the \(i\text{th}\) source. While it may seem at first that the multiplicative error
is spurious given the presence of the additive error and data central value,
this may not be the case. For example, in a closure test scenario, the data
central values may have been replaced in the CommonData
file by
theoretical predictions. Therefore if you wish to use a covariance matrix
generated with the original multiplicative uncertainties via the \(t_0\) method,
you must also store the original multiplicative (percentage) error. For
flexibility and ease of I/O this is therefore done in the CommonData
file
itself.
For a Dataset with \(N_{\text{dat}}\) data points and \(N_{\text{sys}}\)
sources of systematic uncertainty, the total CommonData
file should
therefore be \(N_{\text{dat}}+1\) lines long. Its first line contains the set
parameters, and every subsequent line should consist of the description of a
single data point. Each data point line should therefore contain \(7 +
2N_{\text{sys}}\) columns.
SYSTYPE
file format
The explicit presentation of the systematic uncertainties in the
CommonData
file allows for a great deal of flexibility in the treatment of
these errors. Specifically, whether they should be treated as additive or
multiplicative uncertainties, and how they are correlated, both within the
Dataset and within a larger Experiment. A specification for how
the systematic uncertainties should be treated is provided by a SYSTYPE
file. As there is not always an unambiguous method for the treatment of these
uncertainties, these information is kept outside the (unambiguous)
CommonData
file. Several options for this treatment are often provided in the
form of multiple SYSTYPE
files which may be selected between in the fit.
Each SYSTYPE
file begins with a line specifying the total number of
systematics. Naturally this must match with the \(N_{\text{sys}}\) variable
specified in the associated CommonData
file. This is presented as a single
integer. For example, in the case of the BCDMSD SYSTYPE
files, the first line is
8
as there are \(N_{\text{sys}}=8\) sources of systematic uncertainty for this Dataset. Following this line there are \(N_{\text{sys}}\) lines describing each source of systematic uncertainty. For each source two parameters are provided, the uncertainty treatment and the uncertainty description. These are laid out for each systematic as:
\(i_{\text{sys}}\) [uncertainty treatment] [uncertainty description]
where \(1< i_{\text{sys}} \leq N_{\mathrm{sys}}\) enumerates each systematic. The
uncertainty treatment determines whether the uncertainty should be
treated as additive, multiplicative, or in cases where the choice is unclear, as
randomised on a replica by replica basis. These choices are selected by using
the strings ADD, MULT, or RAND. The uncertainty
description specifies how the systematic is to be correlated with other
data points. There are three special cases for the uncertainty
description, specified by the strings CORR, UNCORR,
THEORYCORR, THEORYUNCORR and SKIP. The first two
specify whether the systematic is fully correlated only within the
Dataset (CORR), or whether the systematic is totally
uncorrelated (UNCORR). The THEORY descriptor is used to
describe theoretical systematics due to e.g missing NNLO corrections, which are
treated as either CORR or UNCORR according to their suffix,
but are not included in the generation of artificial replicas (their only
contribution is to the fitting error function). If the user wishes to correlate
a specific uncertainty between multiple Datasets within an
Experiment, then they should use a custom uncertainty description.
When building a covariance matrix for an Experiment, the nnpdf++
code checks for matches between the uncertainty descriptions of
systematics of its constituent Datasets. If a match is found, the code
will correlate those systematics over the relevant datasets. The SKIP
descriptor removes the systematic from the covariance matrices for debugging
purposes.
As an example, let us consider an NNPDF2.3 standard SYSTYPE
for the BCDMSD
Dataset:
81 ADD BCDMSFB2 ADD BCDMSFS3 ADD BCDMSFR4 MULT BCDMSNORM5 MULT BCDMSRELNORMTARGET6 MULT CORR7 MULT CORR8 MULT CORR
Here the first five systematics have custom uncertainty descriptions,
thereby allowing them to be cross-correlated with other Datasets in a
larger Experiment. Systematics six to eight are specified as being fully
correlated, but only within the BCDMSD Dataset. Additionally note that
the first three systematics are specified as additive, and the remainder are
multiplicative. If we compare now to the equivalent SYSTYPE
file for the
BCDMSP Dataset:
111 ADD BCDMSFB2 ADD BCDMSFS3 ADD BCDMSFR4 MULT BCDMSNORM5 MULT BCDMSRELNORMTARGET6 MULT CORR7 MULT CORR8 MULT CORR9 MULT CORR10 MULT CORR11 MULT CORR
it is clear that the first five systematics are the same as in the BCDMSD
Dataset, and therefore should the two sets be combined into a common
Experiment, the code will cross-correlate them appropriately. The
combination of SYSTYPE
and CommonData
is quite flexible. As stated
previously, once generated from the original raw experimental data, the
CommonData
file is fixed and should not be altered apart from for the purpose
of correcting errors. In practice the full details on the systematic correlation
and their treatment is often not precisely specified. This system allows for the
safe variation of these parameters for testing purposes.