# Experimental data files

Data made available by experimental collaborations comes in a variety of formats. For use in a fitting code, this data must be converted into a common format that contains all the required information for use in PDF fitting. Existing formats commonly used by the community, such as in HepData, are generally unsuitable. Principally as they often do not fully describe the breakdown of systematic uncertainties. Therefore over several years an NNPDF standard data format has been iteratively developed, now denoted CommonData. In addition to the CommonData files themselves, in the nnpdf++ project the user has the ability to vary the treatment of individual systematic errors by use of parameter files denoted SYSTYPE files. In this section we shall detail the specifications of these two files.

In principle, the file specification and classes described in this section are independent of the nnpdf++ project and may be generated by whatever means the user sees fit. In practice, the CommonData and SYSTYPE files are generated by the buildmaster project of nnpdf++ from the raw experimental data files.

## Process types and kinematics

Before going into the file formats, we shall summarise the identifying features used for data in the nnpdf++ code.

Each data point has an associated process type string. This can be specified by the user, but must begin with the appropriate identifying base process type. Additionally for each data point three kinematic values are given, the process type being primarily to identify the nature of these values. Typically the first kinematic variable is the principal differential quantity used in the measurement. The second kinematic variable defines the scale of the process. The third is generally the centre-of-mass energy of the process, or inelasticity in the case of DIS. The allowed basic process types, and their corresponding three kinematic variables are outlined below.

• DIS - Deep inelastic scattering measurements: $$(x,Q^2,y)$$

• DYP - Fixed-target Drell-Yan measurements: $$(y,M^2,\sqrt{s})$$

• JET - Jet production: $$(\eta,p_T^2,\sqrt{s})$$

• DIJET - Dijet production: $$(\eta,m_{12},\sqrt{s})$$

• PHT - Photon production: $$(\eta_\gamma,E_{T,\gamma}^2,\sqrt{s})$$

• INC - A total inclusive cross-section: $$(0,\mu^2,\sqrt{s})$$

• EWK_RAP - Collider electroweak rapidity distribution: $$(\eta/y,M^2,\sqrt{s})$$

• EWK_PT - Collider electroweak $$p_T$$ distribution: $$(p_T,M^2,\sqrt{s})$$

• EWK_PTRAP - Collider electroweak $$p_T, y$$ distribution: $$(\eta/y, p_T^2,\sqrt{s})$$

• EWK_MLL - Collider electroweak lepton-pair mass distribution: $$(M_{ll},M_{ll}^2,\sqrt{s})$$

• EWJ_(J)RAP - Collider electroweak + jet boson(jet) rapidity distribution: $$(\eta/y,M^2,\sqrt{s})$$

• EWJ_(J)PT - Collider electroweak + jet boson(jet) $$p_T$$ distribution: $$(p_T,M^2,\sqrt{s})$$

• EWJ_(J)PTRAP - Collider electroweak + jet boson(jet) $$p_T, y$$ distribution: $$(\eta/y, p_T^2,\sqrt{s})$$

• EWJ_MLL - Collider electroweak+jet lepton-pair mass distribution: $$(M_{ll},M_{ll}^2,\sqrt{s})$$

• HQP_YQQ - Heavy diquark system rapidity $$(y^{QQ},\mu^2,\sqrt{s})$$

• HQP_MQQ - Heavy diquark system mass $$(M^{QQ},\mu^2,\sqrt{s})$$

• HQP_PTQQ - Heavy diquark system $$p_T$$ $$(p_T^{QQ},\mu^2,\sqrt{s})$$

• HQP_YQ - Heavy quark rapidity $$(y^Q,\mu^2,\sqrt{s})$$

• HQP_PTQ - Heavy quark $$p_T$$ $$(p_T^Q,\mu^2,\sqrt{s})$$

• HIG_RAP - Higgs boson rapidity distribution $$(y,M_H^2,\sqrt{s})$$

As examples of process type strings, consider EWK_RAP for a collider $$W$$ boson asymmetry measurement binned in rapidity, and DIS_F2P for the $$F_2^p$$ structure function in DIS. The user is free to choose something identifying for the second segment of the process type, the important feature being the basic process type. However, users are encouraged to only use this freedom when absolutely necessary (such as when used in combination with APFEL).

One special case is that of $$W$$ boson lepton asymmetry measurements, which being cross-section asymmetries may occasionally have negative data points. Therefore asymmetry measurements must have the final tag ASY to ensure that artificial data generation permits negative data values. An example process type string would be EWK_RAP_ASY.

### Notes for the future

In the future it would be nice to have a more flexible treatment of the kinematic variables, both in their number and labelling.

## CommonData file format

Each experimental Dataset has its own CommonData file. CommonData files contain the bulk of the experimental information used in the nnpdf++ project, with the only other experimental data files controlling the treatment and correlation of systematic errors. Each CommonData file is a plaintext file whose layout is described in the following.

The first line begins with the Dataset name, the number of systematic errors, and the number of data points in the set, whitespace separated. For example, for the ATLAS 2010 jet measurement the first line of the file reads:

ATLASR04JETS36PB 91 90

Which demonstrates that the set name is ‘ATLASR04JETS36PB’, that there are 91 sources of systematic uncertainty, 90 data points, one associated FK table, and that the FK table corresponds to a proton initial state. As another example, consider the NMCPD Dataset:

NMCPD 5 211

Here there are 5 sources of systematic uncertainty and 211 data points. Following this, each line specifies the details of a single data point. The first value being the data point index $$1< i_{\text{dat}} \leq N_{\mathrm{dat}}$$, followed by the process type string as outlined above, and the three kinematic variables in order. These are followed by the value of the experimental data point itself, and the value of the statistical uncertainty associated with it (absolute value). Finally the systematic uncertainties are specified. The layout per data point is therefore

$$i_{\mathrm{dat}}$$ ProcessType $$\text{kin}_1 \text{kin}_2 \text{kin}_3$$ data_value stat_error $$[..$$ systematics $$..]$$

For example, in the case of a DIS data point from the BCDMSD Dataset:

1 DIS_F2D 7.0e-02 8.75e+00 5.666e-01 3.6575e-01 6.43e-03 $$[..$$ systematics $$..]$$

In these lines the systematic uncertainties are laid out as so. For each uncertainty, additive and multiplicative versions are given. The additive uncertainty is given by absolute value, and the multiplicative as a percentage of the data value (that is, relative error multiplied by 100). The systematics string is formed by the sequence of $$N_{\text{sys}}$$ pairs of systematic uncertainties:

$$[..$$ systematics $$..] = \sigma^{\mathrm{add}}_0 \quad \sigma^{\mathrm{mul}}_0\quad \sigma^{\mathrm{add}}_1 \quad \sigma^{\mathrm{mul}}_1 \quad....\quad \sigma^{\mathrm{add}}_n \quad\sigma^{\mathrm{mul}}_n$$

where $$\sigma^{\mathrm{add}}_i$$ and $$\sigma^{\mathrm{mul}}_i$$ are the additive and multiplicative versions respectively of the systematic uncertainty arising from the $$i\text{th}$$ source. While it may seem at first that the multiplicative error is spurious given the presence of the additive error and data central value, this may not be the case. For example, in a closure test scenario, the data central values may have been replaced in the CommonData file by theoretical predictions. Therefore if you wish to use a covariance matrix generated with the original multiplicative uncertainties via the $$t_0$$ method, you must also store the original multiplicative (percentage) error. For flexibility and ease of I/O this is therefore done in the CommonData file itself.

For a Dataset with $$N_{\text{dat}}$$ data points and $$N_{\text{sys}}$$ sources of systematic uncertainty, the total CommonData file should therefore be $$N_{\text{dat}}+1$$ lines long. Its first line contains the set parameters, and every subsequent line should consist of the description of a single data point. Each data point line should therefore contain $$7 + 2N_{\text{sys}}$$ columns.

## SYSTYPE file format

The explicit presentation of the systematic uncertainties in the CommonData file allows for a great deal of flexibility in the treatment of these errors. Specifically, whether they should be treated as additive or multiplicative uncertainties, and how they are correlated, both within the Dataset and within a larger Experiment. A specification for how the systematic uncertainties should be treated is provided by a SYSTYPE file. As there is not always an unambiguous method for the treatment of these uncertainties, these information is kept outside the (unambiguous) CommonData file. Several options for this treatment are often provided in the form of multiple SYSTYPE files which may be selected between in the fit.

Each SYSTYPE file begins with a line specifying the total number of systematics. Naturally this must match with the $$N_{\text{sys}}$$ variable specified in the associated CommonData file. This is presented as a single integer. For example, in the case of the BCDMSD SYSTYPE files, the first line is

8

as there are $$N_{\text{sys}}=8$$ sources of systematic uncertainty for this Dataset. Following this line there are $$N_{\text{sys}}$$ lines describing each source of systematic uncertainty. For each source two parameters are provided, the uncertainty treatment and the uncertainty description. These are laid out for each systematic as:

$$i_{\text{sys}}$$ [uncertainty treatment] [uncertainty description]

where $$1< i_{\text{sys}} \leq N_{\mathrm{sys}}$$ enumerates each systematic. The uncertainty treatment determines whether the uncertainty should be treated as additive, multiplicative, or in cases where the choice is unclear, as randomised on a replica by replica basis. These choices are selected by using the strings ADD, MULT, or RAND. The uncertainty description specifies how the systematic is to be correlated with other data points. There are three special cases for the uncertainty description, specified by the strings CORR, UNCORR, THEORYCORR, THEORYUNCORR and SKIP. The first two specify whether the systematic is fully correlated only within the Dataset (CORR), or whether the systematic is totally uncorrelated (UNCORR). The THEORY descriptor is used to describe theoretical systematics due to e.g missing NNLO corrections, which are treated as either CORR or UNCORR according to their suffix, but are not included in the generation of artificial replicas (their only contribution is to the fitting error function). If the user wishes to correlate a specific uncertainty between multiple Datasets within an Experiment, then they should use a custom uncertainty description. When building a covariance matrix for an Experiment, the nnpdf++ code checks for matches between the uncertainty descriptions of systematics of its constituent Datasets. If a match is found, the code will correlate those systematics over the relevant datasets. The SKIP descriptor removes the systematic from the covariance matrices for debugging purposes.

As an example, let us consider an NNPDF2.3 standard SYSTYPE for the BCDMSD Dataset:

8
4 MULT BCDMSNORM
5 MULT BCDMSRELNORMTARGET
6 MULT CORR
7 MULT CORR
8 MULT CORR

Here the first five systematics have custom uncertainty descriptions, thereby allowing them to be cross-correlated with other Datasets in a larger Experiment. Systematics six to eight are specified as being fully correlated, but only within the BCDMSD Dataset. Additionally note that the first three systematics are specified as additive, and the remainder are multiplicative. If we compare now to the equivalent SYSTYPE file for the BCDMSP Dataset:

11
4 MULT BCDMSNORM
5 MULT BCDMSRELNORMTARGET
6 MULT CORR
7 MULT CORR
8 MULT CORR
9 MULT CORR
10 MULT CORR
11 MULT CORR

it is clear that the first five systematics are the same as in the BCDMSD Dataset, and therefore should the two sets be combined into a common Experiment, the code will cross-correlate them appropriately. The combination of SYSTYPE and CommonData is quite flexible. As stated previously, once generated from the original raw experimental data, the CommonData file is fixed and should not be altered apart from for the purpose of correcting errors. In practice the full details on the systematic correlation and their treatment is often not precisely specified. This system allows for the safe variation of these parameters for testing purposes.