.. _exp_data_files:

=======================
Experimental data files
=======================

Data made available by experimental collaborations comes in a variety of
formats. For use in a fitting code, this data must be converted into a common
format that contains all the required information for use in PDF fitting.
Existing formats commonly used by the community, such as in `HepData <https://www.hepdata.net/>`_,
are generally unsuitable. Principally as they often do not fully describe the
breakdown of systematic uncertainties. Therefore over several years an NNPDF
standard data format has been iteratively developed, now denoted
``CommonData``. In addition to the ``CommonData`` files themselves, in the
``nnpdf++`` project the user has the ability to vary the treatment of individual
systematic errors by use of parameter files denoted ``SYSTYPE`` files. In this
section we shall detail the specifications of these two files.

In principle, the file specification and classes described in this section are
independent of the ``nnpdf++`` project and may be generated by whatever means
the user sees fit.  In practice, the ``CommonData`` and ``SYSTYPE`` files
are generated by the ``buildmaster`` project of ``nnpdf++`` from the raw
experimental data files.

.. _process_type_label:

Process types and kinematics
============================

Before going into the file formats, we shall summarise the identifying features
used for data in the ``nnpdf++`` code.

Each data point has an associated *process type* string. This can be
specified by the user, but **must** begin with the appropriate identifying
base process type. Additionally for each data point three kinematic values are
given, the *process type* being primarily to identify the nature of these
values. Typically the first kinematic variable is the principal differential
quantity used in the measurement. The second kinematic variable defines the
scale of the process. The third is generally the centre-of-mass energy of the
process, or inelasticity in the case of DIS. The allowed basic process types,
and their corresponding three kinematic variables are outlined below.

* **DIS** - Deep inelastic scattering measurements: :math:`(x,Q^2,y)`
* **DYP** - Fixed-target Drell-Yan measurements: :math:`(y,M^2,\sqrt{s})`
* **JET** - Jet production: :math:`(\eta,p_T^2,\sqrt{s})`
* **DIJET** - Dijet production: :math:`(\eta,m_{12},\sqrt{s})`
* **PHT** - Photon production: :math:`(\eta_\gamma,E_{T,\gamma}^2,\sqrt{s})`
* **INC** - A total inclusive cross-section: :math:`(0,\mu^2,\sqrt{s})`
* **EWK\_RAP** - Collider electroweak rapidity distribution: :math:`(\eta/y,M^2,\sqrt{s})`
* **EWK\_PT** - Collider electroweak :math:`p_T` distribution: :math:`(p_T,M^2,\sqrt{s})`
* **EWK\_PTRAP** - Collider electroweak :math:`p_T, y` distribution: :math:`(\eta/y, p_T^2,\sqrt{s})`
* **EWK\_MLL** - Collider electroweak lepton-pair mass distribution: :math:`(M_{ll},M_{ll}^2,\sqrt{s})`
* **EWJ\_(J)RAP** - Collider electroweak + jet boson(jet) rapidity distribution: :math:`(\eta/y,M^2,\sqrt{s})`
* **EWJ\_(J)PT** - Collider electroweak + jet boson(jet) :math:`p_T` distribution: :math:`(p_T,M^2,\sqrt{s})`
* **EWJ\_(J)PTRAP** - Collider electroweak + jet boson(jet) :math:`p_T, y` distribution: :math:`(\eta/y, p_T^2,\sqrt{s})`
* **EWJ\_MLL** - Collider electroweak+jet lepton-pair mass distribution: :math:`(M_{ll},M_{ll}^2,\sqrt{s})`
* **HQP\_YQQ** - Heavy diquark system rapidity :math:`(y^{QQ},\mu^2,\sqrt{s})`
* **HQP\_MQQ** - Heavy diquark system mass :math:`(M^{QQ},\mu^2,\sqrt{s})`
* **HQP\_PTQQ** - Heavy diquark system :math:`p_T` :math:`(p_T^{QQ},\mu^2,\sqrt{s})`
* **HQP\_YQ** - Heavy quark rapidity :math:`(y^Q,\mu^2,\sqrt{s})`
* **HQP\_PTQ** - Heavy quark :math:`p_T` :math:`(p_T^Q,\mu^2,\sqrt{s})`
* **HIG\_RAP** - Higgs boson rapidity distribution :math:`(y,M_H^2,\sqrt{s})`

As examples of *process type* strings, consider **EWK\_RAP** for a
collider :math:`W` boson asymmetry measurement binned in rapidity, and
**DIS\_F2P** for the :math:`F_2^p` structure function in DIS. The user is free to
choose something identifying for the second segment of the process type, the
important feature being the basic process type. However, users are encouraged to
only use this freedom when absolutely necessary (such as when used in
combination with APFEL).

One special case is that of :math:`W` boson lepton asymmetry measurements, which being
cross-section asymmetries may occasionally have negative data points. Therefore
asymmetry measurements must have the final tag **ASY** to ensure that
artificial data generation permits negative data values. An example
*process type* string would be **EWK\_RAP\_ASY**.

Notes for the future
--------------------

In the future it would be nice to have a more flexible treatment of the
kinematic variables, both in their number and labelling.

``CommonData`` file format
==============================

Each experimental *Dataset* has its own ``CommonData`` file.
``CommonData`` files contain the bulk of the experimental information used in the
``nnpdf++`` project, with the only other experimental data files controlling
the treatment and correlation of systematic errors. Each ``CommonData`` file
is a plaintext file whose layout is described in the following.

The first line begins with the *Dataset* name, the number of systematic
errors, and the number of data points in the set, whitespace separated. For
example, for the ATLAS 2010 jet measurement the first line of the file reads:

	ATLASR04JETS36PB        91      90

Which demonstrates that the set *name* is 'ATLASR04JETS36PB', that there
are 91 sources of systematic uncertainty, 90 data points, one associated ``FK``
table, and that the ``FK`` table corresponds to a proton initial state. As
another example, consider the NMCPD *Dataset*:

	NMCPD   5       211

Here there are 5 sources of systematic uncertainty and 211 data points.
Following this, each line specifies the details of a single data point. The first
value being the data point index :math:`1< i_{\text{dat}} \leq N_{\mathrm{dat}}`,
followed by the *process type* string as outlined above, and the three
kinematic variables in order. These are followed by the value of the
experimental data point itself, and the value of the statistical uncertainty
associated with it (absolute value). Finally the systematic uncertainties are
specified. The layout per data point is therefore

	:math:`i_{\mathrm{dat}}`   *ProcessType* :math:`\text{kin}_1 \text{kin}_2 \text{kin}_3` data\_value stat\_error  :math:`[..` systematics :math:`..]`

For example, in the case of a DIS data point from the BCDMSD *Dataset*:

	1    DIS\_F2D 7.0e-02   8.75e+00   5.666e-01   3.6575e-01   6.43e-03 :math:`[..` systematics :math:`..]`

In these lines the systematic uncertainties are laid out as so. For each
uncertainty, additive and multiplicative versions are given. The additive
uncertainty is given by absolute value, and the multiplicative as a percentage
of the data value (that is, relative error multiplied by 100). The systematics
string is formed by the sequence of :math:`N_{\text{sys}}` pairs of systematic
uncertainties:

	:math:`[..` systematics :math:`..] =  \sigma^{\mathrm{add}}_0 \quad  \sigma^{\mathrm{mul}}_0\quad \sigma^{\mathrm{add}}_1 \quad \sigma^{\mathrm{mul}}_1 \quad....\quad \sigma^{\mathrm{add}}_n  \quad\sigma^{\mathrm{mul}}_n`

where :math:`\sigma^{\mathrm{add}}_i` and :math:`\sigma^{\mathrm{mul}}_i` are the additive
and multiplicative versions respectively of the systematic uncertainty arising
from the :math:`i\text{th}` source. While it may seem at first that the multiplicative error
is spurious given the presence of the additive error and data central value,
this may not be the case. For example, in a closure test scenario, the data
central values may have been replaced in the ``CommonData`` file by
theoretical predictions. Therefore if you wish to use a covariance matrix
generated with the original multiplicative uncertainties via the :math:`t_0` method,
you must also store the original multiplicative (percentage) error. For
flexibility and ease of I/O this is therefore done in the ``CommonData`` file
itself.

For a *Dataset* with :math:`N_{\text{dat}}` data points and :math:`N_{\text{sys}}`
sources of systematic uncertainty, the total ``CommonData`` file should
therefore be :math:`N_{\text{dat}}+1` lines long. Its first line contains the set
parameters, and every subsequent line should consist of the description of a
single data point. Each data point line should therefore contain :math:`7 +
2N_{\text{sys}}` columns.

``SYSTYPE`` file format
=======================

The explicit presentation of the systematic uncertainties in the
``CommonData`` file allows for a great deal of flexibility in the treatment of
these errors. Specifically, whether they should be treated as additive or
multiplicative uncertainties, and how they are correlated, both within the
*Dataset* and within a larger *Experiment*. A specification for how
the systematic uncertainties should be treated is provided by a ``SYSTYPE``
file. As there is not always an unambiguous method for the treatment of these
uncertainties, these information is kept outside the (unambiguous)
``CommonData`` file. Several options for this treatment are often provided in the
form of multiple ``SYSTYPE`` files which may be selected between in the fit.

Each ``SYSTYPE`` file begins with a line specifying the total number of
systematics. Naturally this must match with the :math:`N_{\text{sys}}` variable
specified in the associated ``CommonData`` file. This is presented as a single
integer. For example, in the case of the BCDMSD ``SYSTYPE`` files, the first line is

	8

as there are :math:`N_{\text{sys}}=8` sources of systematic uncertainty for this
*Dataset*. Following this line there are :math:`N_{\text{sys}}` lines describing each
source of systematic uncertainty. For each source two parameters are provided,
the *uncertainty treatment* and the *uncertainty description*. These
are laid out for each systematic as:

	:math:`i_{\text{sys}}`	[*uncertainty treatment*]	[*uncertainty description*]

where :math:`1< i_{\text{sys}} \leq N_{\mathrm{sys}}` enumerates each systematic. The
*uncertainty treatment* determines whether the uncertainty should be
treated as additive, multiplicative, or in cases where the choice is unclear, as
randomised on a replica by replica basis. These choices are selected by using
the strings **ADD**, **MULT**, or **RAND**. The *uncertainty
description* specifies how the systematic is to be correlated with other
data points. There are three special cases for the *uncertainty
description*, specified by the strings **CORR**, **UNCORR**,
**THEORYCORR**, **THEORYUNCORR** and **SKIP**. The first two
specify whether the systematic is fully correlated **only** within the
*Dataset* (**CORR**), or whether the systematic is totally
uncorrelated (**UNCORR**). The **THEORY** descriptor is used to
describe theoretical systematics due to e.g missing NNLO corrections, which are
treated as either **CORR** or **UNCORR** according to their suffix,
but are not included in the generation of artificial replicas (their only
contribution is to the fitting error function). If the user wishes to correlate
a specific uncertainty between multiple *Datasets* within an
*Experiment*, then they should use a custom *uncertainty description*.
When building a covariance matrix for an *Experiment*, the ``nnpdf++``
code checks for matches between the *uncertainty descriptions* of
systematics of its constituent *Datasets*. If a match is found, the code
will correlate those systematics over the relevant datasets. The **SKIP**
descriptor removes the systematic from the covariance matrices for debugging
purposes.

As an example, let us consider an NNPDF2.3 standard ``SYSTYPE`` for the BCDMSD
*Dataset*:

	| 8
	| 1    ADD    BCDMSFB
	| 2    ADD    BCDMSFS
	| 3    ADD    BCDMSFR
	| 4    MULT    BCDMSNORM
	| 5    MULT    BCDMSRELNORMTARGET
	| 6    MULT    CORR
	| 7    MULT    CORR
	| 8    MULT    CORR

Here the first five systematics have custom *uncertainty descriptions*,
thereby allowing them to be cross-correlated with other *Datasets* in a
larger *Experiment*. Systematics six to eight are specified as being fully
correlated, but only within the BCDMSD  *Dataset*. Additionally note that
the first three systematics are specified as additive, and the remainder are
multiplicative. If we compare now to the equivalent ``SYSTYPE`` file for the
BCDMSP *Dataset*:

	| 11
	| 1    ADD    BCDMSFB
	| 2    ADD    BCDMSFS
	| 3    ADD    BCDMSFR
	| 4    MULT    BCDMSNORM
	| 5    MULT    BCDMSRELNORMTARGET
	| 6    MULT    CORR
	| 7    MULT    CORR
	| 8    MULT    CORR
	| 9    MULT    CORR
	| 10    MULT    CORR
	| 11    MULT    CORR

it is clear that the first five systematics are the same as in the BCDMSD
*Dataset*, and therefore should the two sets be combined into a common
*Experiment*, the code will cross-correlate them appropriately. The
combination of ``SYSTYPE`` and ``CommonData`` is quite flexible. As stated
previously, once generated from the original raw experimental data, the
``CommonData`` file is fixed and should not be altered apart from for the purpose
of correcting errors. In practice the full details on the systematic correlation
and their treatment is often not precisely specified. This system allows for the
safe variation of these parameters for testing purposes.