.. _exp_data_files: ======================= Experimental data files ======================= Data made available by experimental collaborations comes in a variety of formats. For use in a fitting code, this data must be converted into a common format that contains all the required information for use in PDF fitting. Existing formats commonly used by the community, such as in `HepData `_, are generally unsuitable. Principally as they often do not fully describe the breakdown of systematic uncertainties. Therefore over several years an NNPDF standard data format has been iteratively developed, now denoted ``CommonData``. In addition to the ``CommonData`` files themselves, in the ``nnpdf++`` project the user has the ability to vary the treatment of individual systematic errors by use of parameter files denoted ``SYSTYPE`` files. In this section we shall detail the specifications of these two files. In principle, the file specification and classes described in this section are independent of the ``nnpdf++`` project and may be generated by whatever means the user sees fit. In practice, the ``CommonData`` and ``SYSTYPE`` files are generated by the ``buildmaster`` project of ``nnpdf++`` from the raw experimental data files. .. _process_type_label: Process types and kinematics ============================ Before going into the file formats, we shall summarise the identifying features used for data in the ``nnpdf++`` code. Each data point has an associated *process type* string. This can be specified by the user, but **must** begin with the appropriate identifying base process type. Additionally for each data point three kinematic values are given, the *process type* being primarily to identify the nature of these values. Typically the first kinematic variable is the principal differential quantity used in the measurement. The second kinematic variable defines the scale of the process. The third is generally the centre-of-mass energy of the process, or inelasticity in the case of DIS. The allowed basic process types, and their corresponding three kinematic variables are outlined below. * **DIS** - Deep inelastic scattering measurements: :math:`(x,Q^2,y)` * **DYP** - Fixed-target Drell-Yan measurements: :math:`(y,M^2,\sqrt{s})` * **JET** - Jet production: :math:`(\eta,p_T^2,\sqrt{s})` * **DIJET** - Dijet production: :math:`(\eta,m_{12},\sqrt{s})` * **PHT** - Photon production: :math:`(\eta_\gamma,E_{T,\gamma}^2,\sqrt{s})` * **INC** - A total inclusive cross-section: :math:`(0,\mu^2,\sqrt{s})` * **EWK\_RAP** - Collider electroweak rapidity distribution: :math:`(\eta/y,M^2,\sqrt{s})` * **EWK\_PT** - Collider electroweak :math:`p_T` distribution: :math:`(p_T,M^2,\sqrt{s})` * **EWK\_PTRAP** - Collider electroweak :math:`p_T, y` distribution: :math:`(\eta/y, p_T^2,\sqrt{s})` * **EWK\_MLL** - Collider electroweak lepton-pair mass distribution: :math:`(M_{ll},M_{ll}^2,\sqrt{s})` * **EWJ\_(J)RAP** - Collider electroweak + jet boson(jet) rapidity distribution: :math:`(\eta/y,M^2,\sqrt{s})` * **EWJ\_(J)PT** - Collider electroweak + jet boson(jet) :math:`p_T` distribution: :math:`(p_T,M^2,\sqrt{s})` * **EWJ\_(J)PTRAP** - Collider electroweak + jet boson(jet) :math:`p_T, y` distribution: :math:`(\eta/y, p_T^2,\sqrt{s})` * **EWJ\_MLL** - Collider electroweak+jet lepton-pair mass distribution: :math:`(M_{ll},M_{ll}^2,\sqrt{s})` * **HQP\_YQQ** - Heavy diquark system rapidity :math:`(y^{QQ},\mu^2,\sqrt{s})` * **HQP\_MQQ** - Heavy diquark system mass :math:`(M^{QQ},\mu^2,\sqrt{s})` * **HQP\_PTQQ** - Heavy diquark system :math:`p_T` :math:`(p_T^{QQ},\mu^2,\sqrt{s})` * **HQP\_YQ** - Heavy quark rapidity :math:`(y^Q,\mu^2,\sqrt{s})` * **HQP\_PTQ** - Heavy quark :math:`p_T` :math:`(p_T^Q,\mu^2,\sqrt{s})` * **HIG\_RAP** - Higgs boson rapidity distribution :math:`(y,M_H^2,\sqrt{s})` As examples of *process type* strings, consider **EWK\_RAP** for a collider :math:`W` boson asymmetry measurement binned in rapidity, and **DIS\_F2P** for the :math:`F_2^p` structure function in DIS. The user is free to choose something identifying for the second segment of the process type, the important feature being the basic process type. However, users are encouraged to only use this freedom when absolutely necessary (such as when used in combination with APFEL). One special case is that of :math:`W` boson lepton asymmetry measurements, which being cross-section asymmetries may occasionally have negative data points. Therefore asymmetry measurements must have the final tag **ASY** to ensure that artificial data generation permits negative data values. An example *process type* string would be **EWK\_RAP\_ASY**. Notes for the future -------------------- In the future it would be nice to have a more flexible treatment of the kinematic variables, both in their number and labelling. ``CommonData`` file format ============================== Each experimental *Dataset* has its own ``CommonData`` file. ``CommonData`` files contain the bulk of the experimental information used in the ``nnpdf++`` project, with the only other experimental data files controlling the treatment and correlation of systematic errors. Each ``CommonData`` file is a plaintext file whose layout is described in the following. The first line begins with the *Dataset* name, the number of systematic errors, and the number of data points in the set, whitespace separated. For example, for the ATLAS 2010 jet measurement the first line of the file reads: ATLASR04JETS36PB 91 90 Which demonstrates that the set *name* is 'ATLASR04JETS36PB', that there are 91 sources of systematic uncertainty, 90 data points, one associated ``FK`` table, and that the ``FK`` table corresponds to a proton initial state. As another example, consider the NMCPD *Dataset*: NMCPD 5 211 Here there are 5 sources of systematic uncertainty and 211 data points. Following this, each line specifies the details of a single data point. The first value being the data point index :math:`1< i_{\text{dat}} \leq N_{\mathrm{dat}}`, followed by the *process type* string as outlined above, and the three kinematic variables in order. These are followed by the value of the experimental data point itself, and the value of the statistical uncertainty associated with it (absolute value). Finally the systematic uncertainties are specified. The layout per data point is therefore :math:`i_{\mathrm{dat}}` *ProcessType* :math:`\text{kin}_1 \text{kin}_2 \text{kin}_3` data\_value stat\_error :math:`[..` systematics :math:`..]` For example, in the case of a DIS data point from the BCDMSD *Dataset*: 1 DIS\_F2D 7.0e-02 8.75e+00 5.666e-01 3.6575e-01 6.43e-03 :math:`[..` systematics :math:`..]` In these lines the systematic uncertainties are laid out as so. For each uncertainty, additive and multiplicative versions are given. The additive uncertainty is given by absolute value, and the multiplicative as a percentage of the data value (that is, relative error multiplied by 100). The systematics string is formed by the sequence of :math:`N_{\text{sys}}` pairs of systematic uncertainties: :math:`[..` systematics :math:`..] = \sigma^{\mathrm{add}}_0 \quad \sigma^{\mathrm{mul}}_0\quad \sigma^{\mathrm{add}}_1 \quad \sigma^{\mathrm{mul}}_1 \quad....\quad \sigma^{\mathrm{add}}_n \quad\sigma^{\mathrm{mul}}_n` where :math:`\sigma^{\mathrm{add}}_i` and :math:`\sigma^{\mathrm{mul}}_i` are the additive and multiplicative versions respectively of the systematic uncertainty arising from the :math:`i\text{th}` source. While it may seem at first that the multiplicative error is spurious given the presence of the additive error and data central value, this may not be the case. For example, in a closure test scenario, the data central values may have been replaced in the ``CommonData`` file by theoretical predictions. Therefore if you wish to use a covariance matrix generated with the original multiplicative uncertainties via the :math:`t_0` method, you must also store the original multiplicative (percentage) error. For flexibility and ease of I/O this is therefore done in the ``CommonData`` file itself. For a *Dataset* with :math:`N_{\text{dat}}` data points and :math:`N_{\text{sys}}` sources of systematic uncertainty, the total ``CommonData`` file should therefore be :math:`N_{\text{dat}}+1` lines long. Its first line contains the set parameters, and every subsequent line should consist of the description of a single data point. Each data point line should therefore contain :math:`7 + 2N_{\text{sys}}` columns. ``SYSTYPE`` file format ======================= The explicit presentation of the systematic uncertainties in the ``CommonData`` file allows for a great deal of flexibility in the treatment of these errors. Specifically, whether they should be treated as additive or multiplicative uncertainties, and how they are correlated, both within the *Dataset* and within a larger *Experiment*. A specification for how the systematic uncertainties should be treated is provided by a ``SYSTYPE`` file. As there is not always an unambiguous method for the treatment of these uncertainties, these information is kept outside the (unambiguous) ``CommonData`` file. Several options for this treatment are often provided in the form of multiple ``SYSTYPE`` files which may be selected between in the fit. Each ``SYSTYPE`` file begins with a line specifying the total number of systematics. Naturally this must match with the :math:`N_{\text{sys}}` variable specified in the associated ``CommonData`` file. This is presented as a single integer. For example, in the case of the BCDMSD ``SYSTYPE`` files, the first line is 8 as there are :math:`N_{\text{sys}}=8` sources of systematic uncertainty for this *Dataset*. Following this line there are :math:`N_{\text{sys}}` lines describing each source of systematic uncertainty. For each source two parameters are provided, the *uncertainty treatment* and the *uncertainty description*. These are laid out for each systematic as: :math:`i_{\text{sys}}` [*uncertainty treatment*] [*uncertainty description*] where :math:`1< i_{\text{sys}} \leq N_{\mathrm{sys}}` enumerates each systematic. The *uncertainty treatment* determines whether the uncertainty should be treated as additive, multiplicative, or in cases where the choice is unclear, as randomised on a replica by replica basis. These choices are selected by using the strings **ADD**, **MULT**, or **RAND**. The *uncertainty description* specifies how the systematic is to be correlated with other data points. There are three special cases for the *uncertainty description*, specified by the strings **CORR**, **UNCORR**, **THEORYCORR**, **THEORYUNCORR** and **SKIP**. The first two specify whether the systematic is fully correlated **only** within the *Dataset* (**CORR**), or whether the systematic is totally uncorrelated (**UNCORR**). The **THEORY** descriptor is used to describe theoretical systematics due to e.g missing NNLO corrections, which are treated as either **CORR** or **UNCORR** according to their suffix, but are not included in the generation of artificial replicas (their only contribution is to the fitting error function). If the user wishes to correlate a specific uncertainty between multiple *Datasets* within an *Experiment*, then they should use a custom *uncertainty description*. When building a covariance matrix for an *Experiment*, the ``nnpdf++`` code checks for matches between the *uncertainty descriptions* of systematics of its constituent *Datasets*. If a match is found, the code will correlate those systematics over the relevant datasets. The **SKIP** descriptor removes the systematic from the covariance matrices for debugging purposes. As an example, let us consider an NNPDF2.3 standard ``SYSTYPE`` for the BCDMSD *Dataset*: | 8 | 1 ADD BCDMSFB | 2 ADD BCDMSFS | 3 ADD BCDMSFR | 4 MULT BCDMSNORM | 5 MULT BCDMSRELNORMTARGET | 6 MULT CORR | 7 MULT CORR | 8 MULT CORR Here the first five systematics have custom *uncertainty descriptions*, thereby allowing them to be cross-correlated with other *Datasets* in a larger *Experiment*. Systematics six to eight are specified as being fully correlated, but only within the BCDMSD *Dataset*. Additionally note that the first three systematics are specified as additive, and the remainder are multiplicative. If we compare now to the equivalent ``SYSTYPE`` file for the BCDMSP *Dataset*: | 11 | 1 ADD BCDMSFB | 2 ADD BCDMSFS | 3 ADD BCDMSFR | 4 MULT BCDMSNORM | 5 MULT BCDMSRELNORMTARGET | 6 MULT CORR | 7 MULT CORR | 8 MULT CORR | 9 MULT CORR | 10 MULT CORR | 11 MULT CORR it is clear that the first five systematics are the same as in the BCDMSD *Dataset*, and therefore should the two sets be combined into a common *Experiment*, the code will cross-correlate them appropriately. The combination of ``SYSTYPE`` and ``CommonData`` is quite flexible. As stated previously, once generated from the original raw experimental data, the ``CommonData`` file is fixed and should not be altered apart from for the purpose of correcting errors. In practice the full details on the systematic correlation and their treatment is often not precisely specified. This system allows for the safe variation of these parameters for testing purposes.