Python based data objects

Internal data formats such as CommonData, or FKTables are internally always numpy arrays or pandas dataframes. PDF sets are a bit more complicated since they use lhapdf.

Loading FKTables

This is implemented in the validphys.fkparser module. For example:

from validphys.fkparser import load_fktable
from validphys.loader import Loader
l = Loader()
fk = l.check_fktable(setname="LHCB_Z0_13TEV_DIMUON-Y", theoryID=40_000_000)
res = load_fktable(fk)

results in an validphys.coredata.FKTableData object containing all the information needed to compute a convolution. In particular the sigma property contains a dataframe representing the partonic cross-section (including the cfactors).

Computing theory predictions

The validphys.convolution module implements the necessary tooling to compute theory predictions using Numpy and Pandas. In particular the validphys.convolution.predictions() function returns predictions in terms of PDF and dataset objects that can be obtained directly from validphys runcards:

from validphys.api import API
from validphys.convolution import predictions

inp = {
    'dataset_input': {'dataset': 'LHCB_Z0_13TEV_DIMUON-Y'},
    'theoryid': 40_000_000,
    'use_cuts': 'internal',
    'pdf': 'NNPDF40_nnlo_as_01180'
}

preds = predictions(API.dataset(**inp), API.pdf(**inp))

print(preds.values.mean(axis=1))

The usage of standard scientific Python types opens interesting avenues for parallelization. For example here is how to compute the mean prediction for all datasets using the Dask library:

from dask.distributed import Client

from validphys.api import API
from validphys.convolution import predictions

c = Client()

inp = {
    'fit': 'NNPDF40_nlo_as_01180',
    'use_cuts': 'internal',
    'theoryid': 40_000_000,
    'pdf': 'NNPDF40_nnlo_as_01180',
    'experiments': {'from_': 'fit'}
}


all_datasets = [ds for e in API.experiments(**inp) for ds in e.datasets]

pdf = API.pdf(**inp)

future_pred = dask.delayed(pure=True)(predictions)
c.gather(c.compute([np.mean(future_pred(ds, pdf), axis=0) for ds in all_datasets]))

Central predictions

The default validphys.convolution.predictions() computes one prediction for each replica in the PDF set (for Monte Carlo PDF sets). The user is then supposed to average the replica predictions to get a central value. A quick approximation is to use the central value directly. This is exact for DIS observables and a generally very good approximation for hadronic observables. The validphys.convolution.central_predictions() function may be appropriate for computations where the PDF error is not required, such as the central χ².

The previous example can be simpler using central_predictions:

from validphys.api import API
from validphys.convolution import central_predictions

inp = {
    'dataset_input': {'dataset': 'LHCB_Z0_13TEV_DIMUON-Y'},
    'theoryid': 40_000_000,
    'use_cuts': 'internal',
    'pdf': 'NNPDF40_nnlo_as_01180'
}


central_preds = central_predictions(API.dataset(**inp), API.pdf(**inp))

print(central_preds)

Linear predictions

DIS predictions are linear in the difference between PDF and central value, and hence in the Hessian error parameters. For hadronic observables this is only true to a good approximation. The validphys.convolution.linear_predictions() computes approximate predictions that are linear in the error parameters, and which may be useful in specific situations. In particular, for such predictions the prediction of the central replica is the same as the mean of the replica predictions:

import numpy as np
from validphys.loader import Loader
from validphys.convolution import predictions, linear_predictions, central_predictions

l = Loader()
pdf = l.check_pdf('NNPDF40_nnlo_as_01180')
ds = l.check_dataset('LHCB_Z0_13TEV_DIMUON-Y', theoryid=40_000_000)

# "Exact" predictions
p = predictions(ds, pdf).T
# Approximate predictions, neglecting the quadratic terms in the
# differences between each replica and the central value.
lp = linear_predictions(ds, pdf).T
# Central predictions
cp = central_predictions(ds, pdf).T


assert np.allclose(lp.mean(), cp)
assert not np.allclose(p.mean(), cp)
# Compute the size of the differences between approximate and true predictions
# over the PDF uncertainty. Take the maximum over the three ttbar data points.
print(((p - lp).std() / p.std()).max())

Loading CommonData

The underlying functions for loading CommonData can be found in nnpdf_data.commondataparser. The data is loaded as nnpdf_data.coredata.CommonData, which uses the dataclasses module which automatically generates some special methods for the class. The underlying data is stored as DataFrames, and so can be used with the standard pandas machinery:

import pandas as pd

from validphys.api import API
from nnpdf_data.commondataparser import load_commondata
# define dataset settings
ds_input={'dataset': 'CMS_Z0J_8TEV_PT-Y', 'cfac':('NRM'), 'variant':'sys_10'}
# first get the CommonDataSpec
cd = API.commondata(dataset_input=ds_input)
lcd = load_commondata(cd)
assert isinstance(lcd.central_values, pd.Series)
assert isinstance(lcd.systematics_table, pd.DataFrame)

The nnpdf_data.coredata.CommonData class has a method which returns a new instance of the class with cuts applied:

from validphys.api import API
from nnpdf_data.commondataparser import load_commondata
# define dataset and additional settings
ds_input={'dataset': 'CMS_Z0J_8TEV_PT-Y', 'cfac':('NRM'), 'variant':'sys_10'}
inp = {
    "dataset_input": ds_input,
    "use_cuts": "internal",
    "theoryid": 40_000_000
}
# first get the CommonDataSpec
cd = API.commondata(**inp)
lcd = load_commondata(cd)
# CommonDataSpec object ndata is always total data points uncut
assert lcd.ndata == cd.ndata
cuts = API.cuts(**inp)
lcd_cut = lcd.with_cuts(cuts)
# data has been cut, ndata should have changed.
assert lcd_cut.ndata != cd.ndata

An action already exists which returns the loaded and cut commondata, which is more convenient than calling the underlying functions:

api_lcd_cut = API.loaded_commondata_with_cuts(**inp)
assert api_lcd_cut.ndata == lcd_cut.ndata

Loading Covariance Matrices

Functions which take nnpdf_data.coredata.CommonData s and return covariance matrices can be found in validphys.covmats. As with the commondata the functions can be called in scripts directly:

import numpy as np
from validphys.api import API
from validphys.covmats import covmat_from_systematics

inp = {
    "dataset_input": {"dataset":"NMC_NC_NOTFIXED_P_EM-SIGMARED", "variant": "legacy"},
    "use_cuts": "internal",
    "theoryid": 40_000_000
}
lcd = API.loaded_commondata_with_cuts(**inp)
cov = covmat_from_systematics(lcd)
assert isinstance(cov, np.ndarray)
assert cov.shape == (lcd.ndata, lcd.ndata)

There exists a similar function which acts upon a list of multiple commondatas and takes into account correlations between datasets:

from validphys.covmats import dataset_inputs_covmat_from_systematics
inp = {
    "dataset_inputs": [
        {"dataset":"NMC_NC_NOTFIXED_P_EM-SIGMARED", "variant": "legacy_dw"}
        {"dataset":"NMC_NC_NOTFIXED_EM-F2", "variant": "legacy_dw"}
    ],
    "use_cuts": "internal",
    "theoryid": 40_000_000
}
lcds = API.dataset_inputs_loaded_cd_with_cuts(**inp)
total_ndata = np.sum([lcd.ndata for lcd in lcds])
total_cov = dataset_inputs_covmat_from_systematics(lcds)
assert total_cov.shape == (total_ndata, total_ndata)

These functions are also actions, which can be accessed directly from the API:

from validphys.api import API

inp = {
    "dataset_input": {"dataset":"NMC_NC_NOTFIXED_P_EM-SIGMARED", "variant": "legacy_dw"},
    "use_cuts": "internal",
    "theoryid": 40_000_000
}
# single dataset covmat
cov = API.covmat_from_systematics(**inp)
inp = {
    "dataset_inputs": [
        {"dataset":"NMC_NC_NOTFIXED_P_EM-SIGMARED", "variant": "legacy_dw"}
        {"dataset":"NMC_NC_NOTFIXED_EM-F2", "variant": "legacy_dw"}
    ],
    "use_cuts": "internal",
    "theoryid": 40_000_000
}
total_cov = API.dataset_inputs_covmat_from_systematics(**inp)

Loading LHAPDF PDFs

A wrapper class for LHAPDF PDFs is implemented in the validphys.lhapdfset module. An instance of this module will provide with a handful of useful wrappers to the underlying LHAPDF python interface. This is also the output of the pdf.load() method.

For example, the following will return the values for all 100 members of NNPDF4.0 for the gluon and the d-quark, at three values of x at Q=91.2.

from validphys.api import API
pdf = API.pdf(pdf="NNPDF40_nnlo_as_01180")
l_pdf = pdf.load()
alpha_s = l_pdf.central_member.alphasQ(91.2)
results = l_pdf.grid_values([21,1], [0.1, 0.2, 0.3], [91.2])