Obtaining the pseudodata used by an n3fit
fit
Since version 4.0.7 the Monte Carlo data replicas are saved to disk by default as the fit is performed.
This can be deactivated by setting the savepseudodata
flag to False
under the fitting
namespace in the fit runcard:
fitting:
savepseudodata: False
If the savepseudodata
flag is not set to False
, the training and validation splits to disk
under files named datacuts_theory_fitting_training_pseudodata.csv
and similarly for the validation
split. These can then be loaded within validphys by leveraging the
validphys.pseudodata.read_fit_pseudodata()
action:
>>> from validphys.api import API
>>> pseudodata = API.read_fit_pseudodata(fit="pseudodata_test_fit_n3fit")
>>> replica1_info = pseudodata[0]
>>> replica1_info.pseudodata.loc[replica1_info.tr_idx]
replica 1
group dataset id
ATLAS ATLASZPT8TEVMDIST 1 29.856281
3 14.686290
4 8.568288
5 2.848544
6 0.704977
... ...
NMC NMCPD 247 0.688019
249 0.713272
255 0.673997
256 0.751973
259 0.750572
[223 rows x 1 columns]
With the postfit reshuffling handled instead by validphys.pseudodata.read_pdf_pseudodata()
.
Reconstructing pseudodata
Warning
The functionality described here is not guaranteed to work between different versions of the code or its dependencies. Specifically, if anything breaks the pseudodata generation between commits, e.g. changes to the theory predictions or settings or the random number generator, it is not possible to reconstruct previously generated pseudodata for the code state at such different commits.
Suppose one has obtained a fit using the n3fit
framework and wants to do some analysis that requires
knowing exactly the data that the neural networks saw during the fitting procedure while this has not been stored.
The information is reproducible given the various seeds in the fit runcard.
The 3 random seeds used in the fit are trvlseed
which determines the training/validation splitting, nnseed
which concerns the initialization of the neural netowrks themselves, and finally mcseed
which is the
seed used by the pseudodata generation. Clearly, the ones we are interested in are trvlseed
and mcseed
.
This functionality is exposed through the API by using
validphys.pseudodata.recreate_fit_pseudodata()
which will retrieve the
pseudodata information that we are interested in. The below is a example
usage:
from validphys.api import API
API.recreate_fit_pseudodata(fit="pseudodata_test_fit_n3fit")
If instead we wish to account for the postfit
reshuffling of the replicas which make it through
the postfit selection, we must use the closely related validphys.pseudodata.recreate_pdf_pseudodata()
API method:
from validphys.api import API
pseudodata = API.recreate_pdf_pseudodata(fit="pseudodata_test_fit_n3fit")
The return type for both these functions is a list of validphys.pseudodata.DataTrValSpec
. Which
is a namedtuple
containing the entire dataset, alongside the training and validation indices:
>>> type(pseudodata)
list
>>> type(pseudodata[0])
validphys.pseudodata.DataTrValSpec
>>> replica1 = pseudodata[0]
>>> replica1_tr = replica1.pseudodata.loc[replica1.tr_idx]
>>> replica1.pseudodata.loc[replica1.tr_idx]
replica 1
group dataset id
NMC NMC 16 0.336004
22 0.349966
27 0.385452
29 0.361615
36 0.430297
... ...
ATLAS ATLASZPT8TEVMDIST 56 22.123374
59 7.284467
61 2.204524
62 0.671212
63 0.023891
[223 rows x 1 columns]
Note
When running this action from a runcard, it may be worthwhile to use the --parallel
flag when calling validphys.
This flag parallelizes dependencies which will compute the pseudodata replicas in an asynchronous manner. This is
advantageous since the MC replica generation is computationally intensive.