While the methodology used up to the 3.1 release of NNPDF considerably reduced the dependency on the functional form of the PDFs compared to other collaborations, there existed a bias regarding the choice of hyperparameters that define the NNPDF neural network and optimization strategy.
One of the main advantages introduced by the
n3fit framework with respect to
nnfit is the
possibility of running the fits in a fraction of the time. This allow us to reduce the dependence of the
hyperparameters by running a grid scan on the relevant parameters. Together with an appropriate
figure of merit this grid search or hyperparameter scan will minimize the bias of the network
by finding the best one for each possible situation.
The final goal is for the methodology to be robust enough that a change in the physics (fitted experiments, choice of basis, choice of constraints, …) depends only on a new run of the hyperparameter scan to be functional.
It is important to remember that the best hyperparameter combination is not necessarily the one that produces the minimal training/validation \(\chi^2\). In fact, looking for the minimal \(\chi^2\) is known to produce overlearning even when optimizing on the validation loss, as can be seen here.
Despite producing a very good \(\chi^2\), the previous fit will fail when challenged with new unseen data. This needs to be accounted for in the figure of merit of the hyperoptimization.
The desired features of this figure of merit can be summarized as:
Produce a low \(\chi^2\) for both fitted experiments and non-fitted experiments.
Be stable upon random fluctuations.
Be reliable even when the number of points is not very large.
A good compromise between all previous points is the usage of the cross-validation technique usually known as k-folds.
In its most general form, we take all data points that enter the fit and break them down into k partitions. Then, for every combination of hyperparameters, we do k fits leaving out a different partition each time. We then use this partition to evaluate the goodness of the fit for each of the k fits and construct, with these results, a reward function for the combination of hyperparameters.
In the NNPDF implementation of k-folding, each of the data points can be identified with a dataset. Note that during the fit we perform the usual training-validation split within each dataset and use it for stopping.
The choice of this method for selecting the hyperparameters of the NNPDF fitting methodology
has been discussed in the mailing list.
Some public discussion about the different hyperoptimization techniques that have been used and
tested during the development of
n3fit can be found in public slides
as well as in internal presentations.
The choice of figure of merit is still under development, but we have several possibilities.
By default we take the combination that produces the best average for the partitions’ \(\chi^2\).
An example of a DIS fit using this loss function can be found here: [best average]. It can be selected in the runcard using the target
We can take the combination that produces the best worst loss.
An example of a DIS fit using this loss function can be found here: [best worst]. It can be selected in the runcard using the target
We can take the most stable combination which gets the loss under a certain threshold.
An example of a DIS fit using this loss function with the threshold \(\chi^2\) set to 2.0
can be found here: [best std].
It can be selected in the runcard using the target
As observed, for DIS fits we obtain fits of similar quality using these losses. This is not unexpected but it is a good test of the robustness of the method.
While this method is much more robust that the previously used “test set” (which is similar to doing the limit \(k\rightarrow 1\)) we can still find overfitting configurations. For instance, if one of the metrics gives a much more complicated network structure, overfitting is expected. Here’s an example where, after 10000 hyperparameter trials, the network structure had an order of magnitude more free parameters than normal, in the case of the best average loss function: [best avg overlearned].
The K-folding method is based on the creation of several partitions such that we can evaluate how the fit would behave on completely unseen data. The choice of this partitions is completely arbitrary, but defines the method completely. Here we list some important considerations to be taken into account when constructing these partitions.
The reward function of the partitions must be comparable.
All loss functions implemented in
n3fit for the optimization of hyperparameters use the reward
of all partitions as if they were equivalent.
When they are not equivalent the
weight flag should be used (see Practical Usage)
Not all datasets should enter a partition: beware of extrapolation.
Beyond the last dataset that has entered the fit we find ourselves in what is usually known as the extrapolation region. The behaviour of the fit in this region is not controlled by any data but rather by the choice of preprocessing exponents (\(\alpha\) at small x, \(\beta\) at large x). For this reason, if a dataset is included in a partition which however falls in the extrapolation region of the fit, its loss function will be determined by these exponents (which are randomly chosen) rather than by the hypeparameter combination.
The general rule that we follow is to always include in the fit the lowest-x dataset that determines each of the PDF functions. This means that no partition has datasets which falls in the extrapolation regions. As a practical proxy-rule we can classify the datasets by process type and exclude from the partitioning the ones that reach the lowest value of x.
Interpretation of results
While performing the hyperparameter scan we found that optimizing by only looking at the validation
loss produced results which would usually be considered overfitted: very low training and validation
\(\chi^2\) but very complex replica patterns. Thanks to the high performance of the
n3fit procedure the
usual within-dataset cross-validation algorithm used in the NNPDF framework was not enough to prevent overlearning
for all architectures.
The cross-validation implemented in NNPDF is successful in avoiding the learning of the noise within
a dataset. However, we observe that this choice is not enough to prevent overfitting due to
correlations between points in the same dataset when using hyperopt with
For hyperopt we have implemented k-folding cross-validation. This method works by refitting with the same set of parameters several times (k times) each time leaving out a partition of the datasets. By using this method we reduce the bias associated with a particular choice of the datasets to leave out, while at the same time, refitting with the same set of parameters allows us to assess the stability of the particular combination of hyperparameters.
The hyperparameter scan capabilities are implemented using the hyperopt framework which
systematically scans over a selection of parameter using Bayesian optimization and measures model
performance to select the best architecture.
A Jupyter Notebook is provided
with a practical example of the usage of the hyperopt framework. This example is a simplified version
of the hyperparameter scan used in
The hyperopt library implements the tree-structured Parzen estimator algorithm
which is a robust sequential-model-based optimization approach [SMBO].
We optimize on a combination of the best validation loss and the stability of the fits. In other words, we select the architecture that produces the lowest validation loss after we trim those combinations which are deemed to be unstable.
The fits done for hyperoptimization are one-replica fits. We take advantage of the
stability of the Gradient Descent and of the fact that the difference between set of hyperparameters
is small. This is a trade-off as we sustain a loss of “accuracy” (as some very ill-behaved replicas
might destroy good sets of parameters) in exchange for being able to test many more parameters in
the same time. Once a multireplica
n3fit is implemented we can hyperoptimize without having to
rely on the one-replica proxy and without a loss of performance.
From the fitting point of view, the implementation of the k-folding is done by setting all experimental
data points from the fold to 0 and by masking the respective predictions from the Neural Network to 0.
In the code this means that during the data-reading phase
n3fit also creates one mask per k-fold
per experiment to apply to the experimental data before compiling the Neural Network.
Note that this is not a boolean mask that drops the points but rather it just sets the data to 0.
The reason for doing it in this way is to minimize the number of things that change when doing a
hyperparameter scan with respect to a fit.
hyperscan object is also available from
validphys which behaves as a special case of
It can be accessed and inspected through the validphys API (see Using the validphys API).
The product of a hyperparameter scan are
tries.json files which can be acccessed with the
from validphys.api import API hyperscan = API.hyperscan(hyperscan="test_hyperopt_fit_300621")
It is also possible to access a
hyperscan by using the
validphys loader with:
from validphys.loader import Loader l = Loader() hyperscan = l.check_hyperscan("test_hyperopt_fit_300621")
Positivity and integrability
Since positivity is a hard constraint of the fit (i.e., a replica fit will not be marked as good unless it passes the positivity constraints), it enters the hyperoptimization in a similar way. There is no threshold, either the replica passes positivity or it doesn’t, and if it doesn’t hyperopt will receive a failure instead of a fit (so the run will be discarded).
Integrability instead is implemented as a penalty.
In order to activate it it is necessary to add
integrability to the penalties section
of the hyperoptimization namespace (see below).
In this case the integrability is implemented as an exponential penalty, this means that as
the “integrability number” grows, the test loss will grow as well, favouring replicas with
an “integrability number” below the chosen threshold.
For consistency the threshold used during hyperoptimization is read directly from the
An example runcard can be found at
The partitions can be chosen by adding a
kfold::partitions key to the runcard.
kfold: target: average verbosity: training: True kfold: True threshold: 5.0 penalties: - saturation - patience - integrability partitions: - overfit: True datasets: - data_1 - data_2 - weight: 2.0 datasets: - data_3 - datasets: - data_4 - data_5
overfit flag, when applied to one of the partitions, introduces this partition in the
fitted data, i.e., the training and validation always include that partition and will work normally.
This is useful for very broad scans where we want to find an architecture which is able to
fit, without worrying about things like overlearning which might be a second-order problem.
weight (default 1.0) is multiplied with the loss function of the partition for which it is set.
Note that the weight is applied before the threshold check.
threshold_loss flag will make the fit stop if any of the partitions produces a loss greater
than the given threshold. This is useful for quickly discarding hyperparameter subspaces without
needing to do all
verbosity dictionary allows fine control over what to report each 100 epochs. When both
kfold are set to
False, nothing is printed until the end of the fit of the fold.
When set to
True, the losses for the training (training and validation) and for the partition are printed.
During hyperoptimization we might want to search for specific features, such as quickly fitting
(giving an incentive to quicker runs) or avoiding saturation (increasing the loss for models that
have produce saturation after a fit). New penalties can easily be added in the
The target function for minimization can be selected with the
By default, and if no
target is chosen,
n3fit defaults to
the average of the loss function over the partition sets (
New target functions can be easily added in the
The hyperoptimization procedure performed in hep-ph/1907.05075 used a slightly different approach in order to avoid overfitting, by leaving out a number of datasets to compute a “testing set”. The loss function was then computed as
The group of datasets that were left out followed the algorithm mentioned above with only one fold:
These were chosen attending to their process type as defined in their commondata files.