Hyperoptimization algorithm
Motivation
While the methodology used up to the 3.1 release of NNPDF considerably reduced the dependency on the functional form of the PDFs compared to other collaborations, there existed a bias regarding the choice of hyperparameters that define the NNPDF neural network and optimization strategy.
The hyperparameters of the model used by n3fit
are determined by running a grid scan on the
relevant parameters. Together with an appropriate
figure of merit this grid search or hyperparameter scan will minimize the bias of the network
by finding the best one for each possible situation.
The final goal is for the methodology to be robust enough that a change in the physics (fitted experiments, choice of basis, choice of constraints, …) depends only on a new run of the hyperparameter scan to be functional.
It is important to remember that the best hyperparameter combination is not necessarily the one that produces the minimal training/validation \(\chi^2\). In fact, looking for the minimal \(\chi^2\) is known to produce overlearning even when optimizing on the validation loss, as can be seen here.
Despite producing a very good \(\chi^2\), the previous fit will fail when challenged with new unseen data. This needs to be accounted for in the figure of merit of the hyperoptimization.
The desired features of this figure of merit can be summarized as:
Produce a low \(\chi^2\) for both fitted experiments and non-fitted experiments.
Be stable upon random fluctuations.
Be reliable even when the number of points is not very large.
K-folding cross-validation
A good compromise between all previous points is the usage of the cross-validation technique usually known as k-folds.
In its most general form, we take all data points that enter the fit and break them down into k partitions. Then, for every combination of hyperparameters, we do k fits leaving out a different partition each time. We then use this partition to evaluate the goodness of the fit for each of the k fits and construct, with these results, a reward function for the combination of hyperparameters.
In the NNPDF implementation of k-folding, each of the data points can be identified with a dataset. Note that during the fit we perform the usual training-validation split within each dataset and use it for stopping.
The choice of this method for selecting the hyperparameters of the NNPDF fitting methodology
has been discussed in the mailing list.
Some public discussion about the different hyperoptimization techniques that have been used and
tested during the development of n3fit
can be found in public slides
as well as in internal presentations.
The choice of figure of merit is still under development, but we have several possibilities.
By default we take the combination that produces the best average for the partitions’ \(\chi^2\).
An example of a DIS fit using this loss function can be found here: [best average]. It can be selected in the runcard using the target average
.
We can take the combination that produces the best worst loss.
An example of a DIS fit using this loss function can be found here: [best worst]. It can be selected in the runcard using the target best_worst
.
We can take the most stable combination which gets the loss under a certain threshold.
An example of a DIS fit using this loss function with the threshold \(\chi^2\) set to 2.0
can be found here: [best std].
It can be selected in the runcard using the target std
.
As observed, for DIS fits we obtain fits of similar quality using these losses. This is not unexpected but it is a good test of the robustness of the method.
While this method is much more robust that the previously used “test set” (which is similar to doing the limit \(k\rightarrow 1\)) we can still find overfitting configurations. For instance, if one of the metrics gives a much more complicated network structure, overfitting is expected. Here’s an example where, after 10000 hyperparameter trials, the network structure had an order of magnitude more free parameters than normal, in the case of the best average loss function: [best avg overlearned].
Creating partitions
The K-folding method is based on the creation of several partitions such that we can evaluate how the fit would behave on completely unseen data. The choice of this partitions is completely arbitrary, but defines the method completely. Here we list some important considerations to be taken into account when constructing these partitions.
The reward function of the partitions must be comparable.
All loss functions implemented in n3fit
for the optimization of hyperparameters use the reward
of all partitions as if they were equivalent.
When they are not equivalent the weight
flag should be used (see Practical Usage)
Not all datasets should enter a partition: beware of extrapolation.
Beyond the last dataset that has entered the fit we find ourselves in what is usually known as the extrapolation region. The behaviour of the fit in this region is not controlled by any data but rather by the choice of preprocessing exponents (\(\alpha\) at small x, \(\beta\) at large x). For this reason, if a dataset is included in a partition which however falls in the extrapolation region of the fit, its loss function will be determined by these exponents (which are randomly chosen) rather than by the hypeparameter combination.
The general rule that we follow is to always include in the fit the lowest-x dataset that determines each of the PDF functions. This means that no partition has datasets which falls in the extrapolation regions. As a practical proxy-rule we can classify the datasets by process type and exclude from the partitioning the ones that reach the lowest value of x.
Interpretation of results
While performing the hyperparameter scan we found that optimizing by only looking at the validation
loss produced results which would usually be considered overfitted: very low training and validation
\(\chi^2\) but very complex replica patterns. Thanks to the high performance of the n3fit
procedure the
usual within-dataset cross-validation algorithm used in the NNPDF framework was not enough to prevent overlearning
for all architectures.
The cross-validation implemented in NNPDF is successful in avoiding the learning of the noise within
a dataset. However, we observe that this choice is not enough to prevent overfitting due to
correlations between points in the same dataset when using hyperopt with n3fit
.
For hyperopt we have implemented k-folding cross-validation. This method works by refitting with the same set of parameters several times (k times) each time leaving out a partition of the datasets. By using this method we reduce the bias associated with a particular choice of the datasets to leave out, while at the same time, refitting with the same set of parameters allows us to assess the stability of the particular combination of hyperparameters.
Implementation in n3fit
The hyperparameter scan capabilities are implemented using the hyperopt framework which
systematically scans over a selection of parameter using Bayesian optimization and measures model
performance to select the best architecture.
A Jupyter Notebook is provided
with a practical example of the usage of the hyperopt framework. This example is a simplified version
of the hyperparameter scan used in n3fit
.
The hyperopt library implements the tree-structured Parzen estimator algorithm
which is a robust sequential-model-based optimization approach [SMBO].
We optimize on a combination of the best validation loss and the stability of the fits. In other words, we select the architecture that produces the lowest validation loss after we trim those combinations which are deemed to be unstable.
Note
The fits done for hyperoptimization are one-replica fits. We take advantage of the
stability of the Gradient Descent and of the fact that the difference between set of hyperparameters
is small. This is a trade-off as we sustain a loss of “accuracy” (as some very ill-behaved replicas
might destroy good sets of parameters) in exchange for being able to test many more parameters in
the same time. Once a multireplica n3fit
is implemented we can hyperoptimize without having to
rely on the one-replica proxy and without a loss of performance.
From the fitting point of view, the implementation of the k-folding is done by setting all experimental
data points from the fold to 0 and by masking the respective predictions from the Neural Network to 0.
In the code this means that during the data-reading phase n3fit
also creates one mask per k-fold
per experiment to apply to the experimental data before compiling the Neural Network.
Note that this is not a boolean mask that drops the points but rather it just sets the data to 0.
The reason for doing it in this way is to minimize the number of things that change when doing a
hyperparameter scan with respect to a fit.
Implementation in validphys
A hyperscan
object is also available from validphys
which behaves as a special case of fit
.
It can be accessed and inspected through the validphys API (see Using the validphys API).
The product of a hyperparameter scan are tries.json
files which can be acccessed with the
tries_files
attribute.
from validphys.api import API
hyperscan = API.hyperscan(hyperscan="test_hyperopt_fit_300621")
It is also possible to access a hyperscan
by using the validphys
loader with:
from validphys.loader import Loader
l = Loader()
hyperscan = l.check_hyperscan("test_hyperopt_fit_300621")
Positivity and integrability
Since positivity is a hard constraint of the fit (i.e., a replica fit will not be marked as good unless it passes the positivity constraints), it enters the hyperoptimization in a similar way. There is no threshold, either the replica passes positivity or it doesn’t, and if it doesn’t hyperopt will receive a failure instead of a fit (so the run will be discarded).
Integrability instead is implemented as a penalty.
In order to activate it it is necessary to add integrability
to the penalties section
of the hyperoptimization namespace (see below).
In this case the integrability is implemented as an exponential penalty, this means that as
the “integrability number” grows, the test loss will grow as well, favouring replicas with
an “integrability number” below the chosen threshold.
For consistency the threshold used during hyperoptimization is read directly from the fitveto.py
variable.
Practical Usage
Note
An example runcard can be found at n3fit/runcards/Basic_hyperopt.yml
.
The partitions can be chosen by adding a kfold::partitions
key to the runcard.
kfold:
target: average
verbosity:
training: True
kfold: True
threshold: 5.0
penalties:
- saturation
- patience
- integrability
partitions:
- overfit: True
datasets:
- data_1
- data_2
- weight: 2.0
datasets:
- data_3
- datasets:
- data_4
- data_5
The overfit
flag, when applied to one of the partitions, introduces this partition in the
fitted data, i.e., the training and validation always include that partition and will work normally.
This is useful for very broad scans where we want to find an architecture which is able to
fit, without worrying about things like overlearning which might be a second-order problem.
The weight
(default 1.0) is multiplied with the loss function of the partition for which it is set.
Note that the weight is applied before the threshold check.
The threshold_loss
flag will make the fit stop if any of the partitions produces a loss greater
than the given threshold. This is useful for quickly discarding hyperparameter subspaces without
needing to do all k
fits.
The verbosity
dictionary allows fine control over what to report each 100 epochs. When both training
and kfold
are set to False
, nothing is printed until the end of the fit of the fold.
When set to True
, the losses for the training (training and validation) and for the partition are printed.
During hyperoptimization we might want to search for specific features, such as quickly fitting
(giving an incentive to quicker runs) or avoiding saturation (increasing the loss for models that
have produce saturation after a fit). New penalties can easily be added in the src/n3fit/hyper_optimization/penalties.py
file.
The target function for minimization can be selected with the target
key.
By default, and if no target
is chosen, n3fit
defaults to
the average of the loss function over the partition sets (average
).
New target functions can be easily added in the src/n3fit/hyper_optimization/rewards.py
file.
The hyperoptimization procedure performed in hep-ph/1907.05075 used a slightly different approach in order to avoid overfitting, by leaving out a number of datasets to compute a “testing set”. The loss function was then computed as
The group of datasets that were left out followed the algorithm mentioned above with only one fold:
NMC
BCDMSP
BCDMSD
HERACOMBNCEP460
H1HERAF2B
D0ZRap
CDFR2KT
D0WMASY
ATLASZHIGHMASS49FB
CMSZDIFF12
ATLASTTBARTOT
These were chosen attending to their process type as defined in their commondata files.
Changing the hyperoptimization target
Beyond the usual \(\chi2\)-based optimization figures above, it is possible to utilize other measures as the target for hyperoptimization.
Future tests
One possibility is to use a future test-based metric for which the goal is not to get the minimum \(\chi2\) but to get the same \(\chi2\) (with PDF errors considered) for different datasets. The idea is that this way we select models of which the prediction is stable upon variations in the dataset.
In order to obtain the PDF errors used in the figure of merit it is necessary to run multiple replicas, luckily n3fit
provides such a possibility also during hyperoptimization.
Take the following modifications to a normal hyperopt runcard (note that for convenience we take the trials directly from a previous run, so we don’t have to create a new hyperopt configuration dictionary).
dataset_inputs:
- {dataset: NMC_NC_NOTFIXED_DW_EM-F2, frac: 0.75, variant: legacy}
- {dataset: NMC_NC_NOTFIXED_P_EM-SIGMARED, frac: 0.75, variant: legacy}
- {dataset: SLAC_NC_NOTFIXED_P_DW_EM-F2, frac: 0.75, variant: legacy}
- {dataset: SLAC_NC_NOTFIXED_D_DW_EM-F2, frac: 0.75, variant: legacy}
- {dataset: BCDMS_NC_NOTFIXED_P_DW_EM-F2, frac: 0.75, variant: legacy}
- {dataset: BCDMS_NC_NOTFIXED_D_DW_EM-F2, frac: 0.75, variant: legacy}
- {dataset: HERA_NC_251GEV_EP-SIGMARED, frac: 0.75, variant: legacy}
- {dataset: HERA_CC_318GEV_EM-SIGMARED, frac: 0.75, variant: legacy}
- {dataset: HERA_CC_318GEV_EP-SIGMARED frac: 0.75, variant: legacy}
hyperscan_config:
use_tries_from: 210508-hyperopt_for_paper
kfold:
target: fit_future_tests
partitions:
- datasets:
- HERA_CC_318GEV_EP-SIGMARED
- HERA_CC_318GEV_EM-SIGMARED
- HERA_NC_251GEV_EP-SIGMARED
- datasets:
parallel_models: true
same_trvl_per_replica: true
We can run this hyperparameter scan for 10 parallel replicas for 20 trials with:
n3fit runcard.yml 1 -r 10 --hyperopt 20
The above runcard will, for a sample of 20 trials in 210508-hyperopt_for_paper
(according to their rewards),
run two fits of 10 replicas each.
The first fit will hide the data from HERA and the second one (an empty fold) will take into consideration all data.
In order to properly set up a future test the last fold (the future) is recommended to be left as an empty fold such that no data is masked out.
The figure of merit will be the difference between the \(\chi2\) of the second fit to the folded data and the \(\chi2\) of the first fit to the folded data including pdf errors
(the future test \(\chi2\)).
New hyperoptimization metrics with fold and replica statistics
The hyperopt measures discussed above are all based on performing a single replica fit per fold.
However, one may also wish to run the hyperoptimization algorithm on fits consisting of many
replicas per fold. This is a feasible option in n3fit
, since it has been optimised to
efficiently run many replica fits in parallel on GPU.
The combination of k-folding and multi-replica experiments opens several possibilities for the choice of figure of merit. The simplest option would be to minimize the average of \(\chi^2\) across both replica and k folds, i.e.,
In NNPDF, this hyperoptimisation metrics is selected via the following generic runcard:
dataset_inputs:
...
kfold:
loss_type: chi2
replica_statistic: average_best
fold_statistic: average
penalties_in_loss: False
partitions:
- datasets:
...
- datasets:
...
parallel_models: true
The key replica_statistic
defines how to combine all replicas when perform a multireplica hyperopt.
With average
a simple average will be taken, average_best
instead will take the 90% best replicas,
mimicking what is done in a real post-fit selection.
The fold_statistic
instead defines how to combine the loss of the different folds.
While the values for the penalties
are always saved during the hyperopt run, by default they are not
considered by the hyoperoptimizaton algorithm.
If they are to be considered the key penalties_in_loss
needs to be set to True
.
By combining the average
, best_worst
, and std
figures of merit discussed in K-folding cross-validation,
several alternatives may arise. For example, one approach could involve minimizing
the maximum value of the set of averaged-over-replicas \(\chi^2\),
with correspond runcard kfold
settings:
dataset_inputs:
...
kfold:
loss_type: chi2
replica_statistic: average
fold_statistic: best_worst
partitions:
- datasets:
...
- datasets:
...
An alternative metric that is sensitive to higher moments of the probability distribution has been defined in NNPDF3.0 [see Eq. (4.6) therein], namely, the \(\varphi\) estimator. In the context of hyperopt, \(\varphi^{2}\) can be calculated for each k-fold as
where the first term represents the usual averaged-over-replicas \(\left< \chi^2_{k} \right>_{\rm rep}\) calculated based on the dataset used in the fit (\(\mathcal{D}\)) and the theory predictions from each fitted PDF (\(f_{\rm fit}\)) replica. The second term involves the calculation of \(\chi2\) but now with respect to the theory predictions from the central PDF.
On the basis of \(\varphi\), we define the loss as
which selects hyperparameters that favor the most conservative extrapolation. In NNPDF, this figure of merit is chosen using the following settings:
kfold:
loss_type: phi2
fold_statistic: average
...
Alternatively, it is currently also possible to combine \(L_1\) (which is only sensitive to the first moment)
with \(L_3\) (which provides information on the second moment).
For example, one might minimize \(L_1\) while monitoring the values of \(L_3\) for a final model selection.
The optimal approach for this combination is still under development.
All the above options are implemented in the HyperLoss
class
which is instantiated and monitored within hyperparametrizable()
method.
Restarting hyperoptimization runs
In addition to the tries.json
files, hyperparameter scans also produce tries.pkl
pickle files,
which are located in the same directory as the corresponding tries.json
file.
The generated tries.pkl
file stores the complete history of a previous hyperoptimization run, making it possible to resume the process using the hyperopt
framework.
To achieve this, you can use the --restart
option within the n3fit
command, e.g.,:
n3fit runcard.yml 1 -r 10 --hyperopt 20 --restart
The above command example is effective when the number of saved trials in the test_run/nnfit/replica_1/tries.pkl
is
less than 20
. If there are 20
or more saved trials, n3fit
will simply terminate, displaying the best results.
Running hyperoptimizations in parallel with MongoDB
It is possible to run hyperoptimization scans in parallel using MongoDB.
This functionality is provided by the MongoFileTrials
class,
which extends the capabilities of hyperopt’s MongoTrials and enables the
simultaneous evaluation of multiple trials.
To run a parallelized hyperopt search, use the following command:
n3fit hyper-quickcard.yml 1 -r N_replicas --hyperopt N_trials --parallel-hyperopt --num-mongo-workers N
Here, N
represents the number of MongoDB workers you wish to launch in parallel.
Each mongo worker handles one trial in Hyperopt. So, launching more workers allows for the simultaneous calculation of a greater number of trials.
Note that there is no need to manually launch MongoDB databases or mongo workers prior to using n3fit
,
as the mongod
and hyperopt-mongo-worker
commands are automatically executed
by start()
and
start_mongo_workers()
methods, respectivelly.
By default, the host
and port
arguments are set to localhost
and 27017
. The database is named hyperopt-db-output_name
, where
output_name
is set to the name of the runcard. If the n3fit -o OUTPUT
option is provided, output_name
is set to OUTPUT
, with the database being referred to as hyperopt-db-OUTPUT
.
If necessary, it is possible to modify all the above settings using the n3fit --db-host
, n3fit --db-port
and n3fit --db-name
options.
To resume a hyperopt experiment, add the --restart
option to the n3fit
command:
n3fit hyper-quickcard.yml 1 -r N_replicas --hyperopt N_trials --parallel-hyperopt --num-mongo-workers N --restart
Note that, unlike in serial execution, parallel hyperoptimization runs do not generate tries.pkl
files.
Instead, MongoDB databases are saved as hyperopt-db-output_name.tar.gz
files inside replica_path
directory.
These are conveniently extracted for reuse in restart runs.