The design of validphys 2
Introduction
The objective of this page is to explain the high level considerations
underpinning the design of validphys 2
, and of reportengine
[Kas19], the code it is based on. It should be
useful for anyone aiming to understand the general philosophy and design goals
of the project or take in in qualitatively different directions. More concrete
information on how to use the code can be found in the sections under Using
validphys.
Some specific issues of scientific code
There are some characteristics of typical scientific programs that are rather specific to them. A program aiming to optimize the workflow of a scientific collaboration should be designed with those in mind. Some of these characteristics make work easier compared to other types of programs like text editors or games. In general the runtime execution can be made simpler because user interactivity is not required. On the other hand we need easy access to all parts of the code, since we never know which parts might be subject to changes. That’s the whole point of research!
We now proceed to list these characteristics in more detail.
Lack of interactivity
Typically programs such as report generators, fitting codes, or simulations are entirely determined by the initial user input together with perhaps some random source and environment events like which replicas failed. We typically don’t make network connections in the middle of fits, don’t wait for the user to be pushing buttons. In many instances it is possible to know about the amount of resources such as memory that a scientific program is going to require shortly after running it.
Changing structure
It happens often that the hardcoded parameter you are optimizing becomes the next object of research. Scientific code should make as little assumptions as possible on how it is supposed to run, while managing to get by and produce some results. This somewhat upsets the balance of trade-offs compared to other kinds of programs. We are willing to trade some increased complexity for the ability to look under the hood.
Complexity
Non trivial scientific programs can rarely be assembled exclusively by putting together external components without adding anything of our own. That typically consists on unspecified, or at best, underspecified algorithms that are difficult to test in isolation.
Design considerations
Look before you leap
A scientific program usually requires a large set of complex input parameters and can take a very long time to produce results. Results tend to be frequently unexpected even if everything is working correctly. It is essential to check that the inputs are correct and make sense. Thus good error reporting capabilities are an important design consideration.
Also, the inputs of a given function should be checked as early as possible, which is not necessarily at the point of the program where the function is to be called. For example, something like this:
def check_parameter_is_correct(parameter):
...
def plot(complex_calculation, parameter):
check_parameter_is_correct(parameter)
#make plot
...
has the disadvantage that in order to get to the check, we presumably
need to compute complex_calculation
first, and that could be a waste
of time if it turns out that parameter
is not correct for some
reason and we can’t do the plot anyway. Instead we should arrange to
call check_parameter_is_correct(parameter)
as early as possible
(outside the plotting function) and show the user an informative error
message.
All inputs should be checked as early as possible. If the checks pass, the program should not fail.
Of course it is impossible to guarantee that this is always going to happen. For example it is possible that we check that a folder exists and we have permissions to write to it, but by the time we actually need to write, the folder has been deleted. However in such cases the state of the program can be considered broken and it’s OK to make it fail completely.
There is an API for early checks in reportengine
. We would write
something like:
@make_argcheck
def check_parameter_is_correct(parameter):
...
@check_parameter_is_correct
def plot(complex_calculation, parameter):
#make plot
...
The checking function will now be called as soon as the program realizes that the plot function will be required eventually (at “compile time”). The user would be shown immediately an error message explaining why the parameter is wrong.
The fancy term for this style of coding is Contract Programming.
Declarative input
It is convenient to be able to specify the what the program should do without any regard of knowledge of how that is achieved by the underlying implementation. The primary input of validphys are YAML run cards. A very simple one looks like this:
pdfs:
- NNPDF40_nlo_as_01180
- NNPDF40_nnlo_as_01180
- NNPDF40_nnlo_as_01180_hessian
norm:
normalize_to: NNPDF40_nlo_as_01180
first:
Q: 1
flavours: [up, down, gluon]
second:
Q: 100
scale: linear
actions_:
- first::norm plot_pdfreplicas
- first plot_pdfs
- second plot_pdfreplicas
This has a number of advantages:
Correct by definition
A declarative input specifies what you want. It is up to the underlying code to try to provide it (or fail with an informative message).
Clear meaning
It is easy for a human to verify that the input is indeed what it was intended. Even without any explanation it should be easy enough to guess what the runcard above does.
Implementation independent
The input is very loosely coupled with the underlying implementation, and therefore it is likely to remain valid even after big changes in the code are made. For example, in the runcard above, we didn’t have to concern ourselves with how LHAPDF grids are loaded, and how the values of the PDFs are reused to produce the different plots. Therefore the underlying mechanism could change easily without breaking the runcard.
Usage as a programmatic API
While the goal of reportengine
is to allow simple and easily
repeatable bach actions, sometimes it is far simpler to get the work
done with a raw (Python) script, or it is needed to explore the outcomes
using something like an Jupyter notebook. It would be good to be able to
use all the tools that already exist in validphys
for that, without
needing to reinvent the wheel or to alter functions so that for example
they don’t write data to some preconfigured path.
Therefore The various computing and plotting tools should work well when
included in a normal script that doesn’t use the reportengine
graph
compiler.
This is implemented by making sure that as much as possible all the
validphys
functions are pure. That is, the output is a
deterministic function of the inputs, and the function has no side
effects (e.g. no global state of the program is altered, nothing is
written to disk). There are some exceptions to this though. For example
the function that produces a reweighted PDF set needs to write the
result to disk. The paths and side effects for other more common results
like figures are managed by reportengine
. For example, the
@figure
decorator applied to a function that returns a Python
(matplotlib
) figure will make sure that the figure is saved in the
output path, with a nice filename, while having no effect at all outside
the reportengine
loop. The same goes for the check functions
described above.
Easy to loop
Very frequently there is the need to compare. It is easy enough to write simple scripts that loop over the required configurations, but that cannot scale well when requirements change rapidly (and is also easy to make trivial mistakes). Therefore reportengine allows configurations to easily loop over different sets of inputs. For example the following runcard:
pdfs:
- id: NNPDF40_nlo_as_01180
label: NLO
- id: NNPDF40_nnlo_as_01180
label: NNLO
theoryids:
- 208
- 200
use_cuts : nocuts
experiments:
- experiment: LHCb
datasets:
- { dataset: LHCB_DY_7TEV_MUON_Y, cfac: [NRM], variant: legacy }
- { dataset: LHCB_DY_8TEV_MUON_Y, cfac: [NRM], variant: legacy }
- experiment: ATLAS
datasets:
- { dataset: ATLAS_DY_7TEV_36PB_ETA, variant: legacy }
actions_:
- theoryids::pdfs::experiments::experiment plot_fancy
Will produce a separate plot for each combination of the two theories (200 and 208), the two PDFs at the top, and each dataset in the two experiments (so 28 plots in total). This syntax is discussed in more detail in the [Usage] section.
It should be trivial to repeat an action for different sets of inputs.