Defining custom pipelines

Here we discuss what information from user-entered strings needs to go in the YAML file to plots and reports.

The basic code flow is as follows:

The actions_ key is parsed to obtain a list of requirements with their associated fuzzyspec

Each requirement spans other requirements. These can be:

Providers: Other functions with requirements of their own.

User input processed by the configuration, which is immediately tested for correctness.

Production rules, also derived from the configuration.

Once the requirements are satisfied for a given provider, the checks of the provider are executed.

If all the checks pass, all the runtime requirements are executed in such an order that the dependencies are resolved.

Configuration

A configuration class derived from reportengine.ConfigParser is used to parse the user input. In validphys, it is defined in validphys.config.

The parsing in reportengine is context dependent. Because we want to specify resources as much as possible before computing anything (at “compile time”), we need to have some information about other resources (e.g. theory folders) in order to do any meaningful processing.

The config class takes the user input and the dependencies and:

Returns a valid resource if the user input is valid.
Raises a ConfigError if the user input is invalid.

To parse a given user-entered key (e.g. posdataset), simply define a parse_posdataset function. The first argument (i.e. second after self) will be the raw value in the configuration file. Any other arguments correspond to dependencies that are already resolved at the point where they are passed to the function (reportengine takes care of that).

For example, we might have:

def parse_posdataset(self, posset:dict, * ,theoryid):
    ...

The type specification (dict above) makes sure that the user input is of that type before it is seen by the function (which avoids a bunch of repetitive error checking). A positivity dataset requires a theory ID in order to be meaningfully processed (i.e. to find the folder where the FK tables are) and therefore the theoryid will be looked for and processed first.

We need to document what the resource we are providing does. The docstring will be seen in validphys --help config:

def parse_posdataset(self, posset:dict, * ,theoryid):
    """An observable used as positivity constrain in the fit.
    It is a mapping containing 'dataset' and 'poslambda'."""
    ...

Production rules

Apart from parse_ functions, which take an explicit user input from the corresponding key (and optionally a set of dependencies), there are the produce_ functions, which take only the dependencies. Other than not taking the user input, the produce_ functions work in a very similar way to the parse_ functions: they are resolved at “compile time”, before any provider function is executed, and they should raise a ConfigError if they fail.

In general, production rules should be preferred to parse functions that bundle together various dependencies (e.g. data, cuts and theory), because by having more granular elements, we can iterate over them in different ways: for example, we might want to generate a separate report page for each of the positivity datasets, where they are compared for multiple theories. We could break the parse function above into:

def parse_posdataset_input(self, posset:dict):
    ...

def produce_posdataset(posdataset_input, *, theoryid):
   ...

Now the user has to enter a key called “posdataset_input”, from which some Python object will be obtained as the return value of parse_posdataset_input. Then produce_posdataset is used for an object representing the positivity set and the corresponding FK tables in a given theory is obtained from the output of parse_posdataser_input and a theory ID.

Automatically parsing lists

It is possible to easily process list of elements once the parsing for a single element has been defined. Simply add an element_of decorator to the parsing function defined in the Config class:

@element_of('posdatasets')
def parse_posdataset(self, posset:dict, * ,theoryid):

Now posdatasets is parsed as a list of positivity datasets, which can be passed together to a provider, or iterated over (for example with a with tag in the report, see Generating reports).

Note that you can also put together results from evaluating providers using the collect function, which can be used to map computations over the lists described here.

Validphys loaders

In validphys, we use a Loader class to load resources from various folders. It is good to have a common interface, since it is used to list the available resources of a given type or even download a missing resource. The functions of type check_<resource> should take the information processed by the Config class and verify that a given resource is correct. If so, they should return a “Resource specification” (something typically containing metadata information such as paths, which are necessary to load the final commondata or fktable)

In the case of the positivity set, this is entirely given in terms of existing check functions

def check_posset(self, theoryID, setname, postlambda):
    cd = self.check_commondata(setname, 0)
    fk = self.check_fktable(theiryID, setname, [])
    th =  self.check_theoryID(theiryID)
    return PositivitySetSpec(cd, fk, postlambda, th)

A more complicated example should raise the appropriate loader errors (see the other examples in the class).

The PositivitySet inherits in the code from DataSetSpec but one could roughly define it as:

class PositivitySetSpec():
    def __init__(self, commondataspec, fkspec, poslambda, thspec):
      self.commondataspec = commondataspec
      self.fkspec = fkspec
      self.poslambda = poslambda
      self.thspec = thspec

    @property
    def name(self):
      return self.commondataspec.name

    def __str__(self):
      return self.nam

This contains all necessary information for validphys to be able to load the relevant fktable. It is generally better to pass around the spec objects because they are lighter and have more information (e.g. the theory in the above example).

With this, our parser method could look like this:

def parse_posdataset(self, posset:dict, * ,theoryid):
    """An observable used as positivity constrain in the fit.
    It is a mapping containing 'dataset' and 'poslambda'."""
    bad_msg = ("posset must be a mapping with a name ('dataset') and "
               "a float multiplier(poslambda)")

    theoryno, theopath = theoryid
    try:
        name = posset['dataset']
        poslambda = float(posset['poslambda'])
    except KeyError as e:
        raise ConfigError(bad_msg, e.args[0], posset.keys()) from e
    except ValueError as e:
        raise ConfigError(bad_msg) from e

    try:
        return self.loader.check_posset(theoryno, name, poslambda)
    except FileNotFoundError as e:
        raise ConfigError(e) from e

The first part makes sure that the user input is of the expected form (a mapping with a string and a number). The ConfigError has support for suggesting that something could be mistyped. The syntax is ConfigError(message, bad_key, available_keys). For example, if the user enters “poslanda” instead of “postlambda”, the error message would suggest the correct key.

Note that all possible error paths must end by raising a ConfigError.

Computing PDF-dependent quantities

Now that we can receive positivity sets as input, let’s do something with them. We can start by defining a class to produce and hold the results:

class PositivityResult(StatsResult):
    @classmethod
    def from_convolution(cls, pdf, posset):
        loaded_pdf = pdf.load()
        loaded_pos = posset.load()
        data = loaded_pos.GetPredictions(loaded_pdf)
        stats = pdf.stats_class(data.T)
        return cls(stats)

    @property
    def rawdata(self):
        return self.stats.data

pdf.stats_class allows for the interpretation of the results of the convolution as a function of the PDF error type (e.g. to use the different formulas for the uncertainty of Hessian and Monte Carlo sets). In that way it allows to abstract away the different error types. One constructs an object inheriting from validphys.core.Stats that is appropriate for a given error type by calling pdf.stats_class(data), where data is an array where the entries along the first dimension are the results from each member (and the other dimensions are arbitrary). Stats has methods that appropriately collapse along the first axis. For example, central_value computes the mean along the first axis for Monte Carlo PDFs and yields the first member for Hesssian PDFs.

And then define a simple provider function:

def positivity_predictions(pdf, positivityset):
     return PositivityResult.from_convolution(pdf, positivityset)