Filtering data

Introduction

In PDF fits, not all the data provided by the experimental collaborations are useful. For example, we may wish to discard certain datapoints for which we know small-x resummation or electroweak corrections are important. These effects are problematic since we know them to be important, but we cannot account for them.

In this light, we produce cuts of the data, by filtering data points which we know are free of the above and other problems.

In validphys, the cuts are handled by the validphys.filters alongside filter definitions and defaults found within validphys.cuts.

Cuts as declarative filters

Due to the nature of data cuts, it is important to be transparent about which cuts are being applied to which dataset and/or process. Moreover, it is useful for the rules defining the data cut to be readable such that a non-developmental user can read and understand the nature of the rule by making these rules functions of kinematic variables such as p_T or Q2.

In much the same vein, it is useful for any default values used in the rules to be readily accessible. For example, suppose there is a minimum value for the square transferred momenta in the DIS process q2min, that is used widely by many different rules. It is important for this variable to be in an obvious and easily accessed location.

Defaults

There are certain values which are commonly used by many rules. For example, the value q2min usually takes the value 3.49 or w2min is usually set to 12.5.

It is thus useful to define these default values somewhere. These values can be found within validphys.cuts inside the defaults.yaml file. One can overwrite these values and this is discussed later.

Filters

In validphys 2 the default filter rules used can be found in the validphys.cuts module within the filter.yaml file. This file is read by validphys and is interpreted as a list of dictionaries.

By default, these filters can have several entries:

  1. dataset: The dataset this rule applies

  2. process_type: The process type this rule applies to

  3. rule: The Python code defining the rule for this filter

  4. reason: (optional) The reason this rule was needed

  5. local_variables: (optional) Any additional, non-standard local variables the user wishes to add for this rule only.

Note

At least one of dataset or process_type is required. Additionally, a rule entry is always required.

The rule entry in the rule definition is evaluated as Python code. If the rule does not apply to this particular datapoint (say the dataset names don’t match) then we return None indicating this rule had nothing to do with this particular datapoint. In this case, we move on to the next rule. However, if the process type or dataset defined in the rule match that of the datapoint, we evaluate the rule. If the rule evaluates to False we discard the point, if instead it returns True we move on to the next rule. If by the time all the rules have been evaluated and we have yet to return False, then the datapoint passes and it is kept.

In addition, the user can add any theory parameter they wish. For example, one could add PTO: NNLO which means to evaluate the rule only if the theory is NNLO. These are discussed further here. One can see a full list of possible theory parameters using: vp-checktheory <theory id>

Important

The rule entry should be interpreted as a str type within Python. As such a rule such as rule: True is not valid since this is read in as a boolean, however, rule: "True" is perfectly valid notation. Moreover, the string itself should be valid Python code.

By default the user can use the following non-builtin mathematical functions in their rules: sqrt, log or fabs (floating point absolute value). In addition, one can use any numpy function using np.<function> in their rule definition. For example:

rule: "np.exp(x) > 0.1"

The kinematic variables that can be used within the rule depends on the process type. A full list of available parameters can be found by running:

In [1]: from NNPDF import CommonData

In [2]: print(dict(CommonData.kinLabel))

The user may additionally define their own variables by adding the local_variables field to their rule. For example, I can use w2 in my rule, so long as I define what I mean by w2:

local_variables:
  w2: Q2 * (1 - x) / x

Danger

Defining local_variables is non-commutative. The order of definition is important. If a local variable depends on other local variables, then the user must ensure all other dependencies have already been defined.

The following would raise an error

local_variables:
  w: sqrt(w2)
  w2: Q2 * (1 - x) / x

The following would not

local_variables:
  w2: Q2 * (1 - x) / x
  w: sqrt(w2)

Note

local_variables have a local scope. They apply to only the rule within which they are defined.

Theory parameters and perturbative orders

There are particular situations in which we only want to evaluate a rule if the theory input for the PDF matches certain conditions. For example, it may be the case we only keep the datapoint provided the theory includes intrinsic charm or is evaluated at NNLO.

Suppose for example I wish the rule to only be evaluated if the theory includes intrinsic charm. We note in the theory.get_description(), the relevant entry is 'IC': 1 (we use here theory 53 for demonstration purposes). Thus if I want my rule to be applied only if the theory has intrinsic charm, I simply add to my rule:

IC: True

Similarly I can condition on flavour number scheme. I again check theory.get_description() and note that the relevant key is 'FNS'. Thus to only evaluate my rule if the FNS is FONLL-C, simply add:

FNS: FONLL-C

Similarly, one can add any such theory description key into their rule.

Tip

Sometimes, we may want to evaluate a rule provided the perturbative order is within a certain range. For example, we may want a rule to be evaluated if the perturbative order is strictly less than NLO. This can be done by using directives succeeding the PTO declaration.

In the above example, one would thus simply use:

PTO: NLO-

The following are a list of possible directives which can succeed a PTO declaration: * + Evaluate this rule if the theory PTO is greater than or equal to the preceeding PTO * - Evaluate this rule if the theory PTO is strictly less than the preceeding PTO * ! Evaluate this rule if the theory PTO is not equal to the preceeding PTO

Examples are:

PTO: NNLO!
PTO: N3LO-
PTO: LO+

If the user doesn’t specify a directive then that implies the rule will only be evaluated if the declared PTO matches exactly with the PTO of the theory.

Overwriting filters and default values

One can overwrite the default behaviour by adding to the fit runcard.

Custom rules can be added by adding a filter_rules: namespace in the fit runcard. This should be a list of rules in the format outlined above. For example:

filter_rules:
  - dataset: NMC
    rule: x > 0.2

Warning

Adding a filter_rules section to the runcard overwrites the default

behaviour and does not append to the default behaviour. By adding the above code snippet, this would be the only rule used by vp-setupfit. Use the added_filter_rules option to append rules when needed.

Similarly the defaults can be overwritten by adding a filter_defaults namespace to the runcard. For example:

filter_defaults:
  q2min: 5
  w2min: 10

As in the case of the rules, this overwrites the original defaults and does not append to them.

Attention

To ensure backwards compatibility with old style runcards, if q2min and w2min are defined under the datacuts namespace within the runcard, these values are read in and override the default values. However, if this overriding occurs, a warning is displayed in standard output.

Adding filters to the default ones

An added_filter_rules key may be specified in the runcard. Its effect is to append a list of filter rules to the rules obtained by the mechanisms described above. It is particularly useful when one wishes to analyze the effect of a sliding cut:

fit: mm_sm_hllhc_seed1_221222

pdf:
  from_: fit


# Retrieve default filters
use_cuts: "internal"

theoryid: 200

dataset_inputs:
  from_: fit


dataspecs:
  - speclabel: "Filter: 50"
    added_filter_rules:
      - process_type: EWK_MLL
        local_variables:
            mass_threshold: 50
        reason: "Variable mass filter"
        rule: "M_ll < mass_threshold"

  - speclabel: "Filter: 500"
    added_filter_rules:
      - process_type: EWK_MLL
        local_variables:
            mass_threshold: 500
        reason: "Variable mass filter"
        rule: "M_ll < mass_threshold"


template_text: |
  # χ² as a function of sliding cut
  {@dataspecs_chi2_table@}


actions_:
  - report(main=True)

The value of added_filter_rules should be a list of rules with the same format as filter_rules.

Examples

Consider the following filter from the filters.yaml file:

- dataset: ATLASZPT7TEV
  reason: Avoid the region where resummation effects become important.
  rule: "p_T2 >= 30**2"

this rule applies only to the ATLASZPT7TEV dataset and keeps all datapoints with a transverse momentum greater than or equal to 30 MeV. The reason for the conception of this rule is also provided and we see that it is due to the fact that datapoints with smaller transverse momentum will be affected by resummation effects.

Now consider the slightly more complicated example:

- dataset: CMSDY2D12
  reason: Remove data points for which electroweak corrections are large.
  PTO: NNLO-
  local_variables:
    M: sqrt(M2)
    min_M: 30.0
    max_rapidity: 2.2
  rule: M >= min_M and etay <= max_rapidity

This rule only applies to CMSDY2D12. I wish for the rule to only be evaluated provided the theory perturbative order is strictly less than NNLO (i.e LO or NLO). I check what the process type of CMSDY2D12 is:

In [1]: from validphys.loader import Loader

In [2]: l = Loader()

In [3]: cd = l.check_commondata("CMSDY2D12")

In [4]: cd.process_type
Out[4]: 'EWK_RAP'

Then cross check this against NNPDF.CommonData.kinLabels to see that the relevant kinematic variables are:

'EWK_RAP': ('etay', 'M2', 'sqrts'),

I choose to define custom local_variables in the form of M which is the square root of the invariant mass squared, i.e. just the invariant mass. Moreover, I define a value for minimum M and maximum rapidity which I use in my rule as cutoff values.

The rule itself is then self-explanatory, notice however, it is written in valid Python syntax. Finally, the reason for the rule is given which is to cut datapoints which are affected by electroweak corrections.

As a final example consider the following rule:

- process_type: DIS_NCP_CH
  reason: |
    Missing higher order corrections to Delta F_IC, the piece that needs
    to be added to the FONLL-C calculation in the case of fitted charm.
  FNS: FONLL-C
  IC: True
  rule: "Q2 > 8"

Instead of this rule applying to one particular dataset, we see it is applicable to all datasets that have process type DIS_NCP_CH. The reason for the rule is rather involved and so yaml’s multiline string syntax is used.

Finally, the user wishes for the rule to be evaluated only if the theory input has the FONNL-C flavour number scheme and if the theory uses intrinsic charm. The rule itself is trivial.