Filtering data
Introduction
In PDF fits, not all the data provided by the experimental collaborations are useful. For example, we may wish to discard certain datapoints for which we know small-x resummation or electroweak corrections are important. These effects are problematic since we know them to be important, but we cannot account for them.
In this light, we produce cuts of the data, by filtering data points which we know are free of the above and other problems.
In validphys, the cuts are handled by the validphys.filters
alongside filter definitions and defaults found within
validphys.cuts
.
Cuts as declarative filters
Due to the nature of data cuts, it is important to be transparent about
which cuts are being applied to which dataset and/or process. Moreover,
it is useful for the rules defining the data cut to be readable such
that a non-developmental user can read and understand the nature of the
rule by making these rules functions of kinematic variables such as
p_T
or Q2
.
In much the same vein, it is useful for any default values used in the
rules to be readily accessible. For example, suppose there is a minimum
value for the square transferred momenta in the DIS process q2min
,
that is used widely by many different rules. It is important for this
variable to be in an obvious and easily accessed location.
Defaults
There are certain values which are commonly used by many rules. For
example, the value q2min
usually takes the value 3.49
or
w2min
is usually set to 12.5
.
It is thus useful to define these default values somewhere. These values
can be found within validphys.cuts
inside the defaults.yaml
file. One can overwrite these values and this is discussed later.
Filters
In validphys 2
the default filter rules used can be found in the
validphys.cuts
module within the filter.yaml
file. This file is
read by validphys
and is interpreted as a list
of
dictionaries
.
By default, these filters can have several entries:
dataset
: The dataset this rule appliesprocess_type
: The process type this rule applies torule
: ThePython
code defining the rule for this filterreason
: (optional) The reason this rule was neededlocal_variables
: (optional) Any additional, non-standard local variables the user wishes to add for this rule only.
Note
At least one of dataset
or process_type
is required.
Additionally, a rule
entry is always required.
The rule
entry in the rule definition is eval
uated as
Python
code. If the rule does not apply to this particular datapoint
(say the dataset names don’t match) then we return None
indicating
this rule had nothing to do with this particular datapoint. In this
case, we move on to the next rule. However, if the process type or
dataset defined in the rule match that of the datapoint, we evaluate the
rule. If the rule evaluates to False
we discard the point, if
instead it returns True
we move on to the next rule. If by the time
all the rules have been evaluated and we have yet to return False
,
then the datapoint passes and it is kept.
In addition, the user can add any theory parameter they wish. For
example, one could add PTO: NNLO
which means to evaluate the rule
only if the theory is NNLO. These are discussed further here.
One can see a full list of possible theory parameters using:
vp-checktheory <theory id>
Important
The rule
entry should be interpreted as a str
type within Python
. As such
a rule such as rule: True
is not valid since this is read in as a boolean,
however, rule: "True"
is perfectly valid notation. Moreover, the string
itself should be valid Python
code.
By default the user can use the following non-builtin mathematical
functions in their rules: sqrt
, log
or fabs
(floating point
absolute value). In addition, one can use any numpy
function using
np.<function>
in their rule definition. For example:
rule: "np.exp(x) > 0.1"
The kinematic variables that can be used within the rule depends on the process type. A full list of available parameters can be found by running:
In [1]: from NNPDF import CommonData
In [2]: print(dict(CommonData.kinLabel))
The user may additionally define their own variables by adding the
local_variables
field to their rule. For example, I can use w2
in my rule, so long as I define what I mean by w2
:
local_variables:
w2: Q2 * (1 - x) / x
Danger
Defining local_variables
is non-commutative. The order of definition is important.
If a local variable depends on other local variables, then the user must ensure all other
dependencies have already been defined.
The following would raise an error
local_variables:
w: sqrt(w2)
w2: Q2 * (1 - x) / x
The following would not
local_variables:
w2: Q2 * (1 - x) / x
w: sqrt(w2)
Note
local_variables
have a local scope. They apply to only the rule within which
they are defined.
Theory parameters and perturbative orders
There are particular situations in which we only want to evaluate a rule if the theory input for the PDF matches certain conditions. For example, it may be the case we only keep the datapoint provided the theory includes intrinsic charm or is evaluated at NNLO.
Suppose for example I wish the rule to only be evaluated if the theory
includes intrinsic charm. We note in the theory.get_description()
,
the relevant entry is 'IC': 1
(we use here theory 53 for
demonstration purposes). Thus if I want my rule to be applied only if
the theory has intrinsic charm, I simply add to my rule:
IC: True
Similarly I can condition on flavour number scheme. I again check
theory.get_description()
and note that the relevant key
is
'FNS'
. Thus to only evaluate my rule if the FNS is FONLL-C
,
simply add:
FNS: FONLL-C
Similarly, one can add any such theory description key
into their
rule.
Tip
Sometimes, we may want to evaluate a rule provided the perturbative order is within
a certain range. For example, we may want a rule to be evaluated if the perturbative
order is strictly less than NLO. This can be done by using directives succeeding the
PTO
declaration.
In the above example, one would thus simply use:
PTO: NLO-
The following are a list of possible directives which can succeed a
PTO
declaration: * +
Evaluate this rule if the theory PTO
is greater than or equal to the preceeding PTO * -
Evaluate
this rule if the theory PTO
is strictly less than the preceeding PTO
* !
Evaluate this rule if the theory PTO
is not equal to the
preceeding PTO
Examples are:
PTO: NNLO!
PTO: N3LO-
PTO: LO+
If the user doesn’t specify a directive then that implies the rule will
only be evaluated if the declared PTO
matches exactly with the
PTO
of the theory.
Overwriting filters and default values
One can overwrite the default behaviour by adding to the fit runcard.
Custom rules can be added by adding a filter_rules:
namespace in the
fit runcard. This should be a list of rules in the format outlined
above. For example:
filter_rules:
- dataset: NMC
rule: x > 0.2
Warning
- Adding a
filter_rules
section to the runcard overwrites the default behaviour and does not append to the default behaviour. By adding the above code snippet, this would be the only rule used by
vp-setupfit
. Use the added_filter_rules option to append rules when needed.
Similarly the defaults can be overwritten by adding a
filter_defaults
namespace to the runcard. For example:
filter_defaults:
q2min: 5
w2min: 10
As in the case of the rules, this overwrites the original defaults and does not append to them.
Attention
To ensure backwards compatibility with old style runcards, if q2min
and w2min
are defined
under the datacuts
namespace within the runcard, these values are read in and override the default
values. However, if this overriding occurs, a warning is displayed in standard output.
Adding filters to the default ones
An added_filter_rules
key may be specified in the runcard. Its effect is to
append a list of filter rules to the rules obtained by the mechanisms described above. It is particularly useful when one wishes to analyze the effect of a sliding cut:
fit: mm_sm_hllhc_seed1_221222
pdf:
from_: fit
# Retrieve default filters
use_cuts: "internal"
theoryid: 200
dataset_inputs:
from_: fit
dataspecs:
- speclabel: "Filter: 50"
added_filter_rules:
- process_type: EWK_MLL
local_variables:
mass_threshold: 50
reason: "Variable mass filter"
rule: "M_ll < mass_threshold"
- speclabel: "Filter: 500"
added_filter_rules:
- process_type: EWK_MLL
local_variables:
mass_threshold: 500
reason: "Variable mass filter"
rule: "M_ll < mass_threshold"
template_text: |
# χ² as a function of sliding cut
{@dataspecs_chi2_table@}
actions_:
- report(main=True)
The value of added_filter_rules
should be a list of rules with the same format as filter_rules
.
Examples
Consider the following filter from the filters.yaml
file:
- dataset: ATLASZPT7TEV
reason: Avoid the region where resummation effects become important.
rule: "pT2 >= 30**2"
this rule applies only to the ATLASZPT7TEV
dataset and keeps all
datapoints with a transverse momentum greater than or equal to 30 MeV.
The reason for the conception of this rule is also provided and we see
that it is due to the fact that datapoints with smaller transverse
momentum will be affected by resummation effects.
Now consider the slightly more complicated example:
- dataset: CMSDY2D12
reason: Remove data points for which electroweak corrections are large.
PTO: NNLO-
local_variables:
M: sqrt(M2)
min_M: 30.0
max_rapidity: 2.2
rule: M >= min_M and etay <= max_rapidity
This rule only applies to CMSDY2D12
. I wish for the rule
to only
be evaluated provided the theory
perturbative order is strictly
less than NNLO (i.e LO or NLO). I check what the process type of
CMSDY2D12
is:
In [1]: from validphys.loader import Loader
In [2]: l = Loader()
In [3]: cd = l.check_commondata("CMSDY2D12")
In [4]: cd.process_type
Out[4]: 'EWK_RAP'
Then cross check this against NNPDF.CommonData.kinLabels
to see that
the relevant kinematic variables are:
'EWK_RAP': ('etay', 'M2', 'sqrts'),
I choose to define custom local_variables
in the form of M
which
is the square root of the invariant mass squared, i.e. just the
invariant mass. Moreover, I define a value for minimum M
and maximum
rapidity which I use in my rule
as cutoff values.
The rule
itself is then self-explanatory, notice however, it is
written in valid Python
syntax. Finally, the reason for the rule is
given which is to cut datapoints which are affected by electroweak
corrections.
As a final example consider the following rule:
- process_type: DIS_NCP_CH
reason: |
Missing higher order corrections to Delta F_IC, the piece that needs
to be added to the FONLL-C calculation in the case of fitted charm.
FNS: FONLL-C
IC: True
rule: "Q2 > 8"
Instead of this rule applying to one particular dataset, we see it is
applicable to all datasets that have process type DIS_NCP_CH
. The
reason for the rule is rather involved and so yaml
’s multiline
string syntax is used.
Finally, the user wishes for the rule
to be evaluated only if
the theory input has the FONNL-C flavour number scheme and if the theory
uses intrinsic charm. The rule itself is trivial.