Abstract
distr6 is an object-oriented (OO) probability distributions interface leveraging the extensibility and scalability of R6 and the speed and efficiency of Rcpp. Over 50 probability distributions are currently implemented in the package with ‘core’ methods, including density, distribution, and generating functions, and more ‘exotic’ ones, including hazards and distribution function anti-derivatives. In addition to simple distributions, distr6 supports compositions such as truncation, mixtures, and product distributions. This paper presents the core functionality of the package and demonstrates examples for key use-cases. In addition, this paper provides a critical review of the object-oriented programming paradigms in R and describes some novel implementations for design patterns and core object-oriented features introduced by the package for supporting distr6 components.Probability distributions are an essential part of data science, underpinning models, simulations, and inference. Hence, they are central to computational data science. With the advent of modern machine learning and AI, it has become increasingly common to adopt a conceptual model where distributions are considered objects in their own right, as opposed to primarily represented through distribution-defining functions (e.g., cdf, pdf) or random samples.
An important distinction to keep in mind is between random variables (that can be sampled from) and probability distributions. distr6 is an interface for probability distributions and supports construction, manipulation, composition, and querying of parameterized simple and composite distributions. distr6 is not an interface for random variables, and therefore, procedures such as sampling and inference are out of scope.
We continue by explaining our conceptual model of probability distributions underpinning the design of distr6 and delineate it from the common conceptualization of random variables. A full mathematical definition of the conceptual model is given in the next section. This section contains an intuitive introduction.
First, we invite the reader to recall some common mathematical objects and recognize that these are related but conceptually distinct:
Critically, we highlight that random variables and distributions are neither identical objects nor concepts. A random variable \(X\) has distribution \(d\), and multiple random variables may be distributed according to \(d\). Further, random variables are sampled from, while the distribution is only a description of probabilities for \(X\). Thus, \(X\) and \(d\) are not identical objects. Figure 1 visually summarizes these differences.
As a possible logical consequence of the above, we adopt the conceptual model that distribution is an abstract object, which:
Abstracting distributions as objects from multiple, non-identical, representations (random variables), introduces major consequences for the conceptual model:
It lends itself naturally to a class-object representation in the computer scientific sense of object-oriented programming. Abstract distributions become classes, concrete distributions are objects, and distribution defining functions are methods of these classes. Random variables are a separate type of object.
It strongly suggests adoption of mathematical conceptualization and notation which cleanly separates distributions from random variables and distribution defining functions - in contrast to common convention, where random variables or random sampling takes conceptual primacy above all.
It allows clean formulation of algorithmic manipulations involving distributions, especially higher-order constructs (truncation, huberization, etc.), as well as clean mathematical definitions.
|
|
|
|
|
In distr6, distributions are first-class objects subject to
an object-oriented class-object representation. For example, a discrete
uniform distribution (fig. 1b)
is a ‘class’ with traits such as type (Naturals) and variate form
(univariate). With a given parameterization, this becomes an ‘object’
with properties including symmetry and support. An alternative
definition to the conceptual model of distributions is now
provided.
On the mathematical level, we again consider distributions as objects in
their own right, not being identical with a cdf, pdf, or measure, but
instead ‘having’ these as properties.
For a set \(\mathcal{Y}\) (endowed with suitable topology), we define Distr\((\mathcal{Y})\) as a set containing formal objects \(d\) which are in bijection to (but not identical with) probability measures over \(\mathcal{Y}\). Elements of Distr\((\mathcal{Y})\) are called distributions over \(\mathcal{Y}\). We further define formal symbols which, in case of existence, denote ‘aspects’ that such elements have, in the following way: the symbol \(d.F\), for example, denotes the cdf of \(d\), which is to be read as the ‘\(F\)’ of \(d\), with \(F\), in this case, to be read as a modifier to a standard symbol \(d\), rather than a fixed, bound, or variable symbol. In this way, we can define:
\(d.F\) for the cdf of \(d\). This typically exists if \(\mathcal{Y}\subseteq \mathbb{R}^n\) for some \(n\), in which case \(d.F\) is a function of type \(d.F: \mathbb{R}^n \rightarrow [0,1]\).
\(d.f\) for the pdf of \(d\). This exists if \(\mathcal{Y}\subseteq \mathbb{R}^n\), and the distribution \(d\) is absolutely continuous over \(\mathcal{Y}\). In this case, \(d.f\) is a function of type \(d.f: \mathbb{R}^n \rightarrow [0,\infty)\).
\(d.P\) for the probability measure that is in bijection with \(d\). This is a function \(d.P: \mathcal{F} \rightarrow [0,1]\), where \(\mathcal{F}\) is the set of measurable sub-sets of \(\mathcal{Y}\).
We would like to point out that the above is indeed a full formal mathematical definition of our notion of distribution. While distributions, defined this way, are not identical with any of the conventional mathematical objects that define them (cdf, pdf, measures), they are conceptually, formally, and notationally well-defined. Similarly, the aspects (\(d.F\), \(d.f\), etc.) are also well-defined since they refer to one of the conventional mathematical objects which are well-specified in the dependence of the distribution (in case of existence).
This notation provides a more natural and clearer separation of distribution and random variables and allows us to talk about and denote concepts such as ‘the cdf of any random variable following the distribution \(d\)’ with ease (\(d.F\)), unlike classical notation that would see one define \(X\sim d\) and then write \(F_X\). Our notation more clearly follows the software implementation of distributions.
For example, in distr6, the code counterpart to defining a distribution \(d\) which is Gaussian with mean \(1\) and variance \(2\) is
> d <- Normal$new(1, 2)
The pdf and cdf of this Gaussian distribution evaluated at \(2\) are obtained in code as
> d$pdf(2)
> d$cdf(2)
which evaluates to ‘numerics’ that represent the real numbers \(d.f(2)\) and \(d.F(2)\).
The consideration of distributions as objects, and their conceptual
distinction from random variables as objects, notably differs from
conceptualization in R stats, which implements both
distribution and random variable methods by the ‘dpqr’
functions. Whilst this may allow a very fast generation of probabilities
and values, there is no support for querying and inspection of
distributions as objects. By instead treating the dpqr
functions as methods that belong to a distribution object,
distr6 encapsulates all the information in R stats as
well as distribution properties, traits, and other important
mathematical methods. The object orientation principle that defines the
architecture of distr6 is further discussed throughout this
manuscript.
Treating distributions as objects is not unique to this package. Possibly the first instance of the object-oriented conceptualization is the distr (Ruckdeschel et al. 2006) family of packages. distr6 was designed alongside the authors of distr in order to port some of their functionality from S4 to R6.
distr6 is the first such package to use the ‘class’
object-oriented paradigm R6 (Chang 2018), with other distribution related
packages using S3 or S4. The choice of R6 over S3 and S4 is discussed in
detail in section 5.1. This choice
allows distr6 to fully leverage the conceptual model and make
use of core R6 functionality. As well as introducing fundamental
object-oriented programming (OOP) principles such as abstract classes
and tried and tested design patterns (Gamma et
al. 1996) including decorators, wrappers, and compositors (see
section 5.3).
Besides an overview of distr6’s novel approach to probability
distributions in R, this paper also presents a formal comparison of the
different OOP paradigms while detailing the use of design patterns
relevant to the package.
The strength of the object-oriented approach, both on the algorithmic
and mathematical side, lies in its ability to efficiently express
higher-order constructs and operations: actions between distributions,
resulting in new distributions. One such example is mixture
distributions (also known as spliced distributions). In the
distr6 software interface, a MixtureDistribution
is a higher-order distribution depending on two or more other
distributions. For example, take a uniform mixture of two distributions
distr1 and distr2:
> my_mixt <- MixtureDistribution$new(list(distr1, distr2))
Internally, the dependency of the constructs on the components is
remembered so that my_mixt is not only evaluable for
cdf (and other methods), but also carries a symbolic
representation of its construction and definition history in terms of
distr1 and distr2.
On the mathematical side, the object-oriented formalism allows clean
definitions of otherwise more obscure concepts. For example, the mixture
distribution is now defined as follows:
For distributions \(d_1,\dots,d_m\) over \(\mathbb{R}^n\) and weights \(w_1,\dots, w_m\), we define the mixture of \(d_1,\dots, d_m\) with weights \(w_1,\dots, w_m\) to be the unique distribution \(\tilde{d}\) such that \(\tilde{d}.F(x) = \sum_{i=1}^m w_i\cdot d_i.F(x)\) for any \(x\in \mathbb{R}^n\). Note the added clarity by defining the mixture on the distribution \(d_i\), i.e., a first-order concept in terms of distributions.
This section provides a review of other related software that implement probability distributions. This is focused on, but not limited to, software in R.
The core R programming language consists of packages for basic coding
and maths as well as the stats package for statistical
functions. stats contains 17 common probability distributions
and four lesser-known distributions. Each distribution consists of (at
most) four functions: dX, pX, qX, rX where X
represents the distribution name. These correspond to the probability
density/mass, cumulative distribution, quantile (inverse cumulative
distribution), and simulation functions, respectively. Each is
implemented as a separate function, written in C, with both inputs and
outputs as numerics. The strength of these functions lies in their speed
and efficiency. There is no quicker way to find, say, the pdf of a
Normal distribution than to run the dnorm function from
stats. However, this is the limit of the package in terms of
probability distributions. As there is no designated distribution
object, there is no way to query results from the distributions outside
of the ‘dpqr’ functions.
Several R packages implement dpqr functions for extra
probability distributions. Of particular note are the extraDistr
(Wolodzko 2019) and actuar
(Dutang, Goulet, and Pigeon 2008) packages
that add over 60 distributions between them. Both of these packages are
limited to dpqr functions and therefore have the same
limits as R stats.
The distr package was the first package in R to implement an object-oriented interface for distributions, using the S4 object-oriented paradigm. distr tackles the two fundamental problems of stats by introducing distributions as objects that can be stored and queried. These objects include important statistical results, for example, the expectation, variance, and moment generating functions of a distribution. The distr family of packages includes a total of five packages for object-oriented distributions in R. distr has two weaknesses that were caused by using the S4 paradigm. Firstly, the package relies on inheritance, which means that large class trees exist for every object, and extensibility is therefore non-trivial. The second weakness is that S4 objects are not referred to by ‘pointers’ but instead copies. This means that a simple mixture of two distributions is just under 0.5Mb in size (relatively quite large).
The distributions3
package (Hayes and Moller-Trane 2019)
defines distributions as objects using the S3 paradigm. However, whilst
distributions3 treats probability distributions as S3 objects,
it does not add any properties, traits, or methods and instead uses the
objects solely for dpqr dispatch. In comparison to
distr, the distributions3 package provides fewer
features for inspection or composition. More recently, distributional
(O’Hara-Wild and Hayes 2020) builds on the
distributions3 framework (common authors exist between the two)
to focus on the vectorization of probability distributions coded as S3
objects. Similarly to distr6, the primary use-case of this
package is for predictive modeling of distributions as objects.
The mistr package (Sablica and Hornik 2020) is another recent distributions package, which is also influenced by distr. The sole focus of mistr is to add a comprehensive and flexible framework for composite models and mixed distributions. Similarly, to the previous two packages, mistr implements distributions as S3 objects.
Despite not being a package written in R, the Julia Distributions.jl (Lin et al. 2019) package provided inspiration for distr6. Distributions.jl implements distributions as objects with statistical properties including expectation, variance, moment generating and characteristic functions, and many more. This package uses multiple inheritance for ‘valueSupport’ (discrete/continuous) and ‘variateForm’ (univariate/multivariate/matrixvariate). Every distribution inherits from both of these, e.g., a distribution can be ‘discrete-univariate’, ‘continuous-multivariate’, ‘continuous-matrixvariate’, etc. The package provides a unified and user-friendly interface, which was a helpful starting point for distr6.
distr6 was designed and built around the following principles.
Unified interface The package is designed such that all distributions, no matter how complex, have an identical user-facing interface. This helps make the package easy to navigate and the documentation simple to read. Moreover, it minimizes any confusion resulting from using multiple distributions. A clear inheritance structure also allows wrappers and decorators to have the same methods as distributions, which means even complex composite distributions should be intuitive to use. Whether a user constructs a simple Uniform distribution or a mixture of 100 Normal distributions, the same methods and fields are seen in both objects.
Separation of core/exotic and numerical/analytic Via abstraction and encapsulation, core statistical results (common methods such as mean and variance) are separated from ‘exotic’ ones (less common methods such as anti-derivatives and p-norms). Similarly, implemented distributions only contain analytic results; users can impute numerical results using decorators. This separation has several benefits, including: 1) for predictive modeling with/of distributions, numerical results can take longer to compute than analytical results, and the difference between precision of analytical and numerical results can be substantial in the context of automated modeling, separation allows these differences to be highlighted and controlled; 2) separating numerical results allows an expanded interface for users to fine-tune and set their own parameters for how numerical results are computed; 3) a less-technical user can guarantee the precision of results as they are unlikely to use numerical decorators; 4) a user has access to the most important distribution methods immediately after construction but is not overwhelmed by many ‘exotic’ methods that they may never use. Use of decorators and wrappers allows the user to manually expand the interface at any time. For example, a user can choose between an undecorated Binomial distribution, with common methods such as mean and variance, or they can decorate the distribution to additionally gain access to survival and hazard functions.
Inheritance without over-inheritance The class
structure stems from a series of a few abstract classes with concrete
child classes, which allows for a sensible, but not over-complicated,
inheritance structure. For example, all implemented distributions
inherit from a single parent class, so common methods can be unified and
only coded once; note there is no separation of distributions into
‘continuous’ and ‘discrete’ classes. By allowing the extension of
classes by decorators and wrappers, and not solely inheritance, the
interface is highly scalable and extensible. By ‘scalability’, we refer
to the interface’s ability to grow to a large scale without additional
overheads. The decorator and wrapper patterns on top of the R6 paradigm
allow an (theoretically) unlimited number of distributions, wrappers,
and methods without computational difficulty. By ‘extensibility’, we
refer to the ability to extend the interface. Again this is made
possible by clean abstraction of distributions, wrappers, core methods,
and extra methods in decorators. All decorators and wrappers in
distr6 stem from abstract classes, which in turn inherit from
the Distribution super-class. In doing so, any method of
expanding an object’s interface in distr6 (i.e., via
decorators, wrappers, or inheritance) will automatically lead to an
interface that inherits from the top-level class, maintaining the
principle of a unified interface (D1).
Inspection and manipulation of multiple parameterizations The design process identified that use of distributions in R stats is inflexible in that in the majority of cases, only one parameterization of each distribution is allowed. This can lead to isolating users who may be very familiar with one parameterization but completely unaware of another. For example, the use of the precision parameter in the Normal distribution is typically more common in Bayesian statistics, whereas using the variance or standard deviation parameters is more common in frequentist statistics. distr6 allows the user to choose from multiple parameterizations for all distributions (where more than one parameterization is possible/known). Furthermore, querying and updating of any parameter in the distribution is allowed, even if it was not specified in construction (section 4). This allows for a flexible parameter interface that can be fully queried and modified at any time.
Flexible interfacing for technical and non-technical
users Throughout the design process, it was required that
distr6 be accessible to all R users. This was a challenge as R6
is a very different paradigm from S3 and S4. To reduce the learning
curve, the interface is designed to be as user-friendly and flexible as
possible. This includes: 1) a ‘sensible default principle’ such that all
distributions have justified default values; 2) an ‘inspection
principle’ with functions to list all distributions, wrappers, and
decorators. As discussed in (D2), abstraction and encapsulation allow
technical users to expand any distribution’s interface to be as
arbitrarily complex as they like, whilst maintaining a minimal
representation by default. Where possible defaults are ‘standard’
distributions, i.e. with location \(0\)
and scale \(1\), otherwise sensible
defaults are identified as realistic scenarios, for example
Binomial(n = 10, p = 0.5).
Flexible OO paradigms Following from (D5), we identified that R6 is still relatively new in R with only \(314\) out of \(16, 050\) packages depending on it (as of July 2020). Therefore this was acknowledged and taken into account when building the package. R6 is also the first paradigm in R with the dollar-sign notation (though S4 uses ‘@’ notation) and with a proper construction method. Whilst new users are advised to learn the basics of R6, S3 compatibility is available for all common methods via R62S3 (Sonabend 2019). Users can therefore decide on calling a method via dollar-sign notation or dispatch. The example below demonstrates ‘piping’ and S3. As the core package is built on R6, the thin-wrappers provided by R62S3 do not compromise the above design principles.
> library(magrittr)
> N <- Normal$new(mean = 2)
> N %>%
+ setParameterValue(mean = 1) %>%
+ getParameterValue("mean")
[1] 1
> pdf(N, 1:4)
[1] 0.398942280 0.241970725 0.053990967 0.004431848
distr6 1.4.3 implements 56 probability distributions,
including 11 probability kernels. Individual distributions are modeled
via classes that inherit from a common interface, implemented in the
abstract Distribution parent class. The
Distribution class specifies the abstract distribution
interface for parameter access, properties, traits, and methods, such as
a distribution’s pdf or cdf. The most important interface points are
described in Section 4.1
Concrete distributions, kernels, and wrappers are the grandchildren
of Distribution, and children of one of the mid-layer
abstract classes:
SDistribution, which models abstract, generic
distributions. Concrete distributions, such as Normal,
which models the normal distribution, inherit from
SDistribution.Kernel, which models probability kernels, such as
Triangular and Epanechnikov. Probability
kernels are absolutely continuous distributions over the Reals, with
assumed mean 0 and variance 1.DistributionWrapper, which is an abstract parent for
higher-order operations on distributions, including compositions, that
is, operations that create distributions from other distributions, such
as truncation or mixture.DistributionDecorator, whose purpose is supplementing
methods to distributions in the form of a decorator design pattern. This
includes methods such as integrated cdf or squared integrals of
distribution defining functions.Figure 3 visualizes the key class structure
of distr6, including the concrete Distribution
parent class, from which all other classes in the package inherit from
(with the exception of the ParameterSet). These abstract
classes allow simple extensibility for concrete sub-classes.
The base, or top-level, class in distr6 is the
Distribution class. Its primary function is to act as a
parent class for the implemented probability distributions and
higher-order compositions. It is also utilized for the creation of
custom distributions. By design, any distribution already implemented in
distr6 will have the same interface as a user-specified custom
distribution, ensuring (D1) is upheld. The most important methods for a
distribution are shown in Table 1 alongside
their meaning and definitions (mathematical if possible). The two
use-cases for the Distribution class are discussed
separately.
| Method | Description/Definition |
|---|---|
pdf/cdf/quantile/rand |
dpqr functions. |
mean |
\(d.\mu = \mathbb{E}[X]\) |
variance |
\(d.\sigma^2 = \mathbb{E}[(X - d.\mu)^2]\) |
traits |
List including value support (discrete/continuous/mixed); variate form (uni-/multi-/matrixvariate); type (mathematical domain). |
properties |
List including skewness (\(\mathbb{E}[((X - d.\mu)/d.\sigma)^3]\)) and symmetry (boolean). |
get/setParameterValue |
Getters and setters for parameter values. |
parameters |
Returns the internal parameterization set. |
print/summary |
Representation functions, summary includes distribution properties and traits. |
It is anticipated that the majority of distr6 users will be
using the package for the implemented distributions and kernels. With
this in mind, the Distribution class defines all variables
and methods common to all child classes. The most important of these are
the common analytical expressions and the dpqr public
methods. Every concrete implemented distribution/kernel has identical
public dpqr methods that internally call private
dpqr methods. This accounts for inconsistencies occurring
from packages returning functions in different formats and handling
errors differently, a problem most prominent in multivariate
distributions. Another example is the handling of non-integer values for
discrete distributions. In some packages, this returns \(0\), or the value is rounded down, or an
error is returned. The dpqr functions for all distributions
have unified validation checks and return types (numeric or
data.table). In line with base R and other distribution
packages, distr6 implements a single pdf function
to cover both probability mass and probability density functions.
> Normal$new()$pdf(1:2)
[1] 0.24197072 0.05399097
> Binomial$new()$cdf(1:2, lower.tail = FALSE, log.p = TRUE, simplify = FALSE)
Binom
1: -0.01080030
2: -0.05623972
A key design principle in the package is the separation of analytical
and numerical results (D2), which is ensured by only including
analytical results in implemented distributions. Missing methods in a
distribution, therefore, signify that no closed-form expression for the
method is available. However, all can be numerically estimated with the
CoreStatistics decorator (see section 4.2). Ideally, all distributions
will include analytical methods for the following: probability
density/mass function (pdf), cumulative distribution
function (cdf), inverse cumulative distribution
function/quantile function (quantile), simulation function
(rand), mean, variance, skewness, (excess) kurtosis, and
entropy of the distribution
(mean, variance, skewness, kurtosis, entropy), as well as
the moment generating function (mgf), characteristic
function (cf), and probability generating function
(pgf). Speed is currently a limitation in distr6,
but the use of Rcpp (Eddelbuettel and Francois 2011) in all
dpqr functions helps mitigate against this.
The fourth design principle of distr6 ensures that multiple parameterizations of a given distribution can be both provided and inspected at all times. For example, the Normal distribution can be parameterized in terms of variance, standard deviation, or precision. Any of which can be called in construction with other parameters updated accordingly. If conflicting parameterizations are provided, then an error is returned. By example,
# set precision, others updated automatically
> Normal$new(prec = 4)
Norm(mean = 0, var = 0.25, sd = 0.5, prec = 4)
# try and set both precision and variance, results in error
> Normal$new(var = 1, prec = 2)
Error in FUN(X[[i]], ...) :
Conflicting parametrisations detected. Only one of {var, sd, prec} should be given.
The same principle is used for parameter setting. The methods
getParameterValue and
setParameterValue are utilized for getting and setting
parameter values, respectively. The former takes a single argument, the
parameter name, and the second a named list of arguments corresponding
to the parameter name and the value to set. The example below
demonstrates this for a Gamma distribution. Here, the distribution is
constructed, the shape parameter is queried, both shape and rate
parameters are updated, and the latter queried, finally, the scale
parameter is set, which auto-updates the rate parameter.
> G <- Gamma$new(shape = 1, rate = 1)
> G$getParameterValue("shape")
[1] 1
> G$setParameterValue(shape = 1, rate = 2)
> G$getParameterValue("rate")
[1] 2
> G$setParameterValue(scale = 2)
> G$getParameterValue("rate")
[1] 0.5
Distribution and parameter domains and types are represented by
mathematical sets implemented in set6 (Sonabend and Kiraly 2020). This allows for a
clear representation of infinite sets and, most importantly, for
internal containedness checks. For example, all public dpqr
methods first call the $contains method in their respective
type and return an error if any points are outside the
distribution’s domain. As set6 uses Rcpp for this
method, these come at minimal cost to speed.
> B <- Binomial$new()
> B$pdf(-1)
Error in B$pdf(-1) :
Not all points in {-1} lie in the distribution domain (N0).
These domains and types are returned along with other important
properties and traits in a call to properties and
traits, respectively. This is demonstrated below for the
Arcsine distribution.
> A <- Arcsine$new()
> A$properties
$support
[0,1]
$symmetry
[1] "symmetric"
> A$traits
$valueSupport
[1] "continuous"
$variateForm
[1] "univariate"
$type
R
Users of distr6 can create temporary custom distributions
using the constructor of the Distribution class directly.
Permanent extensions, e.g., as part of an R package, should create a new
concrete distribution as a child of the SDistribution
class.
The Distribution constructor is given by
Distribution$new(name = NULL, short_name = NULL, type = NULL, support = NULL,
+ symmetric = FALSE, pdf = NULL, cdf = NULL, quantile = NULL, rand = NULL,
+ parameters = NULL, decorators = NULL, valueSupport = NULL, variateForm = NULL,
+ description = NULL)
The name and short_name arguments are
identification for the custom distribution used for printing.
type is a trait corresponding to scientific type (e.g.,
Reals, Integers,...), and support is the property of the
distribution support. Distribution parameters are passed as a
ParameterSet object. This defines each parameter in the
distribution, including the parameter default value and support. The
pdf/cdf/quantile/rand arguments define the corresponding
methods and are passed to the private
.pdf/.cdf/.quantile/.rand methods. As above, the public
methods are already defined and ensure consistency in each function. At
a minimum, users have to supply the distribution name,
type, and either pdf or cdf. All
other information can be numerically estimated with decorators (see
section 4.2).
> d <- Distribution$new(name = "Custom Distribution", type = Integers$new(),
+ support = Set$new(1:10),
+ pdf = function(x) rep(1/10, length(x)))
> d$pdf(1:3)
[1] 0.1 0.1 0.1
Decorators add functionality to classes in object-oriented
programming. These are not natively implemented in R6, and this novel
implementation is therefore discussed further in section 5.3. Decorators in distr6
are only ‘allowed’ if they have at least three methods and cover a clear
use case. This prevents too many decorators from bloating the interface.
However, by their nature, they are lightweight classes that will only
increase the methods in a distribution if explicitly requested by a
user. Decorators can be applied to a distribution in one of three
ways.
In construction:
> N <- Normal$new(decorators = c("CoreStatistics", "ExoticStatistics"))
Using the decorate() function:
> N <- Normal$new()
> decorate(N, c("CoreStatistics", "ExoticStatistics"))
Using the $decorate method inherited from the
DistributionDecorator super-class:
> N <- Normal$new()
> ExoticStatistics$new()$decorate(N)
The first option is the quickest if decorators are required
immediately. The second is the most efficient once a distribution is
already constructed. The third is the closest method to true OOP but
does not allow adding multiple decorators simultaneously.
Three decorators are currently implemented in distr6. These are
briefly described.
This decorator imputes numerical functions for common statistical
results that could be considered core to a distribution, e.g., the mean
or variance. The decorator additionally adds generalized expectation
(genExp) and moments (kthmoment) functions,
which allow numerical results for functions of the form \(\mathbb{E}[f(X)]\) and for
crude/raw/central \(K\) moments. The
example below demonstrates how the decorate function
exposes methods from the CoreStatistics decorator to the
Normal distribution object.
> n <- Normal$new(mean = 2, var = 4)
> n$kthmoment(3, type = "raw")
Error: attempt to apply non-function
> decorate(n, CoreStatistics)
> n$kthmoment(3, type = "raw")
[1] 32
This decorator adds more ‘exotic’ methods to distributions, i.e., those that are unlikely to be called by the majority of users. For example, this includes methods for the p-norm of survival and cdf functions, as well as anti-derivatives for these functions. Where possible, analytic results are exploited. For example, this decorator can implement the survival function in one of two ways: either as i) \(1\) minus the distribution cdf if an analytic expression for the cdf is available, or ii) via numerical integration of the distribution.
This decorator imputes numerical expressions for the
dpqr methods. This is the most useful for custom
distributions in which only the pdf or cdf is
provided. Numerical imputation is implemented via Rcpp.
Composite distributions - that is, distributions created from other distributions - are common in advanced usage. Examples for composites are truncation, mixture, or transformation of domain. In distr6, a number of such composites are supported. Implementation-wise, this uses the wrapper OOP pattern, which is not native to R6 but part of our extensions to R6, discussed in section 5.3.
As discussed above, wrapped distributions inherit from
Distribution thus have an identical interface to any child
of SDistribution, with the following minor differences:
wrappedModels method provides a unified interface
to access any component distribution.ParameterSetCollection object instead of a
ParameterSet, thus allowing efficient representation of
composite and nested parameter sets.The composition can be iterated and nested any number of times, consider the following example where a mixture distribution is created from two distributions that are in turn composites - a truncated Student T, and a Huberized exponential - note the parameter inspection and automatic prefixing of distribution ‘short names’ to the parameters for identification.
> M <- MixtureDistribution$new(list(
+ truncate(StudentT$new(), lower = -1, upper = 1),
+ huberize(Exponential$new(), upper = 4)
+ ))
> M$parameters()
id value support description
1: mix_T_df 1 R+ Degrees of Freedom
2: mix_trunc_lower -1 R U {-Inf, +Inf} Lower limit of truncation
3: mix_trunc_upper 1 R U {-Inf +Inf} Upper limit of truncation
4: mix_Exp_rate 1 R+ Arrival Rate
5: mix_Exp_scale 1 R+ Scale
6: mix_hub_lower 0 R U {-Inf, +Inf} Lower limit of huberization
7: mix_hub_upper 4 R U {-Inf, +Inf} Upper limit of huberization
8: mix_weights uniform {uniform} U [0,1] Mixture weights
We summarize some important implemented compositors (tables 2 and 3) to illustrate the way composition is handled and implemented.
| Class | Parameters | Type of \(d\) | Components of \(d\) |
|---|---|---|---|
TruncatedDistribution |
\(a,b\in \mathbb{R}\) | \(\mathbb{R}\) | \(d'\), type \(\mathbb{R}\) |
HuberizedDistribution |
\(a,b\in \mathbb{R}\) | \(\mathbb{R},\) mixed | \(d'\), type \(\mathbb{R}\) |
MixtureDistribution |
\(w_i \in \mathbb{R}, \sum^n_{i=1} w_i = 1\) | \(\mathbb{R}^n\) | \(d'_i\), type \(\mathbb{R}^n\) |
ProductDistribution |
- | \(\mathbb{R}^N, N = \sum_{i=1}^n n_i\) | \(d'_i\) type \(\mathbb{R}^{n_i}\) |
| Class | \(d.F(x)\) | \(d.f(x)\) |
|---|---|---|
TruncatedDistribution |
\(\frac{d'.F(x) - d'.F(a)}{d'.F(b) - d'.F(a)}\) | \(\frac{d'.f(x)}{d'.P([a,b])}\) |
HuberizedDistribution |
\(d'.F(x) + {\mathbf{\mathbb{1}}}[x = b]\cdot d'.P(b)\) | (no pdf, since mixed) |
MixtureDistribution |
\(\sum_{i=1}^N w_i\cdot d'_i.F(x)\) | \(\sum_{i=1}^N w_i\cdot d'_i.f(x)\) (if exists) |
ProductDistribution |
\(\prod_{i=1}^N d'_i.F(x)\) | \(\prod_{i=1}^N d'_i.f(x)\) (if exists) |
Example code to obtain a truncated or Huberized distribution is below. Here, we construct a truncated normal with truncation parameters -1 and 1, and a Huberized Binomial with bounding parameters 2 and 5.
> TN <- truncate(Normal$new(), lower = -1, upper = 1)
> TN$cdf(-2:2)
[1] 0.0 0.0 0.5 1.0 1.0
> class(TN)
[1] "TruncatedDistribution" "DistributionWrapper" "Distribution" "R6"
> HB <- huberize(Binomial$new(), lower = 2, upper = 5)
> HB$cdf(1:6)
[1] 0.0000000 0.0546875 0.1718750 0.3769531 1.0000000 1.0000000
> HB$median()
[1] 5
A special feature of distr6 is that it allows vectorization
of distributions, i.e., vectorized representation of multiple
distributions in an array-like structure. This is primarily done for
computational efficiency with the general best R practice of
vectorization. Vectorization of distr6 distributions is
implemented via the VectorDistribution, which is logically
treated as a compositor.
Mathematically, a VectorDistribution is simply a vector
of component distributions \(d_1,\dots,
d_N\) that allows vectorized evaluation. Two kinds of vectorized
evaluation are supported paired and product-wise vectorization, which we
illustrate in the case of cdfs.
cdf method.In practical terms, paired evaluation is the evaluation of \(N\) distributions at \(N\) points (which may be unique or different). So, by example, for three distributions \(d_1,d_2,d_3\), paired evaluation of their cdfs at \((x_1, x_2, x_3) = (4,5,6)\) respectively results in \((d_1.F(x_1), d_2.F(x_2), d_3.F(x_3)) = (d_1.F(4), d_2.F(5), d_3.F(6))\). In distr6:
> V <- VectorDistribution$new(distribution = "Normal", params = data.frame(mean = 1:3))
> V$cdf(4, 5, 6)
Norm1 Norm2 Norm3
1: 0.9986501 0.9986501 0.9986501
In contrast, product-wise evaluation evaluates \(N\) distributions at the same \(M\) points. Product-wise evaluation of the cdfs of \(d_1,d_2,d_3\) at \((x_1, x_2, x_3) = (4,5,6)\) results in \[\begin{pmatrix} d_1.F(x_1) & d_1.F(x_2) & d_1.F(x_3) \\ d_2.F(x_1) & d_2.F(x_2) & d_2.F(x_3) \\ d_3.F(x_1) & d_3.F(x_2) & d_3.F(x_3) \end{pmatrix} = \begin{pmatrix} d_1.F(4) & d_1.F(5) & d_1.F(6) \\ d_2.F(4) & d_2.F(5) & d_2.F(6) \\ d_3.F(4) & d_3.F(5) & d_3.F(6) \end{pmatrix}\]
In distr6:
> V <- VectorDistribution$new(distribution = "Normal", params = data.frame(mean = 1:3))
> V$cdf(4:6)
Norm1 Norm2 Norm3
1: 0.9986501 0.9772499 0.8413447
2: 0.9999683 0.9986501 0.9772499
3: 0.9999997 0.9999683 0.9986501
The VectorDistribution wrapper allows for efficient
vectorization across both the distributions and points
to evaluate, which we believe is a feature unique to distr6
among distribution frameworks in R. By combing product and paired modes,
users can evaluate any distribution in the vector at any point. In the
following example, Normal(1, 1) is evaluated at (1,2), and Normal(2, 1)
is evaluated at (3,4):
> V <- VectorDistribution$new(distribution = "Normal", params = data.frame(mean = 1:2))
> V$pdf(1:2, 3:4)
Norm1 Norm2
1: 0.3989423 0.24197072
2: 0.2419707 0.05399097
Further, common composites such as ProductDistribution
and MixtureDistribution inherit from
VectorDistribution, allowing for efficient vector dispatch
of pdf and cdf methods. Inheriting from VectorDistribution
results in identical constructors and methods. Thus, a minor caveat is
that users could evaluate a product or mixture at different points for
each distribution, which is not a usual use case in practice.
Two different choices of constructors are provided. The first
‘distlist’ constructor passes distribution objects
into the constructor, whereas the second passes a reference to the
distribution class along with the parameterizations. Therefore,
the first allows different types of distributions but is vastly slow as
the various methods have to be calculated individually, whereas the
second only allows a single class of distribution at a time but is much
quicker in evaluation. In the example below, the mixture uses the second
constructor, and the product uses the first.
> M <- MixtureDistribution$new(distribution = "Degenerate",
+ params = data.frame(mean = 1:10))
> M$cdf(1:5)
[1] 0.1 0.2 0.3 0.4 0.5
> class(M)
[1] "MixtureDistribution" "VectorDistribution" "DistributionWrapper" "Distribution"
[5] "R6"
> P <- ProductDistribution$new(list(Normal$new(), Exponential$new(), Gamma$new()))
> P$cdf(1:5)
[1] 0.3361815 0.7306360 0.9016858 0.9636737 0.9865692
This paper has so far discussed the API and functionality in distr6. This section discusses object-oriented programming (OOP). Firstly, a brief introduction to OOP and OOP in R and then the package’s contributions to the field.
R has four major paradigms for object-oriented programming: S3, S4, reference classes (R5), and most recently, R6. S3 and S4 are known as functional object-oriented programming (FOOP) paradigms, whereas R5 and R6 move towards class object-oriented programming (COOP) paradigms (R6) (Chambers 2014). One of the main differences (from a user perspective) is that methods in COOP are associated with a class, whereas in FOOP, methods are associated with generic functions. In the first case, methods are called by first specifying the object, and in the second, a dispatch registry is utilized to find the correct method to associate with a given object.
S3 introduces objects as named structures, which in other languages are often referred to as ‘typed lists’. These can hold objects of any type and can include meta-information about the object itself. S3 is the dominant paradigm in R for its flexibility, speed, and efficiency. As such, it is embedded deep in the infrastructure of R, and single dispatch is behind a vast majority of the base functionality, which is a key part of making R easily readable. S3 is a FOOP paradigm in which functions are part of a dispatch system and consist of a generic function that is external to any object and a specific method registered to a ‘class’. However, the term ‘class’ is slightly misleading as no formal class structure exists (and by consequence, no formal construction or inheritance) and as such, S3 is not a formal OOP language1.
S4 formalizes S3 by introducing: class-object separation, a clear notion of construction, and multiple inheritances (Chambers 2014). S4 has more syntax for the user to learn and a few more steps in class and method definitions. As a result, S4 syntax is not overly user-friendly, and S3 is used vastly more than S4 (Chambers 2014).
There is a big jump from S3 and S4 to R6 as they transition from
functional- to class-object-oriented programming. This means new
notation, semantics, syntax, and conventions. The key changes are: 1)
introducing methods and fields that are associated with classes not,
functions; 2) mutable objects with copy-on-modify semantics; and 3) new
dollar-sign notation. In the first case, this means that when a class is
defined, all the methods are defined as existing within the class, and
these can be accessed at any time after construction. Methods are
further split into public and private, as well as
active bindings, which incorporate the abstraction part of OOP.
The mutability of objects and change to copy-on-modify means that to
create an independent copy of an object, the new method
clone(deep = TRUE) has to be used, which would be familiar
to users who know more classical OOP but very different from most R
users. Finally, methods are accessed via the dollar-sign, and not by
calling a function on an object.
Below, the three paradigms are contrasted semantically with a toy example to create a ‘duck’ class with a method ‘quack’.
S3
> quack <- function(x) UseMethod("quack", x)
> duck <- function(name) return(structure(list(name = name), class = "duck"))
> quack.duck <- function(x) cat(x$name, "QUACK!")
> quack(duck("Arthur"))
Arthur QUACK!
S4
> setClass("duck", slots = c(name = "character"))
> setGeneric("quack", function(x) {
+ standardGeneric("quack")
+ })
> setGeneric("duck", function(name) {
+ standardGeneric("duck")
+ })
> setMethod("duck", signature(name = "character"),
+ definition = function(name){
+ new("duck", name = name)
+ })
> setMethod("quack",
+ definition = function(x) {
+ cat(x@name, "QUACK!")
+ })
> quack(duck("Ford"))
Ford QUACK!
R6
> duck <- R6::R6Class("duck", public = list(
+ initialize = function(name) private$.name = name,
+ quack = function() cat(private$.name, "QUACK!")),
+ private = list(.name = character(0)))
> duck$new("Zaphod")$quack()
Zaphod QUACK!
The example clearly highlights the extra code introduced by S4 and the difference between the S3 dispatch and R6 method system.
There is no doubt that R6 is the furthest paradigm from conventional R usage, and as such, there is a steep learning curve for the majority of R users. However, R6 will be most natural for users coming to R from more traditional OOP languages. In contrast, S3 is a natural FOOP paradigm that will be familiar to all R users (even if they are not aware that S3 is being used). S4 is an unfortunate midpoint between the two, which whilst being very useful, is not particularly user-friendly in terms of programming classes and objects.
distr was developed soon after S4 was released and is arguably one of the best case studies for how well S4 performs. Whilst S4 formalizes S3 to allow for a fully OO interface to be developed, its dependence on inheritance forces design decisions that quickly become problematic. This is seen in the large inheritance trees in distr in which one implemented distribution can be nested five child classes deep. This is compounded by the fact that S4 does not use pointer objects but instead nests objects internally. Therefore, distr has problems with composite distributions in that they quickly become very large in size. For example, a mixture of two distributions can easily be around 0.5Mb, which is relatively large. In contrast, R6 introduces pointers, which means that a wrapped object simply points to its wrapped component and does not copy it needlessly.
Whilst a fully object-oriented interface can be developed in S3 and
S4, they do not have the flexibility of R6, which means that in the long
run, extensibility and scalability can be problematic. R6 forces R users
to learn a paradigm that they may not be familiar with, but packages
like R62S3 allow users to become acquainted with R6 on a
slightly shallower learning curve. Speed differences for the three
paradigms are formally compared on the example above using microbenchmark
(Mersmann 2019); the results are in table
4. The R6 example is compared both including
the construction of the class, duck$new("Zaphod")$quack(),
and without construction, d$quack(), where d
is the object constructed before comparison. A significant ‘bottleneck’
is noted when construction is included in the comparison. However,
despite this, S4 is still significantly the slowest.
| Paradigm | mean \((\mu s)\) | cld |
|---|---|---|
| S3 | 73.44 | a |
| S4 | 276.17 | c |
| R6 | 187.70 | b |
| R6* | 38.32 | a |
In the simplest definition, ‘design patterns’ are abstract solutions to common coding problems. They are probably most widely known due to the book ‘Design Patterns Elements of Reusable Object-Oriented Software’ (Design Patterns) (Gamma et al. 1996). distr6 primarily makes use of the following design patterns
The strategy pattern is common in modeling toolboxes in which
multiple algorithms can be used to solve a problem. This pattern defines
an abstract class for a given problem and concrete classes that each
implement different strategies, or algorithms, to solve the problem. For
example, in the context of mathematical integration (a common problem in
R), one could use Simpson’s rule, Kronrod’s, or many others. These can
be specified by an integrate abstract class with concrete
sub-classes simpson and kronrod (figure 4).
The composite pattern defines a collection of classes with an identical interface when treated independently or when composed into a single class with constituent parts. To the user, this means that only one interface needs to be learned in order to interact with composite or individual classes. A well-built composite pattern allows users to construct complex classes with several layers of composition and yet still be able to make use of a single interface. By inheriting from a parent class, each class and composite share a common interface. Composition is a powerful design principle that allows both modification of existing classes and reduction of multiple classes (Király et al. 2021).
Decorators add additional responsibilities to an object without
making any other changes to the interface. An object that has been
decorated will be identical to its un-decorated counter-part except with
additional methods. This provides a useful alternative to inheritance.
Whereas inheritance can lead to large tree structures in which each
sub-class inherits from the previous and contains all previous methods,
decorators allow the user to pick and choose with responsibilities to
add. Figure 5 demonstrates how this is useful
in a shopping cart example. The top of the figure demonstrates using
inheritance in which each sub-class adds methods to the
Cart parent class. By the Tax child class,
there are a total of five methods in the interface. At the bottom of the
figure, the decorator pattern demonstrates how the functionality for
adding items and tax is separated and can be added separately.
In order to implement distr6, several contributions were made to the R6 paradigm to extend its abilities and to implement the design patterns discussed above.
R6 did not have a concept of abstract classes, which meant that
patterns such as adapters, composites, and decorators, could not be
directly implemented without problems. This is produced in
distr6 with the abstract function, which is placed
in the first line of all abstract classes. In the example below,
obj expects the self argument from R6 classes,
and class is the name of the class, getR6Class
is a custom function for returning the name of the class of the given
object.
abstract <- function(obj, class) {
if (getR6Class(obj) == class) {
stop(sprintf("%s is an abstract class that can't be initialized.", class))
}
}
For example, in decorators, the following line is placed at the top
of the initialize function:
abstract(self, "DistributionDecorator")
The typical implementation of decorators is to have an abstract
decorator class with concrete decorators inheriting from this, each with
their own added responsibilities. In distr6, this is made
possible by defining the DistributionDecorator abstract
class (see above) with a public decorate method. Concrete
decorators are simply R6 classes where the public methods are the ones
to ‘copy’ to the decorated object.
> DistributionDecorator
<DistributionDecorator> object generator
Public:
packages: NULL
initialize: function ()
decorate: function (distribution, ...)
clone: function (deep = FALSE)
> CoreStatistics
<CoreStatistics> object generator
Inherits from: <DistributionDecorator>
Public:
mgf: function (t)
cf: function (t)
pgf: function (z)
When the $decorate method from a constructed decorator
object is called, the methods are simply copied from the decorator
environment to the object environment. The decorator()
function simplifies this for the user.
The composite pattern is made use of in what distr6 calls
‘wrappers’. Again, this is implemented via an abstract class
(DistributionWrapper) with concrete sub-classes.
> DistributionWrapper
<DistributionWrapper> object generator
Inherits from: <Distribution>
Public:
initialize: function (distlist = NULL, name, short_name, description, support,
wrappedModels: function (model = NULL)
setParameterValue: function (..., lst = NULL, error = "warn")
Private:
.wrappedModels: list
> TruncatedDistribution
<TruncatedDistribution> object generator
Inherits from: <DistributionWrapper>
Public:
initialize: function (distribution, lower = NULL, upper = NULL)
setParameterValue: function (..., lst = NULL, error = "warn")
Private:
.pdf: function (x, log = FALSE)
.cdf: function (x, lower.tail = TRUE, log.p = FALSE)
.quantile: function (p, lower.tail = TRUE, log.p = FALSE)
.rand: function (n)
Wrappers in distr6 alter objects by modifying either their
public or private methods. Therefore, an ‘unwrapped’ distribution looks
identical to a ‘wrapped’ one, despite inheriting from different classes.
This is possible via two key implementation strategies: 1) on the
construction of a wrapper, parameters are prefixed with a unique ID,
meaning that all parameters can be accessed at any time; 2) the
wrappedModels public field allows access to the original
wrapped distributions. These two factors allow any new method to be
called either by reference to wrappedModels or by using
$getParameterValue with the newly prefixed parameter ID.
This is demonstrated in the .pdf private method of the
TruncatedDistribution wrapper (slightly abridged).
.pdf = function(x, log = FALSE) {
dist <- self$wrappedModels()[[1]]
lower <- self$getParameterValue("trunc_lower")
upper <- self$getParameterValue("trunc_upper")
pdf <- numeric(length(x))
pdf[x > lower & x <= upper] <- dist$pdf(x[x > lower & x <= upper]) /
(dist$cdf(upper) - dist$cdf(lower))
return(pdf)
}
As the public pdf is the same for all distributions, and
this is inherited by wrappers, only the private .pdf method
needs to be altered.
This final section looks at concrete short examples for four key use cases.
The primary use case for the majority of users will be in
constructing distributions in order to query their results and visualize
their shape.
Below, a distribution (Binomial) is constructed and queried for its
distribution-specific traits and parameterization-specific
properties.
> b <- Binomial$new(prob = 0.1, size = 5)
> b$setParameterValue(size = 6)
> b$getParameterValue("size")
> b$parameters()
> b$properties
> b$traits
Specific methods from the distribution are queried as well.
> b$mean()
> b$entropy()
> b$skewness()
> b$kurtosis()
> b$cdf(1:5)
The distribution is visualized by plotting it’s density, distribution, inverse distribution, hazard, cumulative hazard, and survival function; the output is in figure 6.
> plot(b, fun = "all")
distr6 can also serve as a toolbox for analysis of empirical
data by making use of the three ‘empirical’ distributions:
Empirical, EmpricalMV, and
WeightedDiscrete.
First, an empirical distribution is constructed with samples from a
standard exponential distribution.
> E <- Empirical$new(samples = rexp(10000))
The summary function is used to quickly obtain key
information about the empirical distribution.
> summary(E)
Empirical Probability Distribution.
Quick Statistics
Mean: 0.105954
Variance: 1.140673
Skewness: 0.05808027
Ex. Kurtosis: -0.473978
Support: (-2.50, -2.19,...,2.27, 2.66) Scientific Type: R
Traits: discrete; univariate
Properties: asymmetric; platykurtic; positive skew
The distribution is compared to a (standard) Normal distribution and then (standard) Exponential distribution; output in figure 7.
> qqplot(E, Normal$new(), xlab = "Empirical", ylab = "Normal")
> qqplot(E, Exponential$new(), xlab = "Empirical", ylab = "Exponential")
The CDF of a bivariate empirical distribution is visualized; output in figure 8.
> plot(EmpiricalMV$new(data.frame(rnorm(100, mean = 3), rnorm(100))), fun = "cdf")
Whilst empirical distributions are useful when data samples have been
generated, custom distributions can be used to build an entirely new
probability distribution. Though, here, we use a simple discrete uniform
distribution. This example highlights the power of decorators to
estimate distribution results without manual computation of every
possible method. The output demonstrates the precision and accuracy of
these results.
Below, a custom distribution is created, and by including the
decorators argument, all further methods are imputed
numerically. The distribution is summarized for properties, traits, and
common results (this is possible with the ‘CoreStatistics’ decorator).
The summary is identical to the analytic DiscreteUniform
distribution.
> U <- Distribution$new(
+ name = "Discrete Uniform",
+ type = set6::Integers$new(), support = set6::Set$new(1:10),
+ pdf = function(x) ifelse(x < 1 | x > 10, 0, rep(1/10,length(x))),
+ decorators = c("CoreStatistics", "ExoticStatistics", "FunctionImputation"))
> summary(U)
Discrete Uniform
Quick Statistics
Mean: 5.5
Variance: 8.25
Skewness: 0
Ex. Kurtosis: -1.224242
Support: {1, 2,...,9, 10} Scientific Type: Z
Traits: discrete; univariate
Properties: asymmetric; platykurtic; no skew
Decorated with: CoreStatistics, ExoticStatistics, FunctionImputation
The CDF and simulation function are called (numerically imputed with
the FunctionImputation decorator), the hazard function from
the ExoticStatistics decorator, and the
kthmoment function from the CoreStatistics
decorator.
> U$cdf(1:10)
[1] 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
> U$rand(10)
[1] 8 10 5 8 5 10 6 7 1 4
> U$hazard(2)
[1] 0.125
> U$kthmoment(2)
[1] 8.25
Composite distributions are an essential part of any distribution
software. The following example demonstrates two types of composites:
composition via distribution transformation (truncation) and composition
via mixtures and vectors.
First, a Binomial distribution is constructed and truncated between
\(1\) and \(5\), the CDF of the new distribution is
queried.
> TB <- truncate(
Binomial$new(size = 20, prob = 0.5),
lower = 1,
upper = 5
)
> round(TB$cdf(0:6), 4)
[1] 0.0000 0.0000 0.0088 0.0613 0.2848 1.0000 1.0000
Next, a vector distribution is constructed of two Normal distributions, with respective means \(1\) and \(2\) and unit standard deviation. The parameters are queried (some columns suppressed).
> V <- VectorDistribution$new(distribution = "Normal",
+ params = data.frame(mean = 1:2))
> V$parameters()
id value support
1: Norm1_mean 1 R
2: Norm1_var 1 R+
3: Norm1_sd 1 R+
4: Norm1_prec 1 R+
5: Norm2_mean 2 R
6: Norm2_var 1 R+
7: Norm2_sd 1 R+
8: Norm2_prec 1 R+
Vectorization is possible across distributions, samples, and both. In
the example below, the first call to $pdf evaluates both
distributions at (1, 2), the second call evaluates the first at (1) and
the second at (2), and the third call evaluates the first at (1, 2) and
the second at (3, 4).
> V$pdf(1:2)
Norm1 Norm2
1: 0.3989423 0.2419707
2: 0.2419707 0.3989423
> V$pdf(1, 2)
Norm1 Norm2
1: 0.3989423 0.3989423
> V$pdf(1:2, 3:4)
Norm1 Norm2
1: 0.3989423 0.24197072
2: 0.2419707 0.05399097
Finally, a mixture distribution with uniform weights is constructed from a \(Normal(2, 1)\) distribution and an \(Exponential(1)\).
> MD <- MixtureDistribution$new(
+ list(Normal$new(mean = 2, sd = 1),
+ Exponential$new(rate = 1)
+ )
+ )
> MD$pdf(1:5)
[1] 0.304925083 0.267138782 0.145878896 0.036153303 0.005584898
> MD$cdf(1:5)
[1] 0.3953879 0.6823324 0.8957788 0.9794671 0.9959561
> MD$rand(5)
[1] 3.6664473 0.1055126 0.6092939 0.8880799 3.4517465
b, with plot(b, fun = "all").
Functions clockwise from top-left: probability mass, cumulative
distribution, quantile (inverse cdf), cumulative hazard, hazard, and
survival.Whilst distr6 fulfils its primary purpose as an R6 interface for probability distributions with basic features, it is not consider ‘feature-complete’, as it currently lacks many of the important features included in distr and other related software. distr6 is in constant development and has an active GitHub with open issues and projects. Some concrete short-term goals include:
FunctionImputation decorator to work on
higher-order distributions as well as to improve speed and
accuracy.distr6 introduces a robust and scalable object-oriented
interface for probability distributions to R and aims to be the
first-stop for class object-oriented probability distributions in R. By
making use of R6, every implemented distribution is clearly defined with
properties, traits, and analytic results. Whilst R stats is
limited to very basic dpqr functions for representing
evaluated distributions, distr6 ensures that probability
distributions are treated as complex mathematical objects.
Future updates of the package will include adding further numerical approximation strategies in the decorators to allow users to choose different methods (instead of being forced to use one). Additionally, the extensions to R6 could be abstracted into an independent package in order to better benefit the R community.
distr6 is released under the MIT license on GitHub and CRAN. Extended documentation, tutorials, and examples are available at https://alan-turing-institute.github.io/distr6/. Code quality is monitored and maintained by an extensive suite of unit tests on multiple continuous integration systems.
We would like to thank and acknowledge Prof. Dr. Peter Ruckdeschel and Prof. Dr. Matthias Kohl for their work on the distr package and for extensive discussions, planning, and design decisions that were utilised in the development of distr6. RS receives a PhD stipend from EPSRC (EP/R513143/1).
E, to theoretical Normal (left) and
Exponential (right) distributions with
qqplot(E, Normal$new() and
qqplot(E, Exponential$new() respectively.E, with
plot("E", fun = "cdf").