Introduction
Pattern classification is an important task nowadays and is in use
everywhere, from our e-mail client, which is able to separate spam from
legit messages, to credit institutions, that rely on it to detect fraud
and grant or deny loans. All these cases operate with binary datasets,
since a message is either spam or legit, and multiclass datasets, the
loan is safe, medium, risky or highly risky, for instance. In both cases
the user expects only one output.
The huge growth of the amount of information stored in late years on
the web, such as blog posts, pictures taken from cameras and phones,
videos hosted on YouTube, and messages on social networks, has led to
more complex classification work. A blog post can be classified into
several non-exclusive categories, for instance news, economy and
politics simultaneously. A picture can be assigned a set of labels, such
as landscape, sky and forest. A video can be labeled into several music
genres at once, etc. All of these are examples of problems in need of
multilabel classification.
Binary and multiclass datasets can be managed in R by using data
frames. Usually the last attribute (column of the “data.frame”) is the
output class, which might contain only
TRUE/FALSE values or values belonging to a
finite set (a factor). Multilabel datasets (MLDs) can also be stored in
an R “data.frame”, but an additional structure to know which attributes
are output labels is needed. Moreover, this kind of datasets have many
specific characteristics that do not exist in the traditional ones. The
average number of labels per instance, the imbalance ratio for each
label, the number of labelsets (sets of labels assigned to each row) and
their frequencies, and the level of concurrence among imbalanced labels
are some of the traits that differentiate an MLD from the others.
Until now, most of the software to work with MLDs has been written in
Java. The two best known frameworks are MULAN (Tsoumakas et al. 2011) and MEKA (J. Read and Reutemann 2012). Both
implementations rely on WEKA (Hall et al.
2009), which offers a large variety of binary and multiclass
classifiers, as well as functions needed to deal with ARFF
(Attribute-Relation File Format) files. Most of the existing
MLDs are stored in ARFF format. MULAN and MEKA provide the specialized
tools needed to deal with multilabel ARFFs, and the infrastructure to
build multilabel classifiers (MLCs). Although R can access WEKA
functionality through the RWeka (Hornik, Buchta, and Zeileis 2009) package,
handling MLDs is far from an easy task in R. This has been the main
motivation behind the development of the mldr
package. To the best of our knowledge, mldr is the first R
package aimed to ease the work with multilabel data.
The mldr package aims to provide the user with the functions
needed to perform exploratory analysis of MLDs, determining their main
traits both statistically and visually. Moreover, it also brings the
proper tools to manipulate this kind of datasets, including the
application of the most common transformation methods, BR (Binary
Relevance) and LP (Label Powerset), that will be described
in the following section. These would be the foundation for processing
MLDs with traditional classifiers, as well as for developing new
multilabel algorithms.
The mldr package does not depend on the RWeka
package, and it is not linked to MULAN nor MEKA. It has been designed to
allow reading both MULAN and MEKA MLDs, but without any external
dependencies. In fact, it would be possible to load MLDs stored in other
file formats, as well as creating them from scratch. When loaded, MLDs
are wrapped in an S3 type object with class “mldr”, which allows for the
use of methods. The object will contain the data in the MLD and also a
large set of measures obtained from it. The functions provided by the
package ease the access to this information, produce some specific
plots, and make possible the manipulation of its content. A web-based
graphical user interface, developed using the shiny (Chang et al. 2015) package, puts the
exploratory analysis tools of the mldr package at the
fingertips of all users, even those who have little experience using
R.
In the following section the foundations related to MLDs and MLC will
be briefly introduced. After that, the structure of the mldr
package, and the operations it provides will be explained. Finally, the
user interface provided by mldr to ease exploratory analysis
tasks over MLDs will be shown. All code displayed in this paper is
available in a vignette, accessible by entering
vignette("mldr", package = "mldr").
Working with multilabel datasets
MLDs are generated from text documents (Klimt
and Yang 2004), sets of images (Duygulu et
al. 2002), music collections, and protein attributes (Diplaris et al. 2005), among other sources. For
each sample a set of features (input attributes) is collected, and a set
of labels (the output labelset) is assigned. Usually there are several
hundreds or even thousands of attributes, and it is not rare that a MLD
has more labels than features. Some MLDs have only a few labels per
instance, while others have dozens of them. In some MLDs the number of
label combinations (labelsets) is quite short, whereas in others it can
be very large. Most MLDs are imbalanced, which means that some labels
are very frequent while others are scarcely represented. The labels in
an MLD can be correlated or not. Moreover, frequent labels and rare
labels can appear together in the same instances.
As can be seen, a lot of different scenarios can be found depending
on the MLD characteristics. This is the reason why several specific
measures have been designed to assess MLD traits (Tsoumakas, Katakis, and Vlahavas 2010), since
they can have a serious impact on the MLC’s performance. The following
two subsections introduce several of these measures and some of the
approaches pursued to face multilabel classification.
Multilabel dataset traits
The most common characterization measures for MLDs can be grouped
into four categories, as depicted in Figure 1.
Figure 1: Characterization measures
taxonomy.
The most basic information that can be obtained from an MLD is the
number of instances, attributes and labels. For any MLD containing \(\lvert D\rvert\) instances, any instance
\(D_i, i \in \{1..\lvert D\rvert \}\)
will be the union of a set of attributes and a set of labels (\(X_i\), \(Y_i\)), \(X_i \in
X^1\times X^2\times \dots\times X^f, Y_i \subseteq L\), where
\(f\) is the number of input features
and \(X^j\) is the space of possible
values for the \(j\)-th attribute,
\(j \in \{1..f\}\). \(L\) being the full set of labels used in
\(D\), \(Y_i\) could be any subset of items in \(L\). Therefore, theoretically the number of
potential labelsets could be \(2^{\lvert
L\rvert}\). In practice this number tends to be limited by \(\lvert D\rvert\).
Each instance \(D_i\) has an
associated labelset, whose length (number of active labels) can be in
the range {0..\(\lvert L\rvert\)}. The
average number of active labels per instance is the most basic measure
of any MLD, usually known as Card (standing for cardinality).
It is calculated as shown in Equation @ref(eq:Card). Dividing this
measure by the number of labels in \(L\), as shown in Equation @ref(eq:Dens),
results in a dimension-less measure, known as Dens (standing
for label density).
\[\begin{aligned}
\label{eq:Card}
Card\left(D\right) &= \frac{1}{\lvert D\rvert}
\displaystyle\sum\limits_{i=1}^{\lvert D\rvert} \lvert Y_i\rvert,
\end{aligned} (\#eq:Card) \]
\[\begin{aligned}
\label{eq:Dens}
Dens\left(D\right) &= \frac{1}{\lvert D\rvert}
\displaystyle\sum\limits_{i=1}^{\lvert D\rvert} \frac{\lvert
Y_i\rvert}{\lvert L\rvert}.
\end{aligned} (\#eq:Dens) \]
Most multilabel datasets are imbalanced, meaning that some of the
labels are very frequent whereas others are quite rare. The level of
imbalance of a label can be measured by the imbalance ratio,
IRLbl, defined in Equation @ref(eq:IRLbl). To know how much
imbalance there is in \(D\), the
MeanIR measure (Charte et al.
2015) is calculated as the mean imbalance ratio among all labels,
as shown in Equation @ref(eq:MeanIR). In order to know the significance
of this last measure, the standard CV (Coefficient of Variation,
Equation @ref(eq:CV)) can be used.
\[\begin{aligned}
\label{eq:IRLbl}
\textit{IRLbl}\left(y\right) &=
\frac{
\max\limits_{y'\in L}
\left(\displaystyle\sum\limits_{i=1}^{\lvert
D\rvert}{h\left(y', Y_i\right)}\right)
}
{
\displaystyle\sum\limits_{i=1}^{\lvert D\rvert}{h\left(y,
Y_i\right)}}
\quad h\left(y, Y_i\right) =
\begin{cases}
1 & y \in Y_i \\
0 & y \notin Y_i
\end{cases},
\end{aligned} (\#eq:IRLbl) \]
\[\begin{aligned}
\label{eq:MeanIR}
\textit{MeanIR} &= \frac{1}{\lvert L\rvert}
\displaystyle\sum\limits_{y\in L}\textit{IRLbl}\left(y\right),
\end{aligned} (\#eq:MeanIR) \]
\[\begin{aligned}
\label{eq:CV}
\textit{CV} &= \frac{\textit{IRLbl}\sigma}{\textit{MeanIR}}\quad
\textit{IRLbl}\sigma = \sqrt{
\displaystyle\sum\limits_{y\in L}^{}{
\frac{\left(\mathit{IRLbl\left(y\right) -
MeanIR}\right)^2}{\lvert L\rvert-1}
}
}.
\end{aligned} (\#eq:CV) \]
The number of different labelsets, as well as the amount of them
being unique labelsets (i.e., appearing only once in \(D\)), give us an idea on how sparsely the
labels are distributed. The labelsets by themselves indicate how the
labels in \(L\) are related. A very
frequent labelset implies that the labels in it tend to appear jointly
in \(D\). The SCUMBLE measure,
introduced in Charte et al. (2014) and
shown in Equation @ref(eq:SCUMBLE), is used to assess the concurrence
level among frequent and infrequent labels.
\[\begin{aligned}
\label{eq:SCUMBLEIns}
\textit{SCUMBLE}_{ins}\left(i\right) &=
1 -
\frac{1}{\overline{\textit{IRLbl}_i}}\left(\prod\limits_{l=1}^{\lvert
L\rvert} \textit{IRLbl}_{il}\right)^{\left(1/\lvert L\rvert\right)},
\end{aligned} (\#eq:SCUMBLEIns) \]
\[\begin{aligned}
\label{eq:SCUMBLE}
\textit{SCUMBLE}\left(D\right) &= \frac{1}{\lvert D\rvert}
\displaystyle\sum\limits_{i=1}^{\lvert D\rvert}
\textit{SCUMBLE}_{ins}\left(i\right).
\end{aligned} (\#eq:SCUMBLE) \]
Besides the aforementioned insights, there are some other interesting
traits that can be indirectly obtained from the previous measures, such
as the ratio between input features and output labels, the maximum
IRLbl, or the coefficient of variation in the imbalance of
levels, among others.
Although the raw numbers given by these calculations describe the
nature of any multilabel dataset to a good level, in general a
visualization of its characteristics is desirable to ease its
interpretation by researchers.
The information obtained from the previous measures depicts the
characteristics of the dataset. These insights, along with other factors
such as the loss function used by the classifier, help in choosing the
most proper algorithm to learn from it and, in the future, make
predictions on new data. Traditional classification models, such as
trees and support vector machines, are designed to give only one output
as result. Multilabel classification can mainly be faced through two
different approaches discussed in the following.
Multilabel classification
Algorithm adaptation: The goal is to modify
existing algorithms taking into account the multilabel nature of the
samples, for instance hosting more than one class in the leaves of a
tree instead of only one.
Problem transformation: This approach transforms
the original data to make it suitable to traditional classification
algorithms, then combines the obtained predictions to build the
labelsets given as output result.
Although several transformation methods have been defined in the
specialized literature, there are two among them that stand out because
they are the foundation for many others:
Binary Relevance (BR): Introduced by Godbole and Sarawagi (2004) as an adaptation of
OVA (One-vs-All) to the multilabel scenario, this method
transforms the original multilabel dataset into several binary datasets,
as many as there are different labels. In this way any binary classifier
can be used by joining the individual predictions to generate the final
output.
Label Powerset (LP): Introduced by Boutell et al. (2004), this method transforms
the multilabel dataset into a multiclass dataset by using the labelset
of each instance as class identifier. Any multiclass classifier can be
used, transforming back the predicted class into a labelset.
BR and LP have been used not only as a direct technique to implement
multilabel classifiers, but also as a base method to build more
sophisticated algorithms. Several ensembles of binary classifiers
relying on BR have been proposed, such as CC (Classifier
Chains) or ECC (Ensemble of Classifier Chains), both by
Jesse Read et al. (2011). The same is
applicable to the LP transformation, the foundation of ensemble
multilabel classifiers such as RAkEL (Random k-Labelsets for Multi-Label
Classification, Tsoumakas and Vlahavas
(2007)) and EPS (Ensemble of Pruned Sets, J. Read, Pfahringer, and Holmes (2008)).
For the readers interested in more details, a recent review on
multilabel classification has been published by Zhang and Zhou (2014).
The mldr package
R is among the most used tools when it comes to performing data
mining tasks, including binary and multiclass classification. However,
the work with MLDs in R is not as easy as it is with classic datasets.
This is the main motivation behind the development of the mldr
package, whose goals and functionality are described in this
section.
Main goals of the mldr package
When we planned the development of this package, our main objective
was to ease the exploration of MLDs in R. This included loading existing
MLDs in different formats, as well as obtaining from them all the
available information. These functions should be accessible to everyone,
even to users not used to the R command line but to GUIs (Graphical
User Interfaces) such as those provided by packages Rcmdr (aka
R Commander, Fox (2005)) or rattle
(Williams 2011).
At the same time, we aimed to include the tools needed to manipulate
the MLDs, to apply filters and transformations, as well as to create
MLDs from scratch. This functionality, directed to more experienced R
users, opens the doors to implement other algorithms on top of
mldr, for instance preprocessing methods or multilabel
classifiers.
Installing and loading the mldr package
The mldr package is available from the Comprehensive R
Archive Network (CRAN), therefore it can be installed as any other
package, by simply typing:
> install.packages("mldr")
mldr depends on three R packages: XML (Lang and the CRAN Team 2015), circlize
(Gu et al. 2014) and shiny. The
first one allows reading XML (eXtensible Markup Language)
files, the second one is used to generate a specific type of plot
(described below), and the third one is the base of its user
interface.
Older releases of mldr, as well as the development version,
are available at http://github.com/fcharte/mldr. It is possible to
install the development version using the install_github()
function from devtools
(Wickham and Chang 2015).
Once installed, the package has to be loaded before it can be used.
This can be done through the library() or
require() functions, as usual. After loading the package
three sample MLDs will be available: birds,
emotions and genbase. These are contained in
the birds.rda, emotions.rda and
genbase.rda files, which are lazily loaded along with the
package.
The mldr package uses its own internal representation for
MLDs, which are assigned the “mldr” class. Inside an “mldr” object, such
as the previous mentioned emotions or birds,
both the data in the MLD and all the information obtained from this data
can be found.
Loading and creating MLDs
Besides the three sample MLDs included in the package, the
mldr() function allows to load any MLD stored in MULAN or
MEKA file formats. Assuming that the files corel5k.arff and
corel5k.xml, which hold the Corel5k (Duygulu et al. 2002) MLD in MULAN format, are
in the current directory, the loading is done as follows:
> corel5k <- mldr("corel5k")
If the XML file is not available, it is possible to indicate just the
number of labels in the MLD instead. In this case, the function assumes
that the labels are at the end of the list of features. For
instance:
> corel5k <- mldr("corel5k", label_amount = 374)
Loading an MLD in MEKA file format is equally easy. In this case
there is no XML file with label information used, but a special header
inside the ARFF file, a fact that will be indicated to
mldr() with the use_xml argument:
> imdb <- mldr("imdb", use_xml = FALSE)
In all cases the result, as long as the MLD can be correctly loaded
and parsed, will be a new “mldr” object ready to use.
If the MLD we are interested in is not in MULAN or MEKA format,
firstly it will have to be loaded into a “data.frame”, for instance
using functions such as read.csv(),
read.table() or a more specialized reader, and secondly
this “data.frame” and an integer vector stating the indices of the
labels inside it are given to the mldr_from_dataframe()
function. This is a general function for creating an “mldr” object from
any “data.frame”, so it can also be used to generate new MLDs on the
fly, as shown in the following example:
> df <- data.frame(matrix(rnorm(1000), ncol = 10))
> df$Label1 <- c(sample(c(0,1), 100, replace = TRUE))
> df$Label2 <- c(sample(c(0,1), 100, replace = TRUE))
> mymldr <- mldr_from_dataframe(df, labelIndices = c(11, 12), name = "testMLDR")
This will assign to mymldr an MLD, named
testMLDR, with 10 input attributes and 2 labels.
Plotting functions
Exploratory analysis of MLDs can be tedious, since most of them have
thousands of attributes and hundreds of labels. The mldr
package provides a plot() function specific for dealing
with “mldr” objects, allowing the generation of several specific types
of plots. The first argument given to plot() must be an
“mldr” object, while the second one specifies the type of plot to be
produced.
> plot(emotions, type = "LH")
There are seven different types of plots available: three histograms
showing relations between instances and labels, two bar plots with
similar purpose, a circular plot indicating types of attributes and a
concurrence plot for labels. All of them are shown in Figure 2, generated by the following code:
> layout(matrix(c(1, 1, 7, 1, 1, 4, 5, 5, 4, 2, 6, 3), 4, 3, byrow = TRUE))
> plot(emotions, type = c("LC", "LH", "LSH", "LB", "LSB", "CH", "AT"))
The concurrence plot is the default one, with type "LC",
and responds to the need of exploring interactions among labels, and
specifically between majority and minority ones. This plot has a
circular shape, with the circumference partitioned into several disjoint
arcs representing labels. Each arc has length proportional to the number
of instances where the label is present. These arcs are in turn divided
into bands that join two of them, showing the relation between the
corresponding labels. The width of each band indicates the strength of
the relation, since it is proportional to the number of instances in
which both labels appear simultaneously. In this manner, a concurrence
plot can show whether imbalanced labels appear frequently together, a
situation which could limit the possible improvement of a preprocessing
technique (Charte et al. 2014).
Since drawing interactions among a lot of labels can produce a
confusing result, this last type of plot accepts more arguments:
labelCount, which accepts an integer that will be used to
generate the plot with that number of labels chosen at random; and
labelIndices, which allows to indicate exactly the indices
of the labels to be displayed in the plot. For example, in order to plot
the first ten labels of genbase:
> plot(genbase, labelIndices = genbase$labels$index[1:10])
The label histogram (type "LH") relates labels and
instances in a way that shows how well-represented labels are in
general. The X axis represents the number of instances and the Y axis
the amount of labels. This means that if a large number of labels are
appearing in very few instances, all data will be concentrated on the
left side of the plot. On the contrary, if labels are generally present
in many instances, data will tend to accumulate on the right side. This
plot shows imbalance of labels when there is data accumulated on both
sides of the plot, which implies that many labels are underrepresented,
and a large amount are overrepresented as well.
The labelset histogram (named "LSH") is similar to the
former. However, instead of representing the number of instances in
which each label appears, it shows the amount of labelsets. This
indicates quantitatively whether labelsets repeat consistently or not
among instances.
The label and labelset bar plots display exactly the number of
instances for each one of the labels and labelsets, respectively. Their
codes are "LB" for the label bar plot and
"LSB" for the labelset one.
The cardinality histogram (type "CH") represents the
amount of labels instances have in general. Therefore data accumulating
on the right side of the plot indicates that instances do have a notable
amount of labels, whereas data concentrating on the left side shows the
opposite situation.
The attribute types plot (named "AT") is a pie chart
displaying the number of labels, numeric attributes and finite set
(character) attributes, thus showing the proportions between these types
of attributes to ease the understanding of the amount of input
information and that of output data.
Additionally, plot() accepts coloring arguments,
col and color.function. The former can be used
on all plot types except for the label concurrence plot, and must be a
vector of colors. The latter is only used on the label concurrence plot
and accepts a coloring function, such as rainbow or
heat.colors, as can be seen in the following example:
> plot(emotions, type = "LC", color.function = heat.colors)
> plot(emotions, type = "LB", col = terrain.colors(emotions$measures$num.labels))
Transforming and filtering functions
Manipulation of datasets is a crucial task in multilabel
classification. Since transformation is one of the main approaches to
tackle the problem, both BR and LP transformations are implemented in
package mldr. They can be obtained using the
mldr_transform function, which accepts an “mldr” object as
first argument, the type of transformation, "BR" or
"LP", as second, and an optional vector of label indices to
be included in the transformation as last argument:
> emotionsbr <- mldr_transform(emotions, type = "BR")
> emotionslp <- mldr_transform(emotions, type = "LP", emotions$labels$index[1:4])
The BR transformation will return a list of “data.frame” objects,
each one of them using one of the labels as class, whereas the LP
transformation will return a single “data.frame” representing a
multiclass dataset using each labelset as a class. Both of these
transformations can be directly used in order to apply binary and
multiclass classification algorithms, or even implement new ones.
> emo_lp <- mldr_transform(emotions, "LP")
> library(RWeka)
> classifier <- IBk(classLabel ~ ., data = emo_lp, control = Weka_control(K = 10))
> evaluate_Weka_classifier(classifier, numFolds = 5)
=== 5 Fold Cross Validation ===
=== Summary ===
Correctly Classified Instances 205 34.57 %
Incorrectly Classified Instances 388 65.43 %
Kappa statistic 0.2695
Mean absolute error 0.057
Root mean squared error 0.1748
Relative absolute error 83.7024 %
Root relative squared error 94.9069 %
Coverage of cases (0.95 level) 75.3794 %
Mean rel. region size (0.95 level) 19.574 %
Total Number of Instances 593
A filtering utility is included in the package as well. Using it is
intuitive, since it can be called with the square bracket operator
[. This allows to partition an MLD or filter it according
to a logical condition.
> emotions$measures$num.instances
[1] 593
> emotions[emotions$dataset$.SCUMBLE > 0.01]$measures$num.instances
[1] 222
Combined with the joining operator, +, this enables
users to implement new preprocessing techniques that modify information
in the MLD in order to improve classification results. For example, the
following would be an implementation of an algorithm disabling majority
labels on instances with highly imbalanced labels:
> mldbase <- mld[.SCUMBLE <= mld$measures$scumble]
> # Samples with coocurrence of highly imbalanced labels
> mldhigh <- mld[.SCUMBLE > mld$measures$scumble]
> majIndexes <- mld$labels[mld$labels$IRLbl < mld$measures$meanIR, "index"]
> # Deactivate majority labels
> mldhigh$dataset[, majIndexes] <- 0
> mldbase + mldhigh # Join the instances without changes with the filtered ones
In this last example, the first two commands filter the MLD,
separating instances with their SCUMBLE lower than the mean and
those with it higher. Then, the third line obtains the indices of the
labels with lower IRLbl than the mean, thus these are the
majority labels of the dataset. Finally, these labels are set to 0 in
the instances with high SCUMBLE, and then the two partitions
are joined again.
Lastly, another useful feature included in the mldr package
is the MLD comparison with the == operator. This indicates
whether both MLDs in comparison share the same structure, which would
mean they have the same attributes, and these would have the same
type.
> emotions[1:10] == emotions[20:30]
[1] TRUE
> emotions == birds
[1] FALSE
Assessing multilabel predictive performance
Assuming that a set of predictions has been obtained for a MLD, e.g.,
through a set of binary classifiers, a multiclass classifier or any
other algorithm, the next step would be to evaluate the classification
performance. In the literature there exist more than 20 metrics for this
task, and some of them are quite complex to calculate. The mldr
package provides the mldr_evaluate function to accomplish
this task, supplying both example based and label based metrics.
Multilabel evaluation metrics are grouped into two main categories:
example based and label based metrics. Example based metrics are
computed individually for each instance, then averaged to obtain the
final value. Label based metrics are computed per label, instead of per
instance. There are two approaches called micro-averaging and
macro-averaging (described below). The output of the classifier
can be a bipartition (i.e., a set of 0s and 1s denoting the predicted
labels) or a ranking (i.e., a set of real values denoting the relevance
of each label). For this reason, there are bipartition based and ranking
based evaluation metrics for each one of the two previous
categories.
\(D\) being the MLD, \(L\) the full set of labels used in \(D\), \(Y_i\) the subset of predicted labels for
the i-th instance, and \(Z_i\)
the true subset of labels, the example/bipartition based metrics
returned by mldr_evaluate are the following:
Accuracy: It is defined (see Equation
@ref(eq:Accuracy)) as the proportion of correctly predicted labels with
respect to the total number of labels for each instance.
\[\label{eq:Accuracy}
Accuracy = \frac{1}{\lvert D\rvert}
\displaystyle\sum\limits_{i=1}^{\lvert D\rvert} \frac{\lvert Y_i \cap
Z_i\rvert}{\lvert Y_i \cup
Z_i\rvert}. (\#eq:Accuracy) \]
Precision: This metric is computed as indicated
in Equation @ref(eq:Precision), giving as result the ratio of relevant
labels predicted by the classifier.
\[\label{eq:Precision}
Precision = \frac{1}{\lvert D\rvert}
\displaystyle\sum\limits_{i=1}^{\lvert D\rvert } \frac{\lvert Y_i \cap
Z_i\rvert }{\lvert Z_i\rvert }. (\#eq:Precision) \]
Recall: It is a metric (see Equation
@ref(eq:Recall)) commonly used along with the previous one, measuring
the proportion of predicted labels which are relevant.
\[\label{eq:Recall}
Recall = \frac{1}{\lvert D\rvert }
\displaystyle\sum\limits_{i=1}^{\lvert D\rvert } \frac{\lvert Y_i \cap
Z_i\rvert}{\lvert Y_i\rvert}. (\#eq:Recall) \]
F-Measure: As can be seen in Equation
@ref(eq:F1), this metric is the harmonic mean between Precision
(see Equation @ref(eq:Precision)) and Recall (see Equation
@ref(eq:Recall)), providing a balanced assessment between precision and
sensitivity.
\[\label{eq:F1}
\textit{FMeasure} = 2 * \frac{Precision \cdot Recall}{Precision
+ Recall}. (\#eq:F1) \]
Hamming Loss: It is the most common evaluation
metric in the multilabel literature, computed (see Equation @ref(eq:HL))
as the symmetric difference between predicted and true labels and
divided by the total number of labels in the MLD.
\[\label{eq:HL}
HammingLoss = \frac{1}{\lvert D\rvert }
\displaystyle\sum\limits_{i=1}^{\lvert D\rvert } \frac{\lvert Y_i
\triangle Z_i\rvert}{\lvert L\rvert}. (\#eq:HL) \]
Subset Accuracy: This metric is also known as
0/1 Subset Accuracy and Classification Accuracy, and
it is the most strict evaluation metric. The \(\left[\!\!\left[ expr \right]\!\!\right]\)
operator (see Equation @ref(eq:SubsetAccuracy)) returns 1 when \(expr\) is true and 0 otherwise. In this
case its value is 1 only if the predicted set of labels equals the true
one.
\[\label{eq:SubsetAccuracy}
SubsetAccuracy = \frac{1}{\lvert D\rvert }
\displaystyle\sum\limits_{i=1}^{\lvert D\rvert }\left[\!\!\left[ Y_i =
Z_i \right]\!\!\right] . (\#eq:SubsetAccuracy) \]
Let \(rank\left(x_i, y\right)\) be a
function returning the position of \(y\), a certain label, in the \(x_i\) instance. The example/ranking based
evaluation metrics returned by the mldr_evaluate function
are the following ones:
Average Precision: This metric (see Equation
@ref(eq:AveragePrecision)) computes the proportion of labels ranked
ahead of a certain relevant label. The goal is to establish how many
positions have to be traversed until this label is found.
\[\label{eq:AveragePrecision}
\textit{AveragePrecision} = \frac{1}{\lvert D\rvert }
\displaystyle\sum\limits_{i=1}^{\lvert D\rvert } \frac{1}{\lvert
Y_i\rvert} \displaystyle\sum\limits_{y \in Y_i}
\frac{\lvert \left\{y'\in Y_i : rank\left(x_i,
y'\right) \leq rank\left(x_i, y\right)
\right\}\rvert}{rank\left(x_i,
y\right)}. (\#eq:AveragePrecision) \]
Coverage: Defined as indicated in Equation
@ref(eq:Coverage), this metric calculates the extent to which it is
necessary to go up in the ranking to cover all relevant labels.
\[\label{eq:Coverage}
\textit{Coverage} = \frac{1}{\lvert D\rvert }
\displaystyle\sum\limits_{i=1}^{\lvert D\rvert }
\displaystyle\max\limits_{y \in Y_i} rank\left(x_i,
y\right) - 1. (\#eq:Coverage) \]
One Error: It is a metric (see Equation
@ref(eq:OneError)) which determines how many times the best ranked label
given by the classifier is not part of the true label set of the
instance.
\[\label{eq:OneError}
\textit{OneError} = \frac{1}{\lvert D\rvert }
\displaystyle\sum\limits_{i=1}^{\lvert D\rvert } \left[\!\!\left[
\mathop{argmax}\limits_{y \in Z_i} rank\left(x_i, y\right) \notin Y_i
\right]\!\!\right] . (\#eq:OneError) \]
Ranking Loss: This metric (see Equation
@ref(eq:RankingLoss)) compares each pair of labels in \(L\), computing how many times a relevant
label (member of the true labelset) appears ranked lower than a
non-relevant label. In the equation, \(\overline{Y_i}\) denotes \(L\backslash Y_i\).
\[\label{eq:RankingLoss} \small
\textit{RankingLoss} = \frac{1}{\lvert D\rvert }
\displaystyle\sum\limits_{i=1}^{\lvert D\rvert } \frac{1}{\lvert
Y_i\rvert\lvert \overline{Y_i}\rvert}
\lvert \left\{\left(y_a, y_b\right) \in Y_i \times
\overline{Y_i}: rank\left(x_i, y_a\right) > rank\left(x_i, y_b\right)
\right\}\rvert. (\#eq:RankingLoss) \]
Regarding the label based metrics, there are two different ways to
aggregate the values of the labels. The macro-averaging approach (see
Equation @ref(eq:MacroB)) computes the metric independently for each
label and then averages the obtained values to get the final measure. On
the contrary, the micro-averaging approach (see Equation
@ref(eq:MicroB)) first aggregates the counters for all the labels and
then computes the metric only once. In the following equations
TP, FP, TN and FN stand for True
Positives, False Positives, True Negatives and
False Negatives, respectively.
\[\begin{aligned}
\label{eq:MacroB}
MacroMetric &= \frac{1}{\lvert L\rvert}
\sum\limits_{l=1}^{\lvert
L\rvert}evalMetric\left(TP_l,FP_l,TN_l,FN_l\right).
\end{aligned} (\#eq:MacroB) \]
\[\begin{aligned}
\label{eq:MicroB}
MicroMetric &= evalMetric\left(\sum\limits_{l=1}^{\lvert
L\rvert}TP_l,\sum\limits_{l=1}^{\lvert
L\rvert}FP_l,\sum\limits_{l=1}^{\lvert
L\rvert}TN_l,\sum\limits_{l=1}^{\lvert L\rvert}FN_l\right).
\end{aligned} (\#eq:MicroB) \]
All the bipartition based metrics, such as Precision,
Recall or FMeasure, can be computed as label based
measures following these two approaches. In this category, there are as
well as some ranking based metrics, such as MacroAUC (see
Equation @ref(eq:MacroAUC)) and MicroAUC (see Equation
@ref(eq:MicroAUC)).
\[\label{eq:MacroAUC}
\begin{split}
MacroAUC = \frac{1}{\lvert L\rvert} \sum\limits_{l=1}^{\lvert
L\rvert} \frac{\lvert \left\{x', x'' : rank\left(x',
y_l\right) \ge rank\left(x'', y_l\right), \left(x',
x''\right) \in X_l \times \overline{X_l} \right\}\rvert}{\lvert
X_l\rvert\lvert \overline{X_l}\rvert},\\
X_l = \left\{ x_i : y_l \in Y_i\right\},\ \overline{X_l} =
\left\{x_i : y_l \notin Y_i\right\}.
\end{split} (\#eq:MacroAUC) \]
\[\label{eq:MicroAUC}
\begin{split}
MicroAUC = \frac{\lvert \left\{x', x'', y',
y'' : rank\left(x', y'\right) \ge rank\left(x'',
y''\right), \left(x', y'\right) \in S^+ ,
\left(x'', y''\right) \in S^- \right\}\rvert}{\lvert
S^+\rvert\lvert S^-\rvert},\\
S^+ = \left\{ \left(x_i, y\right) : y \in Y_i\right\},\ S^- =
\left\{ \left(x_i, y\right) : y \notin Y_i\right\}.
\end{split} (\#eq:MicroAUC) \]
When the partition of the MLD for which the predictions have been
obtained, and the predictions themselves are given to the
mldr_evaluate function, a list of 20 measures is returned.
For instance:
> # Get the true labels in emotions
> predictions <- as.matrix(emotions$dataset[, emotions$labels$index])
> # and introduce some noise
> predictions[sample(1:593, 100), sample(1:6, 100, replace = TRUE)] <-
+ sample(0:1, 100, replace = TRUE)
> # then evaluate the predictive performance
> res <- mldr_evaluate(emotions, predictions)
> str(res)
List of 20
$ Accuracy : num 0.917
$ AUC : num 0.916
$ AveragePrecision: num 0.673
$ Coverage : num 2.71
$ FMeasure : num 0.952
$ HammingLoss : num 0.0835
$ MacroAUC : num 0.916
$ MacroFMeasure : num 0.87
$ MacroPrecision : num 0.829
$ MacroRecall : num 0.915
$ MicroAUC : num 0.916
$ MicroFMeasure : num 0.872
$ MicroPrecision : num 0.834
$ MicroRecall : num 0.914
$ OneError : num 0.116
$ Precision : num 0.938
$ RankingLoss : num 0.518
$ Recall : num 0.914
$ SubsetAccuracy : num 0.831
$ ROC :List of 15
...
> plot(res$ROC, main = "ROC curve for emotions") # Plot ROC curve
If the pROC (Robin et al. 2011) package is available, this
list will include non-null AUC (Area Under the ROC Curve)
measures and also an element called ROC. The latter holds
the information needed to plot the ROC (Receiver Operating
Characteristic) curve, as shown in the last line of the previous
example. The result would be a plot similar to that in Figure 3.
The mldr user interface
This package provides the user with a web-based graphical user
interface on top of the shiny package, allowing to
interactively manipulate measurements and obtain graphics and other
results. Once mldr is loaded, this GUI can be launched from the
R console with a single command:
> mldrGUI()
This will cause the user’s default browser to start or open a new tab
in which the GUI will be displayed, organized into a tab bar and a
content pane. The tab bar allows the change of section so that different
information is shown in the pane.
The GUI will initially display the Main section, as shown in Figure
4. It contains options able to select an
MLD from those available, and to load a new one by uploading its ARFF
and XML files onto the application. On the right side, several plots are
stacked. These show the amount of attributes of each type (numeric,
character or label), the amount of labels per instance, the amount of
instances corresponding to labels and the number of instances related to
labelsets. Each plot can be saved as an image on the file system. Right
below these graphics, some tables containing basic measures are shown.
The first one lists generic measures related to the entire MLD, and is
followed by measures specific to labels, such as Card or
Dens. The last table shows a summary of measures for
labelsets.
Figure 4: Main page of the shiny based
graphical user interface.
The Labels section contains a table enumerating each label of the MLD
with its relevant details and measures: its index in the attribute list,
its count and frequency, its IRLbl and its SCUMBLE.
Labels in this table can be reordered using the headers, and filtered by
the Search field. Furthermore, if the list is longer than the number
specified in the Show field, it will be split into several pages. The
data shown in all tables can be exported to files in several formats. On
the right side, a plot shows the amount of instances that have each
label. This is an interactive plot, and allows the range of labels to be
manipulated.
Since relations between labels can determine the behavior of new
data, studying labelsets is important in multilabel classification.
Thus, the section named Labelsets provides information about them,
listing each labelset along with its count. This list can be filtered
and split into pages as well, and is accompanied by a bar plot showing
the count of instances per labelset.
In order to obtain statistical measures about input attributes, the
Attributes section organizes all of them into a paged table, displaying
their type and some data or measures according to it. If the attribute
is numeric, then there will be a table containing its minimum and
maximum values, its quartiles and its mean. On the contrary, if the
attribute takes values from a finite set, each possible value will be
shown along with its count in the MLD.
Lastly, concurrence among labels is provenly a factor to take into
account when applying preprocessing techniques to MLDs. For this reason,
the Concurrence section attempts to create an easy way of visualizing
concurrence among labels (see Figure 5), with a label concurrence plot
displaying the selected labels in the left-side table and their
coocurrences represented by bands in the circle. By default, the ten
labels with highest SCUMBLE are selected. The user is able to
select and deselect other labels by clicking their corresponding row on
the table.