Abstract

Most classification algorithms deal with datasets which have a set of input features, the variables to be used as predictors, and only one output class, the variable to be predicted. However, in late years many scenarios in which the classifier has to work with several outputs have come to life. Automatic labeling of text documents, image annotation or protein classification are among them. Multilabel datasets are the product of these new needs, and they have many specific traits. The mldr package allows the user to load datasets of this kind, obtain their characteristics, produce specialized plots, and manipulate them. The goal is to provide the exploratory tools needed to analyze multilabel datasets, as well as the transformation and manipulation functions that will make possible to apply binary and multiclass classification models to this data or the development of new multilabel classifiers. Thanks to its integrated user interface, the exploratory functions will be available even to non-specialized R users.

Introduction

Pattern classification is an important task nowadays and is in use everywhere, from our e-mail client, which is able to separate spam from legit messages, to credit institutions, that rely on it to detect fraud and grant or deny loans. All these cases operate with binary datasets, since a message is either spam or legit, and multiclass datasets, the loan is safe, medium, risky or highly risky, for instance. In both cases the user expects only one output.

The huge growth of the amount of information stored in late years on the web, such as blog posts, pictures taken from cameras and phones, videos hosted on YouTube, and messages on social networks, has led to more complex classification work. A blog post can be classified into several non-exclusive categories, for instance news, economy and politics simultaneously. A picture can be assigned a set of labels, such as landscape, sky and forest. A video can be labeled into several music genres at once, etc. All of these are examples of problems in need of multilabel classification.

Binary and multiclass datasets can be managed in R by using data frames. Usually the last attribute (column of the “data.frame”) is the output class, which might contain only TRUE/FALSE values or values belonging to a finite set (a factor). Multilabel datasets (MLDs) can also be stored in an R “data.frame”, but an additional structure to know which attributes are output labels is needed. Moreover, this kind of datasets have many specific characteristics that do not exist in the traditional ones. The average number of labels per instance, the imbalance ratio for each label, the number of labelsets (sets of labels assigned to each row) and their frequencies, and the level of concurrence among imbalanced labels are some of the traits that differentiate an MLD from the others.

Until now, most of the software to work with MLDs has been written in Java. The two best known frameworks are MULAN (Tsoumakas et al. 2011) and MEKA (J. Read and Reutemann 2012). Both implementations rely on WEKA (Hall et al. 2009), which offers a large variety of binary and multiclass classifiers, as well as functions needed to deal with ARFF (Attribute-Relation File Format) files. Most of the existing MLDs are stored in ARFF format. MULAN and MEKA provide the specialized tools needed to deal with multilabel ARFFs, and the infrastructure to build multilabel classifiers (MLCs). Although R can access WEKA functionality through the RWeka (Hornik, Buchta, and Zeileis 2009) package, handling MLDs is far from an easy task in R. This has been the main motivation behind the development of the mldr package. To the best of our knowledge, mldr is the first R package aimed to ease the work with multilabel data.

The mldr package aims to provide the user with the functions needed to perform exploratory analysis of MLDs, determining their main traits both statistically and visually. Moreover, it also brings the proper tools to manipulate this kind of datasets, including the application of the most common transformation methods, BR (Binary Relevance) and LP (Label Powerset), that will be described in the following section. These would be the foundation for processing MLDs with traditional classifiers, as well as for developing new multilabel algorithms.

The mldr package does not depend on the RWeka package, and it is not linked to MULAN nor MEKA. It has been designed to allow reading both MULAN and MEKA MLDs, but without any external dependencies. In fact, it would be possible to load MLDs stored in other file formats, as well as creating them from scratch. When loaded, MLDs are wrapped in an S3 type object with class “mldr”, which allows for the use of methods. The object will contain the data in the MLD and also a large set of measures obtained from it. The functions provided by the package ease the access to this information, produce some specific plots, and make possible the manipulation of its content. A web-based graphical user interface, developed using the shiny (Chang et al. 2015) package, puts the exploratory analysis tools of the mldr package at the fingertips of all users, even those who have little experience using R.

In the following section the foundations related to MLDs and MLC will be briefly introduced. After that, the structure of the mldr package, and the operations it provides will be explained. Finally, the user interface provided by mldr to ease exploratory analysis tasks over MLDs will be shown. All code displayed in this paper is available in a vignette, accessible by entering vignette("mldr", package = "mldr").

Working with multilabel datasets

MLDs are generated from text documents (Klimt and Yang 2004), sets of images (Duygulu et al. 2002), music collections, and protein attributes (Diplaris et al. 2005), among other sources. For each sample a set of features (input attributes) is collected, and a set of labels (the output labelset) is assigned. Usually there are several hundreds or even thousands of attributes, and it is not rare that a MLD has more labels than features. Some MLDs have only a few labels per instance, while others have dozens of them. In some MLDs the number of label combinations (labelsets) is quite short, whereas in others it can be very large. Most MLDs are imbalanced, which means that some labels are very frequent while others are scarcely represented. The labels in an MLD can be correlated or not. Moreover, frequent labels and rare labels can appear together in the same instances.

As can be seen, a lot of different scenarios can be found depending on the MLD characteristics. This is the reason why several specific measures have been designed to assess MLD traits (Tsoumakas, Katakis, and Vlahavas 2010), since they can have a serious impact on the MLC’s performance. The following two subsections introduce several of these measures and some of the approaches pursued to face multilabel classification.

Multilabel dataset traits

The most common characterization measures for MLDs can be grouped into four categories, as depicted in Figure 1.

Figure 1: Characterization measures taxonomy.

The most basic information that can be obtained from an MLD is the number of instances, attributes and labels. For any MLD containing \(\lvert D\rvert\) instances, any instance \(D_i, i \in \{1..\lvert D\rvert \}\) will be the union of a set of attributes and a set of labels (\(X_i\), \(Y_i\)), \(X_i \in X^1\times X^2\times \dots\times X^f, Y_i \subseteq L\), where \(f\) is the number of input features and \(X^j\) is the space of possible values for the \(j\)-th attribute, \(j \in \{1..f\}\). \(L\) being the full set of labels used in \(D\), \(Y_i\) could be any subset of items in \(L\). Therefore, theoretically the number of potential labelsets could be \(2^{\lvert L\rvert}\). In practice this number tends to be limited by \(\lvert D\rvert\).

Each instance \(D_i\) has an associated labelset, whose length (number of active labels) can be in the range {0..\(\lvert L\rvert\)}. The average number of active labels per instance is the most basic measure of any MLD, usually known as Card (standing for cardinality). It is calculated as shown in Equation @ref(eq:Card). Dividing this measure by the number of labels in \(L\), as shown in Equation @ref(eq:Dens), results in a dimension-less measure, known as Dens (standing for label density).

\[\begin{aligned} \label{eq:Card} Card\left(D\right) &= \frac{1}{\lvert D\rvert} \displaystyle\sum\limits_{i=1}^{\lvert D\rvert} \lvert Y_i\rvert, \end{aligned} (\#eq:Card) \]

\[\begin{aligned} \label{eq:Dens} Dens\left(D\right) &= \frac{1}{\lvert D\rvert} \displaystyle\sum\limits_{i=1}^{\lvert D\rvert} \frac{\lvert Y_i\rvert}{\lvert L\rvert}. \end{aligned} (\#eq:Dens) \]

Most multilabel datasets are imbalanced, meaning that some of the labels are very frequent whereas others are quite rare. The level of imbalance of a label can be measured by the imbalance ratio, IRLbl, defined in Equation @ref(eq:IRLbl). To know how much imbalance there is in \(D\), the MeanIR measure (Charte et al. 2015) is calculated as the mean imbalance ratio among all labels, as shown in Equation @ref(eq:MeanIR). In order to know the significance of this last measure, the standard CV (Coefficient of Variation, Equation @ref(eq:CV)) can be used.

\[\begin{aligned} \label{eq:IRLbl} \textit{IRLbl}\left(y\right) &= \frac{ \max\limits_{y'\in L} \left(\displaystyle\sum\limits_{i=1}^{\lvert D\rvert}{h\left(y', Y_i\right)}\right) } { \displaystyle\sum\limits_{i=1}^{\lvert D\rvert}{h\left(y, Y_i\right)}} \quad h\left(y, Y_i\right) = \begin{cases} 1 & y \in Y_i \\ 0 & y \notin Y_i \end{cases}, \end{aligned} (\#eq:IRLbl) \]

\[\begin{aligned} \label{eq:MeanIR} \textit{MeanIR} &= \frac{1}{\lvert L\rvert} \displaystyle\sum\limits_{y\in L}\textit{IRLbl}\left(y\right), \end{aligned} (\#eq:MeanIR) \]

\[\begin{aligned} \label{eq:CV} \textit{CV} &= \frac{\textit{IRLbl}\sigma}{\textit{MeanIR}}\quad \textit{IRLbl}\sigma = \sqrt{ \displaystyle\sum\limits_{y\in L}^{}{ \frac{\left(\mathit{IRLbl\left(y\right) - MeanIR}\right)^2}{\lvert L\rvert-1} } }. \end{aligned} (\#eq:CV) \]

The number of different labelsets, as well as the amount of them being unique labelsets (i.e., appearing only once in \(D\)), give us an idea on how sparsely the labels are distributed. The labelsets by themselves indicate how the labels in \(L\) are related. A very frequent labelset implies that the labels in it tend to appear jointly in \(D\). The SCUMBLE measure, introduced in Charte et al. (2014) and shown in Equation @ref(eq:SCUMBLE), is used to assess the concurrence level among frequent and infrequent labels.

\[\begin{aligned} \label{eq:SCUMBLEIns} \textit{SCUMBLE}_{ins}\left(i\right) &= 1 - \frac{1}{\overline{\textit{IRLbl}_i}}\left(\prod\limits_{l=1}^{\lvert L\rvert} \textit{IRLbl}_{il}\right)^{\left(1/\lvert L\rvert\right)}, \end{aligned} (\#eq:SCUMBLEIns) \]

\[\begin{aligned} \label{eq:SCUMBLE} \textit{SCUMBLE}\left(D\right) &= \frac{1}{\lvert D\rvert} \displaystyle\sum\limits_{i=1}^{\lvert D\rvert} \textit{SCUMBLE}_{ins}\left(i\right). \end{aligned} (\#eq:SCUMBLE) \]

Besides the aforementioned insights, there are some other interesting traits that can be indirectly obtained from the previous measures, such as the ratio between input features and output labels, the maximum IRLbl, or the coefficient of variation in the imbalance of levels, among others.

Although the raw numbers given by these calculations describe the nature of any multilabel dataset to a good level, in general a visualization of its characteristics is desirable to ease its interpretation by researchers.

The information obtained from the previous measures depicts the characteristics of the dataset. These insights, along with other factors such as the loss function used by the classifier, help in choosing the most proper algorithm to learn from it and, in the future, make predictions on new data. Traditional classification models, such as trees and support vector machines, are designed to give only one output as result. Multilabel classification can mainly be faced through two different approaches discussed in the following.

Multilabel classification

Algorithm adaptation: The goal is to modify existing algorithms taking into account the multilabel nature of the samples, for instance hosting more than one class in the leaves of a tree instead of only one.
Problem transformation: This approach transforms the original data to make it suitable to traditional classification algorithms, then combines the obtained predictions to build the labelsets given as output result.

Although several transformation methods have been defined in the specialized literature, there are two among them that stand out because they are the foundation for many others:

Binary Relevance (BR): Introduced by Godbole and Sarawagi (2004) as an adaptation of OVA (One-vs-All) to the multilabel scenario, this method transforms the original multilabel dataset into several binary datasets, as many as there are different labels. In this way any binary classifier can be used by joining the individual predictions to generate the final output.
Label Powerset (LP): Introduced by Boutell et al. (2004), this method transforms the multilabel dataset into a multiclass dataset by using the labelset of each instance as class identifier. Any multiclass classifier can be used, transforming back the predicted class into a labelset.

BR and LP have been used not only as a direct technique to implement multilabel classifiers, but also as a base method to build more sophisticated algorithms. Several ensembles of binary classifiers relying on BR have been proposed, such as CC (Classifier Chains) or ECC (Ensemble of Classifier Chains), both by Jesse Read et al. (2011). The same is applicable to the LP transformation, the foundation of ensemble multilabel classifiers such as RAkEL (Random k-Labelsets for Multi-Label Classification, Tsoumakas and Vlahavas (2007)) and EPS (Ensemble of Pruned Sets, J. Read, Pfahringer, and Holmes (2008)).

For the readers interested in more details, a recent review on multilabel classification has been published by Zhang and Zhou (2014).

The mldr package

R is among the most used tools when it comes to performing data mining tasks, including binary and multiclass classification. However, the work with MLDs in R is not as easy as it is with classic datasets. This is the main motivation behind the development of the mldr package, whose goals and functionality are described in this section.

Main goals of the mldr package

When we planned the development of this package, our main objective was to ease the exploration of MLDs in R. This included loading existing MLDs in different formats, as well as obtaining from them all the available information. These functions should be accessible to everyone, even to users not used to the R command line but to GUIs (Graphical User Interfaces) such as those provided by packages Rcmdr (aka R Commander, Fox (2005)) or rattle (Williams 2011).

At the same time, we aimed to include the tools needed to manipulate the MLDs, to apply filters and transformations, as well as to create MLDs from scratch. This functionality, directed to more experienced R users, opens the doors to implement other algorithms on top of mldr, for instance preprocessing methods or multilabel classifiers.

Installing and loading the mldr package

The mldr package is available from the Comprehensive R Archive Network (CRAN), therefore it can be installed as any other package, by simply typing:

> install.packages("mldr")

mldr depends on three R packages: XML (Lang and the CRAN Team 2015), circlize (Gu et al. 2014) and shiny. The first one allows reading XML (eXtensible Markup Language) files, the second one is used to generate a specific type of plot (described below), and the third one is the base of its user interface.

Older releases of mldr, as well as the development version, are available at http://github.com/fcharte/mldr. It is possible to install the development version using the install_github() function from devtools (Wickham and Chang 2015).

Once installed, the package has to be loaded before it can be used. This can be done through the library() or require() functions, as usual. After loading the package three sample MLDs will be available: birds, emotions and genbase. These are contained in the birds.rda, emotions.rda and genbase.rda files, which are lazily loaded along with the package.

The mldr package uses its own internal representation for MLDs, which are assigned the “mldr” class. Inside an “mldr” object, such as the previous mentioned emotions or birds, both the data in the MLD and all the information obtained from this data can be found.

Loading and creating MLDs

Besides the three sample MLDs included in the package, the mldr() function allows to load any MLD stored in MULAN or MEKA file formats. Assuming that the files corel5k.arff and corel5k.xml, which hold the Corel5k (Duygulu et al. 2002) MLD in MULAN format, are in the current directory, the loading is done as follows:

> corel5k <- mldr("corel5k")

If the XML file is not available, it is possible to indicate just the number of labels in the MLD instead. In this case, the function assumes that the labels are at the end of the list of features. For instance:

> corel5k <- mldr("corel5k", label_amount = 374)

Loading an MLD in MEKA file format is equally easy. In this case there is no XML file with label information used, but a special header inside the ARFF file, a fact that will be indicated to mldr() with the use_xml argument:

> imdb <- mldr("imdb", use_xml = FALSE)

In all cases the result, as long as the MLD can be correctly loaded and parsed, will be a new “mldr” object ready to use.

If the MLD we are interested in is not in MULAN or MEKA format, firstly it will have to be loaded into a “data.frame”, for instance using functions such as read.csv(), read.table() or a more specialized reader, and secondly this “data.frame” and an integer vector stating the indices of the labels inside it are given to the mldr_from_dataframe() function. This is a general function for creating an “mldr” object from any “data.frame”, so it can also be used to generate new MLDs on the fly, as shown in the following example:

> df <- data.frame(matrix(rnorm(1000), ncol = 10))
> df$Label1 <- c(sample(c(0,1), 100, replace = TRUE))
> df$Label2 <- c(sample(c(0,1), 100, replace = TRUE))
> mymldr <- mldr_from_dataframe(df, labelIndices = c(11, 12), name = "testMLDR")

This will assign to mymldr an MLD, named testMLDR, with 10 input attributes and 2 labels.

Obtaining information from an MLD

After loading any MLD, a quick summary of its main characteristics can be obtained by means of the usual summary() function, as shown below:

> summary(birds)
num.attributes num.instances num.labels num.labelsets num.single.labelsets
           279           645         19           133                   73
max.frequency cardinality    density   meanIR    scumble num.inputs
          294    1.013953 0.05336597 5.406996 0.03302765        260

Any of these measures can be individually obtained through the measures element of the “mldr” class, like this:

> emotions$measures$num.attributes
[1] 78

> genbase$measures$scumble
[1] 0.0287591

Full information about the labels in the MLD, including the number of times they appear, their IRLbl and SCUMBLE measures, can be retrieved by using the labels element of the “mldr” class:

> birds$labels
                          index count        freq     IRLbl    SCUMBLE
Brown Creeper               261    14 0.021705426  7.357143 0.12484341
Pacific Wren                262    81 0.125581395  1.271605 0.05232609
Pacific-slope Flycatcher    263    46 0.071317829  2.239130 0.06361470
Red-breasted Nuthatch       264     9 0.013953488 11.444444 0.15744451
Dark-eyed Junco             265    20 0.031007752  5.150000 0.10248336
Olive-sided Flycatcher      266    14 0.021705426  7.357143 0.18493760
Hermit Thrush               267    47 0.072868217  2.191489 0.06777263
Chestnut-backed Chickadee   268    40 0.062015504  2.575000 0.06807452
Varied Thrush               269    61 0.094573643  1.688525 0.07940806
Hermit Warbler              270    53 0.082170543  1.943396 0.07999006
Swainson's Thrush           271   103 0.159689922  1.000000 0.11214301
Hammond's Flycatcher        272    28 0.043410853  3.678571 0.06129884
Western Tanager             273    33 0.051162791  3.121212 0.07273988
Black-headed Grosbeak       274     9 0.013953488 11.444444 0.20916487
Golden Crowned Kinglet      275    37 0.057364341  2.783784 0.09509474
Warbling Vireo              276    17 0.026356589  6.058824 0.14333613
MacGillivray's Warbler      277     6 0.009302326 17.166667 0.24337605
Stellar's Jay               278    10 0.015503876 10.300000 0.12151527
Common Nighthawk            279    26 0.040310078  3.961538 0.06520272

The same is applicable for labelsets and attributes, by means of the labelsets and attributes elements of the class.

To access the MLD content, attributes and label values, the print() function can be used, as well as the dataset element of the “mldr” object.

Plotting functions

Exploratory analysis of MLDs can be tedious, since most of them have thousands of attributes and hundreds of labels. The mldr package provides a plot() function specific for dealing with “mldr” objects, allowing the generation of several specific types of plots. The first argument given to plot() must be an “mldr” object, while the second one specifies the type of plot to be produced.

> plot(emotions, type = "LH")

There are seven different types of plots available: three histograms showing relations between instances and labels, two bar plots with similar purpose, a circular plot indicating types of attributes and a concurrence plot for labels. All of them are shown in Figure 2, generated by the following code:

> layout(matrix(c(1, 1, 7, 1, 1, 4, 5, 5, 4, 2, 6, 3), 4, 3, byrow = TRUE))
> plot(emotions, type = c("LC", "LH", "LSH", "LB", "LSB", "CH", "AT"))

Figure 2: Plots generated by mldr’s plot() function. The type of plot is indicated at the top-left corner of each rectangle.

The concurrence plot is the default one, with type "LC", and responds to the need of exploring interactions among labels, and specifically between majority and minority ones. This plot has a circular shape, with the circumference partitioned into several disjoint arcs representing labels. Each arc has length proportional to the number of instances where the label is present. These arcs are in turn divided into bands that join two of them, showing the relation between the corresponding labels. The width of each band indicates the strength of the relation, since it is proportional to the number of instances in which both labels appear simultaneously. In this manner, a concurrence plot can show whether imbalanced labels appear frequently together, a situation which could limit the possible improvement of a preprocessing technique (Charte et al. 2014).

Since drawing interactions among a lot of labels can produce a confusing result, this last type of plot accepts more arguments: labelCount, which accepts an integer that will be used to generate the plot with that number of labels chosen at random; and labelIndices, which allows to indicate exactly the indices of the labels to be displayed in the plot. For example, in order to plot the first ten labels of genbase:

> plot(genbase, labelIndices = genbase$labels$index[1:10])

The label histogram (type "LH") relates labels and instances in a way that shows how well-represented labels are in general. The X axis represents the number of instances and the Y axis the amount of labels. This means that if a large number of labels are appearing in very few instances, all data will be concentrated on the left side of the plot. On the contrary, if labels are generally present in many instances, data will tend to accumulate on the right side. This plot shows imbalance of labels when there is data accumulated on both sides of the plot, which implies that many labels are underrepresented, and a large amount are overrepresented as well.

The labelset histogram (named "LSH") is similar to the former. However, instead of representing the number of instances in which each label appears, it shows the amount of labelsets. This indicates quantitatively whether labelsets repeat consistently or not among instances.

The label and labelset bar plots display exactly the number of instances for each one of the labels and labelsets, respectively. Their codes are "LB" for the label bar plot and "LSB" for the labelset one.

The cardinality histogram (type "CH") represents the amount of labels instances have in general. Therefore data accumulating on the right side of the plot indicates that instances do have a notable amount of labels, whereas data concentrating on the left side shows the opposite situation.

The attribute types plot (named "AT") is a pie chart displaying the number of labels, numeric attributes and finite set (character) attributes, thus showing the proportions between these types of attributes to ease the understanding of the amount of input information and that of output data.

Additionally, plot() accepts coloring arguments, col and color.function. The former can be used on all plot types except for the label concurrence plot, and must be a vector of colors. The latter is only used on the label concurrence plot and accepts a coloring function, such as rainbow or heat.colors, as can be seen in the following example:

> plot(emotions, type = "LC", color.function = heat.colors)
> plot(emotions, type = "LB", col = terrain.colors(emotions$measures$num.labels))

Transforming and filtering functions

Manipulation of datasets is a crucial task in multilabel classification. Since transformation is one of the main approaches to tackle the problem, both BR and LP transformations are implemented in package mldr. They can be obtained using the mldr_transform function, which accepts an “mldr” object as first argument, the type of transformation, "BR" or "LP", as second, and an optional vector of label indices to be included in the transformation as last argument:

> emotionsbr <- mldr_transform(emotions, type = "BR")
> emotionslp <- mldr_transform(emotions, type = "LP", emotions$labels$index[1:4])

The BR transformation will return a list of “data.frame” objects, each one of them using one of the labels as class, whereas the LP transformation will return a single “data.frame” representing a multiclass dataset using each labelset as a class. Both of these transformations can be directly used in order to apply binary and multiclass classification algorithms, or even implement new ones.

> emo_lp <- mldr_transform(emotions, "LP")
> library(RWeka)
> classifier <- IBk(classLabel ~ ., data = emo_lp, control = Weka_control(K = 10))
> evaluate_Weka_classifier(classifier, numFolds = 5)

=== 5 Fold Cross Validation ===

=== Summary ===

Correctly Classified Instances         205               34.57   %
Incorrectly Classified Instances       388               65.43   %
Kappa statistic                          0.2695
Mean absolute error                      0.057
Root mean squared error                  0.1748
Relative absolute error                 83.7024 %
Root relative squared error             94.9069 %
Coverage of cases (0.95 level)          75.3794 %
Mean rel. region size (0.95 level)      19.574  %
Total Number of Instances              593

A filtering utility is included in the package as well. Using it is intuitive, since it can be called with the square bracket operator [. This allows to partition an MLD or filter it according to a logical condition.

> emotions$measures$num.instances
[1] 593

> emotions[emotions$dataset$.SCUMBLE > 0.01]$measures$num.instances
[1] 222

Combined with the joining operator, +, this enables users to implement new preprocessing techniques that modify information in the MLD in order to improve classification results. For example, the following would be an implementation of an algorithm disabling majority labels on instances with highly imbalanced labels:

> mldbase <- mld[.SCUMBLE <= mld$measures$scumble]
> # Samples with coocurrence of highly imbalanced labels
> mldhigh <- mld[.SCUMBLE > mld$measures$scumble]
> majIndexes <- mld$labels[mld$labels$IRLbl < mld$measures$meanIR, "index"]
> # Deactivate majority labels
> mldhigh$dataset[, majIndexes] <- 0
> mldbase + mldhigh # Join the instances without changes with the filtered ones

In this last example, the first two commands filter the MLD, separating instances with their SCUMBLE lower than the mean and those with it higher. Then, the third line obtains the indices of the labels with lower IRLbl than the mean, thus these are the majority labels of the dataset. Finally, these labels are set to 0 in the instances with high SCUMBLE, and then the two partitions are joined again.

Lastly, another useful feature included in the mldr package is the MLD comparison with the == operator. This indicates whether both MLDs in comparison share the same structure, which would mean they have the same attributes, and these would have the same type.

> emotions[1:10] == emotions[20:30]
[1] TRUE

> emotions == birds
[1] FALSE

Assessing multilabel predictive performance

Assuming that a set of predictions has been obtained for a MLD, e.g., through a set of binary classifiers, a multiclass classifier or any other algorithm, the next step would be to evaluate the classification performance. In the literature there exist more than 20 metrics for this task, and some of them are quite complex to calculate. The mldr package provides the mldr_evaluate function to accomplish this task, supplying both example based and label based metrics.

Multilabel evaluation metrics are grouped into two main categories: example based and label based metrics. Example based metrics are computed individually for each instance, then averaged to obtain the final value. Label based metrics are computed per label, instead of per instance. There are two approaches called micro-averaging and macro-averaging (described below). The output of the classifier can be a bipartition (i.e., a set of 0s and 1s denoting the predicted labels) or a ranking (i.e., a set of real values denoting the relevance of each label). For this reason, there are bipartition based and ranking based evaluation metrics for each one of the two previous categories.

\(D\) being the MLD, \(L\) the full set of labels used in \(D\), \(Y_i\) the subset of predicted labels for the i-th instance, and \(Z_i\) the true subset of labels, the example/bipartition based metrics returned by mldr_evaluate are the following:

Accuracy: It is defined (see Equation @ref(eq:Accuracy)) as the proportion of correctly predicted labels with respect to the total number of labels for each instance.

\[\label{eq:Accuracy} Accuracy = \frac{1}{\lvert D\rvert} \displaystyle\sum\limits_{i=1}^{\lvert D\rvert} \frac{\lvert Y_i \cap Z_i\rvert}{\lvert Y_i \cup Z_i\rvert}. (\#eq:Accuracy) \]
Precision: This metric is computed as indicated in Equation @ref(eq:Precision), giving as result the ratio of relevant labels predicted by the classifier.

\[\label{eq:Precision} Precision = \frac{1}{\lvert D\rvert} \displaystyle\sum\limits_{i=1}^{\lvert D\rvert } \frac{\lvert Y_i \cap Z_i\rvert }{\lvert Z_i\rvert }. (\#eq:Precision) \]
Recall: It is a metric (see Equation @ref(eq:Recall)) commonly used along with the previous one, measuring the proportion of predicted labels which are relevant.

\[\label{eq:Recall} Recall = \frac{1}{\lvert D\rvert } \displaystyle\sum\limits_{i=1}^{\lvert D\rvert } \frac{\lvert Y_i \cap Z_i\rvert}{\lvert Y_i\rvert}. (\#eq:Recall) \]
F-Measure: As can be seen in Equation @ref(eq:F1), this metric is the harmonic mean between Precision (see Equation @ref(eq:Precision)) and Recall (see Equation @ref(eq:Recall)), providing a balanced assessment between precision and sensitivity.

\[\label{eq:F1} \textit{FMeasure} = 2 * \frac{Precision \cdot Recall}{Precision + Recall}. (\#eq:F1) \]
Hamming Loss: It is the most common evaluation metric in the multilabel literature, computed (see Equation @ref(eq:HL)) as the symmetric difference between predicted and true labels and divided by the total number of labels in the MLD.

\[\label{eq:HL} HammingLoss = \frac{1}{\lvert D\rvert } \displaystyle\sum\limits_{i=1}^{\lvert D\rvert } \frac{\lvert Y_i \triangle Z_i\rvert}{\lvert L\rvert}. (\#eq:HL) \]
Subset Accuracy: This metric is also known as 0/1 Subset Accuracy and Classification Accuracy, and it is the most strict evaluation metric. The \(\left[\!\!\left[ expr \right]\!\!\right]\) operator (see Equation @ref(eq:SubsetAccuracy)) returns 1 when \(expr\) is true and 0 otherwise. In this case its value is 1 only if the predicted set of labels equals the true one.

\[\label{eq:SubsetAccuracy} SubsetAccuracy = \frac{1}{\lvert D\rvert } \displaystyle\sum\limits_{i=1}^{\lvert D\rvert }\left[\!\!\left[ Y_i = Z_i \right]\!\!\right] . (\#eq:SubsetAccuracy) \]

Let \(rank\left(x_i, y\right)\) be a function returning the position of \(y\), a certain label, in the \(x_i\) instance. The example/ranking based evaluation metrics returned by the mldr_evaluate function are the following ones:

Average Precision: This metric (see Equation @ref(eq:AveragePrecision)) computes the proportion of labels ranked ahead of a certain relevant label. The goal is to establish how many positions have to be traversed until this label is found.

\[\label{eq:AveragePrecision} \textit{AveragePrecision} = \frac{1}{\lvert D\rvert } \displaystyle\sum\limits_{i=1}^{\lvert D\rvert } \frac{1}{\lvert Y_i\rvert} \displaystyle\sum\limits_{y \in Y_i} \frac{\lvert \left\{y'\in Y_i : rank\left(x_i, y'\right) \leq rank\left(x_i, y\right) \right\}\rvert}{rank\left(x_i, y\right)}. (\#eq:AveragePrecision) \]
Coverage: Defined as indicated in Equation @ref(eq:Coverage), this metric calculates the extent to which it is necessary to go up in the ranking to cover all relevant labels.

\[\label{eq:Coverage} \textit{Coverage} = \frac{1}{\lvert D\rvert } \displaystyle\sum\limits_{i=1}^{\lvert D\rvert } \displaystyle\max\limits_{y \in Y_i} rank\left(x_i, y\right) - 1. (\#eq:Coverage) \]
One Error: It is a metric (see Equation @ref(eq:OneError)) which determines how many times the best ranked label given by the classifier is not part of the true label set of the instance.

\[\label{eq:OneError} \textit{OneError} = \frac{1}{\lvert D\rvert } \displaystyle\sum\limits_{i=1}^{\lvert D\rvert } \left[\!\!\left[ \mathop{argmax}\limits_{y \in Z_i} rank\left(x_i, y\right) \notin Y_i \right]\!\!\right] . (\#eq:OneError) \]
Ranking Loss: This metric (see Equation @ref(eq:RankingLoss)) compares each pair of labels in \(L\), computing how many times a relevant label (member of the true labelset) appears ranked lower than a non-relevant label. In the equation, \(\overline{Y_i}\) denotes \(L\backslash Y_i\).

\[\label{eq:RankingLoss} \small \textit{RankingLoss} = \frac{1}{\lvert D\rvert } \displaystyle\sum\limits_{i=1}^{\lvert D\rvert } \frac{1}{\lvert Y_i\rvert\lvert \overline{Y_i}\rvert} \lvert \left\{\left(y_a, y_b\right) \in Y_i \times \overline{Y_i}: rank\left(x_i, y_a\right) > rank\left(x_i, y_b\right) \right\}\rvert. (\#eq:RankingLoss) \]

Regarding the label based metrics, there are two different ways to aggregate the values of the labels. The macro-averaging approach (see Equation @ref(eq:MacroB)) computes the metric independently for each label and then averages the obtained values to get the final measure. On the contrary, the micro-averaging approach (see Equation @ref(eq:MicroB)) first aggregates the counters for all the labels and then computes the metric only once. In the following equations TP, FP, TN and FN stand for True Positives, False Positives, True Negatives and False Negatives, respectively.

\[\begin{aligned} \label{eq:MacroB} MacroMetric &= \frac{1}{\lvert L\rvert} \sum\limits_{l=1}^{\lvert L\rvert}evalMetric\left(TP_l,FP_l,TN_l,FN_l\right). \end{aligned} (\#eq:MacroB) \]

\[\begin{aligned} \label{eq:MicroB} MicroMetric &= evalMetric\left(\sum\limits_{l=1}^{\lvert L\rvert}TP_l,\sum\limits_{l=1}^{\lvert L\rvert}FP_l,\sum\limits_{l=1}^{\lvert L\rvert}TN_l,\sum\limits_{l=1}^{\lvert L\rvert}FN_l\right). \end{aligned} (\#eq:MicroB) \]

All the bipartition based metrics, such as Precision, Recall or FMeasure, can be computed as label based measures following these two approaches. In this category, there are as well as some ranking based metrics, such as MacroAUC (see Equation @ref(eq:MacroAUC)) and MicroAUC (see Equation @ref(eq:MicroAUC)).

\[\label{eq:MacroAUC} \begin{split} MacroAUC = \frac{1}{\lvert L\rvert} \sum\limits_{l=1}^{\lvert L\rvert} \frac{\lvert \left\{x', x'' : rank\left(x', y_l\right) \ge rank\left(x'', y_l\right), \left(x', x''\right) \in X_l \times \overline{X_l} \right\}\rvert}{\lvert X_l\rvert\lvert \overline{X_l}\rvert},\\ X_l = \left\{ x_i : y_l \in Y_i\right\},\ \overline{X_l} = \left\{x_i : y_l \notin Y_i\right\}. \end{split} (\#eq:MacroAUC) \]

\[\label{eq:MicroAUC} \begin{split} MicroAUC = \frac{\lvert \left\{x', x'', y', y'' : rank\left(x', y'\right) \ge rank\left(x'', y''\right), \left(x', y'\right) \in S^+ , \left(x'', y''\right) \in S^- \right\}\rvert}{\lvert S^+\rvert\lvert S^-\rvert},\\ S^+ = \left\{ \left(x_i, y\right) : y \in Y_i\right\},\ S^- = \left\{ \left(x_i, y\right) : y \notin Y_i\right\}. \end{split} (\#eq:MicroAUC) \]

When the partition of the MLD for which the predictions have been obtained, and the predictions themselves are given to the mldr_evaluate function, a list of 20 measures is returned. For instance:

> # Get the true labels in emotions
> predictions <- as.matrix(emotions$dataset[, emotions$labels$index])
> # and introduce some noise
> predictions[sample(1:593, 100), sample(1:6, 100, replace = TRUE)] <-
+   sample(0:1, 100, replace = TRUE)
> # then evaluate the predictive performance
> res <- mldr_evaluate(emotions, predictions)
> str(res)

List of 20
 $ Accuracy        : num 0.917
 $ AUC             : num 0.916
 $ AveragePrecision: num 0.673
 $ Coverage        : num 2.71
 $ FMeasure        : num 0.952
 $ HammingLoss     : num 0.0835
 $ MacroAUC        : num 0.916
 $ MacroFMeasure   : num 0.87
 $ MacroPrecision  : num 0.829
 $ MacroRecall     : num 0.915
 $ MicroAUC        : num 0.916
 $ MicroFMeasure   : num 0.872
 $ MicroPrecision  : num 0.834
 $ MicroRecall     : num 0.914
 $ OneError        : num 0.116
 $ Precision       : num 0.938
 $ RankingLoss     : num 0.518
 $ Recall          : num 0.914
 $ SubsetAccuracy  : num 0.831
 $ ROC             :List of 15
  ...

> plot(res$ROC, main = "ROC curve for emotions") # Plot ROC curve

If the pROC (Robin et al. 2011) package is available, this list will include non-null AUC (Area Under the ROC Curve) measures and also an element called ROC. The latter holds the information needed to plot the ROC (Receiver Operating Characteristic) curve, as shown in the last line of the previous example. The result would be a plot similar to that in Figure 3.

Figure 3: ROC curve plot for the data returned by mldr_evaluate.

The mldr user interface

This package provides the user with a web-based graphical user interface on top of the shiny package, allowing to interactively manipulate measurements and obtain graphics and other results. Once mldr is loaded, this GUI can be launched from the R console with a single command:

> mldrGUI()

This will cause the user’s default browser to start or open a new tab in which the GUI will be displayed, organized into a tab bar and a content pane. The tab bar allows the change of section so that different information is shown in the pane.

The GUI will initially display the Main section, as shown in Figure 4. It contains options able to select an MLD from those available, and to load a new one by uploading its ARFF and XML files onto the application. On the right side, several plots are stacked. These show the amount of attributes of each type (numeric, character or label), the amount of labels per instance, the amount of instances corresponding to labels and the number of instances related to labelsets. Each plot can be saved as an image on the file system. Right below these graphics, some tables containing basic measures are shown. The first one lists generic measures related to the entire MLD, and is followed by measures specific to labels, such as Card or Dens. The last table shows a summary of measures for labelsets.

Figure 4: Main page of the shiny based graphical user interface.

The Labels section contains a table enumerating each label of the MLD with its relevant details and measures: its index in the attribute list, its count and frequency, its IRLbl and its SCUMBLE. Labels in this table can be reordered using the headers, and filtered by the Search field. Furthermore, if the list is longer than the number specified in the Show field, it will be split into several pages. The data shown in all tables can be exported to files in several formats. On the right side, a plot shows the amount of instances that have each label. This is an interactive plot, and allows the range of labels to be manipulated.

Since relations between labels can determine the behavior of new data, studying labelsets is important in multilabel classification. Thus, the section named Labelsets provides information about them, listing each labelset along with its count. This list can be filtered and split into pages as well, and is accompanied by a bar plot showing the count of instances per labelset.

In order to obtain statistical measures about input attributes, the Attributes section organizes all of them into a paged table, displaying their type and some data or measures according to it. If the attribute is numeric, then there will be a table containing its minimum and maximum values, its quartiles and its mean. On the contrary, if the attribute takes values from a finite set, each possible value will be shown along with its count in the MLD.

Lastly, concurrence among labels is provenly a factor to take into account when applying preprocessing techniques to MLDs. For this reason, the Concurrence section attempts to create an easy way of visualizing concurrence among labels (see Figure 5), with a label concurrence plot displaying the selected labels in the left-side table and their coocurrences represented by bands in the circle. By default, the ten labels with highest SCUMBLE are selected. The user is able to select and deselect other labels by clicking their corresponding row on the table.

Figure 5: The plots can be customized and saved.

Summary

In this paper the mldr package, aimed to provide exploratory analysis and manipulation tools for MLDs, has been introduced. The functions supplied by this package allow both loading existing MLDs and generating new ones. Several characterization measures and specific plots can be obtained for any MLD, and the content of an MLD can be extracted, filtered and joined, producing new MLDs. Any MLD can be transformed into a set of binary datasets or a multiclass dataset by means of the transformation functions of package mldr. Finally, a web-based graphical user interface eases the access to most of this functionality for everyone.

In its current version, package mldr is a strong base to develop any preprocessing method for MLDs, as has been shown. The development of the mldr package will continue in the near future by including the tools needed to implement and evaluate multilabel classifiers. With this foundation, we aim to encourage other developers to incorporate their own algorithms into mldr, as we will do in forthcoming releases.

Acknowledgment

This paper is partially supported by the project TIN2012-33856 of the Spanish Ministry of Science and Technology.

Working with Multilabel Datasets in R: The mldr Package

Francisco Charte

David Charte

2015-09-16