Abstract
Learning classification tasks in which each instance is associated with one or more labels are known as multi-label learning. The implementation of multi-label algorithms, performed by different researchers, have several specificities, like input/output format, different internal functions, distinct programming language, to mention just some of them. As a result, current machine learning tools include only a small subset of multi-label decomposition strategies. The utiml package is a framework for the application of classification algorithms to multi-label data. Like the well known MULAN used with Weka, it provides a set of multi-label procedures such as sampling methods, transformation strategies, threshold functions, pre-processing techniques and evaluation metrics. The package was designed to allow users to easily perform complete multi-label classification experiments in the R environment. This paper describes the utiml API and illustrates its use in different multi-label classification scenarios.Multi-label classification (MLC) is a classification task where an instance can be simultaneously classified in more than one of the existing classes. Labeled data extracted from several domains, like text, web pages, multimedia (audio, image, videos), and biology are intrinsically multi-labeled. Additionally, the number of application domains with MLC data is growing fast.
Many current, real-world data science applications are MLC by nature. They are problems from very diverse domains, like labeling newspaper articles by subject and classification of proteins according to their functions. MLC algorithms have been successfully used in these and other MLC tasks (Diplaris et al. 2005). In a recent application, MLC algorithms were used to recommend food truck cuisines (Rivolli, Parker, and de Carvalho 2017), assuming that a person can have more than one cuisine preference, and with the same level of preference.
Despite its growing relevance, there is a lack of comprehensive and easy to use tools for the R environment. A tool frequently used in MLC experiments is MULAN (G. Tsoumakas et al. 2011), which is a Java library built on top of Weka (Hall et al. 2009) to allow Weka users to deal with MLC data. Its popularity in the research community can be attributed to its ease of use, its large number and variation of its functionalities. The MLC alternative to Python users is the scikit-multilearn (Szymański 2017), which provides a set of MLC algorithms and an interface for the MULAN library. Although other simpler tools, like MEKA (Jesse Read et al. 2016) and general data mining software (Gibaja and Ventura 2015) include good functionalities to deal with MLC tasks, they address few MLC features and are not available in R.
It is important to mention that there are packages that offer some level of support for MLC in R. The most complete is the mldr package, an exploratory tool for the manipulation and analysis of MLC datasets (Charte and Charte 2015). Although it does not contain MLC strategies, it supports the ARFF variation for MLC data, largely used for data mining and machine learning (ML) experiments, and has useful features, such as dataset characterization, MLC evaluation measures, and a rich user interface for the data exploration.
Some works use the mlr package, which was not specifically designed for MLC. As a result, it provides only a few multi-label strategies (Probst et al. 2017) and does not support the MLC ARFF format. In fact, it is a general purpose package, with an interface to more than one hundred algorithms that supports several ML tasks (Bischl et al. 2016). Another related package, MLPUGS, is a simple MLC package that contains only the implementation of the classifier chains (CC) strategy (Jesse Read et al. 2009).
Although the previous packages make it easier to perform some procedures related to MLC learning, their adoption in MLC experiments require more efforts from the developer/researcher than MULAN, available for Weka uses, which motivated the authors to design utiml, a more comprehensive, specific, easy to use, and extensible solution. The main features of the utiml package include:
Pre-processing techniques: a set of techniques for the preparation and pre-possessing of MLC data to be used in experiments. These techniques deal with simple tasks, like removal of predictive attributes, instances and labels, replacement of nominal attribute values by numerical values, and data normalization.
Sampling: a set of methods used to split MLC data through the holdout and k-fold methodologies. Random or stratified strategies can be used for data partitioning.
Classification/Ranking: the main MLC strategies. The transformation strategies support several base algorithms and the result can be seen as bipartition, probability/score, and ranking.
Threshold: score-based and ranking-based threshold functions to be employed after the label prediction, so that bipartition values can be changed.
Evaluation: traditional MLC evaluation measures and MLC confusion matrix for the summarization of classification result.
This paper describes the main aspects and resources of the utiml package. The current version is 0.1.4 and an updated list with all resources available will be maintained in the vignette document and the reference manual. The following section provides a brief review of MLC learning. Next, the package API is detailed, its resources are presented and some illustrative examples are provided. Finally, the main issues regarding the package use are highlighted in the summary section.
MLC tasks have attracted a growing attention in the ML community (de Carvalho and Freitas 2009; Grigorios Tsoumakas, Katakis, and Vlahavas 2010; Gibaja and Ventura 2014). While in multi-class classification only a single class label is predicted, in MLC, more than one class label can be simultaneously predicted. In the same way as multi-class classification tasks can be seen as a generalization of binary classification tasks, which restricts to two the number of classes, MLC can be seen as a generalization of multi-class, which restricts to one the number of predicted classes (de Carvalho and Freitas 2009). The main MLC approaches are prediction of multiple labels, label ranking, and multi-label ranking (Grigorios Tsoumakas, Katakis, and Vlahavas 2010).
Multi-Label Classification (MLC), the most common task (Grigorios Tsoumakas, Katakis, and Vlahavas 2010), induces a predictive model \(h(x) \to Y\) from a set of training data, which later assigns one or more labels to each new example. This task can be formally defined as: let \(D\) be a set of labeled instances \(E\), such that \(D = \left\{E_1, E_2, ..., E_n\right\}\). Every labeled instance \(E_i = (x_i, Y_i)\) is composed of \(x_i = (x_{i1}, x_{i2}, ..., x_{id})\), which describes its position in a \(\mathbb{R}^d\) input space, and \(Y_i \subseteq L \mid L = \left\{\lambda_1, \lambda_2, ..., \lambda_q\right\}\), which describes a position in a \(\left\{0,1\right\}^q\) output space.
The Label Ranking (LR) task can be characterized by a function \(f(x, \lambda_i)\), which, for each class label, outputs a score value in the interval \([0.0, 1.0]\), indicating the relevance, confidence, or probability of instance \(x\) belonging to the class whose label is \(\lambda_i\). The higher the score value, the better the ranking position. While MLC predicts bipartitions and LR predicts scores, Multi-label Ranking (MLR) generates both. Since MLC can be derived from the LR formulation (Gibaja and Ventura 2015)1, if a strategy can be used in the LR task, it can also be used in the two other tasks.
These models can be obtained by two approaches (Grigorios Tsoumakas, Katakis, and Vlahavas 2010), problem transformation and algorithm adaptation. Problem transformation converts the original MLC task into a set of binary or multi-class classification subtasks. Afterwards, any classification algorithm, here called base algorithm, can be used to induce models for the subtasks. In the algorithm adaptation approach, the multi-label support is embedded into the algorithm structure. Thus, while transformation fits data to algorithms, adaptation fits algorithms to data (M.-L. Zhang and Zhou 2014).
The transformation approach can be performed in three different ways: binary, pairwise, and powerset. Binary transformation generates at least one dataset per label, as in the one-versus-all multiclass strategy. Pairwise transformation, instead, creates one dataset for each pair of labels, similarly to one-versus-one multiclass strategy. Finally, powerset is a multi-class transformation that uses labelsets as classes. The adaptation approach, on the other hand, modifies conventional ML algorithms, like Decision Tree Induction Algorithms (DT), K-Nearest Neighbors (KNN), Random Forest (RF), and Support Vector Machines (SVM).
Other steps required for the application of ML algorithms need to be adapted to deal with MLC tasks. For example, stratified sampling for MLC data must take into account multiple targets and the predictive performance evaluation must consider situations like partially correct results and ranking accuracy. A complete overview of the alternatives to deal with these issues can be seen in M.-L. Zhang and Zhou (2014) and Gibaja and Ventura (2015).
The predictive performance of MLC tasks can be strongly affected by the use of data pre-processing techniques. For such, utiml uses the mldr package (Charte and Charte 2015), which provides the support for data pre-processing. Moreover, when utiml is installed/loaded, the mldr package is automatically installed/loaded. Specially, it supports the MLC ARFF format, which has an additional XML file describing the label columns2.
By default, the mldr package handles categorical data as
"character", instead of "factor", which is not
supported by the implementation of some traditional machine learning
algorithms available in R, like Random Forest from the randomForest
package. To address this limitation, the mldata function
converts all text columns to factors of an "mldr" dataset.
For example, the function mldata should be used to load the
flags dataset3, contains categorical attributes, like
> flags <- mldata(mldr("flags"))
After a dataset is loaded, pre-processing techniques can be applied
to it. Table 1 shows the pre-processing
techniques available in the utiml package. All these functions
receive an "mldr" dataset as argument and return a
pre-processed version of this dataset.
| Pre-processing function | Description |
|---|---|
fill_sparse_mldata(mdata) |
Exchanges the NA values present in the
dataset to 0 or "", according to the attribute
type. |
normalize_mldata(mdata) |
Re-scales all numerical attribute values to values between 0 and 1 according to the min-max transformation. The lowest value is modified to 0.0 and the highest value is converted to 1.0. |
remove_attributes(mdata, attributes) |
Removes the specified attributes from the dataset. |
remove_labels(mdata, labels) |
Removes the specified labels from the dataset. |
remove_unique_attributes(mdata) |
Removes from the dataset attributes whose values are the same for all instances. |
remove_unlabeled_instances(mdata) |
Removes from the dataset instances without class labels. |
remove_skewness_labels(mdata, t) |
Removes from the dataset highly infrequent or highly frequent labels, according to a specific threshold value. The threshold \(t\) indicates the minimum number of positive and negative instances associated with each label. |
replace_nominal_attributes(mdata) |
Replaces categorical attributes by binary attributes. An attribute with \(n\) different values will be mapped to \(n - 1\) new columns containing binary values. |
The utiml package also supports the main methodologies for
data sampling, as shown in Table 2. The
holdout and k-fold sampling can partition a dataset randomly and in a
stratified way. They are selected by a parameter named
method, which determines the sampling algorithm that
creates the partitions. According to Sechidis,
Tsoumakas, and Vlahavas (2011), the accepted values are
"random", "iterative", and
"stratified", where the latter two are different
stratification options. The "iterative" process stratifies
a MLC dataset considering each label independently, while
"stratified" is based on the different combinations of
labels, also known as labelset.
| Sampling function | Description |
|---|---|
create_holdout_partition(mdata,
partitions, method) |
Splits the data into at least two distinct parts. The
second parameter defines the name and size of the partitions and
method defines the type of sampling. |
create_kfold_partition(mdata,
k, method) |
Creates an object that contains the \(k\) distinct parts of the dataset using the
method for splitting the folds. It should be used in
combination with partition_fold(object, fold), which
provides the training and testing data to a specific fold.
The parameter object is the result of
create_kfold_partition. |
create_random_subset(mdata,
instances, attributes, replacement) |
Creates a random subset of the dataset based on the
proportion of instances and attributes. When
replacement=TRUE, a same instance can appear one more time
in the training data. |
create_subset(mdata, rows,
cols) |
Creates a specific subset of the dataset based on the
instances (rows) and attributes (cols)
specified. |
These techniques were designed to improve the mldr package and to simplify the data preparation for the learning step. Concerning the analysis of the MLC data, utiml does not provide additional resources in this current version. However, it is possible to use the mldr package, which enables the understanding and exploration of several data aspects through an interactive interface (Charte and Charte 2015).
The utiml package also provides two MLC datasets:
toyml, a synthetic dataset generated by the Mldatagen tool
(Tomás et al. 2014); and
foodtruck, a dataset in which several food truck cuisines
are mapped as labels (Rivolli, Parker, and de
Carvalho 2017).
The classification strategies are the heart of the utiml
package. Table 3 shows the strategies
available in the current version of the package. Some of the implemented
strategies, such as brplus, ctrl,
dbr, lift, prudent and
rdbr were not found by the authors in other tools.
| Strategy function | Description | Approach\(^a\) | Reference |
|---|---|---|---|
baseline |
Baseline | - |
Metz et al. (2012) |
br |
Binary Relevance | BR |
Grigorios Tsoumakas, Katakis, and Vlahavas (2010) |
brplus |
BR+ | BR, STA |
Cherman, Metz, and Monard (2012) |
cc |
Classifier Chains | BR, CC |
Jesse Read et al. (2009) |
clr |
Calibrated Label Ranking | PW |
Brinker, Fürnkranz, and Hüllermeier (2006) |
ctrl |
ConTRolled Label correlation | BR, ENS |
Li and Zhang (2014) |
dbr |
Dependent Binary Relevance | BR, STA |
Montañes et al. (2014) |
ebr |
Ensemble of Binary Relevance | BR, ENS |
Jesse Read et al. (2009) |
ecc |
Ensemble of Classifier Chains | BR, CC, ENS |
Jesse Read et al. (2009) |
eps |
Ensemble of Pruned Set | ENS, PS |
Jesse Read, Pfahringer, and Holmes (2008) |
homer |
Hierarchy Of Multi-label classifiER | HIE |
Grigorios Tsoumakas, Katakis, and Vlahavas (2008) |
lift |
Learning with Label specIfic FeaTures | BR, CLU |
M.-L. Zhang and Wu (2015) |
lp |
Label Powerset | PS |
Grigorios Tsoumakas and Katakis (2007) |
mbr |
Meta-BR, 2BR or stacking | BR, STA |
Grigorios Tsoumakas et al. (2009) |
mlknn |
Multi-label kNN | AD |
M.-L. L. Zhang and Zhou (2007) |
ns |
Nested Stacking | BR, CC |
Senge, del Coz, and Hüllermeier (2013) |
ppt |
Pruned Problem Transformation | PS |
Jesse Read, Pfahringer, and Holmes (2008) |
prudent |
PRUned and confiDENT Stacking | BR, STA |
Alali and Kubat (2015) |
ps |
Pruned Set | PS |
J. Read (2008) |
rakel |
Random k-labelsets | ENS, PS |
Grigorios Tsoumakas and Vlahavas (2007) |
rdbr |
Recursive Dependent Binary Relevance | BR, ENS,
STA |
Rauber et al. (2014) |
rpc |
Ranking by Pairwise Comparison | PW |
Hüllermeier et al. (2008) |
\(^a\) AD = Adaptation;
BR = Binary transformation; CC = Chain of
classifiers; CLU = Clustering based; ENS =
Ensemble;
HIE = Hierarchy; PS = Powerset
transformation; PW = Pairwise transformation;
STA = Stacking
Transformation strategies build the multi-label models by using a ML
base algorithm. The base.algorithm parameter defines the
base algorithm employed to create the internal models. Table 4 shows the ML algorithms currently supported by
the package. Their use requires additional packages, as indicated in
column R function Called4. For example, "C5.0" algorithm
requires the C50 package
be installed. Only the "MAJORITY" and "RANDOM"
algorithms require no additional packages.
base.algorithm value |
Description | R function Called |
|---|---|---|
"C5.0" |
C5.0 Decision Trees | C50::C5.0 |
"CART" |
Classification and regression trees | rpart::rpart |
"KNN" |
K Nearest Neighbor | kknn::kknn |
"NB" |
Naive Bayes | e1071::naiveBayes |
"RF" |
Random Forest | randomForest::randomForest |
"SMO" |
Sequential Minimal Optimization | RWeka::SMO |
"SVM" |
Support Vector Machine | e1071::svm |
"XGB" |
eXtreme Gradient Boosting | xgboost::xgboost |
"MAJORITY" |
Majority class prediction | - |
"RANDOM" |
Random prediction | - |
The J48 algorithm has no support for task
parallelization.
The arguments of the transformation strategies follow the pattern:
mdata: an "mldr" dataset
object.
base.algorithm: a base algorithm, as listed in Table
4.
additional strategy parameters: specific parameters for
each strategy. While the BR strategy contains no additional parameters,
the ensemble ECC receives 4 specific parameters, namely m,
subsample, attr.space and
replacement5.
...: extra parameters used by the
base.algorithm selected. As illustration, if
base.algorithm = "SVM", the extra parameters can be those
defined in the svm function of the e1071
package, such as kernel, gamma, cost, among
others.
cores: number of cores used for the parallelization
of the training phase. Note some classification strategies, as
lp, ignore the parameter because the tasks can not be
parallelized.
seed: a seed that ensures reproducibility. This is
particularly important when the task is parallelized. In other words, if
cores = 1, the seed effect is similar to that
of set.seed(seed). However, if the cores are
higher than 1, the set.seed(seed) command will not
guarantee the same result can be obtained, since the task will be
performed in parallel.
After the creation of a MLC model, the model can be applied to new
data through the S3 predict method. The arguments of
predict are:
object: a multi-label classifier.
newdata: a "matrix",
"data.frame" or "mldr" object, containing the
data to be classified.
additional model parameters: specific parameters for
each model. For example, the vote scheme to be used in the
ecc prediction function can be defined6.
probability: a logical value that indicates if the
prediction result should be probability/score or bipartition. If
TRUE, a probability result is returned; otherwise, the
bipartition is obtained. The result can be changed, as observed
next.
...: extra parameters based on the
predict method related to the base.algorithm
selected.
cores: number of cores used for the parallelization
of the prediction phase. Some models, like CC, ignore this parameter
because the tasks cannot be parallelized.
seed: a seed that ensures reproducibility. This is
particularly important when the task is parallelized and the base
algorithm is not deterministic.
The prediction result is an "mlresult" type object and
can be used directly as a matrix, where each column is a label and each
row is an instance. To change the type of result to bipartition,
probability/score or a ranking matrix, the functions
as.bipartition, as.probability, and
as.ranking, respectively, can be used.
Threshold functions adjust the bipartition result according to the score/probabilities predicted by the predictive models. In MLC learning, these functions can be score-based or rank-based, depending on the type of data used to define the threshold values. A single threshold value for all labels is named global threshold, the use of one value per label is named label-wise, and the use of one value per instance is named instance-wise (Al-Otaibi, Flach, and Kull 2014).
Table 5 shows the threshold functions
available in the utiml package. All of them receive a
probability/score matrix or an "mlresult" object as input
and return a new "mlresult" object with different
bipartitions as output. The only exception is
scut_threshold, which returns threshold values, instead of
bipartitions, and should be combined with the
fixed_threshold function. Additionally, the
subset_correction can be used as a threshold function (Senge, del Coz, and Hüllermeier 2013). It
changes the bipartition based on the labelsets present in the training
data and outputs only the known labelsets.
| Threshold function | Description | Approach |
|---|---|---|
fixed_threshold(prediction,
threshold) |
Applies a fixed global or label-wise threshold. | score-based |
lcard_threshold(prediction,
cardinality) |
Applies an instance-wise threshold using the cardinality measure. | rank-based |
mcut_threshold(prediction) |
Applies an instance-wise threshold and selects the subset of labels of the highest interval between two sorted scores. | score-based |
pcut_threshold(prediction, ratio) |
Applies a global or label-wise threshold using the ratio value to define the proportion of instances that will be relevant. | score-based |
rcut_threshold(prediction, k) |
Applies an instance-wise threshold and defines the \(k\) labels with highest scores as relevant. | rank-based |
scut_threshold(prediction,
expected, loss.function) |
Returns a label-wise threshold using a loss function that minimizes the difference between the value predicted and the expected prediction value. | score-based |
Finally, utiml also supports the evaluation of MLC models.
The multilabel_evaluate and
multilabel_confusion_matrix functions can be used during
the evaluation. The first calculates the traditional evaluation measures
also available in mldr and MULAN, whereas the second generates
a multi-label confusion matrix ("mlconfmat" object)
detailing labels and instances.
The multilabel_evaluate function receives an
"mlresult" or an "mlconfmat" object and the
desired evaluation measures. One or more measures, likewise one or more
group of measures can be adopted. Figure 1
shows the measures values, currently supported. A complete
review of MLC evaluation measures can be found in M.-L. Zhang and Zhou (2014) and Gibaja and Ventura (2015). Moreover, if the
hyperparameter labels=TRUE, the return will be a list that
contains the multi-label and labels’ results detailed.
The utiml package uses the option function to
customize some default parameters. For example, the default base
algorithm for all transformation strategies is "SVM". The
utiml.base.algorithm option can be used to change this
parameter value. Table 6 shows the option’s
names, a brief description of each option parameter value, and their
default value. The following code defines “Random Forest” as the default
base algorithm and sets the default number of cores to 8, to illustrate
the setting of the options.
> options(utiml.base.algorithm = "RF", utiml.cores=8)
| Option parameter values | Description | Default value |
|---|---|---|
utiml.base.algorithm |
Default base algorithm used in the transformation strategies. | "SVM" |
utiml.cores |
Default number of cores used to parallelize the tasks. | 1 |
utiml.seed |
Default seed used by the MLC strategies. | NA |
utiml.use.probs |
Default type of the expected prediction results. If
TRUE, the expected prediction is in the probability/score
format, otherwise, a bipartition prediction is expected. |
TRUE |
utiml.empty.prediction |
Default option concerning empty predictions. If
TRUE, predictions can contain instances without labels,
otherwise, the most important label is considered relevant and no empty
predictions are obtained. |
FALSE |
The utiml.empty.prediction option defines whether the
MLC strategies can predict no labels for one or more instances. Among
the alternatives to avoid an empty prediction (Liu and Chen 2015), utiml outputs the
labels with the highest probability/score. It must be observed that this
option may directly interfere with the result of the bipartition
evaluation measures. Thus, it must be set according to the
characteristics of the experiment being carried out.
The toyml dataset was used in the examples illustrated
in this section. As toyml has two irrelevant attributes
("iatt8" and "iatt9") and one redundant
("ratt10") attribute, the pre-processing
remove_attributes function can be applied to remove
them.
> new.toyml <- remove_attributes(toyml, c("iatt8", "iatt9", "ratt10"))
> pre.process <- function (mdata) {
+ aux <- remove_skewness_labels(mdata, 5) # Remove infrequent labels (less than 5)
+ aux <- remove_unlabeled_instances(aux) # Remove instances without labels
+ aux <- remove_unique_attributes(aux) # Remove constant attributes
+ return(mdata)
+ }
As toyml is already normalized and the dataset has a
small number of instances, no other pre-processing technique is
required. Thus, the pre.process function has no effect if
applied in this case. For other datasets, the same procedure can be
useful for their preparation for MLC experiments. Two scenarios that
illustrate the use of utiml for the development of MLC
experiments are presented next. Finally, a simple experimental analysis
is performed using the foodtruck dataset, illustrating the
use of the package in a more realist scenario.
This example shows a MLC experiment using holdout, in which 70% of the dataset instances are used for training and 30% for test. A BR model that uses “Random Forest” as a base algorithm is induced and applied to the test instances. Next, predictions are assessed using MLC evaluation measures.
> set.seed(123)
> ds <- create_holdout_partition(new.toyml, c(train=0.7, test=0.3))
> model <- br(ds$train, "RF")
> predictions <- predict(model, ds$test)
> results <- multilabel_evaluate(ds$test, predictions, c("example-based", "macro-F1"))
> round(results, 4)
## accuracy F1 hamming-loss macro-F1 precision recall subset-accuracy
## 0.6444 0.7411 0.1933 0.4015 0.7833 0.7722 0.3000
A MLC baseline can be included among the strategies being
experimentally compared. In the following code, the general baseline
(Metz et al. 2012) is induced. A subtle
difference is observed in "hamming-loss" and
"F1" measures in favor of the BR model. The small number of
labels are due to very common combinations of them found in
toyml favors the general baseline.
> base.preds <- predict(baseline(ds$train, "general"), ds$test)
> base.res <- multilabel_evaluate(ds$test, base.preds, c("hamming-loss", "F1"))
> round(base.res, 4)
## F1 hamming-loss
## 0.7311 0.2000
In both examples, the test set predictions compose an
"mlresult" object and, by default, show the
score/probability produced by the base algorithm. For example:
> head(predictions)
## y1 y2 y3 y4 y5
## 23 0.336 0.640 0.178 0.922 0.122
## 19 0.106 0.860 0.366 0.670 0.280
## 62 0.094 0.776 0.526 0.688 0.090
## 1 0.196 0.618 0.204 0.758 0.162
## 67 0.090 0.964 0.210 0.700 0.234
## 92 0.060 0.856 0.162 0.598 0.454
The as.bipartition and as.ranking functions
can be used to change, respectively, the probability/score to a
bipartition matrix or the raking values, as illustrated next.
Optionally, a threshold function can be applied to change the
bipartitions.
> head(as.bipartition(predictions))
## y1 y2 y3 y4 y5
## 23 0 1 0 1 0
## 19 0 1 0 1 0
## 62 0 1 1 1 0
## 1 0 1 0 1 0
## 67 0 1 0 1 0
## 92 0 1 0 1 0
> head(as.ranking(predictions))
## y1 y2 y3 y4 y5
## 23 3 2 4 1 5
## 19 5 1 3 2 4
## 62 4 1 3 2 5
## 1 4 2 3 1 5
## 67 5 1 4 2 3
## 92 5 1 4 2 3
> head(mcut_threshold(predictions))
## y1 y2 y3 y4 y5
## 23 0 1 0 1 0
## 19 0 1 0 1 0
## 62 0 1 1 1 0
## 1 0 1 0 1 0
## 67 0 1 0 1 0
## 92 0 1 0 1 1
Three different ECC models are created in the following code to illustrate the use of different parameters and base algorithms. Each model uses a specific base algorithm and configuration, which, consequently, results in different models for the same data and MLC strategy.
> # Using KNN with k = 5 and changing ECC parameters
> model1 <- ecc(ds$train, "KNN", m=7, subsample=0.8, k=5)
> # Using C5.0 and changing ECC parameters
> model2 <- ecc(ds$train, "C5.0", subsample=0.6, attr.space=1)
> # Using SVM with cost = 10 and gamma = 0.5 and default ECC parameters
> model3 <- ecc(ds$train, "SVM", cost=10, gamma=0.5)
By default, the create_holdout_partition function
creates two random partitions (train and test) with 70% and 30% of the
dataset instances, respectively. The number of partitions, sizes and
sampling method can be modified. The following code shows how to create
three, label-stratified partitions, named "train",
"test", and "val" with 70%, 20%, and 10% of
the instances, respectively. The "val" partition can be
used in a validation step for model selection or hyperparameter
tuning.
> partitions <- c(train=0.7, test=0.2, val=0.1)
> strat <- create_holdout_partition(new.toyml, partitions, "iterative")
This section shows some examples of how to perform cross-validation
MLC experiments. The cv method can be used to encapsulate
the whole procedure, which simplifies the respective task, such that a
10-fold stratified cross-validation can be performed with few lines of
code. For instance, the RAkEL strategy using the "SVM" base
algorithm can be evaluated in the following way:
# Defining the evaluation measures
> measures <- c("hamming-loss", "subset-accuracy", "one-error")
# Running 10-fold cross validation
> results <- cv(new.toyml, method="rakel", base.algorith="SVM", cv.folds=10,
+ cv.sampling="stratified", cv.measures=measures, cv.seed=123)
> round(results, 4)
## hamming-loss one-error subset-accuracy
## 0.212 0.160 0.240
To obtain detailed results by folds and/or labels, the hyperparameter
cv.results=TRUE can be set. In this case, a list is
returned where the multi-label and labels’ results can be obtained as
illustrated in the next example.
> results <- cv(new.toyml, method="rakel", base.algorith="SVM", cv.results=TRUE,
+ cv.sampling="stratified", cv.measures=measures, cv.seed=123)
> t(results$multilabel)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## hamming-loss 0.2 0.18 0.18 0.2 0.22 0.22 0.22 0.2 0.24 0.26
## one-error 0.3 0.20 0.20 0.2 0.10 0.00 0.00 0.2 0.20 0.20
## subset-accuracy 0.3 0.30 0.30 0.3 0.20 0.20 0.20 0.2 0.20 0.20
> round(sapply(results$labels, colMeans), 4)
## y1 y2 y3 y4 y5
## accuracy 0.83 0.78 0.81 0.6900 0.83
## balacc 0.50 0.50 0.50 0.5054 0.50
## TP 0.00 7.80 0.00 6.8000 0.00
## TN 8.30 0.00 8.10 0.1000 8.30
## FP 0.00 2.20 0.00 3.0000 0.00
## FN 1.70 0.00 1.90 0.1000 1.70
Any MLC strategy can be used in the cv method, as well
as specific hyperparameters for them. Additionally, the procedure can be
parallelized, using cv.cores. The next example shows the
ECC algorithm with specific hyperparameters being executed using 5 folds
and parallelized in 4 cores.
> results <- cv(new.toyml, method="ecc", base.algorith="RF", subsample=0.9,
+ attr.space=0.9, cv.folds=5, cv.cores=4)
Finally, to perform a cross-validation procedure manually, the
methods create_kfold_partition and
partition_fold, can be used to create the folds and obtain
the train and test dataset for each one of them, respectively. A good
example, is the code used in the cv method, such that
...
> cvdata <- create_kfold_partition(mdata, cv.folds, cv.sampling)
> results <- parallel::mclapply(seq(cv.folds), function (k){
+ ds <- partition_fold(cvdata, k)
+ model <- do.call(method, c(list(mdata=ds$train), ...))
+ pred <- predict(model, ds$test, ...)
+ multilabel_evaluate(ds$test, pred, cv.measures, labels=TRUE)
+ }, mc.cores=cv.cores)
...
In order to show how the package can be used in a real world problem,
this section illustrates the use of the utiml to perform an
exploratory analysis of the food truck dataset (Rivolli, Parker, and de Carvalho 2017). First,
the br strategy is evaluated with different ML base
algorithms ("C5.0", "RF", "SVM" and "XGB") to identify the algorithm
that produces the best macro and micro-F1 results.
> measures <- c("macro-F1", "micro-F1")
> algorithms <- c("C5.0", "RF", "SVM", "XGB")
> res <- sapply(algorithms, function(alg) {
+ cv(foodtruck, "br", base.algorithm=alg, cv.measures=measures, cv.seed=1)
+ })
> round(res, 4)
## C5.0 RF SVM XGB
## macro-F1 0.1764 0.1824 0.1188 0.1827
## micro-F1 0.4856 0.5340 0.4835 0.5130
Regarding the macro-F1 measure, XGB presented the best result,
however the difference observed between RF and C5.0 was small. For
micro-F1, RF obtained the best result, followed closely by XGB. The
differences observed between the macro and micro measures, independently
of the base algorithm, may indicate that some infrequent labels had a
poor F1 performance. To analyze this hypothesis, br was run
again with RF, which obtained the best performance in the previous
cross-validation procedure, with a new data subset. In the confusion
matrix for the induced model, some patterns can be observed.
The following code shows that six labels (mexican_food,
chinese_food, japanese_food,
arabic_food, healthy_food and
fitness_food) had no True Positive (TP) and False Positive
(FP) predictions. Thus, for these labels, all instances were predicted
as negative. This explains the difference observed between the macro and
micro-F1 result, since the macro-F1 is the average labels’ F1, which is
0 for these labels.
> set.seed(1)
> ds <- create_holdout_partition(foodtruck, method="iterative")
> model <- br(ds$train, "RF")
> pred <- predict(model, ds$test)
> cm <- multilabel_confusion_matrix(ds$test, pred)
> as.matrix(cm)
## TP TN FP FN
## street_food 88 5 32 0
## gourmet 13 81 8 23
## italian_food 1 113 0 11
## brazilian_food 3 104 0 18
## mexican_food 0 113 0 12
## chinese_food 0 121 0 4
## japanese_food 0 115 0 10
## arabic_food 0 118 0 7
## snacks 7 103 2 13
## healthy_food 0 116 0 9
## fitness_food 0 116 0 9
## sweets_desserts 11 66 13 35
It must be observed that the cm object is a list
containing several information about the prediction, like the confusion
matrix values summarized by instances and labels. Any evaluation measure
can be computed using only the information provided by this object. As
an example, the next code summarizes the proportion of instances and the
number of labels correctly predicted in the previous example. The
results show that the BR model was not able to predict a correct label
for almost 20% of the test instances; around 65% of the instances were
correctly predicted with a single label; 12% were correctly predicted
with 2 labels; and 3% were correctly predicted with 3 labels.
> prop.table(table(cm$TPi))
## 0 1 2 3
## 0.200 0.648 0.120 0.032
These results show a researcher can simulate new scenarios and explore different solutions in order to improve the predictive performance in a MLC task. The utiml package offers several resources that simplify the most basic and recurrent procedures adopted in the MLC domain.
Data classification is one of the main ML tasks. Although ML classification algorithms are usually designed and employed for single label classification tasks, in several application domains, an instance can have more than one class label. This paper introduced the utiml package, which provides several functions for MLC experiments in R. Similarly to MULAN, one of the most popular MLC tools, utiml offers a wide set of functionalities. The provided functions implement procedures that cover several MLC-related tasks, which include data pre-processing, data sampling model induction, optimization and evaluation of MLC models. The package utiml also supports the intrinsic parallelization of tasks and allows the reproducibility of MLC experiments.
To the best of the authors knowledge, some of the features present in utiml are not available in any other R tool, such as the implementation of MLC stratification (Sechidis, Tsoumakas, and Vlahavas 2011), baselines (Metz et al. 2012), thresholds (Al-Otaibi, Flach, and Kull 2014) and an option that allow the users to avoid the empty prediction problem (Liu and Chen 2015). Moreover, as in MULAN which enables users to take advantage of the resources available in the Weka environment, utiml users can benefit from the several libraries available in R.
The most important limitation of this package is that some common MLC procedures, like feature selection, imbalanced data, and classification strategies based on the algorithm adaptation approach are not available yet. They will be implemented in the future as a natural progression of this work and will be included in the next versions of the utiml package. The authors encourage other developers to integrate their own algorithms in the utiml package7, so that it becomes a more robust and complete MLC package.
Where \(h(x) = \{\lambda \mid f(x, \lambda) \geq \tau(x), \lambda \in L\)}, where \(\tau(x)\) is a threshold function.↩︎
The complete specification is available at http://mulan.sourceforge.net/format.html.↩︎
This dataset, available at http://mulan.sourceforge.net/datasets-mlc.html.↩︎
These packages are not installed together with utiml.↩︎
These parameters denote, respectively, number of models in the ensemble, proportion of instances, attributes, and a possible replacement of instances. The specific parameters of each strategy are reported in the reference manual.↩︎
The specific parameters of each predict method are reported in the reference manual.↩︎
The source and project details are available on the https://github.com/rivolli/utiml page↩︎