Introduction
Subgroup discovery (SD) is a data mining field that aims to describe
data using supervised learning techniques. The goal is to find simple,
general and interesting patterns with respect to a given variable of
interest. Throughout the literature, SD has been applied with success to
different real-world problems in areas such as marketing (Jesus et al. 2007; Berlanga et al. 2006),
medicine (Cristobal Jose Carmona et al. 2015;
Cristóbal J. Carmona, Chrysostomou, et al. 2013; Stiglic and Kokol 2012;
Cristóbal J. Carmona et al. 2011; D. Gamberger, Lavra??, and Krsta??i??
2003) and e-learning (Poitras et al. 2016;
Lemmerich, Ifland, and Puppe 2011; C. J. Carmona et al. 2010),
amongst others (M. Atzmueller, Doerfel, and
Mitzlaff 2016; Jin et al. 2014; Cristóbal J. Carmona, González, et al.
2013; Rodriguez et al. 2013; C. J. Carmona et al. 2012).
SD is an useful rule learning process for complex search spaces.
Therefore, the search strategy used becomes a key factor in the
efficiency of the method. Different strategies can be found in the
literature such as beam search in the algorithm CN2-SD (Lavrač, Kavšek, et al. 2004) and Apriori-SD
(Kavšek and Lavrač 2006), exhaustive
algorithms such as SDMap (Martin Atzmueller and
Puppe 2006) or Evolutionary Algorithms (EAs), for example.
EAs are stochastic algorithms for optimising and searching tasks,
based on the natural evolution process (John
1992). There are different paradigms within EAs: genetic
algorithms (John 1992; Goldberg 1989),
evolution strategies (Schwefel 1995),
evolutionary programming (Fogel 2006) and
genetic programming (Koza 1992). With
these methods the use of rules to represent the knowledge is known as
evolutionary rule-based systems (Freitas
2003) and has the advantage of allowing the inclusion of domain
knowledge, also returning better rules. The use of EAs is very well
suited for SD because these algorithms perform a global search in the
space in a convenient way, stimulating the obtaining of simple,
interesting and precise subgroups.
Fuzzy logic is an extension of traditional set theory whose main aim
is to model imprecise knowledge (Zadeh
1975). The main element is the fuzzy set, which allows belonging
degrees in the range \[0,1\] where
zero means not belonging at all and one means absolute belonging. A
ling??istic variable is a set of overlapped fuzzy sets which define
ling??istic labels that cover all the range of a numeric variable. The
main advantage of using fuzzy logic on SD is obtaining a knowledge
representation for numeric variables more understandable for experts. It
improves the interpretability of rules and the knowledge representation
is very close to human reasoning (E. Hüllermeier
2011). In addition, it avoids the possible loss of information in
variables with continuous domains due to a previous discretisation
stage.
Nowadays, there are several frameworks that allow one to perform data
mining tasks, but only a few of them have implementations of SD
algorithms. The best known frameworks with SD algorithms are KEEL (Alcalá-Fdez et al. 2011) and VIKAMINE (Martin Atzmueller and Lemmerich 2012), but
ORANGE (Demšar et al. 2013), and CORTANA
(Meeng and Knobbe 2011) also provide some
implementations. In fact, VIKAMINE also provides an R package called rsubgroup
(Martin Atzmueller 2014) which is an
interface for R to VIKAMINE Java algorithms.
In this contribution the SDEFSR
package is introduced. It provides the user with the most important
evolutionary fuzzy rule-based methods for SD documented in the
literature, being a major contribution to the R community. In addition,
it also brings the capability of reading datasets in the KEEL data
format (Alcalá-Fdez et al. 2011). This
file format is not natively supported by R. Similarly, this package
provides a Graphical User Interface (GUI) to make this task easier for
the user, especially the unexperienced one.
This contribution is organized according to the following structure:
The first section presents SD concepts, main properties and features. In
the second section, the structure of the SDEFSR package and its
operations are described. In the third section, a complete example of
use of the package is presented. Finally, the GUI of SDEFSR is shown in
the last section.
Subgroup discovery
SD was defined by (Wrobel 2001) as:
“In subgroup discovery, we assume we are given a so-called
population of individuals (objects, customers, \(\ldots\)) and a property of those
individuals we are interested in. The task of subgroup discovery is then
to discover the subgroups of the population that are statistically”most
interesting”, i.e. are as large as possible and have the most unusual
statistical (distributional) characteristics with respect to the
property of interest.”
SD tries to find relationships between different properties of a set
with respect to one interesting or target variable. Such relations must
be statistically interesting, so it is not necessary to find complete
relations, partial relations could be interesting too.
Usually these relations are represented by means of rules, one per
relation. A rule is defined as (Lavrač, Cestnik,
et al. 2004; Dragan Gamberger and Lavrac 2002): \[R: Cond \rightarrow Target_{value}\] where
\(Target_{value}\) is the value for the
variable of interest (target variable) for the SD task and \(Cond\) is normally a conjunction of
attribute-value pairs which describe the characteristics of the induced
subgroup.
SD is a data mining task halfway between description and classification,
because it has a target variable but its objective is not to predict but
rather to describe. The use of a target variable is not possible in
description, because description simply finds relationships between
unlabeled objects.
A key point to fully understanding the goal of SD is how it
differentiates from the classification objective. Classification is a
predictive task that tries to split the entire search space, usually in
a complex way, aiming to find the correct value for the variable in new
incoming instances. On the other hand, SD aims to find interesting
relationships among these instances regarding the variable of
interest.
For instance, assuming there is a group of patients, the variable of
interest is whether they have heart disease or not. The predictive data
mining objective is to predict if new patients will have heart disease.
However, SD tries to find which subgroup of patients are more likely to
have heart disease according to certain characteristics. This could be
relevant to develop a treatment against this disease.
Main elements of subgroup discovery algorithms
Below, the most relevant aspects of SD algorithms are presented (Martin Atzmueller, Puppe, and Buscher
2004):
Type of the target variable: This is the kind of
information the interest variable can hold. The target variable could be
binary (two possible values), categorical (\(n\) posible values) or numerical (a real
value within a range). Nevertheless, the majority of SD algorithms can
only deal with binary or categorical target variables.
Description language: Knowledge representation is a key
factor in SD due to its descriptive nature. In this way, rules must be
as simple as possible. Rules are usually represented by conjunctions of
attribute-value pairs or in disjunctive normal form. Fuzzy logic could
also be included in the rules in order to improve the interpretability
of the knowledge (Zadeh 1975; Eyke Hüllermeier
2005).
Quality measures: This is the most important aspect in
the design of SD algorithms. The quality measures must guide the
learning process and must show the quality of the extracted knowledge.
They are briefly described below.
Search strategy: The search space grows exponentially
with the number of variables.The use of a search strategy able to find a
good solution, or the optimal one, by searching efficiently through the
whole search space is very important.
Quality measures for subgroup discovery
A quality measure tries to measure the interestingness of a given
rule or subgroup, but there is not a formal definition of what
interestingness is. However, the interestingness could be defined as a
concept that emphasises conciseness, coverage, reliability, peculiarity,
diversity, novelty, surprisingness, utility, and actionability (Geng and Hamilton 2006). For SD, the most used
criteria to measure the interestingness of a rule are conciseness,
generality or coverage, reliability, and novelty (C. J. Carmona et al. 2014).
Quality measures that accomplish this criteria available in the
SDEFSR package are:
Measures for conciseness (or complexity). It measures the
complexity of the induced rules. Rules with a few number of
attribute-value pairs are easy to remember and add to the expert’s
knowledge. The quality measures associated to this criterion are:
\(N_r\): The number of rules
generated. A rule set with a large number of rules is much more
difficult to remember than other that has less rules. Additionally, the
lower the number of rules, the easier for the expert to filter those
rules that are interesting.
\(N_v\): The number of variables
of the rules generated. Rules with less variables are easier to
understand and to remember; also they tend to be more general. Thus,
rules with low number of variables are interesting.
Measures for generality (or coverage). It measures the capacity
of a rule to match with a great number of examples in the dataset. Also,
the capacity to generalise the rule to other instances that are not in
the training dataset is greater. The quality measures associated to this
criterion are:
Support: It measures the frequency of correctly
classified examples covered by the rule: \[\label{support}
Sup\left(R\right)= \frac{n\left(Cond
\wedge Target_{value}\right)}{n_s} (\#eq:support)\] where
\(n\left(Cond
\wedge Target_{value}\right)\) means the number of examples that
satisfy the antecedent and consequent part of the rule and \(n_s\) is the number of examples in the
dataset.
Coverage: It measures the percentage of examples covered
by the rule related to the total number of examples: \[\label{coverage}
Cov\left(R\right)=
\frac{n\left(Cond\right)}{n_s} (\#eq:coverage)\] where \(n\left(Cond\right)\) means the number of
examples that satisfy the antecedent part of the rule.
Measures for reliability. A rule is reliable when the relation
described in the rule occurs in a high percentage of cases where the
rule can be applied. The quality measures associated to this criterion
are:
- Confidence: It measure the percentage of examples correctly
covered of the total of covered examples: \[\label{confidence}
Conf\left(R\right) = \frac{n\left(Cond \wedge
Target_{value}\right)}{n\left(Cond\right)} (\#eq:confidence)\]
Measures for novelty. A rule is novel if the knowledge obtained
from this one is unknown by the user or it is unable to infer such
knowledge from other rules. For this kind of criterion, the quality
measures availables in the package are:
- Significance: It reflects the novelty in the distribution
of the examples covered by the rule regarding the whole dataset: \[\label{Significance}
Sign\left(R\right) = 2 \cdot \sum_{k=1}^{n_c}n\left(Cond
\wedge Target_{value_k}\right) \cdot log\left( \frac{n\left(Cond \wedge
Target_{value_k}\right)}{n\left(Cond \wedge Target_{value}\right) \cdot
p\left(Cond\right)} \right) (\#eq:Significance)\] where \(p\left(Cond\right) =
\frac{n\left(Cond\right)}{n_s}\), \(n_c\) is the number of possible values of
the target variable and \(Target_{value_k}\) is the \(k\)-th value of the target variable.
Hybrid measures. These measures try to maximise more than one
criterion with a single expression that finds a good trade-off between
the criteria used. The hybrid quality measures implemented are:
Unusualness: It is defined as the weigthed relative
accuracy of a rule and tries to maximise generality and realiability:
\[\label{Unusualness}
WRAcc\left(R\right) =
\frac{n\left(Cond\right)}{n_s}\left(\frac{n\left(Cond \wedge
Target_{value}\right)}{n\left(Cond\right)} -
\frac{n\left(Target_{value}\right)}{n_s}\right) (\#eq:Unusualness)\]
True Positive Rate (TPR) or Sensitivity: It measures the
proportion of covered examples that has been correctly classified.
\[TPR\left(R\right) = \frac{n\left(Cond
\wedge
Target_{value}\right)}{n\left(Target_{value}\right)}\]
False Positive Rate (FPR): It measures the proportion of
examples that are covered that do not belong to the target variable.
\[FPR\left(R\right) = \frac{n\left(Cond
\wedge
\overline{Target_{value}}\right)}{n\left(\overline{Target_{value}}\right)}\]
where \(\overline{Target_{value}}\)
means the negation of \(Target_{value}\), i.e., the examples that
not belong to the target class.
A more complete classification and definition of quality measures for
SD is available in (Franciso Herrera et al.
2011) and (Martin Atzmueller
2015).
Evolutionary fuzzy systems
Evolutionary fuzzy systems (EFSs) are the union of two powerful tools
for aproximate reasoning: EAs and fuzzy logic.
In one side, EAs (Eiben and Smith 2003)
are stochastic algorithms based on the natural evolution to solve
complex optimisation problems. They are based on a population of
representations of possible solutions, called chromosomes. A basic EA
scheme is:
Generate the initial population
Evaluate the chromosomes of the population. This is the most
important and expensive part of the EA. In the algorithms of this
package, quality measures described above are used as evaluation
function.
Select the chromosomes which the genetic operators will be
applied.
Apply the genetic operators. The most used are:
Crossover operator. Which takes two chromosomes and generates two
descendants as a combination of the elements of the parents.
Mutation operator. Which takes a chromosome and changes randomly
a gene (a value of the possible solution).
Replace the population with the new generated chromomes.
Go to step 2 until a stopping criterion is reached. Normally this
criterion is a number of evaluations or generations.
These algorithms perform efficiently a global stochastic search
through a huge search space. However, it is possible that these
algorithms can not find an optimal solution (a global optimum), but a
good one (a local optimum) that can solve the problem too. They are well
suited for SD because the problem of finding subgroups can be formulated
from the optimisation point of view as coding rules as a parameter
structure that optimise some measures. Additionally, different kinds of
rules exist in SD (with inequality, with intervals, fuzzy rules...).
This can change dramatically the size of the search space and EAs can
adapt these structures easily without performance degradation. Likewise,
the selection of the genetic operators can make EAs great candidates to
introduce expert knowledge in the search process (C. J. Carmona et al. 2014).
On the other side, fuzzy logic (Zadeh
1975) is an extension of the traditional set theory. Its main
objective is to model and deal with imprecise knowledge. The main
difference with traditional set theory is that belonging degree is not
zero or one, but a real value in \[0,1\]. This possibility allow one to
define fuzzy limits and the chance of overlapping between fuzzy
sets.
A fuzzy variable is a set of linguistic labels, e.g. low, medium and
high, which are defined by overlapped fuzzy sets (E. Hüllermeier 2011). This information is
closer to human reasoning and it is possible to calculate with
precission the value of each belonging degree. This expressivity allows
one to obtain simpler rules because continuous variables are more
understandable for humans. A rule can be represented by means of a set
of fuzzy variables. To determine if the rule covers an example it is
neccesary to calculate the belonging degree of each variable in the rule
with respect to the example. If all the variables have a belonging
degree grater than zero, the example is covered.
EFSs are the union of both techniques, and work three ways(Francisco Herrera 2008):
EAs that evolve the fuzzy rules (changing the number of variables
and their values) and use fuzzy set definitions defined by the user.
This way of work is used by all the algorithms implemented in the
SDEFSR package.
EAs that evolve the fuzzy sets, changing the number of fuzzy sets
for each variable, its shapes, etc.
EAs that evolve both rules and fuzzy sets.
The SDEFSR package
SDEFSR is a package entirely written on R from scratch. To
the best of our knowledge, this package includes all the EFSs for SD
presented throughout the literature. In addition, SDEFSR has
the capacity to read data in different standard and well-known formats
such as ARFF (Weka), KEEL, CSV and data.frame objects.
Similarly, all functions included in the SDEFSR package have
default parameters with values recommended by the authors. This allows
the algorithms to be executed in an easy way without the necessity of
knowing the parameters for final users.
Algorithms included in the package
SDEFSR implements the following SD algorithms (Ventura and J. M. Luna 2016):
SDIGA (Jesus et al. 2007). This is
a mono-objective EA (Back, Fogel, and Michalewicz
1997) based on an Iterative Rule Learning (IRL) approach in which
only the best rule is extracted from an execution of the EA. This EA is
executed iteratively until a stopping criterion is reached.
MESDIF (Berlanga et al. 2006). A
multi-objective EA (Deb 2001) based on the
SPEA2 (Zitzler, Laumanns, and Thiele 2002)
algorithm. It returns the best \(n\)
rules (where \(n\) is an user
parameter) in the pareto front.
NMEEF-SD (Cristóbal José Carmona et al.
2010). Another multi-objective EA, based on the NSGA-II (Deb et al. 2000) algorithm which returns rules
in the pareto front with a confidence greater than a given threshold. It
has a reinitialisation operator to promote diversity and maximise the
covering of all examples for target variable.
FuGePSD (Cristobal Jose Carmona et al.
2015). An algorithm that uses genetic programming in which a
competitive-cooperative scheme is implemented in order to obtain the
best global rules. The key operation of this algorithm is the Token
Competition (Leung et al. 1992), which
promotes the extraction of precise, general and also diverse rules from
the evolutionary process.
All these methods share the following characteristics:
They use fuzzy logic to improve the interpretability of results,
making them robust when working with noisy data (Luengo, A. Garc??a-Vico, M. D. P??rez-Godoy, and
C. Carmona, n.d.) and also allowing one to include expert
knowledge on the evolutionary learning process.
Rules can be represented by a conjunction of attribute-value
pairs (canonical form) or in disjunctive normal form (DNF). In (Cristóbal José Carmona et al. 2009) there is an
analysis justifying not using the DNF rule representation on some SD
algorithms.
It is possible to specify the quality measures used to guide the
evolutionary process.
All of the algorithms can deal with categorical (or multi-class)
target variables.
Package architecture, extensibility, limitations and comparison with
similar packages
The main advantage of the SDEFSR package is that it provides
all EFSs for SD that exist in the literature. These algorithms are not
included in R at the moment. Therefore, this package provides to the R
community a brand new possibility for data mining and data analysis.
The base of the package is defined by two S3 classes. These classes
are:
"SDEFSR_Dataset". This object defines a dataset and
contains information about it. Such information are stored in the
following fields:
relation. Defines the name of the relation that this
dataset belongs to, e.g. "german credit".
attributeNames. Stores the names of the
attributes.
attributeTypes. A character that defines the data
type of the attribute, i.e. ‘r’ for real values, ‘e’ for integers and
‘c’ for categoricals variables.
min. A vector with the minimum value for numerical
variables, for categorical variables, this value is zero.
max. A vector with the maximum value for numerical
variables or the number of different categorical values that has a
categorical variable.
nVars. The number of variables in the dataset
(excluding the target class).
data. A list that contains each example of the
dataset. The categorical variables in data are codified. This means that
a categorical value is represented as a value in \[0, max $-1$\] that represents the position
of the value. For example, in the german-credit dataset,
the variable ForeignWorker has two values: "A201" and
"A202", these values are represented in the data field of a
"SDEFSR_Dataset" class as 0 or 1, respectively.
class_names. A vector with the categorical values of
the target class. By default, the target variable is the last one. If it
is not categorical, the method that reads the dataset fails. The user
can select other target class when executing the algorithms.
examplesPerClass. A list with the number of examples
belonging to each class.
lostData. A logical that indicates the presence of
lost data.
covered. A logical vector with length the number of
examples that indicates which examples are covered by the generated
rules. This value by default is NA, but it is used in some
algorithms like SDIGA.
fuzzySets. A list that indicates the fuzzy sets
definitions for each variable. This value is NA by
default.
crispSets. A list that indicates the crisp sets
obtained from the fuzzy sets. As this value is infered from
fuzzySets, this is NA by default.
sets. A vector that defines the number of fuzzy sets
that each variable has or the number of categorical values.
categoricalValues. A list that contains a vector of
names of each categorical variable or NA if the variable is
numerical.
Ns. The number of examples in the dataset.
This class also exports the well-known S3 methods
print() and summary() that show the data
structure without codification and a summary with basic information
about the dataset respectively.
"SDEFSR_Rules". This class is a list that contains
the rules generated by an algorithm. To know the number of rules
generated, it is possible to use
length(SDEFSR_RulesObject). Each rule has the following
fields:
This object must be returned for all the SD algorithms of this
package, and it is neccesary to make an analysis of the generated rules.
This object is necessary for the exported functions
plotRules() that plots an FPR vs TPR graph that allows the
visualisation of rules, and the well-known method sort()
that return other "SDEFSR_Rules" object with the rules
sorted by a given quality measure in descendant order. Likewise, this
object overloads the subset operator (’[’) to allow
filtering operations easily.
Additionally, the package has a general function that reads datasets
in ARFF, KEEL or CSV format called read.dataset() and
SDEFSR_DatasetFromDataFrame() to transform a
data.frame into a "SDEFSR_Dataset". In
summary, exported functions and S3 objects are presented in Table 1.
Table 1: Methods and rules exported by the SDEFSR
package.
"SDEFSR_Dataset" |
SDIGA() |
read.dataset() |
print.SDEFSR_Dataset() |
MESDIF() |
SDEFSR_DatasetFromDataFrame() |
summary.SDEFSR_Dataset() |
NMEEF_SD() |
plotRules() |
"SDEFSR_Rules" |
FUGEPSD() |
|
"[.SDEFSR_Rules" |
|
|
"sort.SDEFSR_Rules()" |
|
|
A potential future extension of the package will be the inclusion of
the confusion matrix for each rule. With this matrix it would be
possible to infer the rest of the quality measures. Also, additional
quality measures could be easily added. Statistical validation of the
results are now implemented in package rsubgroup, which is
already available on CRAN. Therefore, SDEFSR delegates this
task to that package.
The rsubgroup package contains SD algorithms that can
complement the algorithms available in the SDEFSR package.
rsubgroup includes more established algorithms for SD like beam
search (Lowerre 1976) or SD-Map (Martin Atzmueller and Puppe 2006). This opens a
wide range of SD algorithms that a user can execute in R. Nevertheless,
both packages have great differences when calling SD methods and the
results. This means that a future package, which joins both into one
standard framework would be interesting.
Below, the neccesary methodology to perform the inclusion of new
algorithms in the SDEFSR package is shown:
The SD algorithm must use an "SDEFSR_Dataset" object
as input and return an "SDEFSR_Rules" object with the
results. Optionally, these results can be displayed on the console in a
human-readable way.
The method can be executed with a parameter that contains the
path of a parameter file. This file must contain the same parameters as
the algorithm. This means that the algorithm must be executed either by
a parameter file or by writting all neccesary parameters.
The majority of the parameters must have default values to ease
the use of the algorithm.
The SD algorithm must not depend on other packages that are not
in the "base" set of packages.
The source code of the algorithm must be added in a separate file
with the name of such method.
As many others packages in R, this package does not have any
automated test that control the quality of the code or results obtained.
Instead, as the algorithms implemented exists in other platforms (KEEL),
we check the quality of the methods comparing the results with the
original implementation against a significant number of datasets with a
5-fold cross validation schema. This test showed that the results of the
methods in SDEFSR are very close or equal to the results
obtained from the reference implementations. Developers who want to add
a new method to the package, must demonstrate the validity of the
results obtained.
An example of use
This section describes an example of use of the package, covering the
installation and loading of the package to the execution of a SD
algorithm and the analysis of the rules generated.
Installing and loading the SDEFSR package
The SDEFSR package is now available at CRAN, so it can be
installed like any other package by simply typing:
> install.packages("SDEFSR")
The development version is available on GitHub at https://github.com/SIMIDAT/SDEFSR. To install and use
the development version you need to install the devtools
(Wickham and Chang 2015) package and then
use the command:
> devtools::install_github("SIMIDAT/SDEFSR")
The package can be loaded using either the library() or
require() functions. Once the package has been loaded,
there are six sample datasets stored as "SDEFSR_Dataset"
objects available: carTra, carTst,
germanTra, germanTst, habermanTra
and habermanTst that correspond to the car,
german and haberman (Alcalá-Fdez et al. 2011) training and test
datasets respectively. These are contained in the
carTra.rda, carTst.rda,
germanTra.rda, germanTst.rda,
habermanTra.rda and habermanTst.rda files
respectively, which are lazily loaded with the package. Also, rules
generated by the SDIGA algorithm with default parameters over the
haberman dataset are loaded as habermanRules
as a "SDEFSR_Rules" object. These rules are stored in the
habermanRules.rda file.
Loading a dataset
In order to use SD algorithms available in the SDEFSR
package, a "SDEFSR_Dataset" object is necessary. This
object can be generated using the read.dataset() function.
This function reads ".dat" files with the KEEL data mining tool format,
ARFF files (".arff") from WEKA or even CSV files (".csv"). The source
code for reading ARFF files has been taken from the mldr
package(Charte and D. Charte 2015).
Assuming the files iris.dat, iris.arff and
iris.csv corresponding to the classical iris dataset in
KEEL, ARFF and CSV formats respectively in the working directory, the
loading of these files will be as follows:
> irisFromKEEL <- read.dataset("iris.dat")
> irisFromARFF <- read.dataset("iris.arff")
> irisFromCSV <- read.dataset("iris.csv")
Note that the function detects the type of data by the extension. To
read csv files, the function has optional parameters that defines the
separator used between fields, the decimal separator, the quote
indicator and the NA identifier as parameters. By default,
these options and values are
sep = ",", quote = ""̈, dec = "." and
na.strings = "?" respectively. It is important to remark
that sparse ARFF data is not supported.
If the dataset is not available in these formats, it is possible to
obtain a "SDEFSR_Dataset" object from a
data.frame. This data.frame could be loaded by
read.table() or similar functions. Eventually, the
resulting data.frame has to be given to the
SDEFSR_DatasetFromDataFrame() function. As we can see, this
function allows the creation of datasets on the fly, as in this
example:
> df <- data.frame(matrix(data = runif(1000), ncol = 10))
#Add class attribute (It must be the last attribute and it must be categorical)
> df[,11] <- c("Class 0", "Class 1", "Class 2", "Class 3")
> SDEFSR_DatasetObject <- SDEFSR_DatasetFromDataFrame(df, relation = "random")
This will assign to SDEFSR_DatasetObject a new dataset
created randomly with 100 examples and 11 attributes. Note that the
target variable must be categorical, because the SD algorithms can only
deal with categorical target variables.
The SDEFSR_DatasetFromDataFrame() function has three
additional parameters: names, types, and
classNames. These allow the manual assignment of attribute
names, their types, and a vector with values of target variable,
respectively. Leaving the default values (NA), the function
automatically retrieves these values through the information found on
the dataset. However, if the information in the dataset is not accurate,
it could cause unexpected results for the SD algorithms.
Obtaining information from a loaded dataset
Once the dataset is loaded, it is possible to view a simple summary
of its content by using the usual summary() function:
> summary(irisFromKEEL)
Summary of the SDEFSR_Dataset object: 'irisFromKEEL'
- relation: iris
- nVars: 4
- Ns: 150
- attributeNames: SepalLength, SepalWidth, PetalLength, PetalWidth, Class
- class_names: Iris-setosa, Iris-versicolor, Iris-virginica
- examplesPerClass: 50, 50, 50
Any of these values can be obtained individually using the
$ operator on the "SDEFSR_Dataset" object:
> irisFromKEEL$nVars
[1] 4
> irisFromKEEL$attributeNames
[1] "SepalLength" "SepalWidth" "PetalLength" "PetalWidth" "Class"
Also, it is possible to print the dataset as a
data.frame with the print() function.
Executing subgroup discovery algorithms
It is possible to execute a SD algorithm in two ways: through a
parameter file, specifying as argument the path to such file, or by
entering all the parameter names and values at the command line. You can
find the structure of the parameter file, among other useful
information, on the help pages of each function.
Assuming the "SDEFSR_Dataset" object
irisFromKEEL that has been loaded, and that the
params.txt parameter file is stored in the working
directory, the easiest way to run the MESDIF() algorithm,
for example, will be:
> ruleSet <- MESDIF(paramFile = "param.txt")
#or
> ruleSet <- MESDIF(training = irisFromKEEL)
The first way will execute the algorithm with the parameters and
datasets defined in the parameter file. The second one will execute the
algorithm with the specified "SDEFSR_Dataset" object, and
default values for remainder parameters. By default, the target variable
used is the last defined in the dataset and the algorithm searches rules
for all the values of the target variable.
An example of an execution with all parameters and the result
obtained could be:
> ruleSet <- MESDIF(paramFile = NULL, training = irisFromKEEL, test = NULL,
+ output = c("optionsFile.txt", "rulesFile.txt", "testQM.txt"),
+ seed = 0, nLabels = 3, nEval = 300, popLength = 100,
+ eliteLength = 2, crossProb = 0.6, mutProb = 0.01,
+ RulesRep = "can", Obj1 = "CSUP", Obj2 = "CCNF", Obj3 = "null",
+ Obj4 = "null", targetVariable = "Class",
+ targetClass = "Iris-virginica")
--------------------------------
Algorithm: MESDIF
Relation: iris
Training dataset: training
Test dataset: test
Rules Representation: CAN
Number of evaluations: 300
Number of fuzzy partitions: 3
Population Length: 100
Elite Population Length: 2
Crossover Probability: 0.6
Mutation Probability: 0.01
Obj1: CSUP (Weigth: )
Obj2: CCNF (Weigth: )
Obj3: null (Weigth: )
Obj4: null
Number of examples in training: 150
Number of examples in test: 150
Target Variable: Class
Target Class: Iris-virginica
--------------------------------
Searching rules for only one value of the target class...
GENERATED RULE 1
Variable SepalWidth = Label 1 ( 2 , 3.2 , 4.4 )
THEN Iris-virginica
GENERATED RULE 2
Variable PetalWidth = Label 2 ( 1.3 , 2.5 , 3.7 )
THEN Iris-virginica
Testing rules...
Rule 1 :
- N_vars: 2
- Coverage: 0.8
- Significance: 0.602743
- Unusualness: 0.02
- Accuracy: 0.357724
- CSupport: 0.286667
- FSupport: 0.245
- CConfidence: 0.358333
- FConfidence: 0.351955
- True Positive Rate: 0.86
- False Positive Rate: 0.77
Rule 2 :
- N_vars: 2
- Coverage: 0.193333
- Significance: 27.673033
- Unusualness: 0.128889
- Accuracy: 0.9375
- CSupport: 0.193333
- FSupport: 0.201667
- CConfidence: 1
- FConfidence: 0.889706
- True Positive Rate: 0.58
- False Positive Rate: 0
Global:
- N_rules: 2
- N_vars: 2
- Coverage: 0.496666
- Significance: 14.137888
- Unusualness: 0.074444
- Accuracy: 0.647612
- CSupport: 0.3
- FSupport: 0.223334
- FConfidence: 0.620831
- CConfidence: 0.679166
- True Positive Rate: 0.72
- False Positive Rate: 0.385
The output has three defined sections:
The first one provides to the user information about the current
execution, i.e., the values given to the parameters.
After that, the rules obtained are shown one by one. These rules
are numbered, starting at 1.
Finally, the quality measures applied over the test (or training
if test = NULL) dataset for each rule are shown. At the end
of results, the "Global" section shows the average results for the
quality measures analysed,
As this output could be extremely large, the function also saves it
to three files, one for each of the above sections. The name of these
files by default are optionsFile.txt,
rulesFile.txt and testQM.txt and being saved
into the working directory, overwriting existing files. The format of
these files is identical to the output generated by the algorithm, but
divided into the sections described above.
The output parameter must be included if the authors
desire the modification of the names of paths in the stored files. It
accepts a vector with the names or paths to be used instead of the
default ones. Additionally to this output, the function returns the
"SDEFSR_Rules" object which contains the rules generated
and the quality measures associated to each rule.
Analysing the rules obtained
After the execution of a SD algorithm, it returns a
"SDEFSR_Rules" object that contains the rules obtained.
Following the example, with the ruleSet object obtained we
can plot a TPR vs FPR plot to view the reliability and generality of the
rule (Kralj, N. Lavrac, and B. Zupan
2005). Reliable rules have low values of FPR and high TPR values,
and too general variables have high values for both TPR and FPR. To plot
the rules, we can use the function plotRules(). (This
function depends on the package ggplot2. If this is not
installed, the user will be queried to install it.) The resulting plot
is shown in Figure 1.
Figure 1: Plot generated after executing
plotRules() on the ``"SDEFSR_Rules"`` object obtained in the example.
Additionally, we can directly order the rule set by a quality measure
with the sort() function which returns another
"SDEFSR_Rules" object with the rules sorted. By default,
the ordering is done by confidence.
rulesOrderedBySignificance <- sort(x = ruleSet, decreasing = TRUE, by = "Significance")
Filtering rules by number of attribute-value pairs or keeping those
rules with a quality measure greater than a given threshold are
interesting functionalities to extract only high-quality rules. Such
filtering operations are quite simple to apply in SDEFSR. Using
the subset operator (’[]’) and introducing the filtering
conditions will generate a new "SDEFSR_Rules" object with
the result:
Apply filter by unusualness:
> filteredRules <- ruleSet[Unusualness > 0.05]
We check only if the number of rules decrease. In this case, this
value must be 1.
> length(filteredRules)
[1] 1
Also, you can make the filter as complex as you can Filter by
Unusualness, TPr and number of variables:
> filteredRules <- ruleSet[Unusualness > 0.05 & TPr > 0.9 & nVars == 3]
In this case, there are not rules that match the conditions.
Therefore, the number of rules must be 0.
> length(filteredRules)
[1] 0
The user interface
The SDEFSR package provides the user with a GUI to
ease the use of SD algorithms. It also allows basic exploratory analisys
of the data to be performed. This GUI is accessible by calling the
function:
> SDEFSR_GUI()
The GUI was generated using the shiny package, therefore
this function depends on this package. It depends on package
ggplot2 too. If shiny or ggplot2 are not
installed in the system, the user is given the option to install
them.
> SDEFSR_GUI()
Package 'shiny' is not installed and must be installed to run this GUI.
Do you want to install it? (Y/n): Y
\(\vdots\)
Once the package has been installed, the GUI is launched. In Figure
2 the initial state of the GUI is shown.
It is structured into two areas. On the left the user can select the
training and test files to be used, the target variable, and the value
to be used by the SD algorithm as target variable. Finally, the group of
radio buttons allows the user to change between different graphics to
perform a basic exploratory analysis. On the right there is a tab panel
where tabs are organised by the logical process of execution and
visualisation of results of a SD algorithm.
Figure 2: Initial screen of the SDEFSR user
interface.
At this moment, the user only can perform the loading of datasets as
a training or test file. The in Figure 2
within the red box intitiate the reading of files with the the same file
formats as the read.dataset() function. Once a dataset has
been loaded, an initial plot appears. This will be a pie chart if the
last variable is categorical or a histogram otherwise. Figure 3 shows an example of such a graph. This graph
could show information about all the variables in the dataset. For
example, the pie chart shows the value and the number of examples that
belongs to this value. To the right of this plot, a table provides some
information about the distribution of the data samples. This becomes
interesting with numerical variables where basic statistical information
is displayed. To change the variable being visualised, use the "Select
the target variable" dropdown menu.
The Keep this data button brings an interesting
function. This button allows to filter examples whose values contain
unselected values from the Select attributes field for
categorical variables or values from numerical variables that are not
within the range specified on the Show range slider.
Figure 3: Screenshot of the GUI once a dataset
has been loaded.
Another functionality is the Variable vs Variable
dataset visualisation. As shown in Figure 4 , it
is possible to select two variables of the dataset and visualise their
behaviour with respect to the target class. In this case, the target
class is the last variable of the dataset. The plot shown depends on the
types of variables choosen. If both are numerical, a scatter plot like
in Figure 4 is shown and if both are
categorical, a bar plot is shown. A numerical variable versus a
categorical variable is an undefined functionality, thus, no plot is
shown.
Figure 4: Screenshot of the
Variable vs Variable functionality.
On the "Algorithm Selection" tab, the user can choose a SD algorithm
to execute, and easily modify all the available parameters.
After the execution of a SD algorithm the "Rules Generated" tab is
automatically selected as shown in Figure 5.
Here the user can see the rules over a DataTable. The most
important function is the "Search" field where the user can find rules
with a specific variable. For example, with the results obtained
executing MESDIF with the banana (Alcalá-Fdez et al. 2011) dataset, which is an
artificial dataset whose classes form a banana shape, and leaving the
default parameters of the GUI. Typing "At1" on the search box filter
rules with the variable At1 on the antecedent. Similarly,
typing "THEN 1.0" filter rules with the value 1.0 on the
consequent.
Figure 5: Screen of the "Rules Generated" tab
with generated rules.
Tab "Test Quality Measures" shows another DataTable with
the quality measures for each rule. The filter options are similar to
the DataTable of "Rules Generated". As shown in Figure 6, the button Show/Hide Graph
shows a FPR vs TPR plot of the generated rules, similar to the
plotRules() function.
Figure 6: Screen of the "Quality Measures" tab
with the FPR vs TPR graph displayed.
Finally, "Execution Info" shows information about the current
execution, i.e., the parameters used in this execution. This information
is like the execution information of a console execution (See Sec. 4.4).
Summary
In this paper the SDEFSR package has been introduced. This
package implements all the EFS-based algorithms for SD that exist in the
specialised literature. The package can use datasets in ARFF, ".dat" of
KEEL and CSV formats or a data.frame object. The main
contribution of this package is the ease of use of the algorithms by
means of functions with recommended parameters by default and different
ways of execution, saving the user from the need to know the names of
all the parameters of each algorithm. Also, a GUI is presented in order
to make this task even easier.
The development of the package will continue in the future, including
more functionality to work with datasets in more different formats,
adding new SD algorithms, improving the performance of existing ones,
and also bringing all this functionality to the GUI, which will be
extended with more advanced tools for exploratory analysis.
Acknowledgments
This paper has been partially supported by the project
TIN2015-68854-R (FEDER Founds) of the Spanish Ministry of Economy and
Competitiveness.
The code of this package, available at https://github.com/SIMIDAT/SDEFSR is MIT-licensed.
Alcalá-Fdez, J., A Fernandez, J. Luengo, J. Derrac, S. García, L.
Sánchez, and F. Herrera. 2011. “KEEL Data-Mining Software Tool:
Data Set Repository, Integration of Algorithms and Experimental Analysis
Framework.” Journal of Multiple-Valued Logic and Soft
Computing 2-3 (17): 255–87.
Atzmueller, Martin. 2014.
Rsubgroup: Subgroup Discovery and
Analytics.
http://CRAN.R-project.org/package=rsubgroup.
———. 2015. “Subgroup Discovery.” Wiley
Interdisciplinary Reviews: Data Mining and Knowledge Discovery 5
(1): 35–49.
Atzmueller, Martin, and Florian Lemmerich. 2012.
“VIKAMINE–Open-Source Subgroup Discovery, Pattern Mining, and
Analytics.” In Machine Learning and Knowledge Discovery in
Databases, 842–45. Springer.
Atzmueller, Martin, and Frank Puppe. 2006. “SD-Map–a Fast
Algorithm for Exhaustive Subgroup Discovery.” In European
Conference on Principles of Data Mining and Knowledge Discovery,
6–17. Springer.
Atzmueller, Martin, Frank Puppe, and Hans-Peter Buscher. 2004.
“Towards Knowledge-Intensive Subgroup Discovery.” In
Proceedings of the Lernen - Wissensentdeckung - Adaptivit??t -
Fachgruppe Maschinelles Lernen, 111–17.
Atzmueller, M., S. Doerfel, and F. Mitzlaff. 2016.
“Description-Oriented Community Detection Using Exhaustive
Subgroup Discovery.” Information Sciences 329: 965–84.
Back, Thomas, David B Fogel, and Zbigniew Michalewicz. 1997.
Handbook of Evolutionary Computation. IOP Publishing Ltd.
Berlanga, Francisco, Marı́a José Del Jesus, Pedro González, Francisco
Herrera, and Mikel Mesonero. 2006. “Multiobjective Evolutionary
Induction of Subgroup Discovery Fuzzy Rules: A Case Study in
Marketing.” Advances in Data Mining. Applications in
Medicine, Web Mining, Marketing, Image and Signal Mining, 337–49.
Carmona, C. J., P. González, M. J. del Jesus, and F. Herrera. 2014.
“Overview on Evolutionary Subgroup Discovery: Analysis of the
Suitability and Potential of the Search Performed by Evolutionary
Algorithms.” WIREs Data Mining and Knowledge Discovery 4
(2): 87–103.
Carmona, C. J., P González, M José del Jesus, C Romero, and S Ventura.
2010. “Evolutionary Algorithms for Subgroup Discovery Applied to
e-Learning Data.” In Proceedings of Education Engineering
Conference (EDUCON), 983–90. IEEE.
Carmona, C. J., S. Ramirez-Gallego, F. Torres, E. Bernal, M. J. del
Jesus, and S. Garcia. 2012. “Web Usage Mining to Improve the
Design of an e-Commerce Website: OrOliveSur.com.” Expert
Systems with Applications, 11243–49.
Carmona, Cristobal Jose, V Ruiz-Rodado, Maria Jose del Jesus, A Weber, M
Grootveld, P González, and D Elizondo. 2015. “A Fuzzy Genetic
Programming-Based Algorithm for Subgroup Discovery and the Application
to One Problem of Pathogenesis of Acute Sore Throat Conditions in
Humans.” Information Sciences 298: 180–97.
Carmona, Cristóbal J, Charalambos Chrysostomou, Huseyin Seker, and M. J.
del Jesus. 2013. “Fuzzy Rules for Describing Subgroups from
Influenza a Virus Using a Multi-Objective Evolutionary
Algorithm.” Applied Soft Computing 13 (8): 3439–48.
Carmona, Cristóbal J, Pedro González, MJ Del Jesus, Mercedes
Navı́o-Acosta, and Luis Jiménez-Trevino. 2011. “Evolutionary Fuzzy
Rule Extraction for Subgroup Discovery in a Psychiatric Emergency
Department.” Soft Computing 15 (12): 2435–48.
Carmona, Cristóbal J, Pedro González, B Garcı́a-Domingo, MJ Del Jesus,
and J Aguilera. 2013. “MEFES: An Evolutionary Proposal for the
Detection of Exceptions in Subgroup Discovery. An Application to
Concentrating Photovoltaic Technology.” Knowledge-Based
Systems 54: 73–85.
Carmona, Cristóbal José, Pedro González, Maria J del Jesus, and
Francisco Herrera. 2010. “NMEEF-SD: Non-Dominated Multiobjective
Evolutionary Algorithm for Extracting Fuzzy Rules in Subgroup
Discovery.” IEEE Transactions on Fuzzy Systems 18 (5):
958–70.
Carmona, Cristóbal José, Pedro González, Maria Jose del Jesus, and
Francisco Herrera. 2009. “An Analysis of Evolutionary Algorithms
with Different Types of Fuzzy Rules in Subgroup Discovery.” In
Fuzzy Systems, 2009. FUZZ-IEEE 2009. IEEE International Conference
on, 1706–11. IEEE.
Charte and D. Charte, F. 2015.
Working with
multilabel datasets in R: The mldr package. The R Journal 7
(2): dec.
http://journal.r-project.org/archive/2015-2/charte-charte.pdf.
Deb, Kalyanmoy. 2001. Multi-Objective Optimization Using
Evolutionary Algorithms. Vol. 16. John Wiley & Sons.
Deb, Kalyanmoy, Samir Agrawal, Amrit Pratap, and Tanaka Meyarivan. 2000.
“A Fast Elitist Non-Dominated Sorting Genetic Algorithm for
Multi-Objective Optimization: NSGA-II.” Lecture Notes in
Computer Science 1917: 849–58.
Demšar, Janez, Tomaž Curk, Aleš Erjavec, Črt Gorup, Tomaž Hočevar, Mitar
Milutinovič, Martin Možina, et al. 2013. “Orange: Data Mining
Toolbox in Python.” The Journal of Machine Learning
Research 14 (1): 2349–53.
Eiben, AE, and JE Smith. 2003. “Introduction to Evolutionary
Algorithms.” Springer, Berlin.
Fogel, David B. 2006. Evolutionary Computation: Toward a New
Philosophy of Machine Intelligence. Vol. 1. John Wiley & Sons.
Freitas, Alex A. 2003. “A Survey of Evolutionary Algorithms for
Data Mining and Knowledge Discovery.” In Advances in
Evolutionary Computing, 819–45. Springer.
Gamberger, D., N. Lavra??, and G. Krsta??i?? 2003. “Active
Subgroup Mining: A Case Study in Coronary Heart Disease Risk Group
Detection.” Artificial Intelligence in Medicine 28 (1):
27–57.
Gamberger, Dragan, and Nada Lavrac. 2002. “Expert-Guided Subgroup
Discovery: Methodology and Application.” Journal of
Artificial Intelligence Research, 501–27.
Geng, Liqiang, and Howard J Hamilton. 2006. “Interestingness
Measures for Data Mining: A Survey.” ACM Computing Surveys
(CSUR) 38 (3): 9.
Goldberg, DE. 1989. “Genetic Algorithm in Search, Optimization and
Machine Learing.” MA: Addison-Wesley Longman Publishing Co.,
Inc.
Herrera, Francisco. 2008. “Genetic Fuzzy Systems: Taxonomy,
Current Research Trends and Prospects.” Evolutionary
Intelligence 1 (1): 27–46.
Herrera, Franciso, Cristóbal José Carmona, Pedro González, and Marı́a
José del Jesus. 2011. “An Overview on Subgroup Discovery:
Foundations and Applications.” Knowledge and Information
Systems 29 (3): 495–525.
Hüllermeier, E. 2011. “Fuzzy sets in machine
learning and data mining.” Applied Soft
Computing 11 (2): 1493–1505.
Hüllermeier, Eyke. 2005. “Fuzzy Methods in Machine Learning and
Data Mining: Status and Prospects.” Fuzzy Sets and
Systems 156 (3): 387–406.
Jesus, M. J. del, P. González, F. Herrera, and M. Mesonero. 2007.
“Evolutionary Fuzzy Rule Induction Process for Subgroup Discovery:
A Case Study in Marketing.” IEEE Transactions on Fuzzy
Systems 15 (4): 578–92.
Jin, N., P. Flach, T. Wilcox, R. Sellman, J. Thumim, and A. Knobbe.
2014. “Subgroup Discovery in Smart Electricity Meter Data.”
IEEE Transactions on Industrial Informatics 10 (2): 1327–36.
John, Holland. 1992. “Adaptation in Natural and Artificial
Systems.” MIT Press, Cambridge, MA.
Kavšek, Branko, and Nada Lavrač. 2006. “APRIORI-SD: Adapting
Association Rule Learning to Subgroup Discovery.” Applied
Artificial Intelligence 20 (7): 543–83.
Koza, John R. 1992. Genetic Programming: On the Programming of
Computers by Means of Natural Selection. Vol. 1. MIT press.
Kralj, N. Lavrac, and B. Zupan, P. 2005. Subgroup visualization. In 8th International
Multiconference Information Society(IS-05) pages.
Lavrač, Nada, Bojan Cestnik, Dragan Gamberger, and Peter Flach. 2004.
“Decision Support Through Subgroup Discovery: Three Case Studies
and the Lessons Learned.” Machine Learning 57 (1-2):
115–43.
Lavrač, Nada, Branko Kavšek, Peter Flach, and Ljupčo Todorovski. 2004.
“Subgroup Discovery with CN2-SD.” The Journal of
Machine Learning Research 5: 153–88.
Lemmerich, F., M. Ifland, and F. Puppe. 2011. “Identifying and
Presenting Influence Factors on Student Drop-Outs by Subgroup
Discovery.”
Leung, Kwong Sak, Yee Leung, Leo So, and Kim Fai Yam. 1992. “Rule
Learning in Expert Systems Using Genetic Algorithm: 1, Concepts.”
In Proceedings of the 2nd International Conference on Fuzzy Logic
and Neural Networks (IIZUKA???92), 1:201–4.
Lowerre, B. T. 1976. “The Harpy speech
recognition system.” PhD thesis, Carnegie-Mellon Univ.,
Pittsburgh, PA.
Luengo, A. Garc??a-Vico, M. D. P??rez-Godoy, and C. Carmona, J. n.d.
The influence of noise on the evolutionary
fuzzy systems for subgroup discovery. Soft Computing pages
(In Press).
Meeng, Marvin, and Arno Knobbe. 2011. “Flexible Enrichment with
Cortana–Software Demo.” In Proceedings of BeneLearn,
117–19.
Poitras, E. G., S. P. Lajoie, T. Doleck, and A. Jarrell. 2016.
“Subgroup Discovery with User Interaction Data: An Empirically
Guided Approach to Improving Intelligent Tutoring Systems.”
Educational Technology and Society 19 (2): 204–14.
Rodriguez, Daniel, Roberto Ruiz, Jose C. Riquelme, and Rachel Harrison.
2013. “A Study of Subgroup Discovery Approaches for Defect
Prediction.” Information and Software Technology 55
(10): 1810–22.
Schwefel, Hans-Paul. 1995. “Sixth-Generation Computer Technology
Series.” Evolution and Optimum Seeking. Wiley, New York.
Stiglic, G., and P. Kokol. 2012. “Discovering Subgroups Using
Descriptive Models of Adverse Outcomes in Medical Care.”
Methods of Information in Medicine 51 (4): 348–52.
Ventura and J. M. Luna, S. 2016. Pattern mining with evolutionary
algorithms.
Wrobel, Stefan. 2001. “Inductive Logic Programming for Knowledge
Discovery in Databases.” In Relational Data Mining,
74–101. Springer.
Zadeh, Lotfi A. 1975. “The Concept of a Linguistic Variable and
Its Application to Approximate Reasoning???i.” Information
Sciences 8 (3): 199–249.
Zitzler, Eckart, Marco Laumanns, and Lothar Thiele. 2002. “SPEA2:
Improving the Strength Pareto Evolutionary Algorithm for Multiobjective
Optimization.” In International Congress on Evolutionary
Methods for Design Optimization and Control with Applications to
Industrial Problems, 95–100.