Abstract
Knowledge classification is an extensive and practical approach in domain knowledge management. Automatically extracting and organizing knowledge from unstructured textual data is desirable and appealing in various circumstances. In this paper, the tidy framework for automatic knowledge classification supported by the package is introduced. With powerful support from the R ecosystem, the framework can handle multiple procedures in data science workflow, including text cleaning, keyword extraction, synonyms consolidation and data presentation. While focusing on bibliometric analysis, the package is extensible to be used in other contexts. This paper introduces the framework and its features in detail. Specific examples are given to guide the potential users and developers to participate in open science of text mining.
Co-word analysis has long been used for knowledge discovery, especially in library and information science (Callon, Rip, and Law 1986). Based on co-occurrence relationships between words or phrases, this method could provide quantitative evidence of information linkages, mapping the association and evolution of knowledge over time. In conjunction with social network analysis (SNA), co-word analysis could be escalated and yield more informative results, such as topic popularity (Huang and Zhao 2019) and knowledge grouping (Khasseh et al. 2017). Meanwhile, in the area of network science, many community detection algorithms have been proposed to unveil the topological structure of the network (Fortunato 2010; Javed et al. 2018). These methods have then been incorporated into the co-word analysis, assisting to group components in the co-word network. Currently, the co-word analysis based on community detection is flourishing across various fields, including information science, social science and medical science (C.-P. Hu et al. 2013; J. Hu and Zhang 2015; Leung, Sun, and Bai 2017; Baziyad et al. 2019).
For implementation, interactive software applications, such as CiteSpace (Chen 2006) and VOSviewer (Van Eck and Waltman 2010), have provided freely available toolkits for automatic co-word analysis, making this technique even more popular. Interactive software applications are generally friendlier to users, but they might not be flexible enough for the whole data science workflow. In addition, the manual adjustments could be variant, bringing additional risks to the research reproducibility. In this paper, we have designed a flexible framework for automatic knowledge classification, and presented an open software package supported by R ecosystem for implementation. Based on community detection in co-occurrence network, the package could conduct unsupervised classification on the knowledge represented by extracted keywords. Moreover, the framework could handle tasks such as data cleaning and keyword merging in the upstream of data science workflow, whereas in the downstream it provides both summarized table and visualized figure of knowledge grouping. While the package was first designed for academic knowledge classification in bibliometric analysis, the framework is general to benefit a broader audience interested in text mining, network science and knowledge discovery.
Classification could be identified as a meaningful clustering of experience, turning information into structured knowledge (Kwasnik 1999). In bibliometric research, this method has been frequently used to group domain knowledge represented by author keywords, usually listed as a part of co-word analysis, keyword analysis or knowledge mapping (He 1999; C.-P. Hu et al. 2013; Leung, Sun, and Bai 2017; Li, Ma, and Qu 2017; Wang and Chai 2018). While all named as (unsupervised) classification or clustering, the algorithm behind could vary widely. For instance, some researches have utilized hierarchical clustering to group keywords into different themes (J. Hu and Zhang 2015; Khasseh et al. 2017), whereas the studies applying VOSviewer have adopted a weighted variant of modularity-based clustering with a resolution parameter to identify smaller clusters (Van Eck and Waltman 2010). In the framework of , we have utilized the modularity-based clustering method known as community detection in network science (Newman 2004; Murata 2010). These functions are supported by the package (Csardi, Nepusz, et al. 2006). Main detection algorithms implemented in include Edge betweenness (Girvan and Newman 2002), Fastgreedy (Clauset, Newman, and Moore 2004), Infomap (Rosvall and Bergstrom 2007; Rosvall, Axelsson, and Bergstrom 2009), Label propagation (Raghavan, Albert, and Kumara 2007), Leading eigenvector (Newman 2006), Multilevel (Blondel et al. 2008), Spinglass (Reichardt and Bornholdt 2006) and Walktrap (Pons and Latapy 2005). The details of these algorithms and their comparisons have been discussed in the previous studies (Sousa and Zhao 2014; Yang, Algesheimer, and Tessone 2016; Garg and Rani 2017; Amrahov and Tugrul 2018).
In practical application, the classification result is susceptible to data variation. The upstream procedures, such as information retrieval, data cleaning and word sense disambiguation, play vital roles in automatic knowledge classification. For bibliometric analysis, the author keyword field provides a valuable source of scientific knowledge. It is a good representation of domain knowledge and could be used directly for analysis. In addition, such collections of keywords from papers published in specific fields could provide a professional dictionary for information retrieval, such as keyword extraction from raw text in the title, abstract and full text of literature. In addition to automatic knowledge classification based on community detection in keyword co-occurrence network, the framework also provides utilities for keyword-based knowledge retrieval, text cleaning, synonyms merging and data visualization in data science workflow. These tasks might have different requirements in specific backgrounds. Currently, concentrates on keyword-based bibliometric analysis of scientific literature. Nonetheless, the R ecosystem is versatile, and the popular tidy data framework is flexible enough to extend to various data science tasks from other different fields (Wickham et al. 2014; Wickham and Grolemund 2016; Silge and Robinson 2017), which benefits both end-users and software developers. In addition, when users have more specific needs in their tasks, they could easily seek other powerful facilities from the R community. For instance, provides functions to extract keywords using an n-grams model (utilizing facilities provided by ), but skip-gram modelling is not supported currently. This functionality, on the other hand, could be provided in (Mullen et al. 2018) or (Benoit et al. 2018) package in R. A greater picture of natural language processing (NLP) in R could be found in the CRAN Task View: Natural Language Processing.
An overview of the framework is given in Figure @ref(fig:fig1). Note that the name refers to the overall framework for automatic keyword classification as well as the released R package in this paper. The whole workflow can be divided into four procedures: (1) Keyword extraction (optional); (2) Keyword preprocessing; (3) Network construction and clustering; (4) Results presentation.
The design of akc framework. Generally, the framework includes four steps, namely: (1) Keyword extraction (optional); (2) Keyword preprocessing; (3) Network construction and clustering; (4) Results presentation.
In bibliometric meta-data entries, the textual information of title, abstract and keyword are usually provided for each paper. If the keywords are used directly, there is no need to do information retrieval. Then we could directly skip this procedure and start from keyword preprocessing. However, sometimes the keyword field is missing, then we would need to extract the keywords from raw text in the title, abstract or full text with an external dictionary. At other times, one might want to get more keywords and their co-occurrence relationships from each entry. In such cases, the keyword field could serve as an internal dictionary for information retrieval in the provided raw text.
Figure @ref(fig:fig2) has displayed an example of keyword extraction
procedure. First, the raw text would be split into sub-sentences
(clauses), which suppresses the generation of cross-clause n-grams. Then
the sub-sentences would be tokenized into n-grams. The n
could be specified by the users, inspecting the average number of words
in keyword phrases might help decide the maximum number of
n. Finally, a filter is made. Only tokens that have emerged
in the user-defined dictionary are retained for further analysis. The
whole keyword extraction procedure could be implemented automatically
with keyword_extract function in .
An example of keyword extraction procedure. The raw text would be first divided sentence by sentence, then tokenized to n-grams and yield the target keywords based on a dictionary. The letters are automatically turned to lower case.
In practice, the textualized contents are seldom clean enough to
implement analysis directly. Therefore, the upstream data cleaning
process is inevitable. In keyword preprocessing procedure of framework,
the cleaning part would take care of some details in the preprocess,
such as converting the letters to lower case and removing parentheses
and contents inside (optional). For merging part, help merge the
synonymous phrases according to their lemmas or stems. While using
lemmatization and stemming might get abnormal knowledge tokens, here in
we have designed a conversion rule to tackle this problem. We first get
the lemmatized or stemmed form of keywords, then group them by their
lemma or stem, and use the most frequent keyword in the group to
represent the original keyword. This step could be realized by
keyword_merge function in package. An example could be
found in Table @ref(tab:tab1-1). After keyword merging, there might
still be too many keywords included in the analysis, which poses a great
burden for computation in the subsequent procedures. Therefore, a filter
should be carried out here, it could exclude the infrequent terms, or
extract top TF-IDF terms, or use any criteria that meets the need. Last,
a manual validation should be carried out to ensure the final data
quality.
| ID | Original form | Lemmatized form | Merged form |
|---|---|---|---|
| 1 | higher education | high education | higher education |
| 2 | higher education | high education | higher education |
| 3 | high educations | high education | higher education |
| 4 | higher educations | high education | higher education |
| 5 | high education | high education | higher education |
| 6 | higher education | high education | higher education |
Based on keyword co-occurrence relationship, the keyword pairs would
form an edge list for construction of an undirected network. Then the
facilities provided by the package would automatically group the nodes
(representing the keywords). This procedure could be achieved by using
keyword_group function in .
Currently, there are two kinds of output presented by . One is a summarized result, namely a table with group number and keyword collections (attached with frequency). Another is network visualization, which has two modes. The local mode provides a keyword co-occurrence network by group (use facets in ), whereas the global mode displays the whole network structure. Note that one might include a huge number of keywords and make a vast network, but for presentation the users could choose how many keywords from each group to be displayed. More details could be found in the following sections.
The framework could never be built without the powerful support provided by R community. The package was developed under R environment, and main packages imported to framework include (Dowle and Srinivasan 2021) for high-performance computing, (Wickham et al. 2022) for tidy data manipulation, (Wickham 2016) for data visualization, (Pedersen 2021) for network visualization, (Le Pennec and Slowikowski 2019) for word cloud visualization, (Csardi, Nepusz, et al. 2006) for network analysis, (Wickham 2019) for string operations, (Rinker 2018) for lemmatizing and stemming, (Pedersen 2022) for network data manipulation and (Silge and Robinson 2016) for tidy tokenization. Getting more understandings on these R packages could help users utilize more alternative functions, so as to complete more specific and complex tasks. Hopefully, the users might also become potential developers of the framework in the future.
This section shows how can be used in a real case. A collection of bibliometric data of R Journal from 2009 to 2021 is used in this example. The data of this example can be accessed in the GitHub repository. Only the package is used in this workflow. First, we would load the package and import the data in the R environment.
library (akc)
rj_bib = readRDS ("./rj_bib.rds")
rj_bib
## # A tibble: 568 × 4
## id title abstract year
## <int> <chr> <chr> <dbl>
## 1 1 Aspects of the Social Organization and Trajectory of the… Based p… 2009
## 2 2 asympTest: A Simple R Package for Classical Parametric S… asympTe… 2009
## 3 3 ConvergenceConcepts: An R Package to Investigate Various… Converg… 2009
## 4 4 copas: An R package for Fitting the Copas Selection Model This ar… 2009
## 5 5 Party on! Random … 2009
## # ℹ 563 more rows
rj_bib is a data frame with four columns, including
id (Paper ID), title (Title of paper),
abstract (Abstract of paper) and year (Publication
year of paper). Papers in R Journal do not contain a keyword
field, thus we have to extract the keywords from the title or abstract
field (first step in Figure @ref(fig:fig1)). Here in our case, we use
the abstract field as our data source. In addition, we need a
user-defined dictionary to extract the keywords, otherwise all the
n-grams (meaningful or meaningless) would be extracted and the results
would include redundant noise.
# import the user-defined dictionary
rj_user_dict = readRDS ("./rj_user_dict.rds")
rj_user_dict
## # A tibble: 627 × 1
## keyword
## <chr>
## 1 seasonal-adjustment
## 2 unit roots
## 3 transformations
## 4 decomposition
## 5 combination
## # ℹ 622 more rows
Note that the dictionary should be a data.frame with only one column
named “keyword”. The user can also use make_dict function
to build the dictionary data.frame with a string vector. This function
removes duplicated phrases, turns them to lower case and sorts them,
which potentially improves the efficiency for the following
processes.
rj_dict = make_dict (rj_user_dict$keyword)
With the bibliometric data (rj_bib) and dictionary data
(rj_dict), we could start the workflow provided in Figure
@ref(fig:fig1).
In this step, we need a bibliometric data table with simply two informative columns, namely paper ID (id) and the raw text field (in our case abstract). The parameter dict is also specified to extract only keywords emerging in the user-defined dictionary. The implementation is very simple.
rj_extract_keywords = rj_bib %>%
keyword_extract (id = "id",text = "abstract",dict = rj_dict)
By default, only phrases ranging 1 to 4 in length are included as
extracted keywords. The user can change this range using parameters
n_min and n_max in
keyword_extract function. These is also a
stopword parameter, allowing users to exclude specific
keywords in the extracted phrases. The output of
keyword_extract is a data.frame (tibble,tbl_df
class provided by package) with two columns, namely paper ID
(id) and the extracted keyword (keyword).
For the preprocessing part, keyword_clean and
keyword_merge would be implemented in the cleaning part and
merging part respectively. In the cleaning part, the
keyword_clean function would: 1) Splits the text with
separators (If no separators exist, skip); 2) Removes the contents in
the parentheses (including the parentheses, optional); 3) Removes white
spaces from start and end of string and reduces repeated white spaces
inside a string; 4) Removes all the null character string and pure
number sequences (optional); 5) Converts all letters to lower case; 6)
Lemmatization (optional). The merging part has been illustrated in the
previous section (see Table @ref(tab:tab1-1)), thus would not be
explained again. In the tidy workflow, the preprocessing is implemented
via:
rj_cleaned_keywords = rj_extract_keywords %>%
keyword_clean () %>%
keyword_merge ()
No parameters are used in these functions because has been designed
to input and output tibbles with consistent column names. If the users
have data tables with different column names, specify them in arguments
(id and keyword) provided by the functions.
More details can be found in the help document (use
?keyword_clean and ?keyword_merge in the
console).
To construct a keyword co-occurrence network, only a data table with
two columns (with paper ID and keyword) is needed. All the details have
been taken care of in the keyword_group function. However,
the user could specify: 1) the community detection function (use
com_detect_fun argument); 2) the filter rule of keywords
according to frequency (use top or min_freq
argument, or both). In our example, we would use the default settings
(utilizing Fastgreedy algorithm, only top 200 keywords by frequency
would be included).
rj_network = rj_cleaned_keywords %>%
keyword_group ()
The output object rj_network is a tbl_graph
class supported by , which is a tidy data format containing the network
data. Based on this data, we can present the results in various forms in
the next section.
Currently, there are two major ways to display the classified results
in , namely network and table. A fast way to gain the network
visualization is using keyword_vis function:
rj_network %>%
keyword_vis ()
Network visualization for knowledge classification of R Journal (2009-2021). The keywords were automatically classified into three groups based on Fastgreedy algorithm. Only the top 10 keywords by frequency are displayed in each group.
In Figure @ref(fig:fig4), the keyword co-occurrence network is
clustered into three groups. The size of nodes is proportional to the
keyword frequency, while the transparency degree of edges is
proportional to the co-occurrence relationship between keywords. For
each group, only the top 10 keywords by frequency are showed in each
facet. If the user wants to dig into Group 1,
keyword_network could be applied. Also,
max_nodes parameter could be used to control how many nodes
to be showed (in our case, we show 20 nodes in the visualization
displayed in Figure @ref(fig:fig5)).
rj_network %>%
keyword_network (group_no = 1,max_nodes = 20)
Focus on one cluster of the knowledge network of R journal (2009-2021). Top 20 keywords by frequency are shown in the displayed group.
Another displayed form is using table. This could be implemented by
keyword_table via:
rj_table = rj_network %>%
keyword_table ()
This would return a data.frame with two columns (see Table @ref(tab:tab2-1)), namely the group number and the keywords (by default, only the top 10 keywords by frequency would be displayed, and the frequency information is attached).
| Group | Keywords (TOP 10) |
|---|---|
| 1 | r package (238); algorithms (117); time (109); software (93); regression (75); number (72); features (60); sets (45); selection (41); simulation (40) |
| 2 | parameters (98); inference (65); framework (58); information (51); distributions (48); performance (47); probability (45); design (44); likelihood (41); optimization (31) |
| 3 | package (505); model (310); tools (140); tests (48); errors (46); multivariate (42); system (41); hypothesis (18); maps (16); assumptions (15) |
Word cloud visualization is also supported by via package, which
could be implemented by using keyword_cloud function.
In our example, we assume R Journal has a large focus on introducing R packages (Group 1 and Group 3 contains “r package” and “package” respectively). Common statistical subjects mentioned in R Journal include “regression” (in Group 1), “optimization” (in Group 2) and “multivariate” (in Group 3). While our example provides a preliminary analysis of knowledge classification in R Journal, an in-depth exploration could be carried out with a more professional dictionary containing more relevant keywords, and more preprocessing could be implemented according to application scenarios (e.g. “r package” and “package” could be merged into one keyword, and unigrams could be excluded if we consider them carrying indistinct information).
The core functionality of the akc framework is to automatically group the knowledge pieces (keywords) using modularity-based clustering. Because this process is unsupervised, it can be difficult to evaluate the outcome of classification. Nevertheless, the default setting of community detection algorithm was selected after empirical tests via benchmarking. It was found that: 1) Edge betweenness and Spinglass algorithm are most time-consuming; 2) Edge betweenness and Walktrap algorithm could potentially find more local clusters in the network; 3) Label propagation could hardly divide the keywords into groups; 4) Infomap has high standard deviation of node number across groups. In the end, Fastgreedy was chosen as the default community detection algorithm in , because its performance is relatively stable, and the number of groups increases proportionally with the network size.
Though currently focuses on automatic knowledge classification based on community detection in keyword co-occurrence network, this framework is rather general in many natural language processing problems. One could utilize part of the framework to complete some specific tasks, such as word consolidating (using keyword merging) and n-gram tokenizing (using keyword extraction with a null dictionary), then export the tidy table and work in another environment. As long as the data follows the rule of tidy data format (Wickham et al. 2014; Silge and Robinson 2017), the framework could be easily decomposed and applied in various circumstances. For instance, by considering the nationalities of authors as keywords, framework could also investigate the international collaboration behavior in specific domain.
In the meantime, the framework is still in active development, trying new algorithms to carry out better unsupervised knowledge classification under the R environment. The expected new directions include more community detection functions, new clustering methods, better visualization settings, etc. Note that except for the topology-based community detection approach considering graph structure of the network, there is still another topic-based approach considering the textual information of the network nodes (Ding 2011), such as hierarchical clustering (Newman 2003), latent semantic analysis (Landauer, Foltz, and Laham 1998) and Latent Dirichlet Allocation (Blei, Ng, and Jordan 2003). These methods are also accessible in R, the relevant packages could be found in the CRAN Task View: Natural Language Processing. With the tidy framework, could assimilate more nutrition from the modern R ecosystem, and move forward to create better reproducible open science schemes in the future.
In this paper, we have proposed a tidy framework of automatic knowledge classification supported by a collection of R packages integrated by . While focusing on data mining based on keyword co-occurrence network, the framework also supports other procedures in data science workflow, such as text cleaning, keyword extraction and consolidating synonyms. Though in the current stage it aims to support analysis in bibliometric research, the framework is quite flexible to extend to various tasks in other fields. Hopefully, this work could attract more participants from both R community and academia to get involved, so as to contribute to the flourishing open science in text mining.
This study is funded by The National Social Science Fund of China “Research on Semantic Evaluation System of Scientific Literature Driven by Big Data” (21&ZD329). The source code and data for reproducing this paper can be found at: https://github.com/hope-data-science/RJ_akc.