Abstract
MessagePack, or MsgPack for short, or when referring to the implementation, is an efficient binary serialization format for exchanging data between different programming languages. The RcppMsgPack package provides R with both the MessagePack C++ header files, and the ability to access, create and alter MessagePack objects directly from R. The main driver functions of the R interface are two functionsmsgpack_pack
and msgpack_unpack. The function msgpack_pack
serializes R objects to a raw MessagePack message. The function
msgpack_unpack de-serializes MessagePack messages back into
R objects. Several helper functions are available to aid in
processing and formatting data including msgpack_simplify,
msgpack_format and msgpack_map.
MessagePack (or MsgPack for short, or when referring to the actual implementation) is a binary serialization format made for exchanging data between different programming languages (Furuhashi 2018). Unlike other related formats such as JSON, MsgPack is a binary format—which makes it more efficient in terms of (disk or memory) space, transfer speeds (which is increasingly important for large data sets across networks) and potentially also precision (as textual representation rarely goes to the length of binary precision). As shown on the project homepage at https://msgpack.org, several major projects including Redis, Pinterest, Fluentd and Treasure Data utilize MsgPack to transfer data or to represent internal data structures (Furuhashi 2018). Other binary serialization formats similar to MsgPack include BSON (MongoDB 2018) and ProtoBuf (Google 2018) which have their own advantages and disadvantages, such as serialization speed, memory usage, compression and requirement of descriptive schemas (Hamida, Hamida, and Ahmed 2015; Dawborn and Curran 2014). R support for these formats is available via the packages mongolite (Ooms 2014) and RProtoBuf (Eddelbuettel, Stokely, and Ooms 2016); Redis is also implemented in R through the RcppRedis package (Eddelbuettel 2018). RcppMsgPack (Ching et al. 2018) brings support for the MsgPack specification to R.
The MsgPack specification describes a number of common data type: Booleans, Integers, Floats, Strings, Binary data, Arrays, Maps, and user-defined extension types, and has been implemented in most major programming languages. RcppMsgPack aims to provide an efficient, and easy to use implementation by relying on the official C++ MsgPack code and the Rcpp package (Eddelbuettel 2013; Eddelbuettel et al. 2018). The package provides users with the MsgPack header files which can be used to more directly integrate MsgPack into R projects through C++ code. It also provides the ability to serialize and de-serialize data directly to and from R (e.g., through pipes, file handlers, sockets or binary object files). These functionalities can be used to efficiently transfer data between various programming languages and between separate R instances.
In this manuscript, we describe the main interface functions used to serialize and deserialize MsgPack messages, and the conversion between R data types and MsgPack data types. We describe helper functions contained in RcppMsgPack and describe several use cases and examples of how RcppMsgPack can be used in practice, benchmarking several common approaches for transferring data between processes.
The functions msgpack_pack and
msgpack_unpack allow serialization and de-serialization of
R objects respectively (Figure 1). Here,
msgpack_pack takes in any number of R objects and
generates a single serialized message in the form of a raw vector.
Conversely, msgpack_unpack takes a serialized message as
input, and returns the R object(s) contained in the message.
Moreover, msgpack_format is a helper function to properly
format R objects for input, and msgpack_simplify
is a helper function to simplify output from a MsgPack
conversion. One of the main goals of MsgPack is the transfer of
data across processes and/or hosts. Therefore, we also define two helper
functions msgpack_write and msgpack_read which
facilitate writing and reading of MsgPack objects to files,
pipes or any connection object.
The data types in MsgPack do not directly map on to
R data types, more so than other languages. For example, basic
R “atomic” types such as integers and strings are inherently
vectorized, which is not true in C++, Python or most other
languages. I.e., in R there is no distinction between
a single integer and a vector of integers of length 1. R also
has multiple non-value types, such as NULL or
NA for integer, string, numeric, etc. Because of these
complexities, the conversion processes using these interface functions
are described in detail below.
R integers are converted into MsgPack integers,
which are automatically reduced in size, depending on the value of the
integer. MsgPack integers are converted back into R
integers. Because R does not natively support 64 bit integers,
whereas MsgPack supports integers up to 64 unsigned bits in
value, MsgPack integers exceeding signed 32 bits supported by
R are coerced to R numeric values, with potential loss
of precision. The integer NA value in R is
represented by its bit value in C++ (0x80000000), and requires
no special treatment.
R numeric (i.e., doubles) variables
conform to IEEE 754 double-precision standards (IEEE Standards Committee 2008), and also
require no special treatment. The numeric NA value is a special case of
NaN values, and is serialized by its bit representation.
R strings (i.e., objects of class
character) are converted to MsgPack strings.
Because C++ and MsgPack do not have missing values for
strings, NA characters are converted into MsgPack Nil (similar
to NULL in R).
R logical values are converted into
MsgPack bool. Again, because NA logical values do not exist in
C++ or MsgPack, NA logical values are converted into
MsgPack Nil.
R raw vectors are converted into
MsgPack bin. Raw vectors with the “EXT” integer attribute are
converted into MsgPack extension types. The EXT attribute
should be a positive integer, as negative values are reserved for
official extensions.
Currently, the MsgPack specifications includes one official
extension type: timestamps. Timestamps are a MsgPack extension
type with extension value -1 and can be converted to and from R
POSIXct objects using msgpack_timestamp_decode
and msgpack_timestamp_encode respectively. MsgPack
timestamps can encode nanosecond precision. R
POSIXct objects rely on numeric, and therefore
conversion may have some loss of precision (unless a package such as nanotime
(Eddelbuettel and Silvestri 2018) is used,
which is left as a future extension).
MsgPack specifications define two container objects: arrays and maps. MsgPack arrays are a sequential container object. The length of the array is defined in its message header. Arrays can contain any other MsgPack types, including other arrays or maps.
MsgPack arrays are naturally analogous to R
unnamed list objects. However, because lists have a large
memory footprint, R atomic vectors (with length of 0 or greater
than or equal to 2) are also allowed as input for serialization to
arrays.
MsgPack maps are an ordered sequence of key and value pairs,
where each key and value can be any MsgPack object. There is no
requirement for unique keys. Maps do not have an analogous data type in
R . Therefore, maps are implemented by creating an object of
class map, which is also a data.frame with
key and value columns. As input to
serialization, these columns can also be lists, and can therefore
contain any other R object, and not only a single type. The
function msgpack_map is a simple helper function that takes
two lists and returns a map which can be serialized into a
MsgPack object with msgpack_pack.
In order to support as much generality as possible in serialization
and deserialization, the use of lists to represent arrays and maps is
necessary. However, it is often the case in R that one would
want to deal with large vectors or matrices of a single type without the
computational and memory overhead of lists. Two approaches are given to
deal with this type of scenario. msgpack_simplify can be
used after a call to msgpack_unpack to recursively simplify
lists to vectors when only a single type is included within a list. (For
lists of characters or logicals, this may also include NULLs.) Secondly,
msgpack_unpack can be called with the
simplify=TRUE parameter, which performs the same task as
msgpack_simplify within C++, and is therefore much
faster. The second approach can drastically improve speed and memory
usage compared to the first approach.
Complex objects or data structures, such as trees, often do not fit into R data types because a tree data structure does not map nicely to an R vector, data.frame, matrix, etc. Storing such a complex object as a MsgPack message will be more performant in terms of serialization speed and memory usage.
The example below demonstrates how MsgPack headers can be
integrated into a standard Rcpp workflow. In this example, a
prefix tree is created for nucleotide sequences, and is serialized
through MsgPack to create a persistant tree object in the form
of a raw vector in R. The stored tree can be saved to disk,
unpacked within R directly using msgpack_unpack or
it can be reconstructed into the prefix tree within C++ using
the MsgPack C++ interface. The code below defines a
structure for storing the Prefix tree data and a function for
constructing the tree using sequence data input from R and
saving it as a MsgPack object:
struct Node {
std::shared_ptr<Node> parent;
std::set<int> sequence_idx;
std::map< std::string, std::shared_ptr<Node> > children;
};
struct std::shared_ptr<Node>
NewNode(std::shared_ptr<Node> parent,
std::string name,
std::set<int> sequence_idx) {
std::shared_ptr<Node> node = std::shared_ptr<Node>(new Node);
node->sequence_idx = sequence_idx;
if (parent) {
parent->children.insert(std::pair<std::string,
std::shared_ptr<Node> >(name, node));
}
return node;
}
void packTree(std::shared_ptr<Node> node,
msgpack::packer<std::stringstream>& pkr) {
pkr.pack_array(2);
std::set<int> sequence_idx = node->sequence_idx;
std::vector<int> vs(sequence_idx.begin(), sequence_idx.end());
pkr.pack(vs);
std::map< std::string, std::shared_ptr<Node> > children = node->children;
pkr.pack_map(children.size());
for (auto const& x : children) {
pkr.pack(x.first);
packTree(x.second, pkr);
}
}
Rcpp::RawVector create_prefix_tree(std::vector<std::string> clone_sequences) {
std::shared_ptr<Node> root;
root = NewNode(std::shared_ptr<Node>(nullptr), "^", {});
for(int i=0; i<clone_sequences.size(); i++) {
std::shared_ptr<Node> current_node = root;
for(int j=0; j < clone_sequences[i].size(); j++) {
std::string nuc = clone_sequences[i].substr(j,1);
if(current_node->children.count(nuc) == 1) {
current_node = current_node->children[nuc];
if(j == clone_sequences[i].size() - 1) {
current_node->sequence_idx.insert(i);
}
} else {
if(j == clone_sequences[i].size() - 1) {
current_node = NewNode(current_node, nuc, {i});
} else {
current_node = NewNode(current_node, nuc, {});
}
}
}
}
std::stringstream buffer;
msgpack::packer<std::stringstream> pk(&buffer);
packTree(root, pk);
std::string bufstr = buffer.str();
Rcpp::RawVector rawbuffer(bufstr.begin(), bufstr.end());
return rawbuffer;
}
The create_prefix_tree function is an C++
function that returns a raw vector, which is a serialization of the
prefix tree. From R, the prefix tree can be initialized and
serialized through calling the Rcpp function.
tree <- create_prefix_tree(c("AGCT", "AGCC", "AGC", "ATG", "GACC", "GTCT"))
Because the resulting message is a standard MsgPack object,
the tree can be unpacked directly in R using
msgpack_unpack. Additionally, the following code
reconstructs the prefix tree from the MsgPack message from
within C++ using the C++ interface:
void print_node(msgpack::object & node_obj, std::string name) {
std::vector<msgpack::object> node_obj_array;
node_obj.convert(node_obj_array);
msgpack::object_kv* p = node_obj_array[1].via.map.ptr;
msgpack::object_kv* pend = node_obj_array[1].via.map.ptr +
node_obj_array[1].via.map.size;
std::cout << name;
if(p != pend) {
std::cout << "(";
for (; p < pend; ++p) {
std::string name_child = p->key.as<std::string>();
msgpack::object node_child_obj = p->val.as<msgpack::object>();
print_node(node_child_obj, name_child);
if(p < (pend-1)) std::cout << ",";
}
std::cout << ")";
}
}
void print_newick_tree(std::vector<unsigned char> packed_tree) {
std::string message(packed_tree.begin(), packed_tree.end());
msgpack::object_handle oh;
msgpack::unpack(oh, message.data(), message.size());
msgpack::object obj = oh.get();
print_node(obj, "Root");
}
Called from within R, the tree is printed to the console:
tree_nested_list <- msgpack_unpack(tree, simplify=F)
print_newick_tree(tree)
[1] Root(A(G(C(C,G,T),G)),G(A(C(C)),T(C(A,T))))
Following is an example of how one can create a MsgPack
serialized message as a binary file, and read it from the internet into
R. The first step would be to read in the data to R as
normal, and serialize the data using the msgpack_pack
function.
dat <- read.csv(paste0("https://raw.githubusercontent.com/vincentarelbundock/",
"Rdatasets/master/csv/mosaicData/Birthdays.csv"))
mp <- with(dat, msgpack_pack(X, state, year, month, day, date, wday, births))
The msgpack_pack function returns a raw vector, which
can be written to disk using the writeBin function or the
helper function msgpack_write.
msgpack_write(mp, file="birthdays_msgpack.mp")
A MsgPack object can be read directly from the internet,
e.g., using the GET function from the httr package
(Wickham 2018). Subsequently, the
msgpack_unpack function can be used to unpack the object to
its original values.
md <- GET("http://travers.im/birthdays_msgpack.mp")
mp <- md$content
mu <- msgpack_unpack(mp)
To evaluate the performance of MsgPack serialization, we benchmarked the transfer of the MNIST (LeCun, Cortes, and Burges 2010) and CIFAR-10 (Krizhevsky 2009) datasets to and from Python and R. We compared this approach to writing and reading in CSV format, and to writing and reading using feather (Wickham et al. 2016), a cross-platform library and specification for efficiently handling tabular data. (The feather package is no longer actively developed, and will be superseded by Apache Arrow (Apache Arrow Developers 2018), a more general cross-language development platform for working with in-memory data in a standardized language-independent columnar memory format. However, Arrow is not yet available for R.) In Python, serialization of the data was performed using the msgpack package and written to disk in binary format:
xb = msgpack.packb(x)
with open("/tmp/dataset.mp", "wb") as file:
file.write(xb)
Conversely, the unpackb function was used when
benchmarking the transfer time from R to Python:
with open("/tmp/dataset.mp", "rb") as file:
file.write(xb)
x = msgpack.unpackb(file.read())
In R, we used the readBin function followed by
msgpack_unpack function to deserialize the dataset.
Alternatively, one could use the helper function
msgpack_read.
xb <- readBin(con = "/tmp/dataset.mp", "raw",
n=file.info("/tmp/dataset.mp")$size)
x <- msgpack_unpack(xb, simplify=T)
Subsequently, the data can be re-serialized and written to disk using
the msgpack_pack function followed by
writeBin:
xb <- msgpack_pack(x)
writeBin(xb,"/tmp/dataset.mp", useBytes=T)
For CSV format, we used the fread and
fwrite functions in the data.table
package (Dowle et al. 2018) to write and
read CSV tables in R. In Python, we used the package
numpy savetxt and loadtxt functions
(Walt, Colbert, and Varoquaux 2011). For
the feather format, we used the write_feather and
read_feather functions within the feather package
in R and Python. All approaches were benchmarked after clearing
the system cache and replicated 5 times (Figure 2).
Serialization and de-serialization of objects from MsgPack objects was similar between R and Python, with a slight speed advantage going to Python. Reading MsgPack objects was generally considerably faster than reading CSV format in both R and Python. Comparing MsgPack to feather, feather was generally faster. As feather is designed for columnar binary data, it does not use a header for every element and therefore has an advantage over the (more general) MsgPack in this context.
One surprising result is in writing of CSVs in R. Using data.table, writing the MNIST dataset was faster than MsgPack, although not quite as fast as feather. The MNIST dataset is formatted as a single floating point value in per pixel by TensorFlow (Abadi et al. 2016). When writing floating point data as CSV, both numpy and the data.table package truncate floating point numbers by default, which causes loss of precision.
To compare the memory usage of MsgPack to native R
serialization, we serialized the datasets found in the
datasets R package (Figure 3). For R
serialization, this was done by calling the serialize
function for native R serialization or calling the
msgpack_pack function for MsgPack serialization.
For compression, the memCompress function was called on the
results. Some datasets contained formulas, or other R-specific
attributes that can’t be stored in MsgPack; these attributes
were removed prior to serialization and compression.
Memory usage was generally very similar between MsgPack and native R. Although there were some differences in raw serialization, the difference seems to be less apparent after compression. For some uncompressed serializations, MsgPack significantly outperformed R in memory usage. This is attributable to efficient integer storage: MsgPack stores variable size integers depending on the integer magnitude, and can be as low as 8 bits. Native R serialization uses the modified External Data Representation (XDR) standard (Srinivasan 1995), which uses a fixed 32 bits for integers. Furthermore, attributes of each dataset are stored as lists internally, which can increase the relative overhead for small datasets, as R lists are not memory-efficient objects.
On the other hand, every MsgPack element has a short header of several bits, which increases its memory overhead, particularly for large vectors. This overhead is apparent for larger datasets, where R typically is more memory efficient.
As mentioned, one situation where MsgPack serialization is much more efficient than R serialization is how it handles list objects. Because R lists have a large memory overhead, both compression times and writing/reading to disk are faster using the MsgPack format. To show this, we generated a contrived list of DNA sequences as follows:
x <- replicate(10000, {
replicate(sample(5,1), {
paste(sample(c("G","C","A","T"), size=sample(50,1), replace=T), collapse="")
})
})
If the data structure of the object is known, using the C++ headers directly can speed up serialization and de-serialization by avoiding the logic involved in serializing generic data structures, although the speed-up is not large. Compression and writing to disk can also be performed directly within C++, for example, by using the zlib C++ library (Gailly and Adler 2018) to perform standard DEFLATE compression. The example below illustrates MsgPack serialization and subsequent compression using the zlib library:
void list_pack_gzip(List x, std::string file) {
std::stringstream buffer;
msgpack::packer<std::stringstream> pk(&buffer);
pk.pack_array(x.size());
for(int i=0; i<x.size(); i++) {
CharacterVector xi = x[i];
if(xi.size() == 1) {
pk.pack(Rcpp::as<std::string>(xi[0]));
} else {
pk.pack_array(xi.size());
for(int j=0; j<xi.size(); j++) {
pk.pack(Rcpp::as<std::string>(xi[j]));
}
}
}
std::string bufstr = buffer.str();
gzFile fi = gzopen(file.c_str(),"wb");
gzwrite(fi, bufstr.c_str(), bufstr.size());
gzclose(fi);
}
Conversely, uncompression and deserialization would need to be performed to recover the original R object. The following C++ function illustrates how this could be done:
Rcpp::List list_unpack_gzip(std::string file) {
gzFile fi = gzopen(file.c_str(),"rb");
char buf[8192];
std::vector<char> dat;
uint b = gzread(fi, buf, 8192);
while (b) {
dat.insert(dat.end(), buf, buf + b);
b = gzread(fi, buf, 8192);
}
gzclose(fi);
std::string message = std::string(dat.data(), dat.size());
msgpack::object_handle oh;
msgpack::unpack(oh, message.data(), message.size());
msgpack::object obj = oh.get();
std::vector<msgpack::object> obj_vector;
obj.convert(obj_vector);
Rcpp::List L = Rcpp::List(obj_vector.size());
std::vector< std::string > temp_str_vec;
std::string temp_str;
for (int i=0; i< obj_vector.size(); i++) {
if (obj_vector[i].type == msgpack::type::ARRAY) {
obj_vector[i].convert(temp_str_vec);
L[i] = Rcpp::wrap(temp_str_vec);
} else {
obj_vector[i].convert(temp_str);
L[i] = Rcpp::wrap(temp_str);
}
}
return(L);
}
To illustrate the potential benefit of this approach, we compared
MsgPack serialization and zlib compression to standard
R serialization using the saveRDS function (Figure 4).
Using MsgPack, serialization is slightly faster for the example
above. The timing of these two approaches can be broken down by process:
both serialization and compression to disk (and vice versa) are faster
using the MsgPack approach described above for certain types of
data structures, such as lists. However, for most other types of data
structures, native R serialization is usually faster. Most of the
time-saving comes from the fact that the serialized MsgPack
message is considerably smaller, leading to faster compresion time
and/or faster reading and writing to disk.
The examples given above show how working with the R interface
functions the C++ headers in RcppMsgPack can be useful in
transferring data and integrated within data analysis workflows. The
package allows fast and flexible serialization of generic data
structures in R, and equivalent de-serialization of objects from other
langauages. The RcppMsgPack package is hosted on CRAN, and can
be installed via the standard
install.packages("RcppMsgPack") command.