Abstract

The goal of record linkage and deduplication is to detect which records belong to the same object in data sets where the identifiers of the objects contain errors and missing values. The main design considerations of reclin2 are: modularity/flexibility, speed and the ability to handle large data sets. The first points makes it easy for users to extend the package with custom process steps. This flexibility is obtained by using simple data structures and by following as close as possible common interfaces in R. For large problems it is possible to distribute the work over multiple worker nodes. A benchmark comparison to other record linkage packages for R, shows that for this specific benchmark, the fastLink package performs best. However, this package only performs one specific type of record linkage model. The performance of reclin2 is not far behind the of fastLink while allowing for much greater flexibility.

Introduction

Combining different data sets is often an important step in many data analysis projects. Sometimes the data sets will contain high quality linkage keys, especially when the data sets are based on a common register. For example, the samples for (nearly) all social surveys performed at Statistics Netherlands are drawn from the population register and therefore can be linked to each other (Bakker, Rooijen, and Toor 2014). In these cases exact linkage can be used. In exact linkage, records are linked when they agree exactly on the linkage keys used. Exact linkage can be performed in R using base functions such as merge. However, it is not uncommon that data sets have to be linked on keys such as ‘first name’, ‘last name’ and ‘address’. Often these variables contain errors and/or missing values and, therefore, exact linkage is not possible. That is where probabilistic record linkage methods come into play (Herzog, Scheuren, and Winkler 2007; Christen 2012). These methods will calculate some sort of likelihood that two records belong to the same object (person, company, …). This will be called a match. Only record pairs with a high enough likelihood are linked to each other. The goal is to minimise the number of false links (linking two records that do not belong to the same object) and the number of missed links (not linking two records that do belong to the same object).

The process of probabilistic record linkage generally consists of the following steps: (1) Generate pairs of records from each of the two data sets that are to be linked; (2) Compare the two records of the pair and generate a comparison vector (in the simplest case this a vector of ones and zeros coding agreement/disagreement on each of linkage keys); (3) Estimate a model that predicts based on the comparison vector a likelihood that the two records belong to the same object; (4) Select pairs with a high enough likelihood; (5) Using the selected pairs, generate the final linked dataset. reclin2 offers different methods for most of these steps and by mixing the different methods a custom linkage process can be developed. This is discussed in more detail with examples in the section on the record linkage process.

A variant of record linkage is deduplication. Here there is only one data set and one wants to determine which records belong to the same object. For example, a customer database can contain the same customer multiple times with slightly different information (e.g. different email addresses). Deduplication is usually performed by linking a dataset to itself. Matches are then duplicate records. The principles are, therefore, the same as with regular record linkage and in the remainder of paper we will focus on regular record linkage of two data sets.

Record linkage can be computationally and memory intensive. In principle each record from a data set has to be compared to each record in the other data set. Therefore, when the two data sets are of size \(N_1\) and \(N_2\) respectively the computational complexity and memory requirements are of order \(O(N_1 N_2)\). For example, at Statistics Netherlands one common data set is the population register containing in the order of \(10^7\) records; other data sets can be in the order of \(10^3\)–\(10^5\), resulting in \(10^{10}\)-\(10^{12}\) possible comparisons.

reclin2 is a package that provides a set of tools to perform probabilistic record linkage. It is the successor of the reclin package. The reason for the update was to be able to provide better support for the core design considerations of the reclin/reclin2 package. Unfortunately this was not possible while keeping backward compatibility, therefore it was decided to continue with a new package. The core design considerations are:

Modularity/flexibility.
Speed.
Ability to handle large datasets.

The last two points are important because of the aforementioned issues with the size of the problem. The first point is important, as in practice no record linkage project is the same and, therefore, a common need is to vary on the default procedure. The next section will discuss how we tried to address the points above. Besides reclin2 other packages exist for probabilistic record linkage. There is the RecordLinkage (Sariyar and Borg 2022) package that implements various methods such as classic probabilistic record linkage based on the Fellegi and Sunter (1969) model and methods based on machine learning. Furthermore, there is the fastLink (Enamorado, Fifield, and Imai 2020; Enamorado, Fifield, and Imai 2019) package that focuses on a fast and flexible implementation of the Fellegi-Sunter model. The main difference of reclin2 with these packages is the focus on the previous three points: fastLink scores well on points 2 and 3, but only supports one type of model while RecordLinkage scores well on point 3 and better than fastLink on point 1, but lacks some flexibility and speed. Points 2 and 3 are investigated in a later section using a benchmark.

Design considerations

One of the main considerations when designing the package was flexibility. Therefore, the package has been designed as a set of functions that operate on data.table objects (Dowle and Srinivasan 2021). The main object of the package is the pairs object which is a subclass of data.table. The pairs object contains pairs of records from the two datasets that are to be linked (called x and y). The first two columns of the pairs object contain the indices of the corresponding records from the two data sets. Most functions of the package accept a pairs object and return a pairs object. The package has functions for different steps in the linkage process (as described in the next section). By combining the different available functions a custom data linkage process can be built. Furthermore, as the pairs object is a data.table it is also easy for the user to manipulate it. For example, new columns can be derived and pairs can be filtered. Functions that do not manipulate the pairs object are designed to follow as closely as possible the common interfaces of base R functions. For example, the function problink_em that can be used to estimate the parameters of the Fellegi-Sunter model accepts similar input as other modelling functions in R: e.g. a formula to specify the model and a data argument to pass in the data on which to estimate the model. The corresponding predict function can be used to calculate likelihoods for pairs being a match. It is therefore also easy to use models from other packages, such as machine learning methods, to estimate the likelihoods. Where a package such as RecordLinkage has functions for a number of machine learning methods, reclin2 does not need these as the user is free to call these themselves as demonstrated in the section on the record linkage process below.

The other two design considerations, speed and being able to handle large datasets, are obtained in two ways. First, by using a data.table as the main object. Most methods have an inplace argument (default value is FALSE). When set to TRUE the pairs object is modified using the [, :=] operation of a data.table. This prevents unnecessary copies, decreasing memory consumption and increasing speed. Second, there is the option to create a cluster and distribute the computational load over multiple cores. Using functionality from the parallel or snow (Tierney et al. 2021) packages, multiple R processes are started and the data is distributed over these processes. Each process then generates a subset of the pairs which are kept within the process. Subsequent operations on the pairs, such as comparison, are also distributed over the processes where each process applies the operation to its subset of pairs. One of the more computationally intensive operations during linkage is comparing the records from the two data sets to each other. This problem scales well when parallelising. Therefore parallelization can lead to a significant speed up. Furthermore, when using a snow cluster the computation can also be distributed over multiple machines. This can not only lead to a speed up, but also means that the memory of multiple machines can be utilized allowing for larger problems than could be handled on a single machine. To work with a cluster, special functions with the cluster_ prefix are offered. The cluster functions generating the pairs expect as on of their inputs a valid cluster created for example using makeCluster from the parallel package. When using the cluster variant, the object is no longer a data.table and it becomes more difficult to manually manipulate the object. The package has a few functions to help with this which will be discussed at the end of the next section.

The record linkage process

This section section will give an overview of the linkage process and show how the functions in reclin2 can be used for this. The discussion will be brief. An overview of the main steps of a record linkage process has already been given in the introduction section of the paper. A more extensive description can be found in the package vignettes and documentation. Also, we will not go into detail into the methods used as these are well described in, for example, Herzog, Scheuren, and Winkler (2007) and Christen (2012).

Generating pairs

The first step in the linkage process is to generate pairs of records from the two data sets x and y. There are a number of functions for this: the function pair generates all possible pairs. However, this can lead to impractically large numbers of pairs. Therefore, often methods are applied to reduce the total number of pairs. One commonly used method is blocking where only pairs are generated that agree on some key. This, of course, only works when a good enough quality key is available, otherwise true matches are lost. Another method in the package, pairs_simsum, is to generate pairs that agree on a given number of variables (e.g. they have to agree on either the postcode or the town name) (Christen 2012). In the example below we use blocking on ‘postcode’, e.g. pairs are only generated when they agree exactly on ‘postcode’ ([...] in the examples indicate removed output).

> library(reclin2)
[...]
> data("linkexample1", "linkexample2")
> (pairs <- pair_blocking(linkexample1, linkexample2, "postcode"))
  First data set:  6 records
  Second data set: 5 records
  Total number of pairs: 17 pairs
  Blocking on: 'postcode'

    .x .y
 1:  1  1
 2:  1  2
 3:  1  3
 4:  2  1
 5:  2  2
 6:  2  3
 7:  3  1
[...]

A data.table is returned with the added class pairs. The columns .x and .y contain the row indices into the two data sets. A copy (when the original data sets are not modified this is only a reference) of the two data sets is stored in the attributes x and y. This makes some of the next function calls easier.

There also exist cluster variants of these functions that return a cluster_pairs object:

> library(parallel)
> cl <- makeCluster(2)
> cpairs <- cluster_pair_blocking(cl, linkexample1, linkexample2, "postcode")

When calling the cluster variants of the pair generating algorithms, the records from x are randomly distributed over the nodes of the cluster and y is copied to each cluster node. On each node the corresponding pair function is called. The resulting pair object is stored on each node in an environment in the environment reclin2:::reclin_env (the default name of this environment is "default"). The cluster_pairs object is a list with a copy of the cluster object and the name of the environment on the cluster nodes in which the pairs are stored.

Comparing pairs

The next step in the linkage process is to compare the pair of records on a set of common variables in both data sets. For this the package contains various comparison functions. The default function checks for exact agreement. However, for text fields such as names and addresses, it often better to allow for spelling errors. For this some of the functions from the stringdist (van der Loo 2014) package are imported. For classic record linkage using the Fellegi-Sunter model is necessary that these are translated into a similarity score between 0 and 1 where 1 is complete agreement which is what the functions included in reclin2 do. In the example below, we provide a comparison function for ‘firstname’, ‘lastname’ and ‘address’:

> (compare_pairs(pairs, on = c("lastname", "firstname", "address", "sex"),
+   comparators = list(lastname = jaro_winkler(0.9), firstname = jaro_winkler(0.9),
+      address = jaro_winkler(0.9) ), inplace = TRUE))
[...]
    .x .y lastname firstname   address   sex
 1:  1  1 1.000000 0.4722222 0.9230769    NA
 2:  1  2 0.000000 0.5833333 0.8641026  TRUE
 3:  1  3 0.447619 0.4642857 0.9333333  TRUE
[...]

The Jaro-Winkler string similarity score is used: a value of one indicates complete agreement, a value of zero indicated complete disagreement (no overlap in letters) and values in between indicate partial agreement. The 0.9 in the function call is a threshold used, among others, by the EM-algorithm discussed below as this method only handles complete agreement or disagreement: values above 0.9 are considered to agree completely. We see that the first record from x agrees exactly only on ‘lastname’ with the first record of y, while ‘sex’ cannot be compared as it is missing in at least one of the data sets.

The compare_pairs method is also implemented for the cluster_pairs object. For more flexibility there is also the compare_vars method. This function only compares one variable at the time, but it allows for different names of the variables in the two data sets, generating multiple output columns out of one comparison and for more complex comparisons where multiple variables are taken into account. As an example of the latter, the code below compares records on first name and last name allowing for the two parts of a name to be swapped:

> comp_name <- function(x, y) {
+   equal <- identical()
+   regular <- equal(x[[1]], y[[1]]) & equal(x[[2]], y[[2]])
+   swapped <- equal(x[[1]], y[[2]]) & equal(x[[2]], y[[1]])
+   regular | swapped
+ }
> compare_vars(pairs, "name_swap", on_x = c("firstname", "lastname"),
+   comparator = comp_name)
[...]
    .x .y lastname firstname   address   sex name_swap
 1:  1  1 1.000000 0.4722222 0.9230769    NA     FALSE
 2:  1  2 0.000000 0.5833333 0.8641026  TRUE     FALSE
 3:  1  3 0.447619 0.4642857 0.9333333  TRUE     FALSE
[...]

When records are compared on multiple columns, the comparison function receives two data.table objects as its inputs.

Scoring pairs

The goal of probabilistic record linkage is to generate a likelihood for each pair that the two records in the pair belong to the same record. This likelihood is based on the comparison vector. The traditional method is the model by Fellegi and Sunter (1969). The parameters of this model are usually estimated using a EM-algorithm (Winkler 2000). However, reclin2 considers this just a model as any other model and uses the same interface as any other model that can be estimated in R:

> m <- problink_em(~ lastname + firstname + address + sex, data = pairs)
> (pairs <- predict(m, pairs = pairs, add = TRUE))
[...]
    .x .y lastname firstname   address   sex    weights
 1:  1  1 1.000000 0.4722222 0.9230769    NA  7.7103862
 2:  1  2 0.000000 0.5833333 0.8641026  TRUE -5.9463949
 3:  1  3 0.447619 0.4642857 0.9333333  TRUE  0.8042090
[...]

The range of the weights depends on the number of variables and the estimated parameters in the model. They are log-likelihood ratios (Fellegi and Sunter 1969). The values of the weights themselves are not directly of use, except that a higher weight indicates that a pairs if more likely a match. In principle, a weight above zero indicates that the pair is more likely a match than not. However, in practice, a threshold higher than zero is often used in order to reduce the likelihood of false links. The predict function of the EM-model also has the option to estimate posterior probabilities. Thresholds for the weights (or probabilities) are often determined by manually inspecting pairs around potential threshold values (Herzog, Scheuren, and Winkler 2007). These methods can also be used for cluster_pairs objects.

As the pairs object is a regular data.table object, it is also relatively easy to estimate other models on the data set. In principle this is a classification problem: the pairs need to be divided into two categories: matches and non-matches. For example, when for a part of the pairs the true match status is known, a supervised learning method can be used. In the example below the ‘id’ field is used to derive the true match status for the dataset (in practice this would probably only be available for a subset) and predict a linkage probability using logistic regression:

> compare_vars(pairs, "true", on_x = "id", on_y = "id", inplace = TRUE)
> mglm <- glm(true ~ lastname + firstname, data = pairs,
+   family = binomial())
> pairs[, pglm := predict(mglm, type = "response")]

In the first line of this example, a column named ‘true’ is added to the dataset (in place). This column is a comparison of the ‘id’ from the first data set (on_x = "id") to the ‘id’ column of the second data set (on_y = "id"). This column ‘true’ contains the true match status. In the second line, a logistic regression model is estimated that predicts the match status using the two columns ‘firstname’ and ‘lastname’. Using this model the probability of a true match is estimated in the final line of the example and added to the data set.

Creating the linked data set

In order to link the pairs a suitable threshold needs to be determined for the weights. Records with a weight above this threshold are classified as a match. Also, we generally know that each person only has one record in each data set. So, generally we will want to enforce one-to-one linkage. This will also generally improve the quality of the linked data set. reclin2 has two methods for enforcing one-to-one linkage. The method select_n_to_m tries to select the pairs in such a way that the total weight of the selected pairs is maximised while linking each record from each data set to at most one record from the other data set (using its arguments it is also possible to enforce n-to-one or one-to-n linkage). A faster method that can lead to less links is select_greedy that will try to select the pair with the highest weight for each record. Below the first method is applied; records with a weight below 0 are not considered (threshold = 0):

> (pairs <- select_n_to_m(pairs, "weights", variable = "select", threshold = 0))
[...]
    .x .y lastname firstname   address   sex    weights select
 1:  1  1 1.000000 0.4722222 0.9230769    NA  7.7103862  FALSE
 2:  1  2 0.000000 0.5833333 0.8641026  TRUE -5.9463949  FALSE
 3:  1  3 0.447619 0.4642857 0.9333333  TRUE  0.8042090  FALSE
 4:  2  1 1.000000 0.8888889 0.9230769    NA  8.6064218   TRUE
[...]

The method creates a logical column (name given by the ‘variable’ argument) in the pairs object with the selected pairs. The first pair has a high enough weight to be selected, but there is another candidate for the first record of y that is more likely, namely record 2 from x (see fourth row of the output above).

Up until now we are still working with a set of pairs. The goal is to get an actually linked dataset containing matched records from both data sets. This can be done using the link method. This function takes the pairs and the name of a logical column indicating which pairs are selected and it will generate the final linked data set. The output is similar to that of merge. The method also has arguments all_x and all_y that function the same as the corresponding all.x and all.y arguments of merge.

> (linked_data_set <- link(pairs, selection = "select"))
  Total number of pairs: 4 pairs

   .y .x id.x lastname.x firstname.x  address.x sex.x postcode.x id.y
1:  1  2    2      Smith      George 12 Mainstr     M    1234 AB    2
2:  2  3    3    Johnson        Anna 61 Mainstr     F    1234 AB    3
3:  3  4    4    Johnson     Charles 61 Mainstr     M    1234 AB    4
4:  4  6    6   Schwartz         Ben  1 Eaststr     M    6789 XY    6
   lastname.y firstname.y     address.y sex.y postcode.y
1:      Smith      Gearge 12 Mainstreet  <NA>    1234 AB
2:     Jonson          A. 61 Mainstreet     F    1234 AB
3:    Johnson     Charles    61 Mainstr     F    1234 AB
4:   Schwartz         Ben        1 Main     M    6789 XY

For the cluster_pairs the steps above need to change a little bit as select_n_to_m needs to consider all pairs and, therefore, does not work with objects of type cluster_pair where the pairs are distributed over the cluster nodes. Therefore, we first need to copy the relevant pairs to the main R process. We can use a selection variable for this only returning the pairs with a weight above zero:

> cpairs <- predict(m, pairs = cpairs, add = TRUE)
> select_threshold(cpairs, "weights", variable = "initial", threshold = 0)
> local_cpairs <- cluster_collect(cpairs, "initial")
> local_cpairs <- select_n_to_m(local_cpairs, "weights", variable = "select")

The first line calculates weights for the cpairs object. In the second line a logical column ‘initial’ is created which is TRUE for records with a weight higher than 0. Using cluster_collect, we collect the pairs from the worker processes into the main R process. Using the second argument we only collect pairs for which the column ‘initial’ is TRUE. The local_cpairs object is a regular pairs object (and, therefore, also a data.table) on which we can use the regular select_n_to_m method.

Helper functions for `cluster_pair` objects

It is easy to do manual manipulations on the regular pairs object. For the cluster_pairs the pairs are distributed over the worker nodes. There are a couple of functions to help with this. The already mentioned cluster_collect function copies the pairs locally. The cluster_call function accepts the cluster_pairs and a function. It will call the function on each node and pass it the pairs, x and y. The results of the functions are copied back locally. For example to get the number of pairs on each node:

> unlist(cluster_call(cpairs, \(p, ...) nrow(p)))
[1] 9 8

The cluster_modify_pairs function can be used to modify the pairs. The arguments are the cluster_pairs and a function with the same arguments as for cluster_call. The result of that function overwrites the pairs object on the worker node (except when NULL). In the example below, this is used to remove pairs with a weight of zero or lower.

> (cluster_modify_pairs(cpairs, \(p, ...) p[weights > 0, ]))
  Cluster 'default' with size: 2
  First data set:  6 records
  Second data set: 5 records
  Total number of pairs: 15 pairs
  Blocking on: 'postcode'

Showing a random selection of pairs:
    .x .y lastname firstname   address   sex    weights initial
 1:  3  1 0.447619 0.4722222 0.8641026    NA  0.6017106    TRUE
 2:  4  3 1.000000 1.0000000 1.0000000 FALSE 15.4915816    TRUE
[...]

Note, that the original cpairs object has been modified. Using the new_name argument it is also possible to generate a new set of pairs.

Benchmark

In this section the performance of the packages for data linkage will be investigated using example data from the Eurostat financed ESSnet (European Statistical System Centres and Networks of Excellence) project on Data Integration (Eurostat 2011). The two data sets ‘PDR’ and ‘CIS’ were linked to each other. The datasets have 24,750 and 24,613 records respectively resulting in 612,562,500 possible pairs when linking the complete data sets. To study the effect of the size of the problem on the performance, samples were drawn from the two data sets based on the postcode where care was taken to sample the same postcodes in the two datasets. The sample fraction was varied from 0.1 to 1.0 in steps of 0.1.

Figure 1: Comparison of computation times (in minutes) for the different packages (lines) as function of the number of potential pairs (the product of the sizes of the two data sets). For reclin2 also different numbers of worker nodes were investigated; these are denoted by ‘cluster’ and the number of worker nodes.

Figure 2: Comparison of memory usage (in gigabytes) for the different packages (lines) as function of the number of potential pairs (the product of the sizes of the two data sets). For reclin2 no reliable estimates could be obtained for the runs with multiple workers. Therefore results are only presented for the single threaded version of reclin2. In principle the memory usage should not depend on the number of workers.

The methods are compared on total computation time and memory use. These were measured using the ‘time’ program on a Linux server. The virtual server has 16 3.2GHz Intel Xeon Gold 6146 cores and 512GB of memory and runs on Ubuntu 20.04. For the time the reported ‘elapsed (wall clock) time’ is used and for the memory usage the ‘maximum resident set size’. For reclin2 also the effect of different numbers of worker nodes was investigated using the cluster functions. It was attempted to keep the methods used a close as possible to each other. No blocking was applied. The EM-algorithm was used. For comparing the names the Jaro-Winkler similarity score was used with a threshold of 0.85. The quality of the resulting record linkage was recorded, but as the methods were not optimally tuned, it is difficult to compare these results and these results are, therefore, not reported. The complete code for the benchmark and all results can be found on Github (Laan 2022).

Figures 1 and 2 show the computation times and the memory usage respectively as a function of the size of the problem. For the RecordLinkage package larger problem sizes than reported here were not investigated as the system started running out of memory. RecordLinkage has the option to also work from disk for large problems. The performance of this was not investigated as this would only lead to longer computation times which were already longer than those of the other packages. For this specific problem the fastLink package performs better than the other packages. Especially the memory usage is substantially lower because of their use of specialised data structures (Enamorado, Fifield, and Imai 2019). The difference in computation time between fastLink and reclin2 using multiple cores is limited (approx. factor 1.7 for the largest problem). However, fastLink is tailored for use with the EM-algorithm while the other packages are more general.

The speed-up of the reclin2 benchmark is not proportional to the number of cores used. This is caused mainly by the fact that some of the steps take place in the main process: reading and sampling the data, starting the worker nodes, and importantly, finalizing the record linkage using one-to-one matching. We are unfortunately limited by Amdal’s law (Amdahl 1967) although the generation, comparison and (in case of the EM-algorithm) tabulation of pair and calculating the predictions will take an increasingly larger part of the running time as the size of problem increases.

Conclusion

By using simple data structures, namely data.table objects, and providing a set of functions that operate on these structures, we have built a flexible and well performing toolkit for record linkage. For users, it is easy to extend on the methods present in the package. Either by manipulating the data structures directly, by writing custom functions or by using existing functions. An example of the latter, is the relative ease with which existing machine learning methods can be used in the record linkage process.

The package manages to keep memory use limited and on machines with 256GB of memory it should handle problems up to approximately \(10^9\) pairs. By combining the memory of multiple machines this can of course be extended. When one is only interested in using the Fellegi-Sunter model of record linkage with an EM-algorithm to estimate the parameters of that model, the fastLink package is probably the best choice. It performs better and the model and EM-algorithm used is more flexible than that currently present in reclin2. The main advantage of reclin2 over fastLink is the flexibility reclin2 provides. Especially for non standard problems this is important.

The package is still being developed. One of the things that is being worked on is the option to take into account the uniqueness of a certain attribute. For example, agreement on a rare family name is a stronger indication of a match than agreement on a common family name. However, we hope that making it easy for users to extend and modify the record linkage processes also lowers the threshold for contributing to the package.

reclin2: a Toolkit for Record Linkage and Deduplication

D. Jan van der Laan

2022-10-11

Introduction

Design considerations

The record linkage process

Generating pairs

Comparing pairs

Scoring pairs

Creating the linked data set

Helper functions for `cluster_pair` objects

Benchmark

Conclusion

reclin2: a Toolkit for Record Linkage and Deduplication

D. Jan van der Laan

2022-10-11

Introduction

Design considerations

The record linkage process

Generating pairs

Comparing pairs

Scoring pairs

Creating the linked data set

Helper functions for cluster_pair objects

Benchmark

Conclusion

Helper functions for `cluster_pair` objects