stEM {FlexRL} | R Documentation |
Stochastic Expectation Maximisation (StEM) for Record Linkage
Description
Stochastic Expectation Maximisation (StEM) for Record Linkage
Usage
stEM(
data,
StEMIter,
StEMBurnin,
GibbsIter,
GibbsBurnin,
musicOn = TRUE,
newDirectory = NULL,
saveInfoIter = FALSE
)
Arguments
data |
A list with elements:
|
StEMIter |
An integer with the total number of iterations of the Stochastic EM algorithm (including the period to discard as burn-in) |
StEMBurnin |
An integer with the number of iterations to discard as burn-in |
GibbsIter |
An integer with the total number of iterations of the Gibbs sampler (done in each iteration of the StEM) (including the period to discard as burn-in) |
GibbsBurnin |
An integer with the number of iterations to discard as burn-in |
musicOn |
A boolean value, if TRUE the algorithm will play music at the end of the algorithm, useful if you have to wait for the record linkage to run and to act as an alarm when record linkage is done |
newDirectory |
A NULL value or: A string with the name of (or path to) the directory (which should already exist) where to save the environment variables at the end of each iteration (useful when record linkage is very long, to not loose everything and not restart from scratch in case your computer shut downs before record linkage is finished) |
saveInfoIter |
A boolean value to indicate whether you want the environment variables to be saved at the end of each iteration (useful when record linkage is very long, to not loose everything and not restart from scratch in case your computer shut downs before record linkage is finished) |
Value
A list with:
Delta, the summarry of a sparse matrix, i.e. a dataframe with 3 columns: the indices from the first data source A, the indices from the second data source B, the non-zero probability that the records associated with this pair of indices are linked (i.e. the posterior probabilities to be linked). One has to select the pairs where this proba>0.5 to get a valid set of linked records, (this threshold on the linkage probability is necessary to ensure the one-to-one assignment constraint of record linkage stating that one record in one file can at most be linked to one record in the other file).
gamma, a vector with the chain of the parameter gamma representing the proportion of linked records as a fraction of the smallest file,
eta, a vector with the chain of the parameter eta representing the distribution of the PIVs,
alpha, a vector with the chain of the parameter alpha representing the hazard coefficient of the model for instability,
phi, a vector with the chain of the parameter phi representing the registration errors parameters).
There are more details to understand the method in our paper, or on the experiments repository of our paper, or in the vignettes.
Examples
PIVs_config = list( V1 = list(stable = TRUE),
V2 = list(stable = TRUE),
V3 = list(stable = TRUE),
V4 = list(stable = TRUE),
V5 = list( stable = FALSE,
conditionalHazard = FALSE,
pSameH.cov.A = c(),
pSameH.cov.B = c()) )
PIVs = names(PIVs_config)
PIVs_stable = sapply(PIVs_config, function(x) x$stable)
Nval = c(6, 7, 8, 9, 15)
NRecords = c(500, 800)
Nlinks = 300
PmistakesA = c(0.02, 0.02, 0.02, 0.02, 0.02)
PmistakesB = c(0.02, 0.02, 0.02, 0.02, 0.02)
PmissingA = c(0.007, 0.007, 0.007, 0.007, 0.007)
PmissingB = c(0.007, 0.007, 0.007, 0.007, 0.007)
moving_params = list(V1=c(),V2=c(),V3=c(),V4=c(),V5=c(0.28))
enforceEstimability = TRUE
DATA = DataCreation( PIVs_config,
Nval,
NRecords,
Nlinks,
PmistakesA,
PmistakesB,
PmissingA,
PmissingB,
moving_params,
enforceEstimability)
A = DATA$A
B = DATA$B
Nvalues = DATA$Nvalues
encodedA = A
encodedB = B
encodedA[,PIVs][ is.na(encodedA[,PIVs]) ] = 0
encodedB[,PIVs][ is.na(encodedB[,PIVs]) ] = 0
data = list( A = encodedA,
B = encodedB,
Nvalues = Nvalues,
PIVs_config = PIVs_config,
controlOnMistakes = c(TRUE,TRUE,FALSE,FALSE,FALSE),
sameMistakes = TRUE,
phiMistakesAFixed = FALSE,
phiMistakesBFixed = FALSE,
phiForMistakesA = c(NA,NA,NA,NA,NA),
phiForMistakesB = c(NA,NA,NA,NA,NA))
fit = stEM( data = data,
StEMIter = 50,
StEMBurnin = 30,
GibbsIter = 50,
GibbsBurnin = 30,
musicOn = TRUE,
newDirectory = NULL,
saveInfoIter = FALSE )