simDic {smdc} | R Documentation |
This function calculates the similarity between documents and documents by using dictionary.
simDic(docMatrix1, docMatrix2, scoreDict, breaks = seq(-1, 1, length = 11), norm = FALSE, method = "cosine", scoreFunc = mean)
docMatrix1 |
Document matrix whose rows represent feature vector of one document. This matrix must satisfy the following: colnames(docMatrix1) denote feature names, rownames(docMatrix1) denote document names, every element is numerical. |
docMatrix2 |
Document matrix whose rows represent feature vector of one document. This matrix must satisfy the following: colnames(docMatrix2) denote feature names, rownames(docMatrix2) denote document names, every element is numerical. |
scoreDict |
Dictionary matrix which converts features to numbers. This matrix must k * 2 matrix: 1st colmn represents features and 2nd column represents corresponding number. Similarity is calculated according to the number. |
breaks |
Range vector of frequency distribution. Each element must be ascending order. |
norm |
Whether normalize similarity matrix or not. |
method |
Method to caluculate similarity. |
scoreFunc |
Function of scoring from dictionary. |
Similarity Matrix whose rows represent documents of docMatrix1 and whose columns represent documents of docMatrix2. This matrix is n * m matrix where n=ncol(docMatrix1) and m=ncol(docMatrix2), and satisfy the following: rownames(returnValue)=colnames(docMatrix1), colnames(returnValue)=colnames(docMatrix2).
Masaaki TAKADA
## The function is currently defined as function (docMatrix1, docMatrix2, scoreDict, breaks = seq(-1, 1, length = 11), norm = FALSE, method = "cosine", scoreFunc = mean) { library("proxy") words <- unique(rbind(matrix(rownames(docMatrix1)), matrix(rownames(docMatrix2)))) words <- words[order(words)] wordScores <- rep(NA, length(words)) for (i in 1:length(words)) { cond <- (scoreDict[, 1] == words[i]) value <- scoreDict[cond, 2] if (length(value) != 0) { wordScores[i] <- scoreFunc(value, na.rm = TRUE) } } names(breaks) <- cut(breaks, breaks) wordClass <- cut(wordScores, breaks) names(wordClass) <- words docFreq1 <- conv2Freq(docMatrix1, wordClass, breaks) docFreq2 <- conv2Freq(docMatrix2, wordClass, breaks) colnames(docFreq1) <- paste("r_", colnames(docMatrix1), sep = "") colnames(docFreq2) <- paste("c_", colnames(docMatrix2), sep = "") sim <- as.matrix(simil(t(cbind(docFreq1, docFreq2)), method = method))[colnames(docFreq1), colnames(docFreq2)] rownames(sim) <- colnames(docMatrix1) colnames(sim) <- colnames(docMatrix2) if (norm) { sim <- normalize(sim) } return(sim) }