如何从R和SAS中的两对列得到相关矩阵?对角线为零

问题描述:

我有一个数据框看起来像下面;我用R将两列转移到一个矩阵,但是R不能给我矩阵。 (我的预期矩阵大约是700 * 700。)R停止并显示Reached total allocation of 12213Mb: see help(memory.size)如何从R和SAS中的两对列得到相关矩阵?对角线为零

我想在SAS中做同样的事情。我们怎么做到这一点?或者我需要不同的代码来完成R?

ID_r ID_c SCORE 
A1 A2 0.2 
A1 A3 0.2 
A1 A4 0.3 
A1 A5 0.2 
A1 A6 0.2 
A2 A3 0.6 
A2 A4 0.2 
A2 A5 0.2 
A2 A6 0.2 
A3 A4 0.2 
A3 A5 0.2 
A3 A6 0.2 
A4 A5 0.2 
A4 A6 0.9 
A5 A6 0.2 

    ID_r<-c('A1','A1','A1','A1','A1','A2','A2','A2','A2','A3','A3','A3','A4','A4','A5') 
    ID_c<-c('A2','A3','A4','A5','A6','A3','A4','A5','A6','A4','A5','A6','A5','A6','A6') 
    SCORE<-c(0.2,0.2,0.3,0.2,0.2,0.6,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.9,0.2) 

library(dplyr); library(tidyr) 
df$ID_r <- as.character(df$ID_r) 
df$ID_c <- as.character(df$ID_c) 
ID <- unique(c(df$ID_r, df$ID_c)) 
diagDf <- data.frame(ID_r = ID, ID_c = ID, SCORE = "0.0") 
newDf <- rbind(df, diagDf) %>% arrange(ID_r, ID_c) 

resultDf <- spread(newDf, ID_r, SCORE, fill = ".") 
names(resultDf)[1] <- "" 
resultDf 

样本SAS数据如下。

data score_data; 
infile datalines; 
input ID_r $ ID_c $ SCORE; 
return; 
datalines; 

    A1 A2 0.2 
    A1 A3 0.2 
    A1 A4 0.3 
    A1 A5 0.2 
    A1 A6 0.2 
    A2 A3 0.6 
    A2 A4 0.2 
    A2 A5 0.2 
    A2 A6 0.2 
    A3 A4 0.2 
    A3 A5 0.2 
    A3 A6 0.2 
    A4 A5 0.2 
    A4 A6 0.9 
    A5 A6 0.2 
; 
run; 

proc print data=score_data ; 
run; 

而且我想用两列数据生成如下矩阵(diaginal为零)。

A1 A2 A3 A4 A5 A6 
A1 0.0 0.2 0.2 0.3 0.2 0.2 
A2 0.2 0.0 0.6 0.2 0.2 0.2 
A3 0.2 0.6 0.0 0.2 0.2 0.2 
A4 0.3 0.2 0.2 0.0 0.2 0.9 
A5 0.2 0.2 0.2 0.2 0.0 0.2 
A6 0.2 0.2 0.2 0.9 0.2 0.0 

在此先感谢!

R解决方案:

library(plyr) 
ID_r = c('A1','A1','A1','A1','A1','A2','A2','A2','A2','A3','A3','A3','A4','A4','A5') 
ID_c = c('A2','A3','A4','A5','A6','A3','A4','A5','A6','A4','A5','A6','A5','A6','A6') 
SCORE = c(0.2,0.2,0.3,0.2,0.2,0.6,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.9,0.2) 
df1 = data.frame(ID_r, ID_c, SCORE) 
df2 = data.frame(ID_c, ID_r, SCORE) 
names(df2) = c("ID_r","ID_c","SCORE") 
df = rbind(df1,df2) 
ID <- unique(c(ID_r, ID_c)) 

df1 = expand.grid(ID,ID) 
names(df1) = c("ID_r","ID_c") 
d = join(df1, df, by = c("ID_r","ID_c")) 
d$SCORE[is.na(d$SCORE)] <- 0 

a = matrix(0, nrow = length(ID), ncol = length(ID)) 
rownames(a) <- ID 
colnames(a) <- ID 
a 

b = as.matrix(d) 
b 

a[b[,1:2]] <- b[,3] 
a 
+0

'join'需要'plyr'包。 – Divi

+0

默认'join'使用'left'类型。这是你在这个问题中需要的。你得到什么错误? – Divi

+0

我编辑了答案。 – Divi

PROC TRANSPOSE是你的朋友在这里。

proc transpose data=score_data out=score_matrix; 
    by id_r; 
    id id_c; *this makes variable names; 
    var score; 
run; 

这会给你更高的对角线。第二个proc transpose可以给你更低的对角线(交换id_rid_c我想象),或者你可以在数据集中做到这一点。您仍然必须在数据集中创建六个0.0行,但这不应该特别困难。

这样的一个例子:

data pre_transpose; 
    set score_data end=eof; 
    by id_r id_c; 
    output; 

    *Swap R and C; 
    _idtemp = id_r; 
    id_r=id_c; 
    id_c=_idtemp; 
    output; 

    *If EOF, then need that last 0,0 combo which never gets an R; 
    if eof then do; 
    id_c = id_r; 
    score=0; 
    output; 
    id_c = _idtemp; 
    end; 

    *If first line of a new ID, then need the R=C row; 
    if first.id_r then do; 
    id_r=id_c; 
    score=0; 
    output; 
    end; 

run; 

proc sort data=pre_transpose; 
    by id_r id_c; 
run; 
proc transpose data=pre_transpose out=score_matrix; 
    by id_r; 
    id id_c; *this makes variable names; 
    var score; 
run; 
+0

谢谢!!!!它完美的工作!非常感谢。我从你的答案中学到了很多SAS编码。 –