“Loop through” data.table to calculate conditional

I want to "loop through" the rows of a data.table and calculate an average for each row. The average should be calculated based on the following mechanism:

Look up the identifier ID in row i (ID(i))
Look up the value of T2 in row i (T2(i))
Calculate the average over the Data1 values in all rows j, which meet these two criteria: ID(j) = ID(i) and T1(j) = T2(i)

Enter the calculated average in the column Data2 of row i

 DF = data.frame(ID=rep(c("a","b"),each=6), 
             T1=rep(1:2,each=3), T2=c(1,2,3), Data1=c(1:12))
 DT = data.table(DF)
 DT[ , Data2:=NA_real_]
     ID T1 T2  Data1 Data2
[1,]  a  1  1     1    NA
[2,]  a  1  2     2    NA
[3,]  a  1  3     3    NA
[4,]  a  2  1     4    NA
[5,]  a  2  2     5    NA
[6,]  a  2  3     6    NA
[7,]  b  1  1     7    NA
[8,]  b  1  2     8    NA
[9,]  b  1  3     9    NA
[10,] b  2  1    10    NA
[11,] b  2  2    11    NA
[12,] b  2  3    12    NA

For this simple example the result should look like this:

      ID T1 T2  Data1 Data2
[1,]  a  1  1     1    2
[2,]  a  1  2     2    5
[3,]  a  1  3     3    NA
[4,]  a  2  1     4    2
[5,]  a  2  2     5    5
[6,]  a  2  3     6    NA
[7,]  b  1  1     7    8
[8,]  b  1  2     8    11
[9,]  b  1  3     9    NA
[10,] b  2  1    10    8
[11,] b  2  2    11    11
[12,] b  2  3    12    NA

I think one way of doing this would be to loop through the rows, but I think that is inefficient. I've had a look at the apply() function, but I'm sure if it would solve my problem. I could also use data.frame instead of data.table if this would make it much more efficient or much easier. The real dataset contains approximately 1 million rows.

标签： r data.table

3条回答

霸刀☆藐视天下

2楼-- · 2020-08-13 07:51

A somewhat faster alternative to iterating over rows would be a solution which employs vectorization.

R> d <- data.frame(ID=rep(c("a","b"),each=6), T1=rep(1:2,each=3), T2=c(1,2,3), Data1=c(1:12)) 
R> d
   ID T1 T2 Data1
1   a  1  1     1
2   a  1  2     2
3   a  1  3     3
4   a  2  1     4
5   a  2  2     5
6   a  2  3     6
7   b  1  1     7
8   b  1  2     8
9   b  1  3     9
10  b  2  1    10
11  b  2  2    11
12  b  2  3    12

R> rowfunction <- function(i) with(d, mean(Data1[which(T1==T2[i] & ID==ID[i])]))
R> d$Data2 <- sapply(1:nrow(d), rowfunction)
R> d
   ID T1 T2 Data1 Data2
1   a  1  1     1     2
2   a  1  2     2     5
3   a  1  3     3   NaN
4   a  2  1     4     2
5   a  2  2     5     5
6   a  2  3     6   NaN
7   b  1  1     7     8
8   b  1  2     8    11
9   b  1  3     9   NaN
10  b  2  1    10     8
11  b  2  2    11    11
12  b  2  3    12   NaN

Also, I'd prefer to preprocess the data before getting it into R. I.e. if you are retrieving the data from an SQL server, it might be a better choice to let the server calculate the averages, as it will very likely do a better job in this.

R is actually not very good at number crunching, for several reasons. But it's excellent when doing statistics on the already-preprocessed data.

0人赞添加讨论(0) 举报

Fickle 薄情

3楼-- · 2020-08-13 07:59

The rule of thumb is to aggregate first, and then join to that.

agg = DT[,mean(Data1),by=list(ID,T1)]
setkey(agg,ID,T1)
DT[,Data2:={JT=J(ID,T2);agg[JT,V1][[3]]}]
      ID T1 T2 Data1 Data2
 [1,]  a  1  1     1     2
 [2,]  a  1  2     2     5
 [3,]  a  1  3     3    NA
 [4,]  a  2  1     4     2
 [5,]  a  2  2     5     5
 [6,]  a  2  3     6    NA
 [7,]  b  1  1     7     8
 [8,]  b  1  2     8    11
 [9,]  b  1  3     9    NA
[10,]  b  2  1    10     8
[11,]  b  2  2    11    11
[12,]  b  2  3    12    NA

As you can see it's a bit ugly in this case (but will be fast). It's planned to add drop which will avoid the [[3]] bit, and maybe we could provide a way to tell [.data.table to evaluate i in calling scope (i.e. no self join) which would avoid the JT= bit which is needed here because ID is in both agg and DT.

keyby has been added to v1.8.0 on R-Forge so that avoids the need for the setkey, too.

0人赞添加讨论(0) 举报

地球回转人心会变

4楼-- · 2020-08-13 08:00

Using tapply and part of another recent post:

DF = data.frame(ID=rep(c("a","b"),each=6), T1=rep(1:2,each=3), T2=c(1,2,3), Data1=c(1:12))

EDIT: Actually, most of the original function is redundant and was intended for something else. Here, simplified:

ansMat <- tapply(DF$Data1, DF[, c("ID", "T1")], mean)

i <- cbind(match(DF$ID, rownames(ansMat)), match(DF$T2, colnames(ansMat)))

DF<-cbind(DF,Data2 = ansMat[i])


# ansMat<-tapply(seq_len(nrow(DF)), DF[, c("ID", "T1")], function(x) {
#   curSub <- DF[x, ]
#   myIndex <- which(DF$T2 == curSub$T1 & DF$ID == curSub$ID)
#   meanData1 <- mean(curSub$Data1)
#   return(meanData1 = meanData1)
# })

The trick was doing tapply over ID and T1 instead of ID and T2. Anything speedier?

0人赞添加讨论(0) 举报

“Loop through” data.table to calculate conditional

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间