-->

R delete rows in data frame where nrow of index is

2020-04-18 01:12发布

问题:

I want to delete certain rows in a data frame when the number of rows with the same index is smaller than a pre-specified value.

> fof.6.5[1:15, 1:3]
   draw Fund.ID Firm.ID
1     1    1667     666
2     1    1572     622
3     1    1392     553
4     1     248      80
5     1    3223     332
6     2    2959    1998
7     2    2659    1561
8     2   14233    2517
9     2   10521   12579
10    2    3742    1045
11    3    9093   10121
12    3   15681   21626
13    3   26371   70170
14    4   27633   52720
15    4   13751     656

In this example, I want each index to have 5 rows. The third draw (which is my index) has fewer than 5 rows. How can I delete the draws like the third one if they have fewer than 5 rows?

回答1:

You could do this using dplyr (assuming your data is in a data frame called dt:

dt %>% group_by(draw) %>% filter(n() >= 5) %>% ungroup()

Or you could use table or xtabs:

tab <- xtabs(~ draw, dt)

dt[!dt$draw %in% as.numeric(names(which(tab < 5))), ]


回答2:

Here is another option using data.table. Convert the 'data.frame' to 'data.table' (setDT(df1), grouped by 'draw', if the nrows (.N) are greater than or equal to 'n' then get the Subset of Data.table (.SD)

library(data.table)
n <- 5
setDT(df1)[, if(.N >= n) .SD, by = draw]  
#    draw Fund.ID Firm.ID
# 1:    1    1667     666
# 2:    1    1572     622
# 3:    1    1392     553
# 4:    1     248      80
# 5:    1    3223     332
# 6:    2    2959    1998
# 7:    2    2659    1561
# 8:    2   14233    2517
# 9:    2   10521   12579
#10:    2    3742    1045

If we want only 'n' number of rows, use the head

setDT(df1)[, if(.N >= n) head(.SD, n), by = draw]  


回答3:

Code below makes that:

library(dplyr)
fof.6.5 <- data.frame(draw = rep(1:4, c(5, 5, 3, 2)),
                      Fun.ID = rnorm(15),
                      Firm.ID = rnorm(15))
fof_filter <- fof.6.5 %>% group_by(draw) %>% mutate(freq = n()) %>%
  filter(freq >= 5) %>% select(-freq)
fof_filter
  • group_by join (in someway) profiles with the same value of draw
  • mutate add a new variable "freq" that give the number of profiles per each group of draw and repeats it.
  • filter selects only those profiles with "freq" greater or equals to 5
  • select delete the "freq" column because is not needed anymore