值得一提的是data.table引入了全新的索引形式,大大简化了data frame的分片形式,提供接近于原生矩阵的操作方式并直接利用C语言构造底层,保证操作的速度。
对比操作对比data.table 和 dplyr 的操作:
操作data.tabledplyr按行分片 DT[1:2,] DF[1:2,]
按列分片 DT[,1:2,with=False] DF[,1:2]
分组summarise DT[, sum(y), by=z] DF %>% group_by(z) %>% summarise(sum(y))
分组mutate DT[, y := cumsum(y), by=z] ans <- DF %>% group_by(z) %>% mutate(y = cumsum(y))
筛选后分组汇总 DT[x > 2, sum(y), by=z] DF %>% filter(x>2) %>% group_by(z) %>% summarise(sum(y))
筛选后分组更新 DT[x > 2, y := cumsum(y), by=z] ans <- DF %>% group_by(z) %>% mutate(y = replace(y, which(x>2), cumsum(y)))
分组后按条件汇总 DT[, if(any(x > 5L)){y[1L]-y[2L]}else{y[2L], by=z]} DF %>% group_by(z) %>% summarise(if (any(x > 5L)) y[1L]-y[2L] else y[2L])
apply函数族 操作data.tabledplyr
分组扩展各list DT[, (cols) := lapply(.SD, sum), by=z] ans <- DF %>% group_by(z) %>% mutate_each(funs(sum))
分组汇总各list DT[, lapply(.SD, sum), by=z] DF %>% group_by(z) %>% summarise_each(funs(sum))
分组汇总各list DT[, c(lapply(.SD, sum),lapply(.SD, mean)), by=z] DF %>% group_by(z) %>% summarise_each(funs(sum, mean))
分组汇总各list DT[, c(.N, lapply(.SD, sum)), by=z] DF %>% group_by(z) %>% summarise_each(funs(n(), mean))
join 操作 setkey(DT1, x, y) 操作data.tabledplyr
一般join DT1[DT2] left_join(DT2, DT1)
择列join DT1[DT2, .(z, i.mul)] left_join(select(DT2, x,y,mul), select(DT1, x,y,z))
聚合join DT1[DT2, .(sum(z)*i.mul), by=.EACHI] DF1 %>% group_by(x, y) %>% summarise(z=sum(z)) %>% inner_join(DF2) %>% mutate(z = z*mul) %>% select(-mul)
更新join DT1[DT2, z := cumsum(z)*i.mul, by=.EACHI] join and group by + mutate
滚动join DT1[DT2, roll = -Inf] /
其他变量控制输出 DT1[DT2, mult = "first"] /
拼接操作 操作data.tabledplyr
分组再分list聚合 DT[, list(x[1], y[1]), by=z] DF %>% group_by(z) %>% summarise(x[1], y[1])
分组再分list拼接 DT[, list(x[1:2], y[1]), by=z] DF %>% group_by(z) %>% do(data.frame(.$x[1:2], .$y[1]))
分组取分位数聚合 DT[, quantile(x, 0.25), by=z] DF %>% group_by(z) %>% summarise(quantile(x, 0.25))
分组取分位数拼接 DT[, quantile(x, c(0.25, 0.75)), by=z] DF %>% group_by(z) %>% do(data.frame(quantile(.$x, c(0.25, 0.75))))
分组分list聚合拼接 DT[, as.list(summary(x)), by=z] DF %>% group_by(z) %>% do(data.frame(as.list(summary(.$x))))
更多操作详情可查看data.table速查表。
DataFrame可视化 DTDT包是谢溢辉老师的大作,为data frame数据提供了非常好的可视化功能,并且提供了筛选、分页、排序、搜索等数据查询操作。
library(DT) datatable(iris)此外,DT包还提供了大量的UI定制的功能,对html、css和js进行深度定制。比如:
m = matrix(c( \'<b>Bold</b>\', \'<em>Emphasize</em>\', \'<a href="http://rstudio.com">RStudio</a>\', \'<a href="#" onclick="alert(\\'Hello World\\');">Hello</a>\' ), 2) colnames(m) = c(\'<span style="color:red">Column 1</span>\', \'<em>Column 2</em>\') datatable(m) # 默认 escape = TRUE datatable(m, escape = FALSE) raw_matrix %>% DT::datatable(options = list(pageLength = 30, dom = \'tip\')) %>% DT::formatStyle(columns = c("A","B") background = styleColorBar(c(0, max(raw_matrix,na.rm = TRUE)), \'steelblue\'), backgroundSize = \'100% 50%\', backgroundRepeat = \'no-repeat\', backgroundPosition = \'center\') 分布式DataFrame DDFDDF的全称是 Distributed Data Frame, 也就是分布式数据框。DDF用一个统一的跨引擎API简化了多数据源的分析操作,进一步将data frame底层的分布式傻瓜化。