data.table - A list of examples I like to reference

I wanted to create a post here with rather selfish intentions, honestly. This is a list of tips and code snippets that I reference often, either from my own code or from Stack Overflow or my own code based on Stack Overflow posts. I’m planning to potentially add to this post as time goes on. It is my “Cookbook” so-to-speak.


Warning: This is not particularly intended for beginner data.table users

If you do happen to be a beginner, I highly suggest taking a look at their training on datacamp.com, which is really the best way to get a feel for the syntax and brings you into ‘advanced’ data.table. There is also a handy cheat sheet, although I wish the cheat sheet was an RMarkdown document! Even after completing this, I do not feel like an expert, which is why I often find myself on Stack Overflow.

Tips and Tricks List:

Ordering data by multiple columns using frank() and setorder()

In my work I often find myself with a collection of records where a single record can have multiple values. Sometimes you want to rank your whole dataset or sometimes you might want to rank by group. First tip is if you do need the rank values for your entire table, the frank() function provided with the data.table package will do the job. It is pretty fast, even when ranking by groups over a million rows. In the exa

An example that I might face can be seen in the below snippet of code. Even if I increased the sample to 100M rows, I am able to run the below code on my bottom of the line MacBook Pro.

library(data.table)
set.seed(1234)
vals <- sample(1000000, replace = T)
dt1 <- data.table(A = rep(LETTERS[1:10], 1000000),
                 B = vals)
dt1
##           A      B
##        1: A 761680
##        2: B 630678
##        3: C 304108
##        4: D 932745
##        5: E 295846
##       ---         
##  9999996: F 895094
##  9999997: G 367164
##  9999998: H 493172
##  9999999: I 454539
## 10000000: J  50808
system.time(dt1[, B_rank := frank(B), A])
##    user  system elapsed 
##   0.664   0.099   0.857
# Increase to 10M
set.seed(1234)
vals <- sample(10000000, replace = T)
dt2 <- data.table(A = rep(LETTERS[1:10], 10000000),
                 B = vals)
dt2
##            A       B
##         1: A 1810256
##         2: B 6595564
##         3: C 7224201
##         4: D 8684454
##         5: E 9407364
##        ---          
##  99999996: F 1262113
##  99999997: G 9788484
##  99999998: H 1672373
##  99999999: I 1824994
## 100000000: J 3949637
system.time(dt2[, B_rank := frank(B), A])
##    user  system elapsed 
##   9.123   2.502  15.627
        #    A       B  B_rank
        # 1: A 1137035 1139316
        # 2: B 6222995 6218296
        # 3: C 6092748 6094866
        # 4: D 6233795 6227806

With that said, I often time find myself not needing a column to rank things on but rather just need to order by or extract rows with the highest one or two values of a group in a column. An example might be that I have a lot of duplication in my dataset, but there is one column that might have a value that may determine the ‘best’ selection of the duplicates. That’s where I find setorder() to be a better choice for me.

set.seed(1234)
vals <- sample(1000000, replace = T)
dt1 <- data.table(A = rep(LETTERS[1:10], 1000000),
                 B = vals)
system.time(setorder(dt1, A, B))
##    user  system elapsed 
##   0.814   0.044   0.973
# Increase to 10M
set.seed(1234)
vals <- sample(10000000, replace = T)
dt2 <- data.table(A = rep(LETTERS[1:10], 10000000),
                 B = vals)
system.time(setorder(dt2, A, B))
##    user  system elapsed 
##  11.397   1.305  17.241

I just used simple benchmarking here but you can see particularly for the bigger dataset that the ordering is much faster than creating the column with rank values. From here you can add your own primitive rank or you could extract the top row(s) by group.

# You can add a primitive version of a rank, and altogether the operation would be faster than frank()
# that said, it does not handle ties in any particularly good way.
dt2 <- data.table(A = rep(LETTERS[1:10], 10000000),
                 B = vals)
system.time(setorder(dt2, A, B)[, B_rank := seq(1:.N), A])
##    user  system elapsed 
##  12.322   1.348  15.630
dt2

# Extract top 1 row
dt2 <- data.table(A = rep(LETTERS[1:10], 10000000),
                 B = vals)
dt2[dt2[, .I[1], A]$V1]
##     A       B
##  1: A 1810256
##  2: B 6595564
##  3: C 7224201
##  4: D 8684454
##  5: E 9407364
##  6: F 6417511
##  7: G 4539471
##  8: H  871806
##  9: I  249732
## 10: J 9759272
system.time(dt2[dt2[, .I[1], A]$V1])
##    user  system elapsed 
##   2.643   0.830   3.705
# OR
system.time(dt2[dt2[, head(.I, 1), A]$V1]) #better if picking more than 1 (in my opinion!)
##    user  system elapsed 
##   2.904   1.773   7.370

For a quick explanation: the dt2[, .I[1], A]$V1 expression which is inside the dt2[] i argument is returning the first row number for each group in column A. dt2[dt2[, head(.I, 1), A]$V1] is a good option, PARTICULARLY if you are not sure if choosing the top 5 rows per group and you are not sure if each group has 5 values. BTW, the .I notation I will go over in a tip further down.


Fill NAs in a data.table using set()

There are times where you need to fill the NAs in a data.table with a certain value, like a 0 or an empty string. I often refer to this SO answer .

f_dowle3 = function(DT) {
  # either of the following for loops

  # by name :
  for (j in names(DT))
    set(DT,which(is.na(DT[[j]])),j,0)

  # or by number (slightly faster than by name) :
  for (j in seq_len(ncol(DT)))
    set(DT,which(is.na(DT[[j]])),j,0)
}

This is also a general reminder that set() is a great function in data.table, especially in cases where the := notation is tricky for some reason or when the data.table is particularly large.


Writing functions/procedures containing data.tables – Sometimes modifying the input data.table is OK, but be sure you know exactly what you’re doing!

There are times in my work where I find myself applying some series of transformations on multiple data.table objects (you might wonder why I have multiple data.tables with similar information, but that is usually a side effect of me generally getting datasets ready for Shiny dashboards). Sometimes, I can put the data.tables through a for loop, sometimes I might write functions that operate on the original data.table – where the “output” is basically a modified version of the input. I’m still not sure which is the best, but to me it has been relieving to know that it’s not out of question to have some side effect on the input data


What the heck is .I?

Coming soon


What the heck is CJ()?

Coming soon

 Share!