My colleague Chung-hong Chan started a new package in our teams GitHub organization. An issue there caught my attention. The performance was very slow of the main function. The issue lay somewhere in the auxiliary functions. This lead me down quite a rabbit hole to optimize the calculation of row minimums (you can skip the prelude, if you are not interested in the backstory).
Prelude
My goal initially was just to try to work around the identified bottleneck of the auxiliary functions I isolated the bottleneck part into a separate function, which looked like this.
# helper functions taken from the package
.cal_dist <- function(y, poss) {
return(abs(y - poss))
}
.get_min <- function(pos, x) {
min(purrr::map_dbl(x, pos))
}
target_idx <- c(4,7)
poss <- 1:1547
count_from <- 1
purrr_min <- function(){
res <- lapply(target_idx, .cal_dist, poss = poss)
purrr::map_dbl(poss, .get_min, x = res) + count_from
}
head(purrr_min())
## [1] 4 3 2 1 2 2
There is a lot of code here that is specific to the original structure of the package but in essence
purrr_min
calculates the distance for each element in poss
to the indices target_idx
. The output should be
the minimum distance to any index in target_idx
(incremented by count_from
).
bench::mark(
purrr_min()
)
## Warning: Some expressions had a GC in every iteration; so filtering is
## disabled.
## # A tibble: 1 × 6
## expression min median `itr/sec` mem_alloc `gc/sec`
## <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
## 1 purrr_min() 79.5ms 81.4ms 12.3 64.8KB 14.1
The performance of one call is not too bad, but this has to be done many times over and runtime accumulates quite quickly.
The original code was working with lists and purrr, but it is possible to also see this as a matrix problem: Build a matrix
with length(target_idx)
columns and length(poss)
rows and an entry (i,j) of the matrix is the distance of item i in poss
to item j in target_idx
. All that is left to do, is calculate the minimum in each row to get the same output as above. This can be a “simple” apply call.
apply_min <- function(){
res <- sapply(target_idx, .cal_dist, poss = poss)
apply(res,1,min)+count_from
}
bench::mark(
purrr_min(),
apply_min()
)
## Warning: Some expressions had a GC in every iteration; so filtering is
## disabled.
## # A tibble: 2 × 6
## expression min median `itr/sec` mem_alloc `gc/sec`
## <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
## 1 purrr_min() 82.99ms 83.71ms 11.7 48.5KB 15.6
## 2 apply_min() 1.74ms 1.84ms 504. 151.3KB 20.0
The speed up is insane (it was actually quite surprising to me!). But can this be even faster?
Calculating row minimum fast
The prelude established the following optimization problem: Calculate the minimum in each row of an (integer!) matrix fast. We will work with the following matrix.
set.seed(654)
m <- matrix(sample(1:20, 50000, replace = TRUE), ncol = 5)
Pure R solutions
The function derived in the prelude is based on apply
.
rowmin_apply <- function(x){
apply(x,1,min)
}
bench::mark(
rowmin_apply(m)
)
## # A tibble: 1 × 6
## expression min median `itr/sec` mem_alloc `gc/sec`
## <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
## 1 rowmin_apply(m) 11.9ms 14.3ms 68.7 391KB 22.0
Can this be improved? As a matter of fact, it can and the solution might surprise a little.
rowmin_pmin <- function(x){
do.call(pmin, as.data.frame(x))
}
The function converts the matrix x to a data frame (pmin
doesn’t work with matrices) and then applies pmin
across it. The pmin
function takes multiple vectors as input and returns a vector of the minimum values at each position. do.call
is used to apply pmin
across all columns of the data frame (which are the rows of your original matrix).
This method is more efficient because pmin
is vectorized and do.call
efficiently passes the columns of the data frame as arguments to pmin
.
bench::mark(
apply = rowmin_apply(m),
pmin = rowmin_pmin(m)
)
## # A tibble: 2 × 6
## expression min median `itr/sec` mem_alloc `gc/sec`
## <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
## 1 apply 13.2ms 14.6ms 67.9 391KB 26.6
## 2 pmin 355.2µs 446.2µs 2050. 430KB 17.7
That is quite a speedup for a function that looks totally off. You can squeeze out a little bit more by using the fact the matrix has only integer values.
rowmin_pmin.int <- function(x){
do.call(pmin.int, as.data.frame(x))
}
bench::mark(
apply = rowmin_apply(m),
pmin = rowmin_pmin(m),
pmin.int = rowmin_pmin.int(m),
)
## # A tibble: 3 × 6
## expression min median `itr/sec` mem_alloc `gc/sec`
## <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
## 1 apply 13.2ms 14.1ms 68.3 391KB 21.8
## 2 pmin 356.9µs 406.4µs 2418. 430KB 19.5
## 3 pmin.int 354.4µs 411.7µs 2371. 433KB 19.6
At this point, I was sure that I will not be able to squeeze out more in pure R (maybe there is a way?).
C++/C solutions
An obvious way to keep optimizing the solution is to switch to Rcpp. Now my C++ skills are still not the best, but I gave a naïve implementation a shot.
Rcpp::cppFunction(
"
NumericVector rowmin_cpp_naive(NumericMatrix mat) {
int nRows = mat.nrow();
int nCols = mat.ncol();
NumericVector mins(nRows);
for(int i = 0; i < nRows; ++i) {
double minVal = mat(i, 0);
for(int j = 1; j < nCols; ++j) {
if(mat(i, j) < minVal) {
minVal = mat(i, j);
}
}
mins[i] = minVal;
}
return mins;
}
"
)
bench::mark(
apply = rowmin_apply(m),
pmin = rowmin_pmin(m),
pmin.int = rowmin_pmin.int(m),
cpp_naive = rowmin_cpp_naive(m)
)
## # A tibble: 4 × 6
## expression min median `itr/sec` mem_alloc `gc/sec`
## <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
## 1 apply 12.1ms 12.5ms 77.0 391KB 38.5
## 2 pmin 403.7µs 495.2µs 1772. 430KB 13.6
## 3 pmin.int 383.8µs 424.8µs 2241. 430KB 19.2
## 4 cpp_naive 103.8µs 120.9µs 7717. 471KB 70.7
Nice, so a straightforward C++ implementation is faster. This can probably be optimized even more, but I didn’t get much further than this.
Out of pure curiosity, I also ventured into C. My C is really bad, but I got something to work.
rowmin_c_naive <- inline::cfunction(
signature(mat = "integer", nRows = "integer", nCols = "integer"),
body = "
int nrows = INTEGER(nRows)[0];
int ncols = INTEGER(nCols)[0];
SEXP mins = PROTECT(allocVector(INTSXP, nrows));
int *pmat = INTEGER(mat);
int *pmins = INTEGER(mins);
for(int i = 0; i < nrows; i++) {
int minVal = pmat[i];
for(int j = 1; j < ncols; j++) {
int currentVal = pmat[i + j * nrows];
if (currentVal < minVal) {
minVal = currentVal;
}
}
pmins[i] = minVal;
}
UNPROTECT(1);
return mins;
",
language = "C"
)
bench::mark(
apply = rowmin_apply(m),
pmin = rowmin_pmin(m),
pmin.int = rowmin_pmin.int(m),
cpp_naive = rowmin_cpp_naive(m),
c_naive = rowmin_c_naive(m,nrow(m), ncol(m))
)
## # A tibble: 5 × 6
## expression min median `itr/sec` mem_alloc `gc/sec`
## <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
## 1 apply 11.9ms 12.8ms 77.1 390.9KB 33.9
## 2 pmin 346.9µs 400.2µs 2422. 430.2KB 22.4
## 3 pmin.int 325.8µs 364.5µs 2622. 430.2KB 21.4
## 4 cpp_naive 90.2µs 106.3µs 8468. 471.3KB 79.7
## 5 c_naive 29.6µs 31µs 29544. 39.1KB 20.7
Again, probably still room for improvement, but I was not expecting to squeeze out that much with C.