I spent a lot of time this summer messing around with R and manipulating data in a variety of forms (matrices, data frames etc.).

I was working with a lot of gene expression data. In most of these cases, either a data frame or a matrix can be used. A matrix seems simpler — each row corresponds to a gene, each column corresponds to expression levels in a given sample. Thus, all our values are floats, so we can use a matrix (and store the gene names in row.names).

R has an easy builtin function for merging matrices (or data frames). For example, it’s simple to merge by row (i.e. to aggregate samples of the expression level of the same gene). If using a matrix, then the gene names are stored in row.names (i.e. `merge(mat1, mat2, by = "row.names"`

)).

The other option is to use a data frame, and store one column for string entries (which would store the gene name). Then when merging two data frames, the call would be `merge(df1, df2, by = "GeneName"`

)).

Since most expression data were simply made up of floats, it seemed natural to store them as matrices. However, the merging of matrices by row.names was incredibly slow! Surprisingly, using data frames and merging by an explicitly stored “GeneName” column was orders-of-magnitude faster.

I’ve posted an example script showing this behavior on GitHub —

- Creates two random matrices (and gives them the same row names)
- Samples some of these row names (7000 of the the 8000 in my standard example)
- Merges the two matrices (or data frames after conversion) and reports how long it took.
- Using row.names, I get about 20 seconds; using an explicit string column, I get .04 seconds!

Check it out at: https://github.com/nrjones8/R-Examples — I’d love any feedback/comments on this behavior (i.e. what the internals of R are doing in these two cases). In the documentation for the `merge`

function, it mentions “The complexity of the algorithm used is proportional to the length of the answer.” Anyone know what algorithm is being used? Or why it’s so much slower than using an explicit column?

(note: using 70% of the total number of rows in each matrix we are merging)