Slow Merging in R (using row.names)

I spent a lot of time this summer messing around with R and manipulating data in a variety of forms (matrices, data frames etc.).

I was working with a lot of gene expression data. In most of these cases, either a data frame or a matrix can be used. A matrix seems simpler — each row corresponds to a gene, each column corresponds to expression levels in a given sample. Thus, all our values are floats, so we can use a matrix (and store the gene names in row.names).

R has an easy builtin function for merging matrices (or data frames). For example, it’s simple to merge by row (i.e. to aggregate samples of the expression level of the same gene). If using a matrix, then the gene names are stored in row.names (i.e. merge(mat1, mat2, by = "row.names")).

The other option is to use a data frame, and store one column for string entries (which would store the gene name). Then when merging two data frames, the call would be merge(df1, df2, by = "GeneName")).

Since most expression data were simply made up of floats, it seemed natural to store them as matrices. However, the merging of matrices by row.names was incredibly slow! Surprisingly, using data frames and merging by an explicitly stored “GeneName” column was orders-of-magnitude faster.

I’ve posted an example script showing this behavior on GitHub —

  • Creates two random matrices (and gives them the same row names)
  • Samples some of these row names (7000 of the the 8000 in my standard example)
  • Merges the two matrices (or data frames after conversion) and reports how long it took.
  • Using row.names, I get about 20 seconds; using an explicit string column, I get .04 seconds!

Check it out at: https://github.com/nrjones8/R-Examples — I’d love any feedback/comments on this behavior (i.e. what the internals of R are doing in these two cases). In the documentation for the merge function, it mentions “The complexity of the algorithm used is proportional to the length of the answer.” Anyone know what algorithm is being used? Or why it’s so much slower than using an explicit column?

(note: using 70% of the total number of rows in each matrix we are merging)

merge_times

RStudio: The Good and the Bad

I’ve been working with RStudio on and off for about a year now, and it really provides some necessary functionality for working with R. It’s a huge step up from the standard R app, and does a fantastic job of integrating projects that require visualization, text, and R code (i.e. Markdown, Sweave, Shiny). Having separate panes for plots, the console, and source makes the development process much smoother.

With that said, there are definitely some improvements that could be made. Perhaps I’m spoiled as a Sublime Text user for Python/Ruby development, but I’d love to see some of Sublime’s functionality incorporated into RStudio. Here are a few things that frustrate me about RStudio, and please let me know if you have suggestions about fixing these gripes!

  • Key bindings. Why are these not customizable!? For example, I find switching between tabs in the source pane to be a huge pain (no pun intended), but control-option-left/right is not the most comfortable key stroke for someone used to cmd-shift-leftbracket/rightbracket, even if I were to map my Caps Lock to control.
  • Why restrict what can be shown in each pane? For example, I like to look at the data (via View) as I’m writing a function/plot in the source pane, but currently can’t do that since the “View” function takes up the source pane. If my data frame foo is too big to use head(foo) in the console pane, then I’m stuck. Using names(foo) is an OK workaround, but doesn’t provide as much information as actually seeing some of the data would. The workspace tab is great for seeing the names of data frames etc., but it would be even better to be able to View data in that same pane.
  • Splitting panes, as in Sublime. Perhaps this would be too much (> 4 panes is getting crowded), but I often want to be able to reference other files as I’m writing a new script. For example, I often write all my plotting functions in one file and my data-cleaning functions in another file. When I’m writing a function to plot data (usually with ggplot2) from a data frame I just cleaned (usually via ddply, or some other plyr function), I need to reference the names of the columns present in the cleaned data frame. Tab-completion doesn’t help, since either (1) I’m writing a generic function about a data frame that isn’t in my environment or (2) the R environment doesn’t know that a data frame foo has a column named bar, unless I’m using the $ operator (which is rare in the case of ggplot2).

Any other common things that bug people about RStudio? Or suggestions for fixing my complaints above?

CS in K-12 Education

When should kids start learning to code? With the rise in available resources, MOOCs, and the need for more people with technical backgrounds in industry, this is a popular question today. Studying CS at a liberal arts school, it’s been stressed over and over that “computer science” is really the study of algorithms; that is, given a problem, how can we go about solving it? With this in mind, it seems that a different, perhaps more important question, is “when should kids start learning computer science?” 

CS is unlike just about any other discipline I can think of — it’s most closely related to math in its problem-solving nature. However, starting to learn CS is pretty much a prerequisite-free endeavor. Whereas math has a very hierarchical structure, the first parts of CS don’t. What are the prerequisites for an intro CS course? Nothing. For Calc I? Some algebra, some trig, a “familiarity” with math, and so forth. Since CS is rooted in problem solving, there really is no reason not to teach children CS from an early age, i.e. in K-12 education. 

That’s where companies like Tynker can come into play. A “new computing platform designed specifically to teach children computational learning and programming skills in a fun and imaginative way,” Tynker appears to address exactly the above problem of teaching computer science. It seems that many people are scared off by CS’s “scary” syntax, mysterious symbols, and so forth. If we teach children from a young age how to “think like a computer scientist,” the translation from ideas to implementation (i.e. learning to program in Python, C++, or whatever) becomes much smoother and natural. When students are first learning how actually to code, they can then focus on the details/nuances of language having already built the background knowledge of “thinking like a computer scientist.” 

Better yet, Tynker is completely browser-based and therefore doesn’t require a ton of fancy equipment to use. While there are certainly schools which lack the resources (i.e. sufficient computers to let students use Tynker via a browser), it’s certainly a step in the right direction. Learn more about Tynker and check out their blog! http://www.tynker.com/blog/