Comparison with R / R libraries

Since pandas aims to provide a lot of the data manipulation and analysis functionality that people use R for, this page was started to provide a more detailed look at the R language and its many third party libraries as they relate to pandas. In comparisons with R and CRAN libraries, we care about the following things:

  • Functionality / flexibility: what can/cannot be done with each tool
  • Performance: how fast are operations. Hard numbers/benchmarks are preferable
  • Ease-of-use: Is one tool easier/harder to use (you may have to be the judge of this, given side-by-side code comparisons)

This page is also here to offer a bit of a translation guide for users of these R packages.

Base R

subset

New in version 0.13.

The query() method is similar to the base R subset function. In R you might want to get the rows of a data.frame where one column’s values are less than another column’s values:

df <- data.frame(a=rnorm(10), b=rnorm(10))
subset(df, a <= b)
df[df$a <= df$b,]  # note the comma

In pandas, there are a few ways to perform subsetting. You can use query() or pass an expression as if it were an index/slice as well as standard boolean indexing:

In [1]: from pandas import DataFrame

In [2]: from numpy.random import randn

In [3]: df = DataFrame({'a': randn(10), 'b': randn(10)})

In [4]: df.query('a <= b')

          a         b
2 -1.950301  0.173875
3 -1.478332 -0.798063
5 -0.806934  0.141070
8  0.084343  0.879800
9 -0.590813  0.465165

In [5]: df[df.a <= df.b]

          a         b
2 -1.950301  0.173875
3 -1.478332 -0.798063
5 -0.806934  0.141070
8  0.084343  0.879800
9 -0.590813  0.465165

In [6]: df.loc[df.a <= df.b]

          a         b
2 -1.950301  0.173875
3 -1.478332 -0.798063
5 -0.806934  0.141070
8  0.084343  0.879800
9 -0.590813  0.465165

For more details and examples see the query documentation.

with

New in version 0.13.

An expression using a data.frame called df in R with the columns a and b would be evaluated using with like so:

df <- data.frame(a=rnorm(10), b=rnorm(10))
with(df, a + b)
df$a + df$b  # same as the previous expression

In pandas the equivalent expression, using the eval() method, would be:

In [7]: df = DataFrame({'a': randn(10), 'b': randn(10)})

In [8]: df.eval('a + b')

0   -0.316408
1    2.764941
2    2.079059
3   -0.149641
4    1.708174
5   -0.695574
6   -0.513258
7    0.543637
8    1.373293
9    0.466815
dtype: float64

In [9]: df.a + df.b  # same as the previous expression

0   -0.316408
1    2.764941
2    2.079059
3   -0.149641
4    1.708174
5   -0.695574
6   -0.513258
7    0.543637
8    1.373293
9    0.466815
dtype: float64

In certain cases eval() will be much faster than evaluation in pure Python. For more details and examples see the eval documentation.

zoo

xts

plyr

reshape / reshape2