Today is my day of sprinting here at Scipy 2010. I decided to hang out with the stats crew, harass them (“what’s statistical entropy?”) while learning up on datarray and the various semi-common implementations of the same idea.
My goal is to compare and contrast different implementations of this sort of thing. I’d like to run through the following:
Another major concern is performance. This is scientific computing after all. However, I think it would be a bit premature to worry about this right now, so I’m going to worry about that one later.
Datarray is designed to be very simple, a la the numpy ndarray itself. This means that, so far, its features are pretty much limited to labeling and slicing/dicing.
Its API is still quite embryotic, and is most likely going to Change Completely. However, the mental model seems fairly well-thought-out. In three dimensions, it looks something like this:
Datarray works with what Fernando called “axes” and “ticks.” Axes refer to the dimensions of the n-dimensional object–ie, i,j,and k in ye traditional 3-D tensor, or “row” and “column” in ye traditional 2-d table. Ticks are the names of various slices along these dimensions. For example, in ye traditional 3-d tensor, the ticks of i would be 1,2 and 3. In the much more interesting case of a typical spreadsheet, you might have “time” and “temperature” as ticks.
I think for scientists, this is a pretty good data model, and we should probably stick fairly close to it, save for some name changes. A lot of statisticians, on the other hand, never work with anything >2d, a’priori, and tend to use categories and “group-by’s” to hack in more axes.
Fer wrote an axis class, amongst a bunch of other things, and there seems to be a lot of validation in there. I just got to looking at it, so it doesn’t completely fit in my head yet (so to speak). However, I did find this line:
ax._tick_dict = dict( zip(ticks, xrange( len(ticks) )) )
That is, datarray uses the ndarray to store everything, and then uses key-value pairs in some dictionaries (one per axis) for slicing arrays by-tick.
Record arrays are basically ndarrays which hold tuples as datatypes instead of numbers. In my opinion, they kinda suck. Why? Because they’re not very flexible.
See, this comes from a limitation of ndarray itself, due to its c-level implementation. The ndarray is stored in memory not in n dimensions, but as a single list, coupled with “strides” and a knowledge of the memory space the datatype needs to store everything.
For example, if you were to do this in python:
What you would actually see is something like this at the bit-level:
data=#01020304 dtype=int32 strides=(32,64)
This implementation is nice and fast and works awesome for arrays of all floats (or whatever). However, things get more complicated when you try to throw in something akin to a tuple, and once you do this you have to specify the C-datatypes of your tuple. Moreover, in order to do anything with the tuple, you need to specify it as you would any datatype in the ndarray. Which is fine, except that trying to do the things with them that we might like to do with a dataframe-like object is a pain.
Still, this is the closest thing numpy has to a dataframe at the moment, so it’s worth keeping in mind.
Larry’s documentation is pretty nice, and worth looking at for those interested.
Larry labels rows and columns. In other words, it only rocks ticks, not axes. For some people this is probably fine (especially if you’re only using rows and columns), but there are definitely use cases for labeled axes, especially when there are lots of them. So, Larry doesn’t solve this problem.
However, Larry does have some extra features that let you do things with the arrays that you would want to do with them, such as aligning, merging, and some mild statistics.
You create a Larry pretty much the same way you do an ndarray. Its labels are just range(n) by default, but you can also pass ticks_0,ticks_1 as an optional argument.
When indexing, Larry wraps label-indices in a list in order to distinguish them from standard, numpy-style indexing. For examples:
I’m not sure I like this.
Larry holds an ndarray for the data, and holds lists of labels. Instead of holding key-value pairs, Larry uses the ordering of the list to keep things in-check.
Pandas is designed for quantitative finance in particular, meaning that it has a lot of features. Most of these will be out-of-scope for datarray. However, pandas is an important use case for an datarray-ish datatype.
Pandas actually has a number of datatypes meant for handling these sorts of data:
Clearly, Pandas is much less abstracted than datarray, with very concrete notions of dimension (1,2 or 3). This works very well for Pandas, but probably not very well for a datarray successor.
Pandas tends to keep datatypes immutable, or at least unchanged.
Unlike Larry, Pandas uses an ndarray to hold ticks, probably for a speed increase.
Pandas also has a lot of concern for merging and alignment of data. This may be worth having as a base feature.
Tabular is less datatype-centric and more use-centric. For example, tabular can eat CSV, lists of tuples and lists of columns to create tabulars. Tabular also has tools for exporting data (as hdf, csv and html), and tries to offer a nice api for pulling out slices for various columns (whether it succeeds or not, I’m not sure).
I may be wrong, but tabular seems to still operate on a 2-d level. The data model tries to mirror concepts from spreadsheets, and seems to have a SQL-style join method. Like Larry, it seems to label ticks only, and not axes.
The datatype itself seems to be almost an afterthought, though I’m pretty sure it’s Larry-esque in nature. The goal of tabular was more to provide some nice manipulation functions and some slick input/output action. Some of these are meant to be fast, and may be worth incorporating in the future.
R’s dataframe seems a lot like many of the other implementations of tabular data in python, in that it’s strictly 2-d. There are labeled columns, and the rows are “observations.” To see more, maybe check this out. Basically, Pandas seems to have the same ideas.
datarray is doing something that nobody else has, which is to come up with something that “scales well” to n dimensions. Most other cases either don’t really allow for >2 (or maybe >3) dimensions, or they will work for n dimensions but still count on using unlabeled, ordered axes.
However, datarray may seem incomplete in terms of feature set. Datarray mostly allows for slicing and dicing like a bandit, but it seems like most people want to do things like joins/merges, sorting and analyses, a la SQL.
Based on what I’ve seen, I think there a few lessons to be learned:
Datarray’s axis/ticks abstraction is awesome, and in my opinion crucial. However, many of the users are going to be coming from a background where everything is shoehorned into two dimensions. Moreover, lots of input data is shoehorned such as this. Hence, the ability to convert a tick and data values along said tick into an axis with these data values as ticks somewhere is crucial. In addition, being able to “go back” is also important: Not only will people want to save and/or display data in 2-d structures, but many people also think in terms of 2-d datasets only. Hopefully, this latter case will not hamper use of datarray.
There are a lot of things people will want to do with these. Some of these should be coupled with datarray, but others should probably be in a separate bundle—say, “datarraytools”.
It may be worth looking to SQL, sqlalchemy, etc., for ideas on how to work with data.
In the end, speed will be crucial. Honestly, it will likely be the most important aspect of anything that gets included into numpy.