Rodolfo Orozco Rodolfo Orozco - 3 years ago 68
Python Question

When to use pandas series, numpy ndarrays or simply python dictionaries?

I am new to learning Python, and some of its libraries (numpy, pandas).

I have found a lot of documentation on how numpy ndarrays, pandas series and python dictionaries work.

But as part of my inexperience in Python, I have had a really hard time knowing when to use each one of then. And I haven't found some best-practices that will help me understand and decide when it is better to use each type of data structure.

As a general matter, are there any best-practices to decide if a specific data set should be loaded into any of this 3 data structures?


Answer Source

The rule of thumb that I usually apply: use the simplest data structure that still satisfies your needs. If we rank the data structures from most simple to least simple, it usually ends up like this:

  1. Dictionaries / lists
  2. Numpy arrays
  3. Pandas series / dataframes

So first consider dictionaries / lists. If these allow you to do all data operations that you need, then all is fine. If not, start considering numpy arrays. Some typical reasons for moving to numpy arrays are:

  • Your data is 2-dimensional (or higher). Although nested dictionaries/lists can be used to represent multi-dimensional data, in most situations numpy arrays will be more efficient.
  • You have to perform a bunch of numerical calculations. As already pointed out by zhqiat, numpy will give a significant speed-up in this case. Furthermore numpy arrays come bundled with a large amount of mathematical functions.

Then there are also some typical reasons for going beyond numpy arrays and to the more-complex but also more-powerful pandas series/dataframes:

  • You have to merge multiple data sets with each other, or do reshaping/reordering of your data. This diagram gives a nice overview of all the 'data wrangling' operations that pandas allows you to do.
  • You have to import data from or export data to a specific file format like Excel, HDF5 or SQL. Pandas comes with convenient import/export functions for this.
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download