(This page has no text content)
Treading on Python Series Learning Pandas Python Tools for Data Munging, Data Analysis, and Visualization Matt Harrison Technical Editor: Copyright © 2016 While every precaution has been taken in the preparation of this book, the publisher and author assumes no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein. 2
Table of Contents From the Author Introduction Installation Data Structures Series Series CRUD Series Indexing Series Methods Series Plotting Another Series Example DataFrames Data Frame Example Data Frame Methods Data Frame Statistics Grouping, Pivoting, and Reshaping Dealing With Missing Data Joining Data Frames Avalanche Analysis and Plotting Summary About the Author Also Available One more thing 3
From the Author PYTHON IS EASY TO LEARN. YOU CAN LEARN THE BASICS IN A DAY AND BE productive with it. With only an understanding of Python, moving to pandas can be difficult or confusing. This book is meant to aid you in mastering pandas. I have taught Python and pandas to many people over the years, in large corporate environments, small startups, and in Python and Data Science conferences. I have seen what hangs people up, and confuses them. With the correct background, an attitude of acceptance, and a deep breath, much of this confusion evaporates. Having said this, pandas is an excellent tool. Many are using it around the world to great success. I hope you do as well. Cheers! Matt 4
Introduction I HAVE BEEN USING PYTHON IS SOME PROFESSIONAL CAPACITY SINCE THE TURN OF the century. One of the trends that I have seen in that time is the uptake of Python for various aspects of "data science"- gathering data, cleaning data, analysis, machine learning, and visualization. The pandas library has seen much uptake in this area. pandas 1 is a data analysis library for Python that has exploded in popularity over the past years. The website describes it thusly: “pandas is an open source, BSD-licensed library providing high- performance, easy-to-use data structures and data analysis tools for the Python programming language.” -pandas.pydata.org My description of pandas is: pandas is an in memory nosql database, that has sql-like constructs, basic statistical and analytic support, as well as graphing capability. Because it is built on top of Cython, it has less memory overhead and runs quicker. Many people are using pandas to replace Excel, perform ETL, process tabular data, load CSV or JSON files, and more. Though it grew out of the financial sector (for analysis of time series data), it is now a general purpose data manipulation library. Because pandas has some lineage back to NumPy, it adopts some NumPy'isms that normal Python programmers may not be aware of or familiar with. Certainly, one could go out and use Cython to perform fast typed data analysis with a Python-like dialect, but with pandas, you don't need to. This work is done for you. If you are using pandas and the 5
vectorized operations, you are getting close to C level speeds, but writing Python. Who this book is for This guide is intended to introduce pandas to Python programmers. It covers many (but not all) aspects, as well as some gotchas or details that may be counter-intuitive or even non-pythonic to longtime users of Python. This book assumes basic knowledge of Python. The author has written Treading on Python Vol 1 2 that provides all the background necessary. Data in this Book Some might complain that the datasets in this book are small. That is true, and in some cases (as in plotting a histogram), that is a drawback. On the other hand, every attempt has been made to have real data that illustrates using pandas and the features found in it. As a visual learner, I appreciate seeing where data is coming and going. As such, I try to shy away from just showing tables of random numbers that have no meaning. Hints, Tables, and Images The hints, tables, and graphics found in this book, have been collected over almost five years of using pandas. They are derived from hangups, notes, and cheatsheets that I have developed after using pandas and teaching others how to use it. Hopefully, they are useful to you as well. In the physical version of this book, is an index that has also been battle- tested during development. Inevitably, when I was doing analysis not related to the book, I would check that the index had the information I needed. If it didn't, I added it. Let me know if you find any omissions! Finally, having been around the publishing block and releasing content to the world, I realize that I probably have many omissions that others might consider required knowledge. Many will enjoy the content, others might have the opposite reaction. If you have feedback, or suggestions for 6
improvement, please reach out to me. I love to hear back from readers! Your comments will improve future versions. 1 - pandas (http://pandas.pydata.org) refers to itself in lowercase, so this book will follow suit. 2 - http://hairysun.com/books/tread/ 7
Installation PYTHON 3 HAS BEEN OUT FOR A WHILE NOW, AND PEOPLE CLAIM IT IS THE FUTURE. As an attempt to be modern, this book will use Python 3 throughout! Do not despair, the code will run in Python 2 as well. In fact, review versions of the book neglected to list the Python version, and there was a single complaint about a superfluous list(range(10)) call. The lone line of (Python 2) code required for compatibility is: >>> from __future__ import print_function Having gotten that out of the way, let's address installation of pandas. The easiest and least painful way to install pandas on most platforms is to use the Anaconda distribution 3. Anaconda is a meta distribution of Python, that contains many additional packages that have traditionally been annoying to install unless you have toolchains to compile Fortran and C code. Anaconda allows you to skip the compile step and provides binaries for most platforms. The Anaconda distribution itself is freely available, though commercial support is available as well. After installing the Anaconda package, you should have a conda executable. Running: $ conda install pandas Will install pandas and any dependencies. To verify that this works, simply try to import the pandas package: $ python >>> import pandas >>> pandas.__version__ '0.18.0' If the library successfully imports, you should be good to go. Other Installation Options 8
The pandas library will install on Windows, Mac, and Linux via pip 4. Mac and Windows users wishing to install binaries may download them from the pandas website. Most Linux distributions also have native packages pre-built and available in their repos. On Ubuntu and Debian apt-get will install the library: $ sudo apt-get install python-pandas Pandas can also be installed from source. I feel the need to advise you that you might spend a bit of time going down this rabbit hole if you are not familiar with getting compiler toolchains installed on your system. It may be necessary to prep the environment for building pandas from source by installing dependencies and the proper header files for Python. On Ubuntu this is straightforward, other environments may be different: $ sudo apt-get install build-essential python-all-dev Using virtualenv 5 will alleviate the need for superuser access during installation. Because virtualenv uses pip, it can download and install newer releases of pandas if the version found on the distribution is lagging. On Mac and Linux platforms, the following create a virtualenv sandbox and installs the latest pandas in it (assuming that the prerequisite files are also installed): $ virtualenv pandas-env $ source pandas-env/bin/activate $ pip install pandas After a while, pandas should be ready for use. Try to import the library and check the version: $ source pandas-env/bin/activate $ python >>> import pandas >>> pandas.__version__ '0.18.0' scipy.stats Some nicer plotting features require scipy.stats. This library is not required, but pandas will complain if the user tries to perform an action 9
that has this dependency. scipy.stats has many non-Python dependencies and in practice turns out to be a little more involved to install. For Ubuntu, the following packages are required before a pip install scipy will work: $ sudo apt-get install libatlas-base-dev gfortran Installation of these dependencies is sufficiently annoying that it has lead to “complete scientific Python offerings”, such as Anaconda 6. These installers bundle many libraries, are available for Linux, Mac, and Windows, and have optional support contracts. They are a great way to quickly get an environment up. Summary Unlike "pure" Python modules, pandas is not just a pip install away unless you have an environment configured to build it. The easiest was to get going is to use the Anaconda Python distribution. Having said that, it is certainly possible to install pandas using other methods. 3 - https://www.continuum.io/downloads 4 - http://pip-installer.org/ 5 - http://www.virtualenv.org 6 - https://store.continuum.io/cshop/anaconda/ 10
Data Structures ONE OF THE KEYS TO UNDERSTANDING PANDAS IS TO UNDERSTAND THE DATA model. At the core of pandas are three data structures: Different dimensions of pandas data structures DATA STRUCTURE DIMENSIONALITY SPREADSHEET ANALOG Series 1D Column DataFrame 2D Single Sheet Panel 3D Multiple Sheets The most widely used data structures are the Series and the DataFrame that deal with array data and tabular data respectively. An analogy with the spreadsheet world illustrates the basic differences between these types. A DataFrame is similar to a sheet with rows and columns, while a Series is similar to a single column of data. A Panel is a group of sheets. Likewise, in pandas a Panel can have many DataFrames, each which in turn may have multiple Series. 11
Figure showing relation between main data structures in pandas. Namely, that a data frame can have multiple series, and a panel has multiple data frames. Diving into these core data structures a little more is useful because a bit of understanding goes a long way towards better use of the library. This book will ignore the Panel, because I have yet to see anyone use it in the real world. On the other hand, we will spend a good portion of time discussing the Series and DataFrame. Both the Series and DataFrame share features. For example they both have an index, which we will need to examine to really understand how pandas works. Also, because the DataFrame can be thought of as a collection of columns that are really Series objects, it is imperative that we have a 12
comprehensive study of the Series first. Additionally, we see this when we iterate over rows, and the rows are represented as Series. Some have compared the data structures to Python lists or dictionaries, and I think this is a stretch that doesn't provide much benefit. Mapping the list and dictionary methods on top of pandas' data structures just leads to confusion. Summary The pandas library includes three main data structures and associated functions for manipulating them. This book will focus on the Series and DataFrame. First, we will look at the Series as the DataFrame can be thought of as a collection of Series. 13
Series A SERIES IS USED TO MODEL ONE DIMENSIONAL DATA, SIMILAR TO A LIST IN Python. The Series object also has a few more bits of data, including an index and a name. A common idea through pandas is the notion of an axis. Because a series is one dimensional, it has a single axis—the index. Below is a table of counts of songs artists composed: ARTIST DATA 0 145 1 142 2 38 3 13 To represent this data in pure Python, you could use a data structure similar to the one that follows. It is a dictionary that has a list of the data points, stored under the 'data' key. In addition to an entry in the dictionary for the actual data, there is an explicit entry for the corresponding index values for the data (in the 'index' key), as well as an entry for the name of the data (in the 'name' key): >>> ser = { ... 'index':[0, 1, 2, 3], ... 'data':[145, 142, 38, 13], ... 'name':'songs' ... } The get function defined below can pull items out of this data structure based on the index: >>> def get(ser, idx): ... value_idx = ser['index'].index(idx) ... return ser['data'][value_idx] >>> get(ser, 1) 14
142 NOTE The code samples in this book are generally shown as if they were typed directly into an interpreter. Lines starting with >>> and ... are interpreter markers for the input prompt and continuation prompt respectively. Lines that are not prefixed by one of those sequences are the output from the interpreter after running the code. The Python interpreter will print the return value of the last invocation (even if the print statement is missing) automatically. To use the code samples found in this book, leave the interpreter markers out. The index abstraction This double abstraction of the index seems unnecessary at first glance—a list already has integer indexes. But there is a trick up pandas' sleeves. By allowing non-integer values, the data structure actually supports other index types such as strings, dates, as well as arbitrary ordered indices or even duplicate index values. Below is an example that has string values for the index: >>> songs = { ... 'index':['Paul', 'John', 'George', 'Ringo'], ... 'data':[145, 142, 38, 13], ... 'name':'counts' ... } >>> get(songs, 'John') 142 The index is a core feature of pandas’ data structures given the library’s past in analysis of financial data or time series data. Many of the operations performed on a Series operate directly on the index or by index lookup. The pandas Series 15
With that background in mind, let’s look at how to create a Series in pandas. It is easy to create a Series object from a list: >>> import pandas as pd >>> songs2 = pd.Series([145, 142, 38, 13], ... name='counts') >>> songs2 0 145 1 142 2 38 3 13 Name: counts, dtype: int64 When the interpreter prints our series, pandas makes a best effort to format it for the current terminal size. The left most column is the index column which contains entries for the index. The generic name for an index is an axis, and the values of the index—0, 1, 2, 3—are called axis labels. The two dimensional structure in pandas—a DataFrame—has two axes, one for the rows and another for the columns. Figure showing the parts of a Series. The rightmost column in the output contains the values of the series. In this case, they are integers (the console representation says dtype: int64, dtype meaning data type, and int64 meaning 64 bit integer), but in 16
general values of a Series can hold strings, floats, booleans, or arbitrary Python objects. To get the best speed (such as vectorized operations), the values should be of the same type, though this is not required. It is easy to inspect the index of a series (or data frame), as it is an attribute of the object: >>> songs2.index RangeIndex(start=0, stop=4, step=1) The default values for an index are monotonically increasing integers. songs2 has an integer based index. NOTE The index can be string based as well, in which case pandas indicates that the datatype for the index is object (not string): >>> songs3 = pd.Series([145, 142, 38, 13], ... name='counts', ... index=['Paul', 'John', 'George', 'Ringo']) Note that the dtype that we see when we print a Series is the type of the values, not of the index: >>> songs3 Paul 145 John 142 George 38 Ringo 13 Name: counts, dtype: int64 When we inspect the index attribute, we see that the dtype is object: >>> songs3.index Index(['Paul', 'John', 'George', 'Ringo'], dtype='object') The actual data for a series does not have to be numeric or homogeneous. We can insert Python objects into a series: >>> class Foo: ... pass 17
>>> ringo = pd.Series( ... ['Richard', 'Starkey', 13, Foo()], ... name='ringo') >>> ringo 0 Richard 1 Starkey 2 13 3 <__main__.Foo instance at 0x...> Name: ringo, dtype: object In the above case, the dtype-datatype-of the Series is object (meaning a Python object). This can be good or bad. The object data type is used for strings. But, it is also used for values that have heterogenous types. If you have numeric data, you wouldn't want it to be stored as a Python object, but rather as an int64 or float64, which allow you to do vectorized numeric operations. If you have time data and it says it has the object type, you probably have strings for the dates. This is bad as you don't get the date operations that you would get if the type were datetime64[ns]. Strings on the other hand, are stored in pandas as object. Don't worry, we will see how to convert types later in the book. The NaN value A value that may be familiar to NumPy users, but not Python users in general, is NaN. When pandas determines that a series holds numeric values, but it cannot find a number to represent an entry, it will use NaN. This value stands for Not A Number, and is usually ignored in arithmetic operations. (Similar to NULL in SQL). Here is a series that has NaN in it: >>> nan_ser = pd.Series([2, None], ... index=['Ono', 'Clapton']) >>> nan_ser Ono 2.0 Clapton NaN dtype: float64 NOTE 18
One thing to note is that the type of this series is float64, not int64! This is because the only numeric column that supports NaN is the float column. When pandas sees numeric data (2) as well as the None, it coerced the 2 to a float value. Below is an example of how pandas ignores NaN. The .count method, which counts the number of values in a series, disregards NaN. In this case, it indicates that the count of items in the Series is one, one for the value of 2 at index location Ono, ignoring the NaN value at index location Clapton: >>> nan_ser.count() 1 NOTE If you load data from a CSV file, an empty value for an otherwise numeric column will become NaN. Later, methods such as .fillna and .dropna will explain how to deal with NaN. None, NaN, nan, and null are synonyms in this book when referring to empty or missing data found in a pandas series or data frame. Similar to NumPy The Series object behaves similarly to a NumPy array. As show below, both types respond to index operations: >>> import numpy as np >>> numpy_ser = np.array([145, 142, 38, 13]) >>> songs3[1] 142 >>> numpy_ser[1] 142 They both have methods in common: >>> songs3.mean() 84.5 19
>>> numpy_ser.mean() 84.5 They also both have a notion of a boolean array. This is a boolean expression that is used as a mask to filter out items. Normal Python lists do not support such fancy index operations: >>> mask = songs3 > songs3.median() # boolean array >>> mask Paul True John True George False Ringo False Name: counts, dtype: bool Once we have a mask, we can use that to filter out items of the sequence, by performing an index operation. If the mask has a True value for a given index, the value is kept. Otherwise, the value is dropped. The mask above represents the locations that have a value greater than the median value of the series. >>> songs3[mask] Paul 145 John 142 Name: counts, dtype: int64 NumPy also has filtering by boolean arrays, but lacks the .median method on an array. Instead, NumPy provides a median function in the NumPy namespace: >>> numpy_ser[numpy_ser > np.median(numpy_ser)] array([145, 142]) NOTE Both NumPy and pandas have adopted the convention of using import statements in combination with an as statement to rename their imports to two letter acronyms: >>> import pandas as pd >>> import numpy as np This removes some typing while still allowing the user to be explicit with their namespaces. 20
Comments 0
Loading comments...
Reply to Comment
Edit Comment