pandas cheatsheet
1 Pandas Overview
Will probably make you leave MS Excel, once you get good at it.
But there is a read.excel function (as many other types of read capabilities)
By convention, pandas is imported as 'pd', so you will see the following:
import pandas as pd
Likewise the convention is to import numpy as 'np'
import numpy as np
And finally you will need mapplotlib.pyplot, which by convention is imported as plt
import matplotlib.pyplot as plt
1.1 Reading in .csv or excel files. n
You can tell pandas to read various file types: excel, csv, pdf, numbers etc.. You read those files into what pandas calls a "dataframe"
covid_df = pd.read_excel('covid-stats.xls', 'covidbydate', index_col=None, na_values=['NA']) covid_df.head() covid_df.describe()
Could also pd.readcsv('file.tsf', sep='\t') where we override the default ',' separator with a tab separator \t
2 dataframes
A pandas dataframe consists of three components.
- the index can be just a whole number, but can also be one of the columns as long as the index is unique.
- columns (appear at the top, and name each column)
- rows are the actual data values.
2.1 df.columns
This just selects the columns
2.2 df.values
This is just the rows of data
2.3 df.index
This is just the index.
The last three are each one of the three compoenents that every dataframe has.
Some of these examples assume you are looking at the titanic database, a common public database Note to self, change this to covid examples.
Example: data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 'year': [2012, 2012, 2013, 2014, 2014], 'reports': [4, 24, 31, 2, 3]} df = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma']) df name reports year Cochice Jason 4 2012 Pima Molly 24 2012 Santa Cruz Tina 31 2013 Maricopa Jake 2 2014 Yuma Amy 3 2014
2.4 df.describe
coviddf.describe()
You can also just enter a method, without any arguments, to see what the
method will do. For instance.
pd.read_csv()
with no other arguments will describe the readcsv pandas function.
2.5 df.shape
coviddf.shape Returns a count of the rows, and columns, so 1000,3 is 1000 rows of 3 columns. Note that this is an attribute, not a method, so coviddf.shape() will give an error
2.6 df.info()
Shows more than just the row and column counts.
In [19]: russ2.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 687 entries, 0 to 686 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 zone 687 non-null object 1 date 687 non-null object 2 cases 687 non-null int64 3 todayCases 687 non-null int64 4 deaths 687 non-null int64 5 todayDeaths 687 non-null int64 6 recovered 687 non-null int64 7 active 687 non-null int64 8 cases/1M 687 non-null object 9 deaths/1M 687 non-null float64 10 tests 687 non-null int64 11 tests/1M 687 non-null int64 dtypes: float64(1), int64(8), object(3) memory usage: 64.5+ KB In [20]:
In the above, it is typical that 'object' types are strings when importing external datafiles, but they can be any pythyon object, including another dataframe.
2.7 df.drop
2.7.1 dropping columns
df.drop(['recovered', 'deaths']) Inside the () is a list, so square brackets. This will drop both columns, recovered, and deaths.
Drop a variable (column) Note: axis=1 denotes that we are referring to a column, not a row. As of pandas 0.23, the axis number can be 'index' or 'column' so axis = 0 same as axis='index' & axis = 1 same as axis='columns'
default is axis=0, so we are dropping rows by default.
df.drop(['', ''], axis='columns') pvactive.drop(['world', 'USA'], axis='columns')
df.drop('reports', axis='columns') Name year Cochice Jason 2012 Pima Molly 2012 Santa Cruz Tina 2013 Maricopa Jake 2014 Yuma Amy 2014
2.7.2 dropping rows by name
df.drop(['Cochice', 'Pima']) # axis=index is the default so left off.
name reports year Santa Cruz Tina 31 2013 Maricopa Jake 2 2014 Yuma Amy 3 2014
2.7.3 dropping rows by number
df.drop(df.index[2]) # drops row 3
name reports year Santa Cruz Tina 31 2013 Maricopa Jake 2 2014 Yuma Amy 3 2014
2.7.4 conditionally dropping rows
Conditionally drop a row based on a value Specifically: Create a new dataframe called df that includes all rows where the value of a cell in the name column does not equal “Tina”
df[df.name != 'Tina'] name reports year Cochice Jason 4 2012 Pima Molly 24 2012 Maricopa Jake 2 2014 Yuma Amy 3 2014
2.7.5 dropping rows from the end
df.drop(df.index[-2]) # drops the second last row.
2.7.6 keeping top 3
df[:3]
2.7.7 drop bottom 3
df[:-3]
2.8 df.head
coviddf.head() shows the first 10 entries in the data frame
2.9 df.tail
coviddf.tail()
2.10 df.columns
This just selects the columns
2.11 df.index
This is just the index.
The last three are each one of the three compoenents that every dataframe has.
2.12 df.values
This is just the rows of data. Each row is actually a numpy array. Remember that pandas was developed from numpy, and numpy arrarys.
When you are pasdsing dataframes to an external program that is older and does not understand panda dataframe formats, you can use df.values as that passes numpy arrays, a much older form.
So typically df.values or columns.values
.values is usually only used when handing off data to external programs. Internally, pandas will always understand
2.13 df.dtypes
reminds you of the types for each column. Good to check if you have dirty data.
3 Dataframe subsetting
Subsetting is selecting a subset of the dataframe. You can subset by row or by column.
3.1 selecting columns (a.k.a. subsetting our dataframes)
Can be done in three ways:
3.2 pandas.core.series.Series
To get a dataframe back from a column rather than this series, pandas.core.series.Series, we use double square brakets.
- The first (outer) brackets tell pandas that we will be doing
subsetting,
- The second (inner) brackets are a list, of one or more columns.
So recovered=covid_df[ ['recovered'] ]
will give back a datatype
pandas.dataframe
3.3 selecting rows .loc[] and .iloc[]
df.loc[0] is the first row
Can also pass a list so: df.loc[[0,1] ] is the first two rows.
If you subset a single row, you get back a numpy array, or a pandas series. If you subset multiple rows, you get back a dataframe. Just like for columns.
- loc[rows, columns] (rows and columns are their labels not their index location.)
means "all"
loc is for selecting by row and column.
For instance
ufo.loc[0, :]
selects row 0, all columns.ufo.loc[0, 5]
selects row 0, column 5.ufo.loc[0, [5:7]]
selects row 0, columns 5 to 7 inclusive, i.e. 5,6,& 7ufo.loc[0, :]
selects row 0, all columns.ufo.loc[[0,1,2], 4]
selects rows 0, 1 and 2 and only column 4ufo.loc[[0;2], 4]
selects rows 0, 1 and 2 and only column 4 i.e. same thing.ufo.loc[[0:2], :]
selects rows 0, 1 and 2 and all columnsufo.loc['can', 'bc']
select row 'can', and columns 'bc' - ufo.loc[0:2] is bad form. It is equivalent to ufo.loc[0:2, :]
- ufo.loc[:, 'City']
This returns the series from the City columns only, but ALL rows.
ufo.loc[:, ["City", "Province"] ]
selects all rows, but only two columns, namely 'City' and 'Province'ufo.loc[:, ["City":"Province"] ]
selects all rows, and all columns between 'City' and 'Province'ufo.loc[:, -1 ]
This does NOT work. as there is no columns labelled '-1' - ufo[ufo.City='Toronto'] is the same as ufo.loc[ufo.City=='Toronto', :]
You could also select two or more columns like this: ufo.loc[ufo.City=='Oakland', 'State'] That is also equivalent to ufo[ufo.City=='Oakland'].State but it is considered bad form and may cause problems. Preferred approach is
In [86]: russ2 Out[86]: zone date cases todayCases deaths todayDeaths recovered active cases/1M deaths/1M tests tests/1M 0 A 0 0 0 0 0 0 0 0 0.0 0 0 1 world mar30-13 785777 1062 37815 32 165607 582355 100.82 0.0 0 0 2 USA mar30-13 164253 409 3165 9 5506 155582 496 10.0 0 0 3 Italy mar30-13 101739 0 11591 0 14620 75528 1683 192.0 0 0 4 Spain mar30-13 87956 0 7716 0 16780 63460 1881 165.0 0 0 .. ... ... ... ... ... ... ... ... ... ... ... ... 682 Michigan apr25-11 36641 0 3085 0 0 30284 0 0.0 140130 14073 683 Florida apr25-11 30839 306 1055 9 0 29098 0 0.0 334974 16262 684 Louisiana apr25-11 26140 0 1660 0 0 9553 0 0.0 142056 30460 685 Texas apr25-11 23170 0 601 0 0 14544 0 0.0 242547 8698 686 Washington apr25-11 13176 0 723 0 0 10646 0 0.0 160324 21979 [687 rows x 12 columns] ================================== russ2.loc[russ2['zone'] == 'Canada', ['zone', 'date', 'cases', 'active']] In [85]: russ2.loc[russ2['zone'] == 'Canada', ['zone', 'date', 'cases', 'active']] Out[85]: zone date cases active 13 Canada mar30-13 7474 6268 33 Canada apr15-21 28205 18262 51 Canada apr16-21 28379 18390 68 Canada apr17-09 30106 19182 86 Canada apr17-09 31927 20074 120 Canada apr17-18 31642 20004 154 Canada apr18-20 33383 20706 188 Canada apr19-13 33951 21235 222 Canada apr19-18 34813 21423 256 Canada apr20-11 35056 21626 290 Canada apr20-22 36829 22553 324 Canada apr21-10 36831 22555 358 Canada apr21-23 38422 23400 392 Canada apr22-07 38422 23400 426 Canada apr22-13 38967 23447 460 Canada apr23-10 40190 24230 494 Canada apr24-09 42110 25202 528 Canada apr24-10 42110 25202 562 Canada apr24-20 43888 26117 596 Canada apr25-11 43888 26117 630 Canada apr25-11 43888 26117 664 Canada apr25-11 43888 26117
** NOTE loc[0] is actually comparing the index to the string '0' NOT an index. That is why loc[-1] for the last row will NOT work, as a key '-1' does not exist.
That is why we have iloc, because that IS in fact the index position, NOT an index label numeric indexing and the iloc selector (subsetting by row) coviddf.iloc[:,6] # for the seventh row coviddv.iloc[-1] is the last row.
Result is a panda series. Panda series are 1-dimensional arrays. Panda series can be summed, averaged, counted, medianed, and have missing values replaced!
Assume x is the series, then, This is done with the x.sum(), x.mean(), x.count(), x.median() is the syntax.
Here is a real example: coviddf.['recovered'].mean or coviddf.['cases'].sum
3.4 combining subsetting
The sytanx for subsetting rows and subsetting columsn can be combined. subst = df.loc[, ] to the left of the column is how to specifiy rows to the right of the comma, is how to specify columns.
So subset = df.loc[:, ['date', 'zone']]
so before the comma we have a : which means everything after the comma we have a list so we subset columns with heading "date" and "zone"
Note: again if you use iloc in place of loc, you will get an error, because 'date' and 'zone' are NOT valid index integers.
3.5 combining with boolean conditions.
df.loc[df['country'] == 'Canada']
== is an equality check (i.e. comparing)
So, this says, pick out rows from your dataset that have Canada as the value for the country column.
3.6 iloc[rows, columns] Where now rows and columns are the interger number.
With iloc, it is exclusive of the second number, inclusive of the first number ufo.iloc[:, 0:4] is for all rows, and columsn 0,1,2,&3 (NOT 4)
ufo.iloc[0:3, : is for rows 0, 1, and 2, and all columns.
3.7 shortcuts that are bad form (but good to know if you see it online)
ufo[['City, 'Province' ] ]
don't do this, what is better is:
ufo.loc[:, ['City', 'Province'] ]
i.e. be explicit is the pythonic way.
ufo[0:2]
is also bad form. It is rows, by integer position, but just avoid this.
ufo.iloc[0:2, :]
is the better form.
3.8 ilex
Allows you to mix integer references with label references.
drinks = pd.readcsv('http://bit.ly/drinksbycountry', indexcol='country')
drinks.ix['Albania', 0] row matches index 'Albania', and first column (i.e. column 0)
drinks.ix[1, 'beerservings']
4 appending
You can join two dataframes even if they have different columns. First the looking at the case when the two dataframes have the SAME columns:
zint = pd.DataFrame.fromdict({'A': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}, orient='index', columns=labels) Results in:
date cases todayCases deaths todayDeaths recovered active cases/1M deaths/1M tests tests/1M A 0 0 0 0 0 0 0 0 0 0 0
Then if russ was a dataframe like this:
date cases todayCases deaths todayDeaths recovered active cases/1M deaths/1M tests tests/1M world apr24-10 2752899 29855 192267 1348 760900 1799138 353.21 0 0 0 USA apr24-10 887622 1180 50283 47 89141 748198 2682 152 4785759 14458 Spain apr24-10 219764 6740 22524 367 92355 104885 4700 482 930230 19896 Italy apr24-10 189973 0 25549 0 57576 106848 3142 423 1579909 26131
Then, zint = zint.append(russ)
would result in, zint being this:
date cases todayCases deaths todayDeaths recovered active cases/1M deaths/1M tests tests/1M A 0 0 0 0 0 0 0 0 0 0 0 world apr24-10 2752899 29855 192267 1348 760900 1799138 353.21 0 0 0 USA apr24-10 887622 1180 50283 47 89141 748198 2682 152 4785759 14458 Spain apr24-10 219764 6740 22524 367 92355 104885 4700 482 930230 19896 Italy apr24-10 189973 0 25549 0 57576 106848 3142 423 1579909 26131
4.1 .groupby
titanicdf.groupby('sex').mean() Will group into men and women, and take the mean ages.
Can say titanicdf.groupby(['sex', 'age']) When you group by a field, you specify the field "title". That "title" name comes from __ ??
4.2 conditions
titanicdf[titanicdf['age']<18].groupby(['sex','pclass']).mean Will only consider data where age is < 18, then group them into boys and girls and class, and take the mean survival
5 Running panda commands on the dataframes:
5.1 valuecounts
pd.valuecounts(titanicdf['survived']).plot.bar() This works will if data has a boolean, (yes/no, true/false) field and you want to count who many of each.
6 Pandas as time series
Read the docs for version 25.3 at:version 0.25.3 as well as pydata.org home.
panda.Series( data, index, dtype, copy)
- data takes various forms, like ndarray, list, constans
- index values must be unique and hashable, same length as data. default np.arrange(n) if no index is passwd
- dtype is for data type. If None, data type will be inferred
- copy data, Default is false.
6.1 For example, looking at the stock prices.
ms is the dataframe name, then: ms['Adj. Close'].plot() 3
7 Pandas indexing
Without specifying an index, pandas will generate a list of whole numbers as
the index. You can reset and index and Pandas will add the whole numbers as
a column that becomes the index. df.reset_index()
For example, I had my covid stats dataframe, called 'zint' as this:
In [61]: zint Out[61]: date cases todayCases deaths todayDeaths recovered active cases/1M deaths/1M tests tests/1M A 0 0 0 0 0 0 0 0 0.0 0 0 world apr20-22 2481287 784 170436 39 646260 1663997 318.36 0.0 0 0 USA apr20-22 792759 0 42514 0 72389 677856 2395 128.0 4026360 12164 Spain apr20-22 200210 0 20852 0 80587 98771 4282 446.0 930230 19896 Italy apr20-22 181228 0 24114 0 48877 108237 2997 399.0 1398024 23122 ... ... ... ... ... ... ... ... ... ... ... ... Sweden apr16-21 12540 613 1333 130 381 10826 1242 132.0 74600 7387 Finland apr16-21 3369 132 75 3 300 2994 608 14.0 48828 8813 Estonia apr16-21 1434 34 36 1 133 1265 1081 27.0 36024 27156 Lithuania apr16-21 1128 37 30 0 178 920 414 11.0 48537 17829 Latvia apr16-21 675 9 5 0 57 613 358 3.0 31302 16595 [551 rows x 11 columns]
The command: zint.reset_index()
resulted in zint being this:
In [62]: russ2=zint.reset_index() Out[62]: index date cases todayCases deaths todayDeaths recovered active cases/1M deaths/1M tests tests/1M 0 A 0 0 0 0 0 0 0 0 0.0 0 0 1 world apr20-22 2481287 784 170436 39 646260 1663997 318.36 0.0 0 0 2 USA apr20-22 792759 0 42514 0 72389 677856 2395 128.0 4026360 12164 3 Spain apr20-22 200210 0 20852 0 80587 98771 4282 446.0 930230 19896 4 Italy apr20-22 181228 0 24114 0 48877 108237 2997 399.0 1398024 23122 .. ... ... ... ... ... ... ... ... ... ... ... ... 546 Sweden apr16-21 12540 613 1333 130 381 10826 1242 132.0 74600 7387 547 Finland apr16-21 3369 132 75 3 300 2994 608 14.0 48828 8813 548 Estonia apr16-21 1434 34 36 1 133 1265 1081 27.0 36024 27156 549 Lithuania apr16-21 1128 37 30 0 178 920 414 11.0 48537 17829 550 Latvia apr16-21 675 9 5 0 57 613 358 3.0 31302 16595 [551 rows x 12 columns]
7.1 renaming a column
The simplest way to rename a column is by editing the .column attribute. df.columns = [list that has as many entries as columns]
In [71]: labels_12 = ['zone', 'date', 'cases', 'todayCases', 'deaths', 'todayDeaths', 'recovered', 'active', ...: 'cases/1M', 'deaths/1M', 'tests', 'tests/1M'] In [72]: russ2.columns = labels_12 In [73]: russ2.head() Out[73]: zone date cases todayCases deaths todayDeaths recovered active cases/1M deaths/1M tests tests/1M 0 A 0 0 0 0 0 0 0 0 0.0 0 0 1 world apr20-22 2481287 784 170436 39 646260 1663997 318.36 0.0 0 0 2 USA apr20-22 792759 0 42514 0 72389 677856 2395 128.0 4026360 12164 3 Spain apr20-22 200210 0 20852 0 80587 98771 4282 446.0 930230 19896 4 Italy apr20-22 181228 0 24114 0 48877 108237 2997 399.0 1398024 23122 In [74]:
7.2 multilevel indexing:
From geeksforgeeks: In this example, 2 columns(First name and Gender) are added to the index column and later one level is removed by using resetindex() method.
import pandas as pd
empdatafr = pd.readcsv("employees.csv")
empdatafr.setindex(["First Name", "Gender"], inplace = True, append = True, drop= True)
empdatafr.resetindex(level = 2, inplace = True, collevel = 1)
data.head()
Output: The gender column in the index column was replaced as it’s level was 2.
Before reset:
Salary Bonus Management Dept First Name Gender 0 Douglas male 50000 20000 False Electrical 1 Peter male 51000 20000 False Electrical 2 Susan female 50000 20000 False Electrical 3 Jack male 65000 22000 True Marketting 4 Ramesh male 48000 20000 False Marketting 5 Gertrude female 94000 11000 True Electrical
After reset:
Gender Salary Bonus Management Dept First Name 0 Douglas male 50000 20000 False Electrical 1 Peter male 51000 20000 False Electrical 2 Susan female 50000 20000 False Electrical 3 Jack male 65000 22000 True Marketting 4 Ramesh male 48000 20000 False Marketting 5 Gertrude female 94000 11000 True Electrical
8 Panda reading in Dictionaries
A python dict can be passed as input, and if no index is specified, then the dictionary keys are taken in a sorted order to construct the pandas index. If index is passed, the values in data corresponding to the labels in the index will be pulled out.
9 plotting
9.1 .plot.bar()
For bar plot coviddf.plot.bar() coviddf.drop(['world','USA']).plot(kind='bar', y='deathsPerOneMillion', color='blue')
coviddf.drop(['world']).plot(kind='bar', y='deaths/1M', color='blue')
9.2 .plot.mean()
coviddf.plot.mean()
Old way of doing it as to import pylab, or from pylab import plot then running pylab.show() (after some plot object has been created with a message like: "<matplotlib.axes.subplots.AxesSubplot at 0x124e32610>"
pylab.show()
This would open up a new window showing the plot graphically
But, since pylab has been depracated, the proper way is to use matplotlib directly as so:
import matplotlib.pyplot as plt plt.plot([1,2,3]) plt.show()
So basically plt.show()
in place of pylab.show()
10 .plot
10.1 Simple plotting
- df.plot(kind='scatter',x='column2',y='column3') plt.show()
- df.plot(kind='scatter',x='numchildren',y='numpets',color='red') plt.show()
- df.plot(kind='bar', y='deaths', color='red')
- coviddf.drop(['world','USA']).plot(kind='bar', y='deathsPerOneMillion', color='blue')
- russ.drop(['world']).plot(kind='bar', y='deaths/1M', color='blue')
- russ.drop(['world']).plot(kind='line',
11 df.pivot
if your data frame is:
0 c1 c2 c3 1 c1.1 c2.1 c3.1 2 c1.2 c2.2 c3.2 3 c1.3 c2.3 c3.3 4 c1.4 c2.4 c3.4 5 c1.5 c2.5 c3.5 6 c1.6 c2.6 c3.6 7 c1.7 c2.7 c3.7 8 c1.8 c2.8 c3.8 9 c1.9 c2.9 c3.9
11.1 Examples from covid
df2 = toplot.pivot_table(values='cases', index='date', columns='zone') In [84]: df2.drop('world', axis=1) Out[84]: zone Austria Belgium Brazil California Canada China Estonia Finland Florida France ... New York Pennsylvania Portugal Russia Spain Sweden Switzerland Texas USA Washington date ... apr15-21 NaN 33573.0 28320.0 NaN 28205.0 82295.0 1400.0 3237.0 NaN 147863.0 ... NaN NaN NaN 24490.0 177633.0 11927.0 NaN NaN 635843.0 NaN apr16-21 NaN 34809.0 29015.0 NaN 28379.0 82341.0 1434.0 3369.0 NaN 147863.0 ... NaN NaN NaN 27938.0 182816.0 12540.0 NaN NaN 644806.0 NaN apr17-09 14595.0 36138.0 32556.0 29175.0 31016.5 82705.5 1459.0 3489.0 24753.0 156498.0 ... 233951.0 29921.0 19022.0 32008.0 187893.5 12878.0 27078.0 17760.0 694115.5 11445.0 apr17-18 14595.0 36138.0 33682.0 28899.0 31642.0 82692.0 1459.0 3489.0 24753.0 147969.0 ... 226593.0 29441.0 19022.0 32008.0 188167.0 13216.0 27078.0 17371.0 700234.0 11152.0 apr18-20 14671.0 37183.0 36722.0 30718.0 33383.0 82719.0 1512.0 3681.0 25492.0 151793.0 ... 241041.0 31731.0 19685.0 36793.0 194416.0 13822.0 27404.0 18679.0 738792.0 11802.0 apr19-13 14749.0 38496.0 36925.0 30831.0 33951.0 82735.0 1528.0 3783.0 25996.0 152578.0 ... 247215.0 32734.0 20206.0 42853.0 195944.0 14385.0 27740.0 18923.0 754263.0 11802.0 apr19-18 14749.0 38496.0 38654.0 30965.0 34813.0 82735.0 1528.0 3783.0 26314.0 152894.0 ... 247215.0 32734.0 20206.0 42853.0 198674.0 14385.0 27740.0 18923.0 762896.0 11802.0 apr20-11 14795.0
In [48]: current Out[48]: zone date active 1 world mar30-13 582355 2 USA mar30-13 155582 3 Italy mar30-13 75528 4 Spain mar30-13 63460 5 China mar30-13 2161 .. ... ... ... 682 Michigan apr25-11 30284 683 Florida apr25-11 29098 684 Louisiana apr25-11 9553 685 Texas apr25-11 14544 686 Washington apr25-11 10646 [686 rows x 3 columns] In [49]: # active-pivot = current.pivot_table(values='', index='', columns='')