Overview

What is the purpose of these notes?

  • Provide a few small examples of things you can do with the Pandas package in Python.

Note, the code chunks in this document are all python chunks, except the first one.

This is a basic guide on working with pandas in Python

In this file we are going to use Pandas to

-make/import dataframes (using dictionaries and indexes)
-Index the data frames
-Access specific rows, columns or entries in the dataframe
-Dropp rows,columns, null values

Reticulate for Pandas in .Rmd

For using Python in rmd files we need to call the Reticulate library:

# note this is the only R code chunk in this document. The rest are Python. 
library(reticulate)
import pandas as pd

For creating a dataframe using values and columns name we can use the following:

  1. Create a dictionary based on the column names and values, something like this for example:
dict = {"country": ["Brazil", "Russia", "India", "China", "South Africa"],
       "capital": ["Brasilia", "Moscow", "New Dehli", "Beijing", "Pretoria"],
       "area": [8.516, 17.10, 3.286, 9.597, 1.221],
       "population": [200.4, 143.5, 1252, 1357, 52.98] }
  1. Create the dataframe using pandas and the dictionary:
df = pd.DataFrame(dict)
print(df)
##         country    capital    area  population
## 0        Brazil   Brasilia   8.516      200.40
## 1        Russia     Moscow  17.100      143.50
## 2         India  New Dehli   3.286     1252.00
## 3         China    Beijing   9.597     1357.00
## 4  South Africa   Pretoria   1.221       52.98

Importing data

You can import dataframes as well too by using read_csv() method from pandas.

For this example lets import the bank_training dataset you’ve used on an early homework set.

import pandas as pd

path="https://campuspro-uploads.s3-us-west-2.amazonaws.com/a9d789c2-6b5e-4020-a941-69984947f1ee/d2c0b7ab-df96-4891-b40f-392d348c30dc/bank_marketing_training"

bank_training=pd.read_csv(path)
bank_training.head()
##    age          job  marital  ... euribor3m nr.employed response
## 0   56    housemaid  married  ...     4.857        5191       no
## 1   57     services  married  ...     4.857        5191       no
## 2   41  blue-collar  married  ...     4.857        5191       no
## 3   25     services   single  ...     4.857        5191       no
## 4   29  blue-collar   single  ...     4.857        5191       no
## 
## [5 rows x 21 columns]

Basic cleanup and data access

Now for adjusting the index, lets do it on the first dataframe (df) that we defined.

#setting index for df
df.index = ["BR", "RU", "IN", "CH", "SA"]

# Print out df with new index values
print(df)
##          country    capital    area  population
## BR        Brazil   Brasilia   8.516      200.40
## RU        Russia     Moscow  17.100      143.50
## IN         India  New Dehli   3.286     1252.00
## CH         China    Beijing   9.597     1357.00
## SA  South Africa   Pretoria   1.221       52.98

Accessing rows, columns, and entries…

# Print out first nine rows of job column. the bank_training['job'] will make a Pandas Series
print(bank_training['job'][0:9])

# Print out the head of the job column as Pandas DataFrame
## 0      housemaid
## 1       services
## 2    blue-collar
## 3       services
## 4    blue-collar
## 5      housemaid
## 6    blue-collar
## 7     management
## 8     unemployed
## Name: job, dtype: object
print(bank_training[['job']].head())

# Print out head of the DataFrame with job and marital columns
##            job
## 0    housemaid
## 1     services
## 2  blue-collar
## 3     services
## 4  blue-collar
print(bank_training[['job', 'marital']].head())
##            job  marital
## 0    housemaid  married
## 1     services  married
## 2  blue-collar  married
## 3     services   single
## 4  blue-collar   single

Square brackets can also be used to access observations (rows) from a DataFrame. For example:


# Print out first 4 observations
print(bank_training[0:4])

# Print out fifth and sixth observation
##    age          job  marital  ... euribor3m nr.employed response
## 0   56    housemaid  married  ...     4.857        5191       no
## 1   57     services  married  ...     4.857        5191       no
## 2   41  blue-collar  married  ...     4.857        5191       no
## 3   25     services   single  ...     4.857        5191       no
## 
## [4 rows x 21 columns]
print(bank_training[4:6])
##    age          job   marital  ... euribor3m nr.employed response
## 4   29  blue-collar    single  ...     4.857        5191       no
## 5   57    housemaid  divorced  ...     4.857        5191       no
## 
## [2 rows x 21 columns]

Simple data selections

You can also use loc and iloc to perform just about any data selection operation. loc is label-based, which means that you have to specify rows and columns based on their row and column labels. iloc is integer index based, so you have to specify rows and columns by their integer index like you did in the previous exercise.


# Print out observation for the second row
print(bank_training.iloc[2])

# Print out first 10 observations for job 
## age                             41
## job                    blue-collar
## marital                    married
## education                  unknown
## default                    unknown
## housing                         no
## loan                            no
## contact                  telephone
## month                          may
## day_of_week                    mon
## duration                       217
## campaign                         1
## days_since_previous            999
## previous                         0
## previous_outcome       nonexistent
## emp.var.rate                   1.1
## cons.price.idx              93.994
## cons.conf.idx                -36.4
## euribor3m                    4.857
## nr.employed                   5191
## response                        no
## Name: 2, dtype: object
print(bank_training.loc[0:9,'job'])

#Print out first nine records of marital and job in the bank_training dataset
## 0      housemaid
## 1       services
## 2    blue-collar
## 3       services
## 4    blue-collar
## 5      housemaid
## 6    blue-collar
## 7     management
## 8     unemployed
## 9        retired
## Name: job, dtype: object
print(bank_training.iloc[0:8,[2,3]])
##     marital    education
## 0   married     basic.4y
## 1   married  high.school
## 2   married      unknown
## 3    single  high.school
## 4    single  high.school
## 5  divorced     basic.4y
## 6   married     basic.6y
## 7    single     basic.9y

Dropping rows, columns

Dropping rows with index: Note: the inplace parameter will determine to apply the transformation to the original dataset or no, if it’s false it won’t save the output without assigning it to a value but if it’s true the dataframe will be transfomed: axis=0 --> rows ,, axis=1 --> columns.

ind=[4,5]
bank_training.drop(ind, axis=0,inplace=True)
bank_training.head()
#dropping marital column and save the new dataframe as "dfnew"
##    age          job  marital  ... euribor3m nr.employed response
## 0   56    housemaid  married  ...     4.857        5191       no
## 1   57     services  married  ...     4.857        5191       no
## 2   41  blue-collar  married  ...     4.857        5191       no
## 3   25     services   single  ...     4.857        5191       no
## 6   35  blue-collar  married  ...     4.857        5191       no
## 
## [5 rows x 21 columns]
dfnew=bank_training.drop("marital", axis=1, inplace=False)
dfnew.head()
##    age          job    education  ... euribor3m nr.employed response
## 0   56    housemaid     basic.4y  ...     4.857        5191       no
## 1   57     services  high.school  ...     4.857        5191       no
## 2   41  blue-collar      unknown  ...     4.857        5191       no
## 3   25     services  high.school  ...     4.857        5191       no
## 6   35  blue-collar     basic.6y  ...     4.857        5191       no
## 
## [5 rows x 20 columns]

Dropping Null values using pandas

The “dropna” method in pandas will let you to remove rows or columns having at least one null values, for more detail visit: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html

Similar to the last chunk, this will let us to choose the axis and inplace too. The following code will remove all the rows which have at least one null values in their entries:

bank_training.dropna(axis=0, inplace=True)
bank_training.head()
##    age          job  marital  ... euribor3m nr.employed response
## 0   56    housemaid  married  ...     4.857        5191       no
## 1   57     services  married  ...     4.857        5191       no
## 2   41  blue-collar  married  ...     4.857        5191       no
## 3   25     services   single  ...     4.857        5191       no
## 6   35  blue-collar  married  ...     4.857        5191       no
## 
## [5 rows x 21 columns]

Other information and references

What else can be done in Python using Pandas? Lots of things!!

Basically, any preprocessing method can be applied using Pandas and Numpy:

  • Manipulating and applying functions in each row and columns (lambda function)
  • Summary statistics and description of the dataset
  • Merging datasets, transformations and normalization
  • Outlier detection, removing
  • Removing duplicates
  • Feature engineering

To learn more about Pandas, visit: https://www.w3schools.com/python/pandas/default.asp

License

This document is created for Math 514, Spring 2021, at Illinois Tech. While the course materials are generally not to be distributed outside the course without permission of the instructor, this particular set of notes is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License


  1. Sonja Petrović, Associate Professor of Applied Mathematics, College of Computing, Illinios Tech. Homepage, Email.↩︎