Overview of some important Python packages: Pandas
Topic: Basics of coding, R&Python
Overview
What is the purpose of these notes?
- Provide a few small examples of things you can do with the
Pandas
package inPython
.
Note, the code chunks in this document are all python
chunks, except the first one.
This is a basic guide on working with pandas in Python
In this file we are going to use Pandas to
-make/import dataframes (using dictionaries and indexes)
-Index the data frames
-Access specific rows, columns or entries in the dataframe
-Dropp rows,columns, null values
Reticulate
for Pandas
in .Rmd
For using Python in rmd files we need to call the Reticulate library:
For creating a dataframe using values and columns name we can use the following:
- Create a dictionary based on the column names and values, something like this for example:
dict = {"country": ["Brazil", "Russia", "India", "China", "South Africa"],
"capital": ["Brasilia", "Moscow", "New Dehli", "Beijing", "Pretoria"],
"area": [8.516, 17.10, 3.286, 9.597, 1.221],
"population": [200.4, 143.5, 1252, 1357, 52.98] }
- Create the dataframe using pandas and the dictionary:
## country capital area population
## 0 Brazil Brasilia 8.516 200.40
## 1 Russia Moscow 17.100 143.50
## 2 India New Dehli 3.286 1252.00
## 3 China Beijing 9.597 1357.00
## 4 South Africa Pretoria 1.221 52.98
Importing data
You can import dataframes as well too by using read_csv() method from pandas.
For this example lets import the bank_training
dataset you’ve used on an early homework set.
import pandas as pd
path="https://campuspro-uploads.s3-us-west-2.amazonaws.com/a9d789c2-6b5e-4020-a941-69984947f1ee/d2c0b7ab-df96-4891-b40f-392d348c30dc/bank_marketing_training"
bank_training=pd.read_csv(path)
bank_training.head()
## age job marital ... euribor3m nr.employed response
## 0 56 housemaid married ... 4.857 5191 no
## 1 57 services married ... 4.857 5191 no
## 2 41 blue-collar married ... 4.857 5191 no
## 3 25 services single ... 4.857 5191 no
## 4 29 blue-collar single ... 4.857 5191 no
##
## [5 rows x 21 columns]
Basic cleanup and data access
Now for adjusting the index, lets do it on the first dataframe (df) that we defined.
#setting index for df
df.index = ["BR", "RU", "IN", "CH", "SA"]
# Print out df with new index values
print(df)
## country capital area population
## BR Brazil Brasilia 8.516 200.40
## RU Russia Moscow 17.100 143.50
## IN India New Dehli 3.286 1252.00
## CH China Beijing 9.597 1357.00
## SA South Africa Pretoria 1.221 52.98
Accessing rows, columns, and entries…
# Print out first nine rows of job column. the bank_training['job'] will make a Pandas Series
print(bank_training['job'][0:9])
# Print out the head of the job column as Pandas DataFrame
## 0 housemaid
## 1 services
## 2 blue-collar
## 3 services
## 4 blue-collar
## 5 housemaid
## 6 blue-collar
## 7 management
## 8 unemployed
## Name: job, dtype: object
## job
## 0 housemaid
## 1 services
## 2 blue-collar
## 3 services
## 4 blue-collar
## job marital
## 0 housemaid married
## 1 services married
## 2 blue-collar married
## 3 services single
## 4 blue-collar single
Square brackets can also be used to access observations (rows) from a DataFrame. For example:
## age job marital ... euribor3m nr.employed response
## 0 56 housemaid married ... 4.857 5191 no
## 1 57 services married ... 4.857 5191 no
## 2 41 blue-collar married ... 4.857 5191 no
## 3 25 services single ... 4.857 5191 no
##
## [4 rows x 21 columns]
## age job marital ... euribor3m nr.employed response
## 4 29 blue-collar single ... 4.857 5191 no
## 5 57 housemaid divorced ... 4.857 5191 no
##
## [2 rows x 21 columns]
Simple data selections
You can also use loc and iloc to perform just about any data selection operation. loc is label-based, which means that you have to specify rows and columns based on their row and column labels. iloc is integer index based, so you have to specify rows and columns by their integer index like you did in the previous exercise.
# Print out observation for the second row
print(bank_training.iloc[2])
# Print out first 10 observations for job
## age 41
## job blue-collar
## marital married
## education unknown
## default unknown
## housing no
## loan no
## contact telephone
## month may
## day_of_week mon
## duration 217
## campaign 1
## days_since_previous 999
## previous 0
## previous_outcome nonexistent
## emp.var.rate 1.1
## cons.price.idx 93.994
## cons.conf.idx -36.4
## euribor3m 4.857
## nr.employed 5191
## response no
## Name: 2, dtype: object
print(bank_training.loc[0:9,'job'])
#Print out first nine records of marital and job in the bank_training dataset
## 0 housemaid
## 1 services
## 2 blue-collar
## 3 services
## 4 blue-collar
## 5 housemaid
## 6 blue-collar
## 7 management
## 8 unemployed
## 9 retired
## Name: job, dtype: object
## marital education
## 0 married basic.4y
## 1 married high.school
## 2 married unknown
## 3 single high.school
## 4 single high.school
## 5 divorced basic.4y
## 6 married basic.6y
## 7 single basic.9y
Dropping rows, columns
Dropping rows with index: Note: the inplace parameter will determine to apply the transformation to the original dataset or no, if it’s false it won’t save the output without assigning it to a value but if it’s true the dataframe will be transfomed: axis=0 --> rows ,, axis=1 --> columns
.
ind=[4,5]
bank_training.drop(ind, axis=0,inplace=True)
bank_training.head()
#dropping marital column and save the new dataframe as "dfnew"
## age job marital ... euribor3m nr.employed response
## 0 56 housemaid married ... 4.857 5191 no
## 1 57 services married ... 4.857 5191 no
## 2 41 blue-collar married ... 4.857 5191 no
## 3 25 services single ... 4.857 5191 no
## 6 35 blue-collar married ... 4.857 5191 no
##
## [5 rows x 21 columns]
## age job education ... euribor3m nr.employed response
## 0 56 housemaid basic.4y ... 4.857 5191 no
## 1 57 services high.school ... 4.857 5191 no
## 2 41 blue-collar unknown ... 4.857 5191 no
## 3 25 services high.school ... 4.857 5191 no
## 6 35 blue-collar basic.6y ... 4.857 5191 no
##
## [5 rows x 20 columns]
Dropping Null
values using pandas
The “dropna” method in pandas will let you to remove rows or columns having at least one null values, for more detail visit: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html
Similar to the last chunk, this will let us to choose the axis and inplace too. The following code will remove all the rows which have at least one null values in their entries:
## age job marital ... euribor3m nr.employed response
## 0 56 housemaid married ... 4.857 5191 no
## 1 57 services married ... 4.857 5191 no
## 2 41 blue-collar married ... 4.857 5191 no
## 3 25 services single ... 4.857 5191 no
## 6 35 blue-collar married ... 4.857 5191 no
##
## [5 rows x 21 columns]
Other information and references
What else can be done in Python
using Pandas
? Lots of things!!
Basically, any preprocessing method can be applied using Pandas
and Numpy
:
- Manipulating and applying functions in each row and columns (lambda function)
- Summary statistics and description of the dataset
- Merging datasets, transformations and normalization
- Outlier detection, removing
- Removing duplicates
- Feature engineering
- …
To learn more about
Pandas
, visit: https://www.w3schools.com/python/pandas/default.asp
License
This document is created for Math 514, Spring 2021, at Illinois Tech. While the course materials are generally not to be distributed outside the course without permission of the instructor, this particular set of notes is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.