Loading Querying Joining And Filtering Data Using Pandas
pandas
pandas is a library that provides a lot of data analysis functionality.
Loading data
pandas allows you to interact with databases and Excel files.
Here's how you'd load an Excel file:
import pandas as pd
xls = pd.ExcelFile('file.xlsx')
df = xls.parse('sheet_name') # creates a DataFrame for a specific sheet
Inspecting data
Here are some useful attributes and methods to give you details about your data:
len(df) # number of rows
df.shape # size of rows and columns
df.count() # count of values in each column
df.columns # access column headers
df.dtypes # view data types for each column
df.describe() # built-in summary statistics for numerical values
df.head() # first 5 rows
df.head(100) # first 100 rows
df = df.drop_duplicates() # removes all duplicate rows (ALL cells identical)
Querying data
Here are some useful operations to query data:
# Returns column values for every row
df["column_name"]
# Returns specified column values for every row
atts = ["col1", "col2", "col3"]
df[atts]
Casting data
Sometimes the data type for a DataFrame column is not what we expect. So we may need to cast the values into another data type.
pd.to_numeric method:
df['salary'] = pd.to_numeric(df['salary'], errors='coerce')
# The errors argument determines how to handle data that fails casting
astype method:
df['name'] = df['name'].astype(str) # casts to string
Cleaning data
If your data in a column is dirty and you need to clean it up, you can use data type-related methods to mutate the values.
# This example removes trailing whitespace
df['name'] = df['name'].str.strip()
If, on the other hand, you have missing values in a column, you can either drop those rows...
df.dropna(inplace = True)
or you can fill in those missing values with some default.
df.fillna('N/A')
Joining data
Given two DataFrames, you can join data together based on some shared key:
In the example above, we inner join (only keeps data where match is found) df2 into df1 using a shared key. That means the data in df2 gets merged intodf1.
You can also filter a DataFrame using boolean indexing:
# Creates a Series containing True/False values for each row
# to determine which passes the logic check
conditionA = df["age"] >= 30
conditionB = df["city"].isin(["Toronto", "Edmonton"])
# Returns only those rows where the row was True in the Series
df[conditionA]
df[conditionB]
# Setting multiple conditions
cond1 = df["name"] = "Dan"
cond2 = df["city"] = "Toronto"
df[conditionA | conditionB] # "|" represents bitwise OR
df[conditionA & conditionB] # "&" represents bitwise AND
Computing data
A column gives you access to the following methods:
df["income"].sum() # returns total income
df["rating"].mean() # returns average rating
df["city"].unique() # filters out duplicates
df["city"].value_counts() # returns number of rows for each city
df["city"].nunique() # returns number of unique values?
Updating/creating data
You can mutate a DataFrame by adding/updating columns: