Loading Querying Joining And Filtering Data Using Pandas

pandas

pandas is a library that provides a lot of data analysis functionality.

Loading data

pandas allows you to interact with databases and Excel files.

Here's how you'd load an Excel file:

import pandas as pd
xls = pd.ExcelFile('file.xlsx')
df = xls.parse('sheet_name') # creates a DataFrame for a specific sheet

Inspecting data

Here are some useful attributes and methods to give you details about your data:

len(df) # number of rows
df.shape # size of rows and columns
df.count() # count of values in each column
df.columns # access column headers
df.dtypes # view data types for each column

df.describe() # built-in summary statistics for numerical values
df.head() # first 5 rows
df.head(100) # first 100 rows

df = df.drop_duplicates() # removes all duplicate rows (ALL cells identical)

Querying data

Here are some useful operations to query data:

# Returns column values for every row
df["column_name"]

# Returns specified column values for every row
atts = ["col1", "col2", "col3"]
df[atts]

Casting data

Sometimes the data type for a DataFrame column is not what we expect. So we may need to cast the values into another data type.

pd.to_numeric method:

df['salary'] = pd.to_numeric(df['salary'], errors='coerce')
# The errors argument determines how to handle data that fails casting

astype method:

df['name'] = df['name'].astype(str) # casts to string

Cleaning data

If your data in a column is dirty and you need to clean it up, you can use data type-related methods to mutate the values.

# This example removes trailing whitespace
df['name'] = df['name'].str.strip()

If, on the other hand, you have missing values in a column, you can either drop those rows...

df.dropna(inplace = True)

or you can fill in those missing values with some default.

df.fillna('N/A')

Joining data

Given two DataFrames, you can join data together based on some shared key:

pd.merge(left=df1, right=df2, how="inner", left_on="key1", right_on="key2")

In the example above, we inner join (only keeps data where match is found) df2 into df1 using a shared key. That means the data in df2 gets merged into df1.

Filtering data

You can get a slice of a DataFrame's rows:

second_100_rows = df[100:200]
last_row = df[-1 : len(df)-11]
last_row_again = df.iloc[-1]

You can also filter a DataFrame using boolean indexing:

# Creates a Series containing True/False values for each row
# to determine which passes the logic check
conditionA = df["age"] >= 30
conditionB = df["city"].isin(["Toronto", "Edmonton"])

# Returns only those rows where the row was True in the Series
df[conditionA]
df[conditionB]

# Setting multiple conditions
cond1 = df["name"] = "Dan"
cond2 = df["city"] = "Toronto"
df[conditionA | conditionB] # "|" represents bitwise OR
df[conditionA & conditionB] # "&" represents bitwise AND

Computing data

A column gives you access to the following methods:

df["income"].sum() # returns total income
df["rating"].mean() # returns average rating
df["city"].unique() # filters out duplicates
df["city"].value_counts() # returns number of rows for each city
df["city"].nunique() # returns number of unique values?

Updating/creating data

You can mutate a DataFrame by adding/updating columns:

df["new_column"] = <some_value>

df["concat_column"] = df["col1"].str.concat(df["col2"], sep=",")

def add_two(x):
  return x + 2
df["amount"] = df["amount"].apply(add_two)

Last updated