Loading Querying Joining And Filtering Data Using Pandas

pandas

pandas is a library that provides a lot of data analysis functionality.

Loading data

pandas allows you to interact with databases and Excel files.

Here's how you'd load an Excel file:

import pandas as pd
xls = pd.ExcelFile('file.xlsx')
df = xls.parse('sheet_name') # creates a DataFrame for a specific sheet

Inspecting data

Here are some useful attributes and methods to give you details about your data:

len(df) # number of rows
df.shape # size of rows and columns
df.count() # count of values in each column
df.columns # access column headers
df.dtypes # view data types for each column

df.describe() # built-in summary statistics for numerical values
df.head() # first 5 rows
df.head(100) # first 100 rows

df = df.drop_duplicates() # removes all duplicate rows (ALL cells identical)

Querying data

Here are some useful operations to query data:

Casting data

Sometimes the data type for a DataFrame column is not what we expect. So we may need to cast the values into another data type.

pd.to_numeric method:

astype method:

Cleaning data

If your data in a column is dirty and you need to clean it up, you can use data type-related methods to mutate the values.

If, on the other hand, you have missing values in a column, you can either drop those rows...

or you can fill in those missing values with some default.

Joining data

Given two DataFrames, you can join data together based on some shared key:

In the example above, we inner join (only keeps data where match is found) df2 into df1 using a shared key. That means the data in df2 gets merged into df1.

Filtering data

You can get a slice of a DataFrame's rows:

You can also filter a DataFrame using boolean indexing:

Computing data

A column gives you access to the following methods:

Updating/creating data

You can mutate a DataFrame by adding/updating columns:

Last updated