Loading Querying Joining And Filtering Data Using Pandas
pandas
pandas is a library that provides a lot of data analysis functionality.
Loading data
pandas allows you to interact with databases and Excel files.
Here's how you'd load an Excel file:
import pandas as pd
xls = pd.ExcelFile('file.xlsx')
df = xls.parse('sheet_name') # creates a DataFrame for a specific sheetInspecting data
Here are some useful attributes and methods to give you details about your data:
len(df) # number of rows
df.shape # size of rows and columns
df.count() # count of values in each column
df.columns # access column headers
df.dtypes # view data types for each column
df.describe() # built-in summary statistics for numerical values
df.head() # first 5 rows
df.head(100) # first 100 rows
df = df.drop_duplicates() # removes all duplicate rows (ALL cells identical)Querying data
Here are some useful operations to query data:
Casting data
Sometimes the data type for a DataFrame column is not what we expect. So we may need to cast the values into another data type.
pd.to_numeric method:
astype method:
Cleaning data
If your data in a column is dirty and you need to clean it up, you can use data type-related methods to mutate the values.
If, on the other hand, you have missing values in a column, you can either drop those rows...
or you can fill in those missing values with some default.
Joining data
Given two DataFrames, you can join data together based on some shared key:
In the example above, we inner join (only keeps data where match is found) df2 into df1 using a shared key. That means the data in df2 gets merged into df1.
Filtering data
You can get a slice of a DataFrame's rows:
You can also filter a DataFrame using boolean indexing:
Computing data
A column gives you access to the following methods:
Updating/creating data
You can mutate a DataFrame by adding/updating columns:
Last updated