Summarizing And Visualizing Data
Summarizing Data
Aggregation
grouby is an SQL-equivalent way to group data by the values in a column:
groups = df.groupby(['city']).groups # returns GroupBy object withs groups attribute
groups['Las Vegas'] # returns all rows with "Las Vegas" as cityOnce you have your groups, you can then perform aggregate computations on the data in each group:
import numpy as np
# You pass aggregate functions to perform the aggregations (numpy has built-in ones)
df.groupby(['city']).agg([np.sum, np.mean])Pivot table
A useful approach to aggregation is via pivot tables: the grouping of data along some index/indices in order to show useful summaries along that index/indices.
When creating a pivot table, you are basically creating a new DataFrame from your original DataFrame:
pivot_city = pd.pivot_table(df, index=["city"])Note: By default, pandas' pivot tables aggregate via mean for all columns. But you can specify which aggregate functions you want to use and for which columns:
Visualizing Data
To get started visualizing data, initialize a visualization tool using the following Jupyter notebook magic function:
%pylab inline
Loading Pylab introduces Matplotlib, a very popular data visualization tool.
Histograms
As a recap, a histogram shows the distribution of values in a data set.
Here's the basic API to create histograms using Matplotlib:
Scatterplots
As a recap, a scatterplot shows the spread of data along 2 variables by plotting each value as a point on an x-y plane.
Here's the basic API to create scatterplots using Matplotlib:
Pro tip: When you have outliers in your scatterplot, it can cause all the centralized data to look squashed. For example, maybe some citizens of Toronto or Mississauga have a huge income over everyone else. To fix this, we can scale the y-axis logarithmically, so the y-axis increments in powers of 10 (10^1, 10^2, etc.).
Last updated