Summarizing And Visualizing Data

Summarizing Data

Aggregation

grouby is an SQL-equivalent way to group data by the values in a column:

groups = df.groupby(['city']).groups # returns GroupBy object withs groups attribute

groups['Las Vegas'] # returns all rows with "Las Vegas" as city

Once you have your groups, you can then perform aggregate computations on the data in each group:

import numpy as np

# You pass aggregate functions to perform the aggregations (numpy has built-in ones)
df.groupby(['city']).agg([np.sum, np.mean])

Pivot table

A useful approach to aggregation is via pivot tables: the grouping of data along some index/indices in order to show useful summaries along that index/indices.

When creating a pivot table, you are basically creating a new DataFrame from your original DataFrame:

pivot_city = pd.pivot_table(df, index=["city"])

Note: By default, pandas' pivot tables aggregate via mean for all columns. But you can specify which aggregate functions you want to use and for which columns:

Visualizing Data

To get started visualizing data, initialize a visualization tool using the following Jupyter notebook magic function:

%pylab inline

Loading Pylab introduces Matplotlib, a very popular data visualization tool.

Histograms

As a recap, a histogram shows the distribution of values in a data set.

Here's the basic API to create histograms using Matplotlib:

Scatterplots

As a recap, a scatterplot shows the spread of data along 2 variables by plotting each value as a point on an x-y plane.

Here's the basic API to create scatterplots using Matplotlib:

Pro tip: When you have outliers in your scatterplot, it can cause all the centralized data to look squashed. For example, maybe some citizens of Toronto or Mississauga have a huge income over everyone else. To fix this, we can scale the y-axis logarithmically, so the y-axis increments in powers of 10 (10^1, 10^2, etc.).

Last updated