📕
Dan Fitz's Notes
  • README
  • Ai
    • Supervised Machine Learning
      • Introduction To Machine Learning
      • Regression With Multiple Input Variables
      • Classification
  • Csharp
    • C Sharp Advanced
      • Generics
      • Delegates
      • Lambda Expressions
      • Events
    • C Sharp Fundamentals
      • Intro To C
      • Primitive Types And Expressions
      • Non Primitive Types
      • Control Flow
      • Arrays And Lists
      • Working With Dates
      • Working With Text
      • Working With Files
      • Debugging Applications
    • C Sharp Intermediate
      • Classes
      • Association Between Classes
      • Inheritance
      • Polymorphism
      • Interfaces
  • Java
    • Inheritance Data Structures Java
      • Inheritance Polymorphism Using Overriding And Access Modifiers
      • Abstract Classes And Debugging
      • File I O And Exceptions
      • Collections Maps And Regular Expressions
    • Intro To Java
      • Introduction To Java Classes And Eclipse
      • Unit Testing Arrays And Array Lists
      • Static Variables Methods And Polymorphism Using Overloading
  • Javascript
    • Algorithms Data Structures
      • Big O Notation
      • Analyzing Performance Of Arrays And Objects
      • Problem Solving Approach
      • Problem Solving Patterns
      • Recursion
      • Searching Algorithms
      • Bubble Selection And Insertion Sort
      • Merge Sort
      • Quick Sort
      • Radix Sort
      • Data Structures Introduction
      • Singly Linked Lists
      • Doubly Linked Lists
      • Stacks And Queues
      • Binary Search Trees
      • Tree Traversal
      • Binary Heaps
    • Complete Nodejs
      • Understanding Node.js
      • REST AP Is And Mongoose
      • API Authentication And Security
      • Node.js Module System
      • File System And Command Line Args
      • Debugging Node.js
      • Asynchronous Node.js
      • Web Servers
      • Accessing API From Browser
      • Application Deployment
      • Mongo DB And Promises
    • Complete React Native
      • Working With Content
      • Building Lists
      • Navigating Users Between Screens
      • State Management
      • Handling Screen Layout
      • Setting Up An App
      • More On Navigation
      • Advanced Statement Management With Context
      • Building A Custom Express API
      • In App Authentication
    • Epic React
      • React Fundamentals
      • React Hooks
      • Advanced React Hooks
      • Advanced React Patterns
      • React Performance
    • Fireship Firestore
      • Firestore Queries And Data Modeling Course
      • Model Relational Data In Firestore No SQL
    • Functional Light Javascript
      • Intro
      • Function Purity
      • Argument Adapters
      • Point Free
      • Closure
      • Composition
      • Immutability
      • Recursion
      • List Operations
      • Transduction
      • Data Structure Operations
      • Async
    • Js Weird Parts
      • Execution Contexts And Lexical Environments
      • Types And Operators
      • Objects And Functions
      • Object Oriented Java Script And Prototypal Inheritance
      • Defining Objects
    • Mastering Chrome Dev Tools
      • Introduction
      • Editing
      • Debugging
      • Networking
      • Auditing
      • Node.js Profiling
      • Performance Monitoring
      • Image Performance
      • Memory
    • React Complete Guide
      • What Is React
      • React Basics
      • Rendering Lists And Conditionals
      • Styling React Components
      • Debugging React Apps
      • Component Deep Dive
      • Building A React App
      • Reaching Out To The Web
      • Routing
    • React Testing
      • Intro To Jest Enzyme And TDD
      • Basic Testing
      • Redux Testing
      • Redux Thunk Testing
    • Serverless Bootcamp
      • Introduction
      • Auction Service Setup
      • Auction Service CRUD Operations
      • Auction Service Processing Auctions
    • Testing Javascript
      • Fundamentals Of Testing
      • Static Analysis Testing
      • Mocking Fundamentals
      • Configuring Jest
      • Test React Components With Jest And React Testing Library
    • Typescript Developers Guide
      • Getting Started With Type Script
      • What Is A Type System
      • Type Annotations In Action
      • Annotations With Functions And Objects
      • Mastering Typed Arrays
      • Tuples In Type Script
      • The All Important Interface
      • Building Functionality With Classes
    • Web Performance With Webpack
      • Intro
      • Code Splitting
      • Module Methods Magic Comments
  • Other
    • Algo Expert
      • Defining Data Structures And Complexity Analysis
      • Memory
      • Big O Notation
      • Logarithm
      • Arrays
      • Linked Lists
      • Hash Tables
      • Stacks And Queues
      • Strings
      • Graphs
      • Trees
    • Aws Solutions Architect
      • AWS Fundamentals IAM EC 2
    • Fundamentals Math
      • Numbers And Negative Numbers
      • Factors And Multiples
      • Fractions
    • Mysql Bootcamp
      • Overview And Installation
      • Creating Databases And Tables
      • Inserting Data
      • CRUD Commands
      • The World Of String Functions
      • Refining Our Selections
      • The Magic Of Aggregate Functions
    • Random Notes
      • Understanding React Hooks
  • Python
    • Data Analysis Using Python
      • Loading Querying And Filtering Data Using The Csv Module
      • Loading Querying Joining And Filtering Data Using Pandas
      • Summarizing And Visualizing Data
    • Intro To Python
      • Course Introduction Intro To Programming And The Python Language Variables Conditionals Jupyter Notebook And IDLE
      • Intro To Lists Loops And Functions
      • More With Lists Strings Tuples Sets And Py Charm
      • Dictionaries And Files
Powered by GitBook
On this page
  • Summarizing Data
  • Aggregation
  • Pivot table
  • Visualizing Data
  • Histograms
  • Scatterplots
  1. Python
  2. Data Analysis Using Python

Summarizing And Visualizing Data

Summarizing Data

Aggregation

grouby is an SQL-equivalent way to group data by the values in a column:

groups = df.groupby(['city']).groups # returns GroupBy object withs groups attribute

groups['Las Vegas'] # returns all rows with "Las Vegas" as city

Once you have your groups, you can then perform aggregate computations on the data in each group:

import numpy as np

# You pass aggregate functions to perform the aggregations (numpy has built-in ones)
df.groupby(['city']).agg([np.sum, np.mean])

Pivot table

A useful approach to aggregation is via pivot tables: the grouping of data along some index/indices in order to show useful summaries along that index/indices.

When creating a pivot table, you are basically creating a new DataFrame from your original DataFrame:

pivot_city = pd.pivot_table(df, index=["city"])

Note: By default, pandas' pivot tables aggregate via mean for all columns. But you can specify which aggregate functions you want to use and for which columns:

pivot_city = pd.pivot_table(
  df,
  index=["city"],
  aggfunc=[np.sum], # aggregates via sum
  values=["review_count", "stars"] # only aggregates these columns
)

pivot_city2 = pd.pivot_table(
  df,
  index=["city"],
  # uses different aggregate methods for different columns
  aggfunc={ "review_count": np.sum, "stars": np.mean }
)

Visualizing Data

To get started visualizing data, initialize a visualization tool using the following Jupyter notebook magic function:

%pylab inline

Loading Pylab introduces Matplotlib, a very popular data visualization tool.

Histograms

As a recap, a histogram shows the distribution of values in a data set.

Here's the basic API to create histograms using Matplotlib:

# Returns series containing ages for relevant cities
ages_toronto = df[df["city"] == "Toronto"]["age"]
ages_sauga = df[df["city"] == "Mississauga"]["age"]

import matplotlib.pyplot as plt

# Show overlapping histogram:
plt.hist(
  ages_toronto,
  alpha = 0.3, # opacity of bars
  color = "blue",
  label = "Toronto",
  bins = "auto" # width of dividers on x-axis
)
plt.hist(
  ages_sauga,
  alpha = 0.3, # opacity of bars
  color = "red",
  label = "Mississauga",
  bins = "auto" # width of dividers on x-axis
)

plt.xlabel("Ages")
plt.ylabel("Number of Items")
plt.legend(loc = "best") # automatic legend positioning
plt.title("Distribution of Ages in Toronto and Mississauga")

plt.show()

# Show side-by-side histogram:
plt.hist(
  [ages_toronto, ages_sauga],
  color = ["blue", "red"],
  label = ["Toronto", "Mississauga"],
  # everything else the same...
)

plt.show()

Scatterplots

As a recap, a scatterplot shows the spread of data along 2 variables by plotting each value as a point on an x-y plane.

Here's the basic API to create scatterplots using Matplotlib:

df_toronto = df[df["city"] == "Toronto"]
df_sauga = df[df["city"] == "Mississauga"]

plt.scatter(
  df_toronto["income"], # x-axis
  df_toronto["age"], # y-axis
  marker = "o", # points shaped as circles
  color = "green",
  alpha = 0.7, # opacity of points
  s = 124, # size 
  label = ["Toronto"]
)

plt.scatter(
  df_sauga["income"], # x-axis
  df_sauga["age"], # y-axis
  marker = "^", # points shaped as triangles
  color = "blue",
  alpha = 0.7, # opacity of points
  s = 124, # size 
  label = ["Mississauga"]
)

plt.xlabel("Income")
plt.ylabel("Age")
plt.legend(loc = "upper left")
plt.title("Scatter for income and age in cities")

plt.show()

Pro tip: When you have outliers in your scatterplot, it can cause all the centralized data to look squashed. For example, maybe some citizens of Toronto or Mississauga have a huge income over everyone else. To fix this, we can scale the y-axis logarithmically, so the y-axis increments in powers of 10 (10^1, 10^2, etc.).

axes = plt.gca()
axes.set_yscale("log")
PreviousLoading Querying Joining And Filtering Data Using PandasNextIntro To Python

Last updated 3 years ago