Week 2 Lab: Intro to Vega-Lite DataTtransformations, Working with Real Datasets

CS-GY 6313 - Information Visualization

Ryan Kim

New York University

2025-09-12

Logistics

TA Office Hours:
- Physical Location: Wednesdays @ 2PM-3PM, 8th floor common area @ 370 Jay Street, Brooklyn
- Online Zoom: (https://nyu.zoom.us/j/92815268504)
Feedback:
- Difficulty of the assignment?
- Submission problems?
- Discord + Brightspace notifications?

Week 2 Lab Overview

User Interface	Graphics Library	Notebook
observablehq.com	Vega-Lite	Week 2 Lab Notebook

Today’s Lab Activities

All about data transformations with movies!

Binning
Aggregation
Filtering
~~Normalization~~

Food 4 Thought: “Data” v.s. “Information”

A Question:

What’s the difference between data and information?

Food 4 Thought: “Data” v.s. “Information”

A Question:

What’s the difference between data and information?

Data v.s. Information

Data:

Facts and statistics collected together for reference or analysis. Can be structured/unstructured, quantitative/qualitative, temporal/static.
E.g. Census data, stock prices, sensor readings, survey responses, click streams.
Alone, it lacks context and meaning.

Information:

Processed and/or organized form of data.
E.g. Sales reports, news articles, graphs & figures.
Analyzed, structured, and given context through a narrative established by its handlers.

Meaning-Making: Data -> Information

As engineers, designers, and researchers, we must do the work to find meaning within the raw data and interpret them for the benefit of others.

Dataset: Movies

Vega-Lite contains several datasets available to us. We’ll be using a dataset that describes movies.

movies = (await require('vega-datasets@1'))['movies.json']()

Features/Columns:

Title
US_Gross
Worldwide_Gross
US_DVD_Sales

Production_Budget
Release_Date
MPAA_Rating
Running_Time_min

Distributor
Source
Major_Genre
Creative_Type

Director
Rotten_Tomatoes_Rating
IMDB_Rating
IMDB_Votes

Review: 4 Major Data Transformations

Aggregation
- Purpose: Summarize groups of data
- Methods: Sum, mean, median, count, min, max
- Example: Daily sales → Monthly totals
Filtering
- Purpose: Focus on relevant subset
- Types: Range, categorical, conditional
Binning
- Purpose: Convert continuous to discrete
- Methods: Equal width, equal frequency, custom _ Example: Dividing age into groups (<18, 18-65, >65)
Normalization
- Purpose: Enable fair comparison
- Methods: Min-max, z-score, percentage

Step 1: Binning

Grouping continuous data into discrete groups.
What are some common examples?
- Age groups
- NYC Boroughs
- Years
- Grades/Scores
- Any kind of continuous data can be binned, in theory.
We lose a bit of data in the meantime, but by doing so we increase the probability of deriving new meaning.

Rotten Tomatoes v.s. IMDb Ratings

To better understand the importance of aggregation, let’s look at raw, unaggregated data of movie ratings across Rotten Tomatoes and IMDb. We’ll use Vega-Lite to produce a scatter plot using the circle mark.

[TO-DO]: Generate a scatter plot with the circle marker, with the X-axis representing the Rotten Tomatoes ratings (Rotten_Tomatoes_Rating) and IMDb ratings (IMDB_Rating).

vl.markCircle()
  .data(movies)
  .encode(
    vl.x().fieldQ("Rotten_Tomatoes_Rating"),
    vl.y().fieldQ("IMDB_Rating")
  )
  .render()

Your Turn (~5 min):

In the Lab 2 notebook, complete Step 1, from 1b to 1d. You should eventually end up with the following two histograms:

Rotten Tomatoes Counts per Rating (Binned)

IMDb Counts per Rating (Binned)

Common Problem: Overplotting

vl.markCircle()
  .data(movies)
  .encode(
    vl.x().fieldQ('Rotten_Tomatoes_Rating').bin({maxbins: 20}),
    vl.y().fieldQ('IMDB_Rating').bin({maxbins: 20})
  )
  .render()

Plotting too much data can make it hard to actually understand what’s going on with the data.

Benefits of Bins

Bins aren’t just restricted to histograms. They are compatible with other chart types
Bins can alleviate overplotting issues.
Bins can emphasize outliers in data distributions.

Step 2: Aggregation

Another data transformation that’s common is aggregation. We use aggregation to summarize groups of data (i.e. mean, median, min/max).

The Vega-Lite documentation includes the full set of available aggregation functions, which may be worth reading through.

Averages (Mean) Across Genres

vl.markBar()
  .data(movies)
  .encode(
    vl.x().average('Rotten_Tomatoes_Rating'),
    vl.y().fieldN('Major_Genre')
  )
  .render()

There may be some interesting variation, but it’s mentally tasking to try to understand overall rankings across genres.

Sorting

Rather than sort the genres alphabetically, let’s try to sort them in descending order of rating (i.e. the genres with the higher ratings are at the top, while the genres with the lower ratings are at the bottom).

vl.markBar()
  .data(movies)
  .encode(
    vl.x().average('Rotten_Tomatoes_Rating'),
    vl.y().fieldN('Major_Genre')
      .sort(vl.average('Rotten_Tomatoes_Rating').order('descending'))
  )
  .render()

Food 4 Thought: Averages (Mean) vs. Median

Two Questions:

What’s the difference between Averages (Mean) and Median?
Why does it matter?

Img source: https://statistics.laerd.com/statistical-guides/measures-central-tendency-mean-mode-median.php

From Mean to Median

vl.markBar()
  .data(movies)
  .encode(
    vl.x().median('Rotten_Tomatoes_Rating'),
    vl.y().fieldN('Major_Genre')
      .sort(vl.median('Rotten_Tomatoes_Rating').order('descending'))
  )
  .render()

Even with this data, we should still be a bit skeptical. What if, within the genres themselves, there’s some skew caused by outliers and such? Observing the variation within each genre is a good way to extend our analysis.

Inter-Quartile Range (IQR)

Let’s add some nuance to our bar chart by considering the “Inter-Quartile Range (IQR)” of each genre.

The IQR is a special range across a set of values that represents where the middle half of the data resides in. A quartile represents 25% of data values. The IQR therefore represents the two middle quartiles, or the middle 50% of data.

Img. src: https://en.wikipedia.org/wiki/Interquartile_range

Your Turn (~5 min):

In the Lab 2 notebook, complete Step 2d and 2e. You should eventually end up with the following two histograms:

IQR of Rotten Tomatoes Ratings, by Genres

IQR of IMDb Ratings, by Genres

Core Concepts of Data Transformations

We Covered in the Lab:

Aggregation
- Purpose: Summarize groups of data
- Methods: Sum, mean, median, count, min, max
- Example: Daily sales → Monthly totals
Binning
- Purpose: Convert continuous to discrete
- Methods: Equal width, equal frequency, custom
- Example: Dividing age into groups (<18, 18-65, >65)

Covered in Assignment #2:

Filtering
- Purpose: Focus on relevant subset
- Types: Range, categorical, conditional
Normalization
- Purpose: Enable fair comparison
- Methods: Min-max, z-score, percentages

End of Lab

Assignment #2 will be posted no later than September 13, 2025.
Assignment #2 is due on September 18th, 2025 @ 11:59pm!
Where do I ask questions?
- TA Office Hours:
  - Physical Location: Wednesdays @ 2PM-3PM, 8th floor common area @ 370 Jay Street, Brooklyn
  - Online Zoom: (https://nyu.zoom.us/j/92815268504)
- Our course Discord!

Week 2 Lab: Intro to Vega-Lite DataTtransformations, Working with Real Datasets

Logistics

Week 2 Lab Overview

Today’s Lab Activities

Food 4 Thought: “Data” v.s. “Information”

A Question:

Food 4 Thought: “Data” v.s. “Information”

A Question:

Data v.s. Information

Data:

Information:

Meaning-Making: Data -> Information

Dataset: Movies

Features/Columns:

Review: 4 Major Data Transformations

Aggregation

Filtering

Binning

Normalization

Step 1: Binning

Rotten Tomatoes v.s. IMDb Ratings

Your Turn (~5 min):

Rotten Tomatoes Counts per Rating (Binned)

IMDb Counts per Rating (Binned)

Common Problem: Overplotting

Benefits of Bins

Step 2: Aggregation

Averages (Mean) Across Genres

Sorting

Food 4 Thought: Averages (Mean) vs. Median

Two Questions:

From Mean to Median

Inter-Quartile Range (IQR)

Your Turn (~5 min):

IQR of Rotten Tomatoes Ratings, by Genres

IQR of IMDb Ratings, by Genres

Core Concepts of Data Transformations

We Covered in the Lab:

Aggregation

Binning

Covered in Assignment #2:

Filtering

Normalization

End of Lab