Week 2 Lab: Intro to Vega-Lite DataTtransformations, Working with Real Datasets

CS-GY 6313 - Information Visualization

Ryan Kim

New York University

2025-09-12

Logistics

  • TA Office Hours:
  • Feedback:
    • Difficulty of the assignment?
    • Submission problems?
    • Discord + Brightspace notifications?

Week 2 Lab Overview

User Interface Graphics Library Notebook
observablehq.com Vega-Lite Week 2 Lab Notebook

Today’s Lab Activities

All about data transformations with movies!

  1. Binning
  2. Aggregation
  3. Filtering
  4. Normalization

Food 4 Thought: “Data” v.s. “Information”

A Question:

What’s the difference between data and information?

Food 4 Thought: “Data” v.s. “Information”

A Question:

What’s the difference between data and information?

Data v.s. Information

Data:

  • Facts and statistics collected together for reference or analysis. Can be structured/unstructured, quantitative/qualitative, temporal/static.
  • E.g. Census data, stock prices, sensor readings, survey responses, click streams.
  • Alone, it lacks context and meaning.

Information:

  • Processed and/or organized form of data.
  • E.g. Sales reports, news articles, graphs & figures.
  • Analyzed, structured, and given context through a narrative established by its handlers.

Meaning-Making: Data -> Information

As engineers, designers, and researchers, we must do the work to find meaning within the raw data and interpret them for the benefit of others.

Dataset: Movies

Vega-Lite contains several datasets available to us. We’ll be using a dataset that describes movies.

movies = (await require('vega-datasets@1'))['movies.json']()


Features/Columns:

  • Title
  • US_Gross
  • Worldwide_Gross
  • US_DVD_Sales
  • Production_Budget
  • Release_Date
  • MPAA_Rating
  • Running_Time_min
  • Distributor
  • Source
  • Major_Genre
  • Creative_Type
  • Director
  • Rotten_Tomatoes_Rating
  • IMDB_Rating
  • IMDB_Votes

Review: 4 Major Data Transformations

  • Aggregation

    • Purpose: Summarize groups of data
    • Methods: Sum, mean, median, count, min, max
    • Example: Daily sales → Monthly totals
  • Filtering

    • Purpose: Focus on relevant subset
    • Types: Range, categorical, conditional
  • Binning

    • Purpose: Convert continuous to discrete
    • Methods: Equal width, equal frequency, custom _ Example: Dividing age into groups (<18, 18-65, >65)
  • Normalization

    • Purpose: Enable fair comparison
    • Methods: Min-max, z-score, percentage

Step 1: Binning

  • Grouping continuous data into discrete groups.
  • What are some common examples?
    • Age groups
    • NYC Boroughs
    • Years
    • Grades/Scores
    • Any kind of continuous data can be binned, in theory.
  • We lose a bit of data in the meantime, but by doing so we increase the probability of deriving new meaning.

Rotten Tomatoes v.s. IMDb Ratings

To better understand the importance of aggregation, let’s look at raw, unaggregated data of movie ratings across Rotten Tomatoes and IMDb. We’ll use Vega-Lite to produce a scatter plot using the circle mark.

  • [TO-DO]: Generate a scatter plot with the circle marker, with the X-axis representing the Rotten Tomatoes ratings (Rotten_Tomatoes_Rating) and IMDb ratings (IMDB_Rating).
vl.markCircle()
  .data(movies)
  .encode(
    vl.x().fieldQ("Rotten_Tomatoes_Rating"),
    vl.y().fieldQ("IMDB_Rating")
  )
  .render()

Your Turn (~5 min):

In the Lab 2 notebook, complete Step 1, from 1b to 1d. You should eventually end up with the following two histograms:

Rotten Tomatoes Counts per Rating (Binned)

IMDb Counts per Rating (Binned)

Common Problem: Overplotting

vl.markCircle()
  .data(movies)
  .encode(
    vl.x().fieldQ('Rotten_Tomatoes_Rating').bin({maxbins: 20}),
    vl.y().fieldQ('IMDB_Rating').bin({maxbins: 20})
  )
  .render()

Plotting too much data can make it hard to actually understand what’s going on with the data.

Benefits of Bins

  • Bins aren’t just restricted to histograms. They are compatible with other chart types
  • Bins can alleviate overplotting issues.
  • Bins can emphasize outliers in data distributions.

Step 2: Aggregation

Another data transformation that’s common is aggregation. We use aggregation to summarize groups of data (i.e. mean, median, min/max).

The Vega-Lite documentation includes the full set of available aggregation functions, which may be worth reading through.

Averages (Mean) Across Genres

vl.markBar()
  .data(movies)
  .encode(
    vl.x().average('Rotten_Tomatoes_Rating'),
    vl.y().fieldN('Major_Genre')
  )
  .render()


There may be some interesting variation, but it’s mentally tasking to try to understand overall rankings across genres.

Sorting

Rather than sort the genres alphabetically, let’s try to sort them in descending order of rating (i.e. the genres with the higher ratings are at the top, while the genres with the lower ratings are at the bottom).

vl.markBar()
  .data(movies)
  .encode(
    vl.x().average('Rotten_Tomatoes_Rating'),
    vl.y().fieldN('Major_Genre')
      .sort(vl.average('Rotten_Tomatoes_Rating').order('descending'))
  )
  .render()

Food 4 Thought: Averages (Mean) vs. Median

Two Questions:

  • What’s the difference between Averages (Mean) and Median?
  • Why does it matter?

From Mean to Median

vl.markBar()
  .data(movies)
  .encode(
    vl.x().median('Rotten_Tomatoes_Rating'),
    vl.y().fieldN('Major_Genre')
      .sort(vl.median('Rotten_Tomatoes_Rating').order('descending'))
  )
  .render()


Even with this data, we should still be a bit skeptical. What if, within the genres themselves, there’s some skew caused by outliers and such? Observing the variation within each genre is a good way to extend our analysis.

Inter-Quartile Range (IQR)

Let’s add some nuance to our bar chart by considering the “Inter-Quartile Range (IQR)” of each genre.

The IQR is a special range across a set of values that represents where the middle half of the data resides in. A quartile represents 25% of data values. The IQR therefore represents the two middle quartiles, or the middle 50% of data.


Img. src: https://en.wikipedia.org/wiki/Interquartile_range

Your Turn (~5 min):

In the Lab 2 notebook, complete Step 2d and 2e. You should eventually end up with the following two histograms:


IQR of Rotten Tomatoes Ratings, by Genres

IQR of IMDb Ratings, by Genres

Core Concepts of Data Transformations

We Covered in the Lab:

  • Aggregation

    • Purpose: Summarize groups of data
    • Methods: Sum, mean, median, count, min, max
    • Example: Daily sales → Monthly totals
  • Binning

    • Purpose: Convert continuous to discrete
    • Methods: Equal width, equal frequency, custom
    • Example: Dividing age into groups (<18, 18-65, >65)

Covered in Assignment #2:

  • Filtering

    • Purpose: Focus on relevant subset
    • Types: Range, categorical, conditional
  • Normalization

    • Purpose: Enable fair comparison
    • Methods: Min-max, z-score, percentages

End of Lab

  • Assignment #2 will be posted no later than September 13, 2025.
  • Assignment #2 is due on September 18th, 2025 @ 11:59pm!
  • Where do I ask questions?
    • TA Office Hours:
    • Our course Discord!