Deep Learning Visualization

CS-GY 9223 - Fall 2025

Claudio Silva

NYU Tandon School of Engineering

2025-10-27

Deep Learning Fundamentals

Agenda


Goal: Grasp foundational DL concepts essential for understanding network visualization techniques.

  1. Deep Learning Terminology and Foundations

  2. Linear Models and Loss Functions

  3. Shallow Neural Networks and Activation Functions

  4. Deep Neural Networks and Composition

  5. Interactive Visualization Tools

Acknowledgments:

Materials adapted from:

Understanding Deep Learning

Understanding Deep Learning by Simon J.D. Prince

Published by MIT Press, 2023

Available free online: https://udlbook.github.io/udlbook

Why this book?

  • Modern treatment (includes transformers, diffusion models)
  • Excellent visual explanations
  • Free and accessible
  • Strong mathematical foundations with intuitive explanations

Deep Learning Terminology

The Supervised Learning Framework:

\[y = f[x, \Phi]\]

Symbol Meaning Example
\(y\) Prediction (model output) House price: $450,000
\(x\) Input (features) Square footage: 2000 sq ft, Bedrooms: 3
\(\Phi\) Model parameters (weights, biases) Millions of numbers learned from data
\(f[\cdot]\) Model function (architecture) Neural network with multiple layers

Key Insight: Deep learning learns the parameters \(\Phi\) from training data pairs \(\{x_i, y_i\}\) to minimize prediction errors.

The Learning Process

Training Data:

Pairs of inputs and outputs: \(\{x_i, y_i\}\)

Loss Function:

Quantifies prediction accuracy: \(L[\Phi]\)

  • Lower loss = better fit to training data
  • Guides parameter updates during training

Goal:

Find parameters \(\Phi\) that minimize \(L[\Phi]\)

\[\Phi^* = \arg\min_{\Phi} L[\Phi]\]

Generalization:

Test on separate data not seen during training

The Challenge:

We don’t want to just memorize training data!

We want models that generalize to new, unseen examples.

→ This is why we split data into train/validation/test sets.

Linear Models: Building Intuition

1-D Linear Regression Model

The simplest supervised learning model:

\[y = f[x, \Phi] = \Phi_0 + \Phi_1 x\]

  • \(\Phi_0\): Intercept (bias term)
  • \(\Phi_1\): Slope (weight)
  • Only 2 parameters to learn

Linear Regression: Measuring Error

How do we quantify “good fit”?

Loss Function: Sum of squared errors

\[L[\Phi] = \sum_{i=1}^{N} (y_i - f[x_i, \Phi])^2\]

Vertical distance from each data point to the line → squared → summed = total error

Loss Surface

Visualizing all possible parameter combinations:

  • X-axis: Slope \(\Phi_1\)
  • Y-axis: Intercept \(\Phi_0\)
  • Z-axis (color): Loss \(L[\Phi]\)

Goal: Find the lowest point (dark blue valley)

Key Observations:

  1. Single global minimum - bowl-shaped surface
  2. Smooth - we can use gradients to navigate
  3. Convex - any path downhill leads to optimum

For linear models, optimization is easy! Deep networks have much more complex loss landscapes…

Optimization: Gradient Descent

How do we find the minimum?

Algorithm: Iteratively move downhill

  1. Start at random position
  2. Compute gradient (slope direction)
  3. Take small step opposite to gradient
  4. Repeat until convergence

\[\Phi_{new} = \Phi_{old} - \alpha \nabla L[\Phi]\]

(\(\alpha\) = learning rate)

Shallow Neural Networks

From Linear to Non-linear

Linear models are limited - they can only learn straight lines!

Solution: Add non-linearity through activation functions

Transform linear combinations with a non-linear function → enables learning complex patterns

Shallow Neural Network (1 hidden layer):

\[y = f[x, \Phi] = \Phi_0 + \sum_{i=1}^{3} \Phi_i \cdot a[ \Theta_{i0} + \Theta_{i1} x]\]

Component Description Count
\(\Theta_{ij}\) First layer parameters 6 parameters
\(\Phi_i\) Second layer parameters 4 parameters
\(a[\cdot]\) Activation function Non-linearity!

Total: 10 parameters (vs 2 for linear model)

Activation Functions: ReLU

ReLU (Rectified Linear Unit): The most popular activation function

\[a[z] = \max(0, z) = \begin{cases} z & \text{if } z > 0 \\ 0 & \text{if } z \leq 0 \end{cases}\]

Why ReLU?

Simple: Easy to compute and differentiate

Efficient: Avoids vanishing gradient problem

Sparse: Many activations are exactly zero

Biological: Neurons either fire or they don’t

Building Intuition: Composing ReLUs

How do multiple ReLU activations combine to approximate complex functions?

Each hidden unit:

  1. Computes linear function of input
  2. Applies ReLU → bent line
  3. Gets weighted and summed

Combining multiple units:

  • Different slopes and bends
  • Sum creates complex shapes
  • More units → more flexibility

Neural Network Computation

Step-by-step: How a shallow network processes an input

Process:

  1. Input \(x\) (left) enters the network

  2. Each hidden unit computes: \(h_i = a[\Theta_{i0} + \Theta_{i1} x]\)

  3. Weighted combination: \(y = \Phi_0 + \sum_i \Phi_i h_i\)

  4. Final output \(y\) (right)

Neural Network Diagram

Standard visualization: Network architecture

Components:

  • Input layer: Raw features
  • 🔵 Hidden layer: Learned representations
  • Output layer: Prediction
  • Connections: Weighted parameters

Terminology:

  • Hidden units/neurons: Computed values in middle
  • Pre-activations: Before ReLU
  • Activations: After ReLU
  • Fully connected: Every unit connects to all units in next layer

Universal Approximation Theorem

Theoretical Foundation: Shallow networks can approximate any continuous function!

Theorem (Cybenko 1989, Hornik 1991):

A shallow neural network with enough hidden units can approximate any continuous function to arbitrary accuracy on a compact domain.

But…

  • May require exponentially many hidden units
  • Doesn’t tell us how to find the parameters
  • Deep networks are often more efficient

Deep Neural Networks

Why Go Deep?

Deep networks compose simple transformations to build complex representations

Shallow network limitations:

  • Requires many hidden units
  • Doesn’t exploit structure
  • Inefficient representation

Deep network advantages:

  • Hierarchical learning
  • Compositional structure
  • Parameter efficiency
  • Feature reuse across layers

Intuition from vision:

Layer 1: Edges, colors

Layer 2: Textures, simple shapes

Layer 3: Object parts

Layer 4: Object categories

Each layer builds on previous representations!

Composing Networks

Building deep networks: Stack multiple hidden layers

Each layer:

\[h^{(k)} = a[W^{(k)} h^{(k-1)} + b^{(k)}]\]

  • \(h^{(k)}\): Activations at layer \(k\)

  • \(W^{(k)}\), \(b^{(k)}\): Parameters for layer \(k\)

  • Composition: \(f = f_K \circ f_{K-1} \circ \ldots \circ f_1\)

How Deep Networks Transform Space

Geometric intuition: Each layer performs a non-linear transformation of the representation space

Layer 1:

Stretches, rotates, bends space with ReLU

Layer 2:

Further transforms the already-bent space

Result:

Complex folding of input space

→ Can separate classes that were originally intertwined

Two Hidden Layers

Adding depth: 2 hidden layers → more complex functions

Key difference from shallow networks:

  • First layer creates intermediate representations
  • Second layer operates on those representations, not raw inputs
  • Can capture compositional structure

Example: First layer detects edges, second layer combines edges into shapes

K Hidden Layers: Deep Architecture

Modern deep learning: Many layers stacked together

Deep Network Characteristics:

  • Input layer: Raw features (e.g., pixel values)
  • Hidden layer 1: Low-level features (edges, textures)
  • Hidden layer 2: Mid-level features (parts, patterns)
  • Hidden layer K: High-level features (concepts, objects)
  • Output layer: Final prediction

Modern architectures: ResNet (152 layers), GPT-3 (96 layers), Vision Transformers (24+ layers)

Interactive Visualization Tools

TensorFlow Playground

Interactive tool for understanding neural networks

https://playground.tensorflow.org

CNN Explainer

Interactive visualization for understanding Convolutional Neural Networks

https://poloclub.github.io/cnn-explainer/

Further Exploration

Recommended interactive visualizations and resources:

Neural Network Visualization:

Understanding Deep Learning:

📖 udlbook.github.io/udlbook

Free online textbook with: - Interactive Python notebooks - Video lectures - Extensive visualizations - Modern coverage (transformers, diffusion models)

Summary: Deep Learning Foundations

Key Takeaways:

  1. Supervised Learning: Learn parameters \(\Phi\) from data pairs \(\{x_i, y_i\}\) to minimize loss \(L[\Phi]\)

  2. From Linear to Non-linear: Activation functions (ReLU) enable learning complex patterns

  3. Shallow Networks: Single hidden layer can approximate any function (Universal Approximation Theorem)

  4. Deep Networks: Multiple layers learn hierarchical representations more efficiently

  5. Geometric View: Networks transform input space through non-linear folding to separate classes

Next lectures: We’ll explore specialized visualizations for CNNs, attention mechanisms, activation analysis, and network interpretability

Next Week: Topological Data Analysis

Preview of upcoming topic:

Topology meets machine learning

  • Persistence diagrams and barcodes
  • Mapper algorithm for visualization
  • Reeb graphs
  • Applications in ML and deep learning

Why it matters:

Topological Data Analysis (TDA) provides robust methods for understanding the shape and structure of high-dimensional data.

Essential for analyzing neural network representations, clustering, and feature spaces!