CS-GY 9223 - Fall 2025
NYU Tandon School of Engineering
2025-10-27
Deep Learning Terminology and Foundations
Linear Models and Loss Functions
Shallow Neural Networks and Activation Functions
Deep Neural Networks and Composition
Interactive Visualization Tools
Acknowledgments:
Materials adapted from:
Understanding Deep Learning by Simon J.D. Prince
Published by MIT Press, 2023
Available free online: https://udlbook.github.io/udlbook
Why this book?

Prince, S. J. D. (2023). Understanding Deep Learning. MIT Press.
The Supervised Learning Framework:
\[y = f[x, \Phi]\]
| Symbol | Meaning | Example |
|---|---|---|
| \(y\) | Prediction (model output) | House price: $450,000 |
| \(x\) | Input (features) | Square footage: 2000 sq ft, Bedrooms: 3 |
| \(\Phi\) | Model parameters (weights, biases) | Millions of numbers learned from data |
| \(f[\cdot]\) | Model function (architecture) | Neural network with multiple layers |
Key Insight: Deep learning learns the parameters \(\Phi\) from training data pairs \(\{x_i, y_i\}\) to minimize prediction errors.
Training Data:
Pairs of inputs and outputs: \(\{x_i, y_i\}\)
Loss Function:
Quantifies prediction accuracy: \(L[\Phi]\)
Goal:
Find parameters \(\Phi\) that minimize \(L[\Phi]\)
\[\Phi^* = \arg\min_{\Phi} L[\Phi]\]
Generalization:
Test on separate data not seen during training
The Challenge:
We don’t want to just memorize training data!
We want models that generalize to new, unseen examples.
→ This is why we split data into train/validation/test sets.
The simplest supervised learning model:
\[y = f[x, \Phi] = \Phi_0 + \Phi_1 x\]
Prince, S. J. D. (2023). Understanding Deep Learning. Chapter 2: Supervised Learning.
How do we quantify “good fit”?
Loss Function: Sum of squared errors
\[L[\Phi] = \sum_{i=1}^{N} (y_i - f[x_i, \Phi])^2\]
Vertical distance from each data point to the line → squared → summed = total error
Prince, S. J. D. (2023). Understanding Deep Learning. Chapter 2: Supervised Learning.
Visualizing all possible parameter combinations:
Goal: Find the lowest point (dark blue valley)
Key Observations:
For linear models, optimization is easy! Deep networks have much more complex loss landscapes…
Prince, S. J. D. (2023). Understanding Deep Learning. Chapter 2: Supervised Learning.
How do we find the minimum?
Algorithm: Iteratively move downhill
\[\Phi_{new} = \Phi_{old} - \alpha \nabla L[\Phi]\]
(\(\alpha\) = learning rate)
Prince, S. J. D. (2023). Understanding Deep Learning. Chapter 2: Supervised Learning.
Linear models are limited - they can only learn straight lines!
Solution: Add non-linearity through activation functions
Transform linear combinations with a non-linear function → enables learning complex patterns
Shallow Neural Network (1 hidden layer):
\[y = f[x, \Phi] = \Phi_0 + \sum_{i=1}^{3} \Phi_i \cdot a[ \Theta_{i0} + \Theta_{i1} x]\]
| Component | Description | Count |
|---|---|---|
| \(\Theta_{ij}\) | First layer parameters | 6 parameters |
| \(\Phi_i\) | Second layer parameters | 4 parameters |
| \(a[\cdot]\) | Activation function | Non-linearity! |
Total: 10 parameters (vs 2 for linear model)
ReLU (Rectified Linear Unit): The most popular activation function
\[a[z] = \max(0, z) = \begin{cases} z & \text{if } z > 0 \\ 0 & \text{if } z \leq 0 \end{cases}\]
Why ReLU?
✓ Simple: Easy to compute and differentiate
✓ Efficient: Avoids vanishing gradient problem
✓ Sparse: Many activations are exactly zero
✓ Biological: Neurons either fire or they don’t
Prince, S. J. D. (2023). Understanding Deep Learning. Chapter 3: Shallow Neural Networks.
How do multiple ReLU activations combine to approximate complex functions?
Each hidden unit:
Combining multiple units:
Prince, S. J. D. (2023). Understanding Deep Learning. Chapter 3: Shallow Neural Networks.
Step-by-step: How a shallow network processes an input

Process:
Input \(x\) (left) enters the network
Each hidden unit computes: \(h_i = a[\Theta_{i0} + \Theta_{i1} x]\)
Weighted combination: \(y = \Phi_0 + \sum_i \Phi_i h_i\)
Final output \(y\) (right)
Prince, S. J. D. (2023). Understanding Deep Learning. Chapter 3: Shallow Neural Networks.
Standard visualization: Network architecture
Components:
Terminology:
Prince, S. J. D. (2023). Understanding Deep Learning. Chapter 3: Shallow Neural Networks.
Theoretical Foundation: Shallow networks can approximate any continuous function!
Theorem (Cybenko 1989, Hornik 1991):
A shallow neural network with enough hidden units can approximate any continuous function to arbitrary accuracy on a compact domain.
But…
Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2(4), 303-314.
Deep networks compose simple transformations to build complex representations
Shallow network limitations:
Deep network advantages:
Intuition from vision:
Layer 1: Edges, colors
↓
Layer 2: Textures, simple shapes
↓
Layer 3: Object parts
↓
Layer 4: Object categories
Each layer builds on previous representations!
Building deep networks: Stack multiple hidden layers

Each layer:
\[h^{(k)} = a[W^{(k)} h^{(k-1)} + b^{(k)}]\]
\(h^{(k)}\): Activations at layer \(k\)
\(W^{(k)}\), \(b^{(k)}\): Parameters for layer \(k\)
Composition: \(f = f_K \circ f_{K-1} \circ \ldots \circ f_1\)
Prince, S. J. D. (2023). Understanding Deep Learning. Chapter 4: Deep Neural Networks.
Geometric intuition: Each layer performs a non-linear transformation of the representation space
Layer 1:
Stretches, rotates, bends space with ReLU
Layer 2:
Further transforms the already-bent space
Result:
Complex folding of input space
→ Can separate classes that were originally intertwined
Prince, S. J. D. (2023). Understanding Deep Learning. Chapter 4: Deep Neural Networks.
Adding depth: 2 hidden layers → more complex functions
Key difference from shallow networks:
Example: First layer detects edges, second layer combines edges into shapes
Prince, S. J. D. (2023). Understanding Deep Learning. Chapter 4: Deep Neural Networks.
Modern deep learning: Many layers stacked together
Deep Network Characteristics:
Modern architectures: ResNet (152 layers), GPT-3 (96 layers), Vision Transformers (24+ layers)
Prince, S. J. D. (2023). Understanding Deep Learning. Chapter 4: Deep Neural Networks.
Interactive tool for understanding neural networks
https://playground.tensorflow.org
Smilkov, D., & Carter, S. (2016). TensorFlow Playground. Google Brain.
Interactive visualization for understanding Convolutional Neural Networks
https://poloclub.github.io/cnn-explainer/
Wang, Z. J., et al. (2020). CNN Explainer: Learning Convolutional Neural Networks with Interactive Visualization. IEEE VIS.
Recommended interactive visualizations and resources:
Neural Network Visualization:
Distill.pub: The Building Blocks of Interpretability Comprehensive visual explanations
TensorFlow Embedding Projector Explore high-dimensional embeddings
ML4A: Looking Inside Neural Nets Interactive tutorials and visualizations
Understanding Deep Learning:
Free online textbook with: - Interactive Python notebooks - Video lectures - Extensive visualizations - Modern coverage (transformers, diffusion models)
Key Takeaways:
Supervised Learning: Learn parameters \(\Phi\) from data pairs \(\{x_i, y_i\}\) to minimize loss \(L[\Phi]\)
From Linear to Non-linear: Activation functions (ReLU) enable learning complex patterns
Shallow Networks: Single hidden layer can approximate any function (Universal Approximation Theorem)
Deep Networks: Multiple layers learn hierarchical representations more efficiently
Geometric View: Networks transform input space through non-linear folding to separate classes
Next lectures: We’ll explore specialized visualizations for CNNs, attention mechanisms, activation analysis, and network interpretability
Preview of upcoming topic:
Topology meets machine learning
Why it matters:
Topological Data Analysis (TDA) provides robust methods for understanding the shape and structure of high-dimensional data.
Essential for analyzing neural network representations, clustering, and feature spaces!