Exercise 5: Dimensionality Reduction Dashboard
Exercise 5: Dimensionality Reduction Dashboard
CS-GY 9223: Visualization for Machine Learning - Fall 2025
Released: October 20, 2025
Due: October 27, 2025 (11:59 PM EST)
Weight: 8.33% of final grade
Overview
High-dimensional data analysis is fundamental to modern machine learning. This assignment focuses on building an interactive dashboard for exploring dimensionality reduction techniques including PCA, t-SNE, and UMAP. You will create tools that help users understand how these methods work, compare their results, and tune parameters effectively.
Learning Objectives
By completing this exercise, you will:
- Understand the strengths and weaknesses of different dimensionality reduction methods
- Implement interactive parameter tuning interfaces for PCA, t-SNE, and UMAP
- Create effective visualizations for high-dimensional data exploration
- Build comparison tools for evaluating reduction quality
- Design interfaces for understanding embedding neighborhoods and clusters
Assignment Tasks
Task 1: Multi-Method Comparison Dashboard (40%)
Build a comprehensive interface comparing multiple dimensionality reduction techniques:
Side-by-Side Method Comparison
- PCA Implementation: Interactive 2D/3D principal component projections
- t-SNE Integration: Real-time t-SNE with perplexity parameter tuning
- UMAP Visualization: Interactive UMAP with n_neighbors and min_dist controls
- Method Switching: Quick toggle between methods with synchronized views
Parameter Exploration Interface
- Interactive Controls: Sliders and inputs for key parameters (perplexity, n_neighbors, etc.)
- Real-time Updates: Live visualization updates as parameters change
- Parameter History: Track parameter combinations and their results
- Preset Configurations: Quick access to common parameter settings
Quality Assessment Metrics
- Preservation Metrics: Local and global structure preservation measures
- Stress Calculations: Kruskal stress and other embedding quality metrics
- Neighborhood Preservation: Measure how well local neighborhoods are maintained
- Comparative Rankings: Score and rank different parameter combinations
Task 2: Interactive Exploration Tools (30%)
Create tools for deep exploration of high-dimensional data:
Neighborhood Analysis
- K-Nearest Neighbors: Visualize original vs embedded space neighborhoods
- Distance Preservation: Compare original distances with embedded distances
- Cluster Validation: Assess how well clusters are preserved in low dimensions
- Outlier Detection: Identify points that behave differently in reduced space
Brushing and Linking
- Point Selection: Select points in 2D embedding to highlight in other views
- Feature Correlation: Show how original features correlate with embeddings
- Cluster Exploration: Interactive clustering with adjustable parameters
- Time-series Tracking: For iterative methods, show embedding evolution
Multi-View Coordination
- Original Feature Space: Parallel coordinates or other high-D visualization
- Embedding Space: 2D/3D scatter plots with customizable styling
- Feature Importance: Show which original features contribute to embedding axes
- Projection Matrix: Visualize PCA loadings or other transformation matrices
Task 3: Algorithm Understanding Tools (20%)
Implement educational visualizations to explain how methods work:
PCA Explanation Interface
- Variance Explanation: Bar chart showing variance explained per component
- Component Vectors: Visualize principal component directions in original space
- Reconstruction Error: Show data reconstruction from reduced dimensions
- Eigenface/Component Display: Visual representation of principal components
t-SNE Process Visualization
- Optimization Progress: Show cost function decrease during optimization
- Gradient Flow: Visualize how points move during t-SNE iterations
- Perplexity Effects: Interactive demonstration of perplexity parameter impact
- Early vs Late Exaggeration: Show effects of different optimization phases
UMAP Mechanics
- Graph Construction: Visualize fuzzy simplicial complex construction
- Neighbor Relationships: Show k-nearest neighbor graphs at different scales
- Optimization Dynamics: Display force-directed layout optimization process
- Parameter Sensitivity: Interactive exploration of UMAP hyperparameters
Task 4: Application Case Studies (10%)
Demonstrate the dashboard with three different datasets:
High-Dimensional Image Data
- Dataset: Fashion-MNIST or similar image dataset
- Challenge: Visualizing 784-dimensional pixel data
- Analysis: Compare how different methods preserve visual similarity
Text Embeddings
- Dataset: Word2Vec, GloVe, or document embeddings
- Challenge: Exploring semantic relationships in embedding space
- Analysis: Evaluate preservation of semantic neighborhoods
Scientific Data
- Dataset: Gene expression, chemical compounds, or similar
- Challenge: Understanding complex multidimensional relationships
- Analysis: Validate dimensionality reduction against known biological/chemical groups
Technical Requirements
Implementation Stack
- D3.js v7+ for all interactive visualizations
- HTML5/CSS3 for responsive dashboard layout
- JavaScript ES6+ with modern async/await patterns
- WebGL (optional) for high-performance scatter plots with large datasets
Dimensionality Reduction Implementation
Choose one of these approaches:
- Python Backend: Flask/FastAPI with scikit-learn, umap-learn integration
- JavaScript Libraries: ML-Matrix, t-SNE-js, or similar client-side libraries
- Pre-computed Embeddings: Focus on visualization of pre-calculated results
- Hybrid Approach: Pre-compute expensive algorithms, implement PCA client-side
Performance Requirements
- Responsive Design: Smooth interactions on datasets with 1,000-10,000 points
- Progressive Loading: Handle large datasets without blocking the interface
- Memory Management: Efficient handling of multiple embeddings simultaneously
- Real-time Updates: Parameter changes reflected within 1-2 seconds
Required Features
1. Method Comparison Interface
// Example structure for embedding comparison
const embeddingComparison = {
methods: ["pca", "tsne", "umap"],
parameters: {
pca: {n_components: 2},
tsne: {perplexity: 30, learning_rate: 200},
umap: {n_neighbors: 15, min_dist: 0.1}
},
quality_metrics: {
pca: {explained_variance: 0.68, stress: 0.15},
tsne: {kl_divergence: 2.1, trustworthiness: 0.85},
umap: {fuzzy_set_cross_entropy: 1.3, connectivity: 0.92}
}
};
2. Interactive Parameter Controls
- Slider Controls: Smooth parameter adjustment with live preview
- Input Validation: Prevent invalid parameter combinations
- Reset Functionality: Return to default parameter settings
- Batch Processing: Queue multiple parameter changes for efficient computation
3. Quality Assessment Dashboard
- Metric Visualization: Bar charts, line plots for quality metrics
- Comparative Analysis: Side-by-side metric comparison across methods
- Historical Tracking: Plot metric changes as parameters are adjusted
- Statistical Significance: Error bars and confidence intervals where applicable
Datasets Provided
Work with all three of these datasets for comprehensive evaluation:
1. MNIST Digits (Subset)
- Size: 5,000 samples × 784 features
- Ground Truth: 10 digit classes for validation
- Challenge: Preserving digit similarity in 2D
- Evaluation Focus: Class separation and within-class clustering
2. Wine Quality Dataset
- Size: 4,898 samples × 11 features
- Ground Truth: Wine quality scores
- Challenge: Understanding feature relationships
- Evaluation Focus: Feature importance and quality correlation
3. Single Cell RNA-seq (Subset)
- Size: 2,000 cells × 1,000 genes
- Ground Truth: Cell type annotations
- Challenge: Biological interpretation of cell relationships
- Evaluation Focus: Biological pathway preservation and cell type clustering
Deliverables
1. Interactive Dashboard Application
Deploy a comprehensive dimensionality reduction exploration tool with:
- Multi-method comparison with synchronized parameter controls
- Real-time quality assessment with multiple metrics
- Educational visualizations explaining algorithm behavior
- Three case study demonstrations with different data types
2. Technical Report (4-5 pages)
Submit a detailed analysis covering:
Introduction and Background (0.5 pages)
- Importance of dimensionality reduction in ML workflows
- Overview of implemented methods and their trade-offs
- Dashboard design philosophy and user experience goals
Method Implementation and Comparison (2 pages)
- Technical details of PCA, t-SNE, and UMAP implementations
- Parameter tuning strategies and default selections
- Quality metric calculations and interpretation
- Performance optimization for real-time interaction
Case Study Analysis (1.5 pages)
- Detailed analysis of each dataset using the dashboard
- Comparison of method performance across different data types
- Parameter sensitivity analysis and recommendations
- Insights discovered through interactive exploration
User Interface Design and Evaluation (1 page)
- Design rationale for interactive controls and layout
- Usability considerations and accessibility features
- Performance evaluation and scalability analysis
- Future enhancements and integration possibilities
3. Code Repository and Documentation
Create a professional GitHub repository including:
- Complete source code with modular, documented architecture
- Comprehensive README with setup, usage, and API documentation
- Live demo deployment with all three datasets available
- Data preprocessing scripts for converting datasets to required formats
- Parameter tuning guidelines for different data types
Evaluation Criteria
Technical Implementation (25%)
- Algorithm Integration: Correct implementation of all three DR methods
- Parameter Control: Smooth, responsive parameter adjustment interfaces
- Performance Optimization: Efficient handling of real-time updates
- Code Quality: Clean, modular, well-documented implementation
Visualization Design (25%)
- Visual Clarity: Effective encoding of high-dimensional relationships
- Interactive Design: Intuitive controls and responsive feedback
- Comparative Display: Clear side-by-side method comparisons
- Information Density: Optimal balance of detail and clarity
Analysis and Insights (25%)
- Method Understanding: Deep insights into DR algorithm behavior
- Parameter Sensitivity: Thorough analysis of parameter effects
- Quality Assessment: Meaningful evaluation of embedding quality
- Dataset Insights: Domain-specific discoveries through exploration
User Experience (15%)
- Interface Usability: Intuitive navigation and control layout
- Educational Value: Effective explanation of DR concepts
- Performance: Smooth interactions without lag or crashes
- Accessibility: Consideration for users with different abilities
Documentation and Communication (10%)
- Report Quality: Clear, professional technical writing
- Code Documentation: Comprehensive inline and external documentation
- Reproducibility: Complete instructions for setup and usage
- User Guidelines: Clear guidance for interpreting results
Bonus Opportunities (+10% each, max +20%)
Advanced Features
- 3D Visualization: Interactive 3D embeddings with WebGL rendering
- Animation and Transitions: Smooth transitions between parameter settings
- Custom Metrics: Implementation of additional quality measures (e.g., co-ranking matrix)
Extended Analysis
- Manifold Learning Methods: Add Isomap, LLE, or other manifold techniques
- Ensemble Embeddings: Combine multiple DR methods for robust visualization
- Time-series DR: Handle temporal high-dimensional data
User Experience Enhancements
- Collaborative Features: Share parameter settings and discoveries
- Export Capabilities: High-quality figure export and embedding data download
- Tutorial System: Interactive tutorial for DR method selection and tuning
Submission Instructions
- Deploy Dashboard: Host complete application on GitHub Pages, Netlify, or similar
- Create Repository: Name it
visml-exercise5-[username]
- Prepare Submission Package:
- Live dashboard URL with all three datasets
- GitHub repository link with complete source code
- Technical report (PDF format)
- 5-7 minute video demonstration showcasing key features
- Submit via NYU Classes: Upload all materials and provide links
Late Policy: 20% deduction per day, maximum 5 days late
Resources
Dimensionality Reduction Literature
- Van der Maaten & Hinton, “Visualizing Data using t-SNE” (2008)
- McInnes et al., “UMAP: Uniform Manifold Approximation and Projection” (2018)
- Jolliffe, “Principal Component Analysis” (2nd Edition)
- Lee & Verleysen, “Nonlinear Dimensionality Reduction” (2007)
Technical Implementation
Quality Metrics and Evaluation
- Venna & Kaski, “Neighborhood Preservation in Nonlinear Projection Methods” (2006)
- Gracia et al., “A Comparison of Extrinsic Clustering Evaluation Metrics” (2014)
- Lueks et al., “How to Evaluate Dimensionality Reduction?” (2011)
Visualization Examples
Example Code Snippets
Real-time Parameter Update
class DimensionalityReductionDashboard {
constructor(container) {
this.container = container;
this.methods = ['pca', 'tsne', 'umap'];
this.setupParameterControls();
}
async updateParameters(method, parameters) {
// Show loading indicator
this.showLoading(method);
// Compute new embedding
const embedding = await this.computeEmbedding(method, parameters);
// Update visualization
this.updateEmbeddingView(method, embedding);
// Calculate and display quality metrics
const metrics = this.calculateQualityMetrics(embedding);
this.updateMetricsDisplay(method, metrics);
}
}
Quality Metrics Visualization
function drawQualityMetrics(metrics, container) {
const svg = d3.select(container)
.append('svg')
.attr('width', 400)
.attr('height', 300);
// Create bar chart for different quality measures
const measures = ['trustworthiness', 'continuity', 'stress'];
const scale = d3.scaleLinear().domain([0, 1]).range([0, 200]);
svg.selectAll('.metric-bar')
.data(measures)
.enter()
.append('rect')
.attr('class', 'metric-bar')
.attr('x', 50)
.attr('y', (d, i) => i * 40 + 20)
.attr('width', d => scale(metrics[d]))
.attr('height', 30)
.style('fill', '#69b3a2');
}
Getting Help
- Discord Support: Use #exercise5 channel for technical questions and discussions
- Office Hours: Available for algorithm implementation and optimization help
- Mathematical Concepts: TA support for understanding DR algorithm mathematics
- Performance Issues: Help available for optimizing real-time parameter updates
- Visualization Design: Feedback on effective DR result presentation
Academic Integrity
- Individual Work: Complete this assignment independently
- Library Usage: Properly credit any DR libraries or implementations used
- Code Attribution: Cite external code snippets and visualization examples
- Original Analysis: Your insights and interface design must be original
- Dataset Usage: Only use provided datasets or clearly cite external sources
Questions? Post in Discord #exercise5 channel or contact the teaching team.