561 words
3 minutes
MIT Song Han (EfficientML Lab) - Research Survey 2024-2025
Table of Contents
1
MIT Song Han (EfficientML Lab) - Research Survey 2024-2025
π Most Influential Works (Highly Cited/Impactful)
1. AWQ: Activation-aware Weight Quantization
2. StreamingLLM: Infinite Context Length
3. EfficientViT: High-Resolution Vision Transformer
4. SmoothQuant: Accurate Quantization for LLMs
5. BEVFusion: Multi-Task Multi-Sensor Fusion
π₯ Latest Works (2024-2025)
6. DuoAttention: Efficient Long-Context LLM Inference
7. Radial Attention: O(n log n) Sparse Attention
8. LPD: Locality-aware Parallel Decoding
9. FastRL: Efficient Reasoning RL Training
10. QServe: W4A8KV4 Quantization for LLM Serving
11. VILA-U: Unified Visual Understanding and Generation
12. XAttention: Block Sparse Attention
13. Quest: Query-Aware Sparsity
14. StreamingVLM: Infinite Video Streams
15. Four Over Six (4/6): NVFP4 Quantization
π οΈ Core Open Source Tools
π Research Themes & Trends
1. Extreme Quantization π§
2. Long Context & Streaming π
3. Sparse Attention Patterns π―
4. Unified Vision Models ποΈ
5. Edge & Real-Time Deployment π±
π Quick Links
π‘ Key Insights
MIT Song Han (EfficientML Lab) - Research Survey 2024-2025
Lab: MIT EfficientML Lab | PI: Song Han
Focus: Efficient Deep Learning, TinyML, Edge AI, LLM Acceleration
Survey Date: 2026-02-15
π Most Influential Works (Highly Cited/Impactful)
1. AWQ: Activation-aware Weight Quantization
Publication: MLSys 2024 Best Paper Award π₯
- Problem: LLMs are too large for edge deployment
- Solution: Protects salient weight channels by observing activation magnitudes
- Impact: 4-bit quantization with near-FP16 accuracy; adopted by industry
- GitHub: mit-han-lab/llm-awq β 3.4k
- Paper: arXiv:2306.00978
2. StreamingLLM: Infinite Context Length
Publication: ICLR 2024
- Problem: LLMs canβt handle sequences longer than training context
- Solution: βAttention Sinkβ mechanism - initial tokens anchor attention
- Impact: Process infinite-length text with constant memory
- GitHub: mit-han-lab/streaming-llm β 7.2k
- Paper: arXiv:2309.17453
3. EfficientViT: High-Resolution Vision Transformer
Publication: CVPR 2023 / ICCV 2023
- Problem: ViTs are too slow for high-resolution images
- Solution: Hardware-efficient attention with cascaded attention mechanism
- Impact: 3-5x faster than DeiT with comparable accuracy
- GitHub: mit-han-lab/efficientvit β 3.2k
- Paper: arXiv:2205.14756
4. SmoothQuant: Accurate Quantization for LLMs
Publication: ICML 2023
- Problem: Activations in LLMs have outliers that hurt quantization
- Solution: Mathematically smooths activation outliers to weights
- Impact: Enables W8A8 quantization; integrated into Hugging Face, vLLM
- GitHub: mit-han-lab/smoothquant
- Paper: arXiv:2211.10438
5. BEVFusion: Multi-Task Multi-Sensor Fusion
Publication: ICRA 2023
- Problem: Autonomous driving needs efficient multi-sensor fusion
- Solution: Unified BEV representation for camera + LiDAR
- Impact: 3D detection and segmentation with unified framework
- GitHub: mit-han-lab/bevfusion β 3k
- Paper: arXiv:2205.13542
π₯ Latest Works (2024-2025)
6. DuoAttention: Efficient Long-Context LLM Inference
Publication: ICLR 2025
- Innovation: Separates retrieval heads and streaming heads in attention
- Benefit: Enables efficient long-context inference by pruning non-essential heads
- GitHub: mit-han-lab/duo-attention
- Paper: arXiv:2410.10819
7. Radial Attention: O(n log n) Sparse Attention
Publication: NeurIPS 2025
- Innovation: Energy decay pattern for video generation
- Benefit: Sub-quadratic attention complexity for long video
- GitHub: mit-han-lab/radial-attention
8. LPD: Locality-aware Parallel Decoding
Publication: ICLR 2026 Oral π€
- Innovation: Parallel token generation for autoregressive models
- Benefit: Faster image generation while maintaining quality
- GitHub: mit-han-lab/lpd
9. FastRL: Efficient Reasoning RL Training
Publication: ASPLOS 2026
- Innovation: Adaptive drafter for long-tail reasoning scenarios
- Benefit: Efficient RL training for reasoning tasks
- GitHub: mit-han-lab/fastrl
10. QServe: W4A8KV4 Quantization for LLM Serving
Publication: MLSys 2025
- Innovation: System-algorithm co-design for serving quantized LLMs
- Benefit: Production-ready quantized LLM serving
- GitHub: mit-han-lab/omniserve
11. VILA-U: Unified Visual Understanding and Generation
Publication: ICLR 2025
- Innovation: Single model for both image understanding and generation
- Benefit: Unified vision foundation model
- GitHub: mit-han-lab/vila-u
12. XAttention: Block Sparse Attention
Publication: ICML 2025
- Innovation: Antidiagonal scoring for sparse attention patterns
- Benefit: Efficient attention with structured sparsity
- GitHub: mit-han-lab/x-attention
13. Quest: Query-Aware Sparsity
Publication: ICML 2024
- Innovation: Dynamic sparsity based on query content
- Benefit: Efficient long-context inference
- GitHub: mit-han-lab/Quest
14. StreamingVLM: Infinite Video Streams
Publication: 2024
- Innovation: Extends StreamingLLM to video understanding
- Benefit: Real-time understanding of infinite video streams
- GitHub: mit-han-lab/streaming-vlm
15. Four Over Six (4/6): NVFP4 Quantization
Publication: 2025
- Innovation: Adaptive block scaling for NVFP4 format
- Benefit: More accurate 4-bit quantization on NVIDIA hardware
- GitHub: mit-han-lab/fouroversix
π οΈ Core Open Source Tools
| Tool | Description | Stars | Link |
|---|---|---|---|
| llm-awq | 4-bit LLM quantization | β 3.4k | GitHub |
| streaming-llm | Infinite context length | β 7.2k | GitHub |
| efficientvit | Efficient vision transformers | β 3.2k | GitHub |
| bevfusion | Multi-sensor fusion for AV | β 3k | GitHub |
| smoothquant | W8A8 LLM quantization | - | GitHub |
| omniserve | LLM serving system | - | GitHub |
| tinyengine | MCU inference engine | β 1.5k | GitHub |
| tinychat | Edge LLM chat | - | GitHub |
π Research Themes & Trends
1. Extreme Quantization π§
- AWQ (4-bit) β Four Over Six (NVFP4)
- Pushing precision limits while maintaining accuracy
- Hardware-algorithm co-design
2. Long Context & Streaming π
- StreamingLLM β StreamingVLM β DuoAttention
- From infinite text to infinite video
- Memory-efficient attention mechanisms
3. Sparse Attention Patterns π―
- Quest (query-aware) β XAttention (antidiagonal) β Radial Attention (energy decay)
- Structured sparsity for efficiency
- O(n log n) and sub-quadratic complexity
4. Unified Vision Models ποΈ
- VILA-U: Understanding + Generation in one model
- Efficient high-resolution processing
5. Edge & Real-Time Deployment π±
- TinyEngine (MCU) β TinyChat (Edge LLM)
- Real-time inference on resource-constrained devices
π Quick Links
- Lab Website: hanlab.mit.edu
- GitHub Org: github.com/mit-han-lab
- Twitter: @SongHan_MIT
- YouTube Talks: Search βSong Han MIT efficient MLβ
π‘ Key Insights
- From Research to Production: AWQ and SmoothQuant became industry standards
- Infinite Context: StreamingLLM paradigm extends to video (StreamingVLM)
- Hardware-Software Co-design: Every algorithm considers deployment target
- Open Source Culture: All major works have open-source implementations
Survey completed on 2026-02-15 by Amy
Sources: arXiv, GitHub, MIT EfficientML Lab website
MIT Song Han (EfficientML Lab) - Research Survey 2024-2025
https://amysheng-ai.github.io/AmyBlog/posts/mit-songhan-survey-2024-2025/