Chapter 4: Toward Edge Intelligence

Abstract

This chapter explores how edge intelligence (EI) integrates artificial intelligence (AI) models into edge devices, utilizing state-of-the-art techniques from both hardware and software perspectives. It begins by discussing the collaborative frameworks between cloud-edge and edge-edge networks. The chapter then explores advances in hardware accelerators for edge devices, including application-specific integrated circuits, field-programmable gate arrays, and graphics processing units, alongside optimized software frameworks, runtime environments, and containers. Key technologies enabling EI such as model compression techniques (e.g., pruning, quantization, lowrank approximation, and knowledge distillation) and hardware-software codesign are explored to meet the unique demands of deploying AI models on resource-constrained edge devices. Furthermore, the chapter discusses optimized methods for training and inference directly on edge, offering a cohesive overview of how EI leverages collaborative frameworks, hardware advancements, and efficient techniques to enable AI on edge devices.

📝 Practice Questions

1. How does edge computing hardware differ from traditional data center hardware?

2. Discuss the advantages and challenges of implementing machine learning at the edge.

3. What are the key considerations in choosing an edge application development framework?

4. How does integrating hardware accelerators impact the design of machine learning models for edge deployment?

5. Discuss the role of software-hardware codesign in optimizing resource-constrained edge computing environments.

📘 Course Projects

1. Develop a simple edge computing application using a containerization platform.

2. Prune a classical neural network model to reduce both model size and latency. Understand the basic concept of pruning, implement and apply a few pruning approaches, get a basic understanding of performance improvement (such as speedup) from pruning, and understand the differences and tradeoffs between these pruning approaches.

3. Quantize a classical neural network model to reduce both model size and latency. Understand the basic concept of quantization, implement and apply a few quantization approaches, get a basic understanding of performance improvement (such as speedup) from quantization, and understand the differences and tradeoffs between these quantization approaches.

4. Use knowledge distillation to compress a classical neural network model to reduce both model size and latency. Understand the basic concept of knowledge distillation and get a basic understanding of performance improvement (such as speedup) from knowledge distillation.

5. Using model compression techniques, optimizing large language models (LLMs) on edge devices (e.g., your laptop). A good example can be found at Github: https://github.com/mit-han-lab/tinychat-tutorial?tab=readme-ov-file

📚 Suggested Papers

1. Jude Haris et al. "SECDA: Efficient hardware/software co-design of FPGA-based DNN accelerators for edge inference". In: 2021 IEEE 33rd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD). IEEE. 2021, pp. 33–43 | Paper.

2. Jakub Konečný et al. "Federated learning: Strategies for improving communication efficiency". In: arXiv preprint arXiv:1610.05492 (2016) | Paper.

3. Xingzhou Zhang et al. "OpenEI: An open framework for edge intelligence". In: 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS). IEEE. 2019, pp. 1840–1851 | Paper.

4. Zhi Zhou et al. "Edge intelligence: Paving the last mile of artificial intelligence with edge computing". In: Proceedings of the IEEE. 107. 8 (2019), pp. 1738–1762 | Paper.