Advancements in AI model compression and hardware acceleration are essential for enabling efficient, high-performance deep learning systems. Torch2Chip is a customizable deep neural network compression and deployment toolkit designed to bridge the gap between AI algorithms and prototype hardware accelerators. Developed by Jian Meng, a Ph.D. candidate at Cornell Tech under the guidance of Professor Jae-sun Seo, Torch2Chip addresses the challenges of verifying customized hardware with actual AI workloads, particularly in the evolving landscape of vision and large language models (LLMs). By enhancing the observability and modularity of low-precision AI operations, Torch2Chip is revolutionizing hardware design for AI, directly contributing to the goals of CoCoSys.

1. Tell us about the findings of your recent work entitled “Torch2Chip: An End-to-end Customizable Deep Neural Network Compression and Deployment Toolkit for Prototype Hardware Accelerator Design” and how it supports the goals of CoCoSys.

Empowered by the “sacred” scaling law, the size of AI models has largely expanded during the past decade, together with a wide range of compression algorithms that flourished. Naturally, the intensive computation of AI models motivates hardware researchers to design efficient and powerful prototype accelerators to maximize the efficiency of AI workloads. As part of a research group working on both hardware-efficient AI compression algorithms and custom ASIC hardware design for such AI algorithms, we realized from our own experience that accurately verifying the customized hardware with actual AI workloads is not easy for both “pre-LLM” and “post-LLM” era:

  1. Customized low-precision data type is powerful with dedicated hardware design, but there is zero support from either mainstream AI infrastructures (e.g., PyTorch, VLLM) or state-of-the-art (SoTA) compression algorithms.
  2. Commercial hardware has a limited degree of customization on the granularity of compression (e.g., quantization), which is also actively implemented in the customized AI accelerator, regardless of the hardware platform (e.g., digital vs. analog).
  3. Even in the era of LLM, the compressed operators and kernels often fuse the low precision operators with high precision scaling, which makes the detailed MAC operation non-observable for hardware engineers.

The “tri-dilemma” of hardware designers is depicted with the image below:

Based on all these findings, we presented the first version of “Torch2Chip” in MLSys 2024. The initial version of the codebase includes mainstream vision task models, followed by the automated and fully observable, modularized low-precision operators. During the past year, Torch2Chip has been actively helping hardware designers in CoCoSys to deploy fully observable AI workloads and designing high-performance AI accelerators. See the codebase, which we keep updating, here: https://github.com/SeoLabCornell/torch2chip. Currently, Torch2Chip has started including LLM models together with various benchmarks, including both summarization and reasoning tasks, which will expand the horizon of customized LLM-accelerator even further.

2. How do your research findings push the boundaries of what we currently know or can do in the field?

Torch2Chip is designed to simplify the verification process of hardware designers while preserving a high degree of customization. Compared to other infrastructures, Torch2Chip is a dedicated framework for designers of custom ASIC/FPGA hardware, while other AI infrastructures failed to provide enough degree of customization on data type and compression algorithms. Acceleration on commercial hardware platforms is NOT the top priority of Torch2Chip. Instead, Torch2Chip primarily focuses on enabling the full observability of the low-precision operation (e.g., Matrix Multiplication) and providing a set of solid and reliable workloads. Therefore, hardware engineers can fully understand the gap between the target AI workloads, customized algorithms, and user-designed hardware.

3. What are some real-world applications or examples of your research that people might encounter in their daily lives?

Practically speaking, Torch2Chip provides an open-sourced community for hardware engineers to extend their vision, knowledge, and ideas on customizing hardware-aware algorithms with solid verifications. With the hierarchical scheme for customization, hardware engineers can primarily focus on customizing the top-level algorithms and data type and just simply “plug in and run.”  As shown in the Figure above, Torch2Chip provides a series of quantization examples built on the bottom-level compression module with different granularities. For the next step (Phase 2.0), Torch2Chip will be fully extended to LLMs, allowing hardware engineers to conduct fast and solid algorithm and data-type verification.

4. What inspired you to pursue this research, and why do you think it is important?

The starting point of building up Torch2Chip was to help my colleagues in our own research group to speed up the design process of their hardware. On top of that, Professor Seo and I started to realize that “there is no AI infrastructure designed for hardware people!” Taping out ASIC chips is expensive, and there is very little room for mistakes. Instead of getting trapped by different algorithm repositories and baselines, having a unified algorithm community for hardware people is critical and practical. During the MLSys 2024 paper review, one reviewer mentioned: “I really understand the motivation for this work. I have fought this kind of thing for many years personally!” It is clear that many people in the community have had to go through similar issues and practices. These urgent needs from the hardware community motivated us to start building up Torch2Chip and continue developing it.