May 13, 2024

CompactifAI: Extreme Compression of Large Language Models using Quantum-Inspired Tensor Networks

Introduction.- The emergence of generative artificial in- telligence (AI) has ushered in an era where computers can perform tasks that were unimaginable just a few years ago. A prime example of this advancement is found in Large Language Models (LLMs) [1], which are based on the innovative “transformer architecture.” [2] The field of LLMs experienced a significant surge with the introduc- tion of OpenAI’s ChatGPT [3], showcasing an unprece- dented level of human-machine interaction. Following this, several other models, such as Meta’s LlaMA [4] and Google’s BERT [5], were developed. Currently, LLMs are expected to be utilized not only in linguistic applica- tions but also across various sectors, attracting substan- tial investments in this transformative technology. This development represents the most profound technological revolution since the inception of the internet.

However, LLMs are not without their challenges. The most significant issue is the energy consumption required for training these AI models. As noted by the CEO of OpenAI, training ChatGPT-3 incurred an estimated 100 million dollars in electricity bills alone, and the costs for training such models are predicted to double every ten months [6]. Coupled with the exponentially growing demand for these systems, we face a daunting scenario: the development of these systems is currently unsustain- able without significantly impacting the planet. The im- mense energy consumption of LLMs is untenable, com- pelling the need for greener, more efficient solutions. In this context, various compression techniques for LLMs have been suggested, with quantization [7], distillation [8] , pruning [9], and low-rank approximations [10] be- ing among the most prominent. However, these methods are quite brute-force — they largely focus on truncating the effective number of neurons, even when the original model’s accuracy is known to increase with size during training. Consequently, controlling and anticipating the compression error in these schemes is challenging, and their application has met with mixed success.

Full article here