Introduction.- The emergence of generative artificial in- telligence (AI) has ushered in an era where computers can perform tasks that were unimaginable just a few years ago. A prime example of this advancement is found in Large Language Models (LLMs) [1], which are based on the innovative “transformer architecture.” [2] The field of LLMs experienced a significant surge with the introduc- tion of OpenAI’s ChatGPT [3], showcasing an unprece- dented level of human-machine interaction. Following this, several other models, such as Meta’s LlaMA [4] and Google’s BERT [5], were developed. Currently, LLMs are expected to be utilized not only in linguistic applica- tions but also across various sectors, attracting substan- tial investments in this transformative technology. This development represents the most profound technological revolution since the inception of the internet.
However, LLMs are not without their challenges. The most significant issue is the energy consumption required for training these AI models. As noted by the CEO of OpenAI, training ChatGPT-3 incurred an estimated 100 million dollars in electricity bills alone, and the costs for training such models are predicted to double every ten months [6]. Coupled with the exponentially growing de- mand for these systems, we face a daunting scenario: the development of these systems is currently unsustainable without significantly impacting the planet. The immense energy consumption of LLMs is untenable, compelling the need for greener, more efficient solutions. In this con- text, various compression techniques for LLMs have been suggested, with quantization, distillation, pruning, and low-rank approximations being among the most promi- nent. However, these methods are quite brute-force — they largely focus on truncating the effective number of neurons, even when the original model’s accuracy is known to increase with size during training. Conse- quently, controlling and anticipating the compression er- ror in these schemes is challenging, and their application has met with mixed success.
In this paper, we introduce CompactifAI [7], a novel LLM compression technique based on quantum-inspired Tensor Networks (TNs) [8, 9]. This technique involves “tensorizing” the self-attention and multi-layer percep- tron layers using a specific TN, which effectively trun- cates the correlations present in the model. The degree of truncation can be controlled via the bond dimension of the TN, enabling a significant reduction in the size of the LLM model while maintaining accuracy. In practice, the compressed model requires less energy and memory, and operations such as training, retraining, and inference be- come more efficient. Let us further note that we retrained the tensorized model using multi-GPU distributed train- ing. Within this framework, we observed that the signif- icant reduction in the number of degrees of freedom by tensorization drastically reduces the GPU-CPU transfer time, consequently cutting the training time by almost half in our case. Hence, our tensorization approach is particularly well-suited for distributed training of LLMs. As we will demonstrate, a brief retraining period allows the accuracy of the compressed model to approach that of the original uncompressed version.
Method.- The compression method we propose is based on the efficient decomposition of weight matrices in neu- ral networks into Tensor Networks, such as Matrix Prod- uct Operators (MPOs) and similar structures. This con- cept has been successfully implemented in deep learning architectures, as previously demonstrated [10–13], but to the best of our knowledge, this work is the first applica- tion of this approach to compressing LLMs. Specifically for Large Language Models (LLMs), our approach in- volves first identifying layers that are more tenable to correlation compression and then replacing them with suitable TNs (in the present case, MPOs). Without loss of generality, here we consider the case of the LLM ar- chitecture of LlaMA models. As illustrated in Fig.1, we substitute the weight matrices in the Self Attention (SA) and Multi-layer Perceptron (MLP) layers of the LlaMA decoder block with MPOs characterized by a bond di- mension χ. The process of determining the MPO in- volves executing sequential Singular Value Decomposi- tions (SVDs) on the respective weight matrix, retaining the largest χ singular values at each SVD. This trun- cation in χ effectively truncates the correlations among neurons within a given layer to only the most relevant ones necessary for describing the system, while discard- ing those that are irrelevant. This approach leads to a significant reduction in memory costs, as storing the truncated MPO, which incurs a polynomial cost, is far more efficient than storing the original weight matrix, which would require an exponential cost. Furthermore, the bond dimension χ, effectively controls the level of compression: a smaller χ results in more information be- ing discarded, leading to greater compression but at the cost of reduced accuracy.
To ensure high accuracy in the compressed model, our method also includes a rapid retraining phase fol- lowing the determination of the truncated MPOs. This retraining is essential because the local, layer-by-layer truncation into MPOs – akin to the so-called “simple update” in TN algorithms [14] – may not be optimal, in the sense that other layers are not explicitly taken into account when truncating the weight matrix of a spe- cific layer. However, retraining the compressed structure is way more efficient than training the original uncom- pressed model, since the tensorized model incurs shorter CPU-GPU transfer times in a distributed training setup. As we demonstrate below, after just a few retraining epochs of the compressed model, its accuracy closely ap- proaches that of the original uncompressed model but at a fraction of the cost.
Benchmark.- To evaluate our method, we used it to compress the LlaMA-2 7B model. This model represents the “smallest” within the “large” category of LLMs in the open-source LlaMA series, developed by META. It encompasses 7 billion parameters, has been pre-trained on over 2 trillion tokens, offers a context length of 4096, and has undergone fine-tuning with more than 1 million human annotations. In float32 the model occupies 24GB in memory, and 12GB in float16 after mild quantization.
Our compression technique involved using MPOs with a bond dimension of χ ≈ 100 in SA and MLP layers on the float16 version of LlaMA-2 7B. As a result, the model was significantly reduced down to 2 billion param- eters and a memory size of 3.7Gb, which is merely 30% of its original untensorized size in float16, and 15% of the original LlaMA-2 7B in float32 if we also take into account the mild quantization. That is to say, our tensor network compression already reduces the number of pa- rameters in the model and its size in memory to 30% of the original size, while the mild quantization from float32 to float16 further reduces the size by an additional factor of 2.
To assess the model’s performance, we focused on the task of text summarization. For this purpose, we selected two well-known open-source datasets: XSum and Giga- word. Both the original and compressed models under- went additional training for a limited number of epochs using these datasets. Notably, the training of the com- pressed model was approximately twice as fast as that for the uncompressed version, see Fig.2. Subsequent to this training, we calculated Rouge scores, which are crucial metrics for evaluating automatic text summarization and machine translation. A comparison of the Rouge scores for both models, post-retraining on the two datasets, is il- lustrated in Fig.3. This comparison reveals that the com- pressed model retains about 90% of the original model’s accuracy in float16 despite being only 30% of its original size.
Let us also stress that calculations for this bench- mark were performed on a single AWS machine with 8 NVIDIA A100 Tensor Core GPUs using distributed re- training, proving in turn that our method is perfectly GPU-compatible.
Conclusions.- In this paper, we have introduced and benchmarked CompactifAI, a compression method of Large Language Models based on quantum-inspired Ten- sor Networks. The method decomposes weight matrices in Self Attention and Multi-layer Perceptron layers of the LLM in terms of Matrix Product Operators with a given bond dimension, effectively truncating in the cor- relations present in the system. The compression rate of the model can be controlled via the bond dimension. After retraining the compressed model, accuracy ramps up, and we benchmarked the technique for text summa- rization in the LlaMa-2 7B model, combined with quanti- zation from float32 to float16 numbers. The compressed model retained 90% of the accuracy of the original one despite having 30% of the original size in float16. The mild quantization from float32 to float16 produced an extra factor of 2 in compression, so that the size of the fi- nal model was 15% of the original LlaMa-2 7B in float32, proving in turn that our TN method is also compatible with other compression techniques.
Our work provides a much more refined, controllable, and explainable compression technique of LLMs com- pared to alternative methods such as pruning, distilla- tion, quantization, and low-rank approximations. Fur- thermore, our TN method is compatible with all these techniques and can be applied along with them, as we have shown with mild quantization, further enhancing the overall LLM compression capacity. This can be con- sidered in further works, as well as more advanced TN compression techniques for LLMs.
In our opinion, our work opens the door to the democ- ratization of LLMs. Small-size LLMs are a necessity to lower their gargantuan energy consumption. Moreover, it also opens the door to the deployment on-premises of LLMs for use cases with no cloud connection, opening an entire new world of applications. We believe that our method will play a fundamental role in the development of future AI technologies.
Acknowledgements: We acknowledge Donostia In- ternational Physics Center (DIPC), Ikerbasque, Basque Government, Diputacio ́n de Gipuzkoa, European Inno- vation Council (EIC), and Spanish Government for con- stant support, as well as insightful discussions with the team from Multiverse Computing. A. T. acknowledges funding by MICIN within NextGenerationEU(PRTR- C17.I1) program and by Generalitat de Catalunya. S. S. J. also acknowledges the Institute for Advanced Stud- ies in Basic Sciences (IASBS).
Data availability statement: all data required for this project can be accessed upon reasonable request by contacting the authors.
Full paper here
Contact: victor.gaspar@multiversecomputing.com