June 25, 2024

Quantum-inspired tensor networks accelerate AI and cut compute costs

Thumbnail

The transformative power of Large Language Models (LLMs) like ChatGPT and Meta’s LLaMA for applications in artificial intelligence and data analytics comes with high development and operational costs. Building these models can require as much as $5 million per training run to meet the immense compute and energy demands of the thousands of GPUs that handle massive datasets in parallel. Complexity, bias and lack of explainability also cause problems once the models are deployed. These barriers put significant limitations on what we can do with today’s AI models.

These enormous energy bills are only going up: analysts predict that training costs will double every year for the forseeable future.

To reach the next level, AI must do more with less.

Multiverse Computing has built a tool for conducting AI “surgery” to trim excess weight while keeping the models just as accurate and powerful, proving that bigger isn’t always better.

“Our results imply that standard LLMs are, in fact, heavily overparametrized, and do not need to be large at all,” said Román Orús, Chief Scientific Officer of Multiverse Computing.

Orús is one of the authors of a new paper from Multiverse Computing that explains how tensor-networks can compress AI models without losing accuracy. The recent paper, “CompactifAI: Extreme Compression of Large Language Models using Quantum-Inspired Tensor Network,” demonstrates how CompactifAI uses quantum-inspired tensor networks to shrink these models.

The paper also includes benchmarking data to illustrate CompactifAI’s performance.

As the paper explains, CompactifAI reduces the number of parameters in a model by 70% to 80%. This significant reduction speeds up both the training process and inference time: cutting memory requirements by 93%, training time by 50% and inference time by 25%. All these changes result in an accuracy drop of only 2%.

Multiverse’s compressor CompactifAI uses this process to shrink AI models:

  1. First, the software identifies the largest and densest layers of the neural network that are suitable for shrinking.
  2. Then, tensorization breaks down the large matrices within these layers into smaller, interconnected matrices.
  3. Finally, it retrains or “heals” the model to improve accuracy by adjusting the remaining parameters.

Orús and his team evaluated CompactifAI’s performance by compressing META’s LlaMA-2 7B model. After the model was compressed and healed, it was benchmarked on tasks related to language understanding, reasoning, reading comprehension, world knowledge and math (see Table I in the paper).

While reducing the parameters by 70%, the model maintained 98% of the original accuracy, surpassing the results of other compression techniques and suggesting that a substantial portion of parameters in LLMs are unnecessary.

Tensorization significantly reduces the number of parameters, while retaining the most relevant correlations between parameters. This reduction means fewer parameters are transferred during the model’s multi-GPU distributed training, drastically reducing the GPU-CPU transfer time.

As a result, the tensorization approach achieves a 50% speedup over the original and purely quantized models trained on the same amount of data.

These methods also allow refined layer sensitivity profiling, showing that deeper layers tend to be ineffective for LLM performance and more suitable for tensor network compression.

“CompactifAI’s tensor network compression provides a much more refined, controllable, and explainable compression technique of LLMs compared to alternative methods,” said Orús. “Our approach is also versatile and can be implemented alongside other compression methods like quantization, distillation and pruning.”

Quantization reduces memory size but slows inference by 13%. However, by combining quantization and tensorization, this approach results in a 93% reduction in the memory size of LlaMA-2 7B while speeding up inference by 25%.

CompactifAI reduces the overall number of parameters in a model, which means a smaller number of parameters can be transferred much faster between CPU and GPUs. This results in energy saving.

“Our technique can also be deployed on premises without the need of a cloud connection, opening the door to smaller, personalized LLMs,” Orús said. “With CompactifAI, AI models use less energy and memory while training, retraining, and inference all become more efficient.”