Machine learning has witnessed a surge in interest in recent years driven by several factors. including the availability of large datasets, advancements in transfer learning, and the development of more potent neural network structures, all giving rise to powerful models with wide ranges of applications. However, the size of these models has been growing at an unprecedented rate, with some models exceeding billions of parameters. This growth in size has led to increased computational costs, making it difficult to train and deploy these models at scale on current GPUs. To address this challenge, researchers have been exploring various model compression techniques to reduce their size and computational requirements.

Use Case: Model Compression

Problem Statement

Problem Statement


Machine learning models typically contain billions of parameters, which leads to unprecedented sizes, increased computational costs, and memory requirements, making it difficult to train and deploy these models at scale on current GPUs.

Realization Approach

Realization Approach


Model compression algorithms targeted for specific applications can reduce and fine-tune the size of the models. When integrated as a unified backend, developers can choose between multiple models and switch between them in runtime.

Solution Space

Solution Space


Model compression algorithms target various constraints, such as cost, latency, and speed requirements, to refine the model performance. They use a variety of techniques designed for different situations, use cases, architectures, and hardware devices.

Featured AI/ML Middleware Platform

Unify.ai offers a platform to deploy custom LLM backend for a more optimized response for generative AI applications. Using a single API endpoint, developers can combine and access multiple LLMs across multiple providers and switch between them based on various constraints such as cost, latency, and speed. Unify significantly reduces the effort in LLM selection, letting developers spend more time on critical application logic.

Model compression refers to a set of algorithms that aim to reduce the size and memory requirements of neural networks without significantly impacting their accuracy. This can help make models more efficient and cost-effective, allowing them to be deployed on various environments, such as edge devices and cloud services.‍

These algorithms are typically integrated into various libraries that provide bespoke APIs for applying compression techniques on the user’s models. On the other hand, the rapid pace of development for these algorithms has made it difficult for user-facing tools to integrate all available techniques on a timely basis. Further, because some algorithms are better suited for specific compiler tool chains, this typically requires some tools to make selective choices as to which techniques to integrate given their unique focus and design choices. ‍

As a result, the landscape for model compression utilities has become complex with non-perfectly-overlapping requirements across available tools, making it difficult for the user to get a clear sense of the tools and techniques most relevant for their unique use-cases.

Model compression techniques
Model compression tools landscape with hardware requirements.

Model Compression in Framework and Libraries

‍Several high level frameworks provide third party libraries with model compression features. Initially, TensorFlow had a significant advantage in the deployment race due to its well-established ecosystem of tools and its own Model Optimization Toolkit. ‍

However, PyTorch has been rapidly closing the gap in the deployment space with the introduction of torch.compile and its built-in Quantization support in Torch 2.0, allowing for the efficient deployment of machine-learning models. These advancements have positioned PyTorch as a strong contender in the deployment arena. Moreover, PyTorch boasts a vast collection of models. At the time of writing this post, there are an impressive 10,270 TensorFlow models on Hugging Face, but an even more staggering 148,605 PyTorch models. This indicates the growing popularity and adoption of PyTorch in the machine-learning community. In addition to its native capabilities, PyTorch benefits from a thriving ecosystem of third-party compression tools specifically developed for PyTorch models. This ecosystem provides a wide range of options for compressing and optimizing PyTorch models, further enhancing their deployment efficiency.

‍Considering all these factors, it is evident that PyTorch’s native quantization and the accompanying third-party tools have become a focal point in the field. Therefore, these posts will primarily focus on exploring PyTorch’s quantization capabilities and some of the tools developed around it. While not comprehensive, we strive to cover as many of the major tools per compression technique as possible. That being said, we will be focusing slightly more on quantization and pruning given the wider array of available tools for these techniques compared to tensorization and knowledge distillation.

‍Various Techniques for Model Compression

‍Quantization

‍Quantization is a model compression technique that reduces the precision of the weights and activations of a neural network. In other words, it involves representing the weights and activations of a neural network using fewer bits than their original precision. For example, instead of using 32-bit floating-point numbers to represent weights and activations, quantization may use 8-bit integers. This transformation significantly reduces storage requirements and computational complexity. Although some precision loss is inherent, careful quantization techniques can achieve substantial model compression with only minimal accuracy degradation.

‍Essentially, quantization involves mapping from a continuous space to a discrete space such that full-precision values are transformed to new values with lower bit-width called quantization levels using a quantization map function.

Using a quantization map function to lower input bit-value. Source: A Comprehensive Survey on Model Quantization for Deep Neural Networks in Image Classification

‍Ideally, quantization should be used when the model’s size and computational requirements are a concern, and when the reduction in precision can be tolerated without significant loss in performance. This is often the case with LLMs in tasks such as text classification, sentiment analysis, and other NLP tasks, where the models are massive and resource-intensive. Quantization is also best used when deploying on resource-constrained devices such as mobile phones, IoT and edge devices. By reducing the precision of weights and activations, quantization can significantly reduce the memory footprint and computational requirements of a neural network, making it more efficient to run on these devices.

‍Pruning

‍Pruning is a model compression technique that involves removing the unnecessary connections or weights from a neural network. The goal of pruning is to reduce the size of the network while maintaining its accuracy. Pruning can be done in different ways, such as removing the smallest weights, or removing the weights that have the least impact on the output of the network.

‍From a technical perspective, pruning involves three main steps: (1) training the original neural network, (2) identifying the connections or weights to prune, and (3) fine-tuning the pruned network. The first step involves training the original neural network to a desired level of accuracy. The second step involves identifying the connections or weights to prune based on a certain criterion, such as the magnitude of the weights or their impact on the output of the network. The third step involves fine-tuning the pruned network to restore its accuracy.

A neural network structure, before and after pruning. Source: Pruning Neural Networks

Tensorization

‍Tensorization is a model compression technique that involves decomposing the weight tensors of a neural network into smaller tensors with lower ranks. In machine learning, it is used to reveal underlying patterns and structures within the data whilst also reducing its size. Tensorization has many practical use cases in ML such as detecting latent structure in the data for e.g representing temporal or multi-relational data, as well as latent variable modelling. 

‍The goal of tensorization is to reduce the number of parameters in the network while maintaining its accuracy. Tensorization can be done using different methods, such as singular value decomposition (SVD), tensor train decomposition, or Tucker decomposition.

Visual representation of the tensorization process applied to image data. Source: Tensor Contraction and Regression Networks

‍Tensorization is most useful when a model can be optimized at a mathematical level, i.e. when the model’s layers can be further broken down into lower ranking tensors to reduce the number of parameters needed for computation.

‍Knowledge Distillation

‍Knowledge distillation is a model compression technique that involves transferring the knowledge from a large, complex neural network (teacher network) to a smaller, simpler neural network (student network). The goal of knowledge distillation is to reduce the size of the network while maintaining its accuracy by leveraging the knowledge learned by the teacher network.

‍From a high level perspective, knowledge distillation involves two main steps: ‍

  1. Training the teacher network: Involves training the teacher network to a desired level of accuracy 
  2. Training the student network: Using the outputs of the teacher network as soft targets, which are probability distributions over the classes instead of hard labels
An illustration of the knowledge distillation process. Source: Knowledge Distillation: A Survey

‍The student network is trained to mimic the behavior of the teacher network by minimizing the difference between the soft targets and its own predictions.


For a detailed analysis on each of the model compression techniques, refer to the original post in Unify.ai

About the author 

Radiostud.io Staff

Showcasing and curating a knowledge base of tech use cases from across the web.

TechForCXO Weekly Newsletter
TechForCXO Weekly Newsletter

TechForCXO - Our Newsletter Delivering Technology Use Case Insights Every Two Weeks

>