
Quantization, one of the most widely used techniques to make AI models more efficient, has limitations and the industry can’t move quickly to address them.
In the context of AI, quantization means reducing the number of bits (the smallest unit a computer can process) needed to represent information. Consider the following analogy. If someone asks you what time it is, you’ll probably say “noon,” not “oh, two thousand two hundred, one second, four milliseconds.” That’s quantization. Both answers are correct, but one is slightly more accurate. The actual precision required depends on the situation.
An AI model consists of several components that can be quantized. In particular, parameters are internal variables that the model uses to make predictions or decisions. This is convenient, considering that the model performs millions of calculations when it runs. Quantized models, with fewer bits representing their parameters, are less demanding mathematically and computationally. (To be clear, this is a different process than “distillation”, which makes parameters more complex and selective.)
However, quantization may have more trade-offs than previously assumed.
models getting smaller
A study by researchers at Harvard, Stanford, MIT, Databricks and Carnegie Mellon found that quantized models perform worse when the original, unquantized version of the model is trained on a lot of data over a long period of time. That is, at some point it may actually be better to train a small model than to train a large model.
This could be bad news for AI companies that train very large models (known to improve answer quality) and then quantify them to drive down the cost of their services.
The effects are already being seen. A few months ago, developers and academics reported that quantizing Meta’s Llama 3 model tends to be “more detrimental” than other models due to its training method.
“In my opinion, the biggest cost for anyone in AI is and will continue to be inference. Our research shows that one important way to reduce inference is that it won’t work forever,” said Harvard Mathematics student and author of the paper. said Tanishq Kumar, first author of . the paper told TechCrunch.
Contrary to popular belief, AI model inference (running a model like ChatGPT does when answering a question) is often more expensive overall than model training. For example, consider that Google spent approximately $191 million to train one of its flagship Gemini models. It’s definitely a huge amount. But if a company used its model to generate just 50-word answers to half of all Google search queries, it would be spending about $6 billion a year.
Major AI labs have adopted training models on large datasets under the assumption that “scaling,” which increases the amount of data and compute used for training, will further improve the capabilities of AI.
For example, Meta trained Llama 3 on a set of 15 trillion tokens. (Tokens represent raw data bits; 1 million tokens equal about 750,000 words.) The previous generation, Llama 2, was trained with “only” 2 trillion tokens.
Evidence shows that scaling ultimately leads to diminishing returns. Anthropic and Google recently reportedly trained massive models that fell short of internal benchmark expectations. But there are few signs that the industry is ready to move away meaningfully from this entrenched approach to expansion.
Exactly how accurate is it?
So if labs are reluctant to train models on smaller data sets, is there a way to make them more susceptible to degradation? if. Kumar said he and his co-authors discovered that they could make “low-precision” training models more robust. Hold on a second as we dive in a bit.
Here, “precision” refers to the number of digits that a numeric data type can accurately represent. A data type is a collection of data values, usually specified by a set of possible values and permitted operations. For example, data type FP8 uses only 8 bits to represent floating point numbers.
Most models today are trained to 16-bit, or “half-precision,” and are “post-trained and quantized” to 8-bit precision. Certain model components (e.g., their parameters) are converted to a lower-precision format at the expense of some accuracy. Think of it as often times you can get the best of both worlds by doing math to a few decimal places and then calculating it to the nearest 10th.
Hardware vendors like Nvidia are pushing to reduce the precision of quantized model inference. The company’s new Blackwell chips support 4-bit precision, specifically a data type called FP4. Nvidia claims this to be helpful in data centers where memory and power are limited.
However, extremely low quantization precision may be undesirable. According to Kumar, unless the original model is prohibitively large in terms of number of parameters, there can be a noticeable decrease in quality at precisions lower than 7 or 8 bits.
If this all seems a bit technical, don’t worry. Yes. But the important point is that AI models are not fully understood, and known shortcuts that work for many kinds of computations will not work here. If someone asked you when you started running 100m, you wouldn’t say ‘noon’, right? Of course it’s not that clear, but the idea is the same.
“The key to our work is that there are limitations that cannot be solved naively,” concludes Kumar. “We hope our work adds nuance to the discussion of pursuing lower and lower precision defaults for training and inference.”
Kumar acknowledges that his and his colleagues’ study is relatively small and plans to test it using more models in the future. But he believes there may be at least one insight. In other words, there is no free lunch when it comes to reducing the cost of inference.
“Bit precision is important, and it’s not free,” he said. “You can’t cut back forever without the model suffering. Since models have a finite capacity, rather than trying to fit hundreds of billions of tokens into a small model, we think we will spend more effort on meticulous data curation and filtering to ensure that only the highest quality data is put into the small model. I am optimistic that new architectures that intentionally aim to make low-precision training reliable will be important in the future.”








