Skip to main content

Pruna AI, a European startup that has been developing compression algorithms for AI models, is releasing its optimization framework as open source on Thursday.

The company has created a framework that implements multiple efficiency methods, including caching, pruning, quantization, and distillation, to optimize a given AI model.

“We also standardize the process of saving and loading compressed models, applying combinations of these compression methods, and evaluating the performance of the compressed model after compression,” explained Pruna AI co-founder and CTO John Rachwan in an interview with TechCrunch.

In particular, Pruna AI’s framework can assess whether there is significant quality loss after compressing a model and determine the resulting performance gains.

“To draw an analogy, we are similar to Hugging Face, which standardized transformers and diffusers – how to call them, save them, load them, etc. We are doing the same, but for efficiency methods,” Rachwan added.

Large AI labs have already been utilizing various compression methods. For example, OpenAI has been relying on distillation to create faster versions of its flagship models, such as GPT-4 Turbo, a faster version of GPT-4. Similarly, the Flux.1-schnell image generation model is a distilled version of the Flux.1 model from Black Forest Labs.

Distillation is a technique that involves extracting knowledge from a large AI model using a “teacher-student” model, where the teacher model’s outputs are used to train a smaller student model to approximate the teacher’s behavior.

“For large companies, they usually develop these capabilities in-house, and what is available in the open source world is typically based on single methods, such as one quantization method for LLMs or one caching method for diffusion models,” Rachwan said. “However, you cannot find a tool that aggregates all these methods, makes them easy to use, and allows them to be combined. This is the significant value that Pruna is bringing to the table.”

Left to right: Rayan Nait Mazi, Bertrand Charpentier, John Rachwan, Stephan GünnemannImage Credits:Pruna AI

Although Pruna AI’s framework supports a wide range of models, including large language models, diffusion models, speech-to-text models, and computer vision models, the company is initially focusing on image and video generation models.

Some of Pruna AI’s existing customers include Scenario and PhotoRoom. In addition to the open source edition, Pruna AI offers an enterprise version with advanced optimization features, including an optimization agent.

“One of the most exciting features we will be releasing soon is a compression agent,” Rachwan said. “You can give it your model, specify the desired speed and accuracy, and the agent will automatically find the best combination and return it to you. As a developer, you won’t need to do anything.”

Pruna AI charges its pro version by the hour, similar to renting a GPU on a cloud service like AWS. The cost savings from optimizing a model can be substantial, especially for critical AI infrastructure. For instance, Pruna AI was able to reduce the size of a Llama model by eight times with minimal loss using its compression framework.

Pruna AI aims for its customers to view its compression framework as a worthwhile investment that pays for itself. The company recently raised $6.5 million in seed funding from investors including EQT Ventures, Daphni, Motier Ventures, and Kima Ventures.


Source Link