AdaZeta: Adaptive Zeroth-Order Tensor-Train Adaption for Memory-Efficient Large Language Models Fine-Tuning

发布于 2024-06-27  3 次阅读


AI 摘要

文章标题:“AdaZeta: Adaptive Zeroth-Order Tensor-Train Adaption for Memory-Efficient Large Language Models Fine-Tuning” 根据该文章介绍,大型语言模型的微调在各种自然语言处理任务中取得了显著的性能,但随着模型尺寸的不断增长,需求的内存也越来越多。为了解决这个问题,最近提出的内存高效的零阶(MeZO)方法尝试使用仅前向传递来微调语言模型,从而避免了反向传播图的需要。然而,这些方法存在显著的性能下降和发散风险,限制了它们的广泛应用。因此,作者提出了自适应零阶张量列适应(AdaZeta)框架,旨在提高ZO方法的性能和收敛性。为了提高依赖维度的ZO估计准确性,他们引入了快速前向、低参数的张量化适配器。为了解决大规模ZO微调任务中经常出现的发散问题,他们提出了一个自适应的查询数量计划,以确保收敛。详细的理论分析和对Roberta-Large和Llama-2-7B模型的大量实验结果证实了他们AdaZeta框架在准确性、内存效率和收敛速度方面的有效性。

[PDF] [Site] [Kimi]

Fine-tuning large language models (LLMs) has achieved remarkable performance across various natural language processing tasks, yet it demands more and more memory as model sizes keep growing. To address this issue, the recently proposed Memory-efficient Zeroth-order (MeZO) methods attempt to fine-tune LLMs using only forward passes, thereby avoiding the need for a backpropagation graph. However, significant performance drops and a high risk of divergence have limited their widespread adoption. In this paper, we propose the Adaptive Zeroth-order Tensor-Train Adaption (AdaZeta) framework, specifically designed to improve the performance and convergence of the ZO methods. To enhance dimension-dependent ZO estimation accuracy, we introduce a fast-forward, low-parameter tensorized adapter. To tackle the frequently observed divergence issue in large-scale ZO fine-tuning tasks, we propose an adaptive query number schedule that guarantees convergence. Detailed theoretical analysis and extensive experimental results on Roberta-Large and Llama-2-7B models substantiate the efficacy of our AdaZeta framework in terms of accuracy, memory efficiency, and convergence speed.

Hello
最后更新于 2024-08-02