-
Notifications
You must be signed in to change notification settings - Fork 2
Report
Finetuned large language models (LLMs) have shown remarkable performance in financial tasks, such as sentiment analysis and information retrieval. Due to privacy concerns, finetuning and deploying Financial LLMs (FinLLMs) locally are crucial for institutions. However, finetuning FinLLMs poses challenges including GPU memory constraints and long input sequences. In this paper, we employ quantized low-rank adaptation (QLoRA) to finetune FinLLMs, which leverage low-rank matrix decomposition and quantization techniques to significantly reduce computational requirements while maintaining high model performance. We also employ data and pipeline parallelism to enable local finetuning using cost-effective, widely accessible GPUs. Experiments on financial datasets demonstrate that our method achieves substantial improvements in accuracy, GPU memory usage, and time efficiency, underscoring the potential of low-rank methods for scalable and resource-efficient LLM finetuning.
Large language models (LLMs) have demonstrated exceptional capabilities in various applications, such as finance, healthcare, scientific discovery, etc. Finetuning these models to domain-specific datasets further enhances their performance and improves their applicability to specialized tasks. In the financial domain, finetuned LLMs demonstrate substantial potential for tasks such as sentiment analysis, named-entity recognition, and knowledge extraction from financial documents.
FinGPT applied low-rank adaptation techniques for finetuning quantized LLMs in financial contexts, which displayed noticeable improvement over the base model, while having substantial memory reduction and training speedup. XBRL agent evaluated the potential of LLMs' capabilities on analyzing XBRL reports. The use of Retrieval-Augmented Generation (RAG) and tools-calling techniques on XBRL-related tasks demonstrated significant improvement in task accuracy.
Due to sensitive data and regulatory constraints, finetuning and inference of LLMs within local environments remain critical requirements for financial institutions. Furthermore, the ability to create personalized and customized LLMs, finetuned for specific tasks, is essential for maximizing the utility of these models in financial applications.
Challenges Existing FinLLMs face the following challenges:
- Inefficient finetuning: Financial tasks are often complex and require precise adaptation to large domain-specific data, which can involve extensive parameter updates and prolonged training times.
- Resource-constrained local devices: Finetuning and deploying large language models on resource-constrained local devices, such as consumer GPUs, can be challenging due to their memory and computational limitations.
Building upon prior research, we demonstrate that state-of-the-art LLMs can be finetuned for diverse financial tasks locally and cost-effectively using widely accessible GPUs, achieving notable improvements over baseline models. Our main contributions can be summarized as follows:
-
We employ Quantized Low-Rank Adaptation (QLoRA) to alleviate memory requirements and allow more efficient finetuning. Using the low-rank structure reduces the number of trainable parameters required for finetuning, and quantization compresses the model size, further limiting GPU memory consumption.
-
We employ distributed data parallelism (DDP) and pipeline parallelism to leverage multiple GPUs effectively. DDP distributes training data across GPUs to accelerate finetuning, while pipeline parallelism partitions the model at the layer level to optimize memory usage during inference. Together, these strategies enable more efficient fintuning and inference for FinLLMs.
-
We conduct extensive experiments on diverse financial datasets. Models finetuned with QLoRA exhibit up to a 48% average increase in accuracy compared to baseline models, which validates the effectiveness of low-rank adaptation and quantization in addressing the unique challenges of FinLLMs.
Low-rank adaptation (LoRA) is a parameter-efficient finetuning method that preserves the pretrained transformer model weights and introduces a smaller set of trainable weights, which are expressed using low-rank decomposition.
In LoRA, the update weights are assumed to follow the low-rank decompositions
During the finetuning stage, the forward pass is:
During the inference stage,
Quantized LoRA (QLoRA) further reduces memory usage by using 8-bit or 4-bit quantization. During finetuning, all weights of the pre-trained model are quantized to 8-bit or 4-bit. Weights will be dynamically dequantized back to 16 bit when performing computation with the input sequence
Quantization | Llama 3.1-8B | Llama 3.1-70B |
---|---|---|
16-bit | 15.0 GB | 131.5 GB |
8-bit | 8.6 GB | 68.5 GB |
4-bit | 5.6 GB | 37.8 GB |
To accelerate the finetuning process and leverage the computational power of multiple GPUs, we employed Distributed Data Parallel (DDP), which distributes the training data across multiple GPUs. DDP launches one process per GPU, where each process gets its own copy of the model and optimizer. Each process then receives different inputs, and the gradients are computed and synchronized across all GPUs to accelerate training.
We also opted to use Brain Floating Point (BF16) during finetuning. BF16 offers a range of values the same as FP32 and easy conversion to/from FP32. Studies showed that BF16 can achieve similar results as FP32 while having significant speedup and memory savings.
We used 0/1 Adam optimizer, a modified version of the Adam optimizer that linearizes each Adam step and allows utilizing 1-bit compression for faster convergence speed, while offering reduced data volume and higher training throughput.
Inference on larger models like Llama 3.1 70B requires substantial GPU memory usage, particularly when using higher precision like 8-bit or 16-bit. We employ pipeline parallelism, where the model is partitioned at the layer level and distributed across multiple GPUs; each GPU process computes different micro-batches with different parts of the model concurrently.
Table 1 illustrates GPU memory usage achieved through quantization and pipeline parallelism during inference with a batch size of 1. For Llama 3.1-8B, memory usage decreases by 43% with 8-bit quantization and 63% with 4-bit quantization compared to the original 16-bit. Similarly, for Llama 3.1-70B, the memory requirement reduces from 131.5 GB (16-bit, requiring 3 GPUs) to 68.5 GB (8-bit, 2 GPUs) and further to 37.8 GB (4-bit, 1 GPU). These reductions demonstrate the practical benefits of quantization in enabling resource-efficient inference for large-scale models.
Hardware Configurations. The experiments were conducted on a server equipped with four 16-core AMD EPYC 7313 CPUs, 1 TB of RAM, and four NVIDIA RTX A6000 GPUs, each featuring 48 GB of dedicated GPU memory.
For classification tasks, our study focuses on three financial language processing tasks: Sentiment Analysis (SA), Named Entity Recognition (NER), and news headline classification.
- Sentiment Analysis (SA) entails analyzing financial text, such as news articles or tweets, to assign sentiment labels (e.g., positive, negative, or neutral).
- Named Entity Recognition (NER) is designed to identify and classify critical entities within financial texts, including organizations, locations, and individuals.
- News headline classification involves categorizing headlines according to predefined criteria or questions, facilitating the automated organization and analysis of financial news.
For Question-Answering (QA) tasks, we focus on eXtensible Business Reporting Language (XBRL) data extraction. XBRL is a standardized format designed for the exchange of financial information. Although XBRL documents are based on structured XML (eXtensible Markup Language), their inherent complexity presents challenges that can be addressed using the capabilities of LLMs to extract key information, thereby facilitating financial reporting and analysis. In this study, we aim to finetune LLMs to accurately extract both numerical and textual information from XBRL files.
We choose Llama 3.1-8B Instruct and Llama 3.1-70B Instruct models as base models.
Dataset Name | Type | Train/Test Examples |
---|---|---|
FPB | Classification | 1,200 / 3,600 |
FiQA SA | Classification | 961 / 150 |
TFNS | Classification | 9,540 / 2,390 |
NWGI | Classification | 16,200 / 4,050 |
Headline | Classification | 82,200 / 20,500 |
NER | Classification | 13,500 / 3,500 |
XBRL Tags | QA | 375 / 164 |
XBRL Values | QA | 846 / 154 |
The following datasets all consist of input texts and sentiment labels such as "neutral", "positive", or "negative".
- Financial phrasebank (FPB) contains sentences extracted from financial news and reports. These sentences are annotated with sentiment labels. We manually created the train/test split.
- Financial question-answering Sentiment Analysis (FiQA SA) is another sentiment analysis dataset with the same labels as FPB from microblog headlines and financial news.
- Twitter financial news sentiment (TFNS) comprises annotated tweets related to financial news labeled with sentiment categories.
- News with GPT instruction (NWGI) comprises samples with seven labels ranging from “strong negative” to “strong positive”. We map the seven labels back to three for simplicity and consistency with other SA datasets.
The Headline dataset classifies headlines based on various questions into two classes: "yes" and "no".
The NER dataset annotates one entity per sentence, categorized into one of three classes: "location", "person", and "organization"
The XBRL dataset comprises questions and answers derived from XBRL filings from 2019 to 2023 for Dow Jones 30 companies. Each example includes a question, a text segment from an XBRL file containing the answer, and the ground truth generated using an XBRL file extraction library. From this dataset, we selected the following two tasks:
- XBRL tag extraction: This task involves extracting a specific XBRL tag from a large XBRL raw text segment given a natural language description of the tag.
- XBRL value extraction: This task focuses on extracting a numeric value from the raw XBRL text segment given a natural language description of the value.
To allow better instruction following for the base model, we use one-shot prompting by providing an example question and answer.
We employed distinct finetuning strategies based on the nature of the tasks:
- Classification Tasks: Single-task fine-tuning.
- XBRL Question Answering: Multi-task fine-tuning.
All fine-tuning experiments used the 0/1 Adam optimizer:
- Learning rate: 1e-4
- LoRA alpha: 32
- LoRA dropout: 0.1
- Llama 3.1 8B: LoRA rank 4 with 4-bit quantization and rank 8 with 8-bit quantization.
- Llama 3.1 70B: LoRA rank 4 with 4-bit quantization.
Batch Size and Epochs:
-
Classification Tasks:
- Llama 3.1 8B: Batch size 16, gradient accumulation step 1, 4 epochs.
- Llama 3.1 70B: Batch size 4, gradient accumulation step 4, 4 epochs.
-
XBRL Tasks:
- Llama 3.1 8B: Batch size 2, gradient accumulation step 2, 1 epoch.
We use 8-bit quantized inference for all evaluations.
- Accuracy: Ratio of correct answers to total queries.
- Weighted F1 score: Weighted average of F1 scores for each class.
- Batch size: Batch size per GPU during finetuning.
- GPU memory usage: Total GPU memory used during training.
- GPU hours: Total training time * number of GPUs.
- Adapter size: Size of the LoRA adapter file.
- Inference Speed: Seconds to process an example.
The finetuned Llama 3.1 8B shows noticeable accuracy improvements over its base model and even surpasses the Llama 3.1 70B base model. Even with lower quantization (4-bit) and rank 4, the finetuned Llama 3.1 8B performs comparably to its 8-bit, rank 8 version while using less memory. The fine-tuned 70B model demonstrates practical usability with 4-bit quantization, showing the feasibility of using larger LLMs for complex financial tasks in resource-constrained environments. All finetuning can be done on a single 48GB GPU, though it takes longer.
Table 3: Accuracy on classification and XBRL extraction tasks.
Model | FPB | FIQA | TFNS | NWGI | NER | Headline | Tags | Values |
---|---|---|---|---|---|---|---|---|
Llama-3.1-8B-Instruct (base) | 0.6873 | 0.4655 | 0.6997 | 0.4658 | 0.4889 | 0.4534 | 0.7937 | 0.5526 |
Llama-3.1-70B-Instruct (base) | 0.7450 | 0.4727 | 0.6842 | 0.7993 | 0.4628 | 0.7168 | 0.8902 | 0.8766 |
Llama-3.1-8B-Instruct-4bits-r4 | 0.8630 | 0.7309 | 0.8827 | 0.8095 | 0.9663 | 0.8803 | 0.9500 | 0.9605 |
Llama-3.1-8B-Instruct-8bits-r8 | 0.8284 | 0.8036 | 0.8405 | 0.8396 | 0.9805 | 0.8466 | 0.9437 | 0.9736 |
Llama-3.1-70B-Instruct-4bits-r4 | - | - | - | - | 0.9888 | - | - | - |
Table 4: Weighted F1 score on all classification tasks.
Model | FPB | FIQA | TFNS | NWGI | NER | Headline |
---|---|---|---|---|---|---|
Llama-3.1-8B-Instruct (Base) | 0.6768 | 0.5571 | 0.6834 | 0.4117 | 0.5686 | 0.5576 |
Llama-3.1-70B-Instruct (Base) | 0.7363 | 0.5645 | 0.6864 | 0.7993 | 0.4539 | 0.7294 |
Llama-3.1-8B-Instruct-4bits-r4 | 0.8600 | 0.7811 | 0.8824 | 0.8029 | 0.9664 | 0.8864 |
Llama-3.1-8B-Instruct-8bits-r8 | 0.8302 | 0.8177 | 0.8436 | 0.8492 | 0.9806 | 0.8520 |
Llama-3.1-70B-Instruct-4bits-r4 | - | - | - | - | 0.9887 | - |
Table 5: Finetuning and inference performance on one classification task (NER).
Model | Batch size | GPU memory (GB) | GPU hours | Adapter size (MB) | Time per example (s) |
---|---|---|---|---|---|
Llama-3.1-8B-Instruct-4bits-r4 | 16 x 4 | 83.6 | 0.77 x 4 | 4.5 | 0.1 |
Llama-3.1-8B-Instruct-8bits-r8 | 16 x 4 | 96.7 | 0.90 x 4 | 9.0 | 0.1 |
Llama-3.1-70B-Instruct-4bits-r4 | 4 x 4 | 184.3 | 3.50 x 4 | 21.3 | 0.9 |
Table 6: Finetuning and inference performance on XBRL.
Model | Batch size | GPU memory (GB) | GPU hours | Adapter size (MB) | Time per Example (s) |
---|---|---|---|---|---|
Llama-3.1-8B-Instruct-4bits-r4 | 2 x 4 | 139.2 | 0.44 x 4 | 4.5 | 1.9 |
Llama-3.1-8B-Instruct-8bits-r8 | 2 x 4 | 152.2 | 0.48 x 4 | 9.0 | 1.9 |
This study demonstrates the effectiveness of Quantized LoRA (QLoRA) for finetuning large language models (LLMs) on a range of financial tasks. We finetuned Llama 3.1 8B and 70B models, achieving up to 48% average accuracy improvements over base models. These gains are achievable with four GPUs and under 20 hours of training per task, making local finetuning and deployment feasible for financial institutions. Future work includes exploring multi-task finetuning for classification and expanding XBRL-related tasks, like formula calculations.
-
Chen, T.; Hao, N.; Van Rechem, C.; Chen, J.; and Fu, T. 2024. Uncertainty quantification and interpretability for clinical trial approval prediction. Health Data Science, 4: 0126.
-
Dettmers, T.; Pagnoni, A.; Holtzman, A.; and Zettlemoyer, L. 2023. QLoRA: Efficient Finetuning of Quantized LLMs. arXiv:2305.14314.
-
Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; and et al. 2024. The Llama 3 Herd of Models. arXiv:2407.21783.
-
Fu, Y.; Lu, Y.; Wang, Y.; Zhang, B.; Zhang, Z.; Yu, G.; Liu, C.; Clarke, R.; Herrington, D. M.; and Wang, Y. 2024. DDN3.0: Determining significant rewiring of biological network structure with differential dependency networks. Bioinformatics, btae376.
-
Han, S.; Kang, H.; Jin, B.; Liu, X.-Y.; and Yang, S. Y. 2024. XBRL Agent: Leveraging Large Language Models for Financial Report Analysis. In Proceedings of the 5th ACM International Conference on AI in Finance, ICAIF ’24, 856–864. New York, NY, USA: Association for Computing Machinery. ISBN 9798400710810.
-
Kalamkar, D. D.; Mudigere, D.; Mellempudi, N.; Das, D.; Banerjee, K.; Avancha, S.; Vooturi, D. T.; Jammalamadaka, N.; Huang, J.; Yuen, H.; Yang, J.; Park, J.; Heinecke, A.; Georganas, E.; Srinivasan, S. M.; Kundu, A.; Smelyanskiy, M.; Kaul, B.; and Dubey, P. K. 2019. A Study of BFLOAT16 for Deep Learning Training. ArXiv, abs/1905.12322.
-
Li, S.; Zhao, Y.; Varma, R.; Salpekar, O.; Noordhuis, P.; Li, T.; Paszke, A.; Smith, J.; Vaughan, B.; Damania, P.; and Chintala, S. 2020. PyTorch distributed: experiences on accelerating data parallel training. Proc. VLDB Endow., 13(12): 3005–3018.
-
Liu, X.-Y.; Wang, G.; Yang, H.; and Zha, . D. 2023a. Data-centric FinGPT: Democratizing Internet-scale data for financial large language models. In Workshop on Instruction Tuning and Instruction Following, NeurIPS.
-
Liu, X.-Y.; Wang, G.; Yang, H.; and Zha, D. 2023b. Data-Centric FinGPT: Democratizing Internet-scale Data for Financial Large Language Models. In Workshop on Instruction Tuning and Instruction Following, NeurIPS.
-
Liu, X.-Y.; Zhang, J.; Wang, G.; Tong, W.; and Walid, A. 2024a. Efficient Pretraining and Finetuning of Quantized LLMs with Low-Rank Structure. In 2024 IEEE 44th International Conference on Distributed Computing Systems (ICDCS), 300–311. Los Alamitos, CA, USA: IEEE Computer Society.
-
Liu, X.-Y.; Zhu, R.; Zha, D.; Gao, J.; Zhong, S.; White, M.; and Qiu, M. 2024b. Differentially Private Low-Rank Adaptation of Large Language Model Using Federated Learning. ACM Transactions on Management Information Systems.
-
Lu, Y.; Li, C.; Zhang, M.; Sa, C. D.; and He, Y. 2022a. Maximizing Communication Efficiency for Large-scale Training via 0/1 Adam. arXiv:2202.06009.
-
Lu, Y.; Wu, C.-T.; Parker, S. J.; Cheng, Z.; Saylor, G.; Van Eyk, J. E.; Yu, G.; Clarke, R.; Herrington, D. M.; and Wang, Y. 2022b. COT: an efficient and accurate method for detecting marker genes among many subtypes. Bioinformatics Advances, 2(1): vbac037.
-
Maia, M.; Handschuh, S.; Freitas, A.; Davis, B.; McDermott, R.; Zarrouk, M.; and Balahur, A. 2018. WWW’18 Open Challenge: Financial Opinion Mining and Question Answering. 1941–1942. (Missing venue information)
-
Malo, P.; Sinha, A.; Takala, P.; Korhonen, P.; and Wallenius, J. 2013. Good Debt or Bad Debt: Detecting Semantic Orientations in Economic Texts. arXiv:1307.5336.
-
Rahman, M. A. 2022. Twitter financial news sentiment. http://precog.iiitd.edu.in/people/anupama.
-
Saeedi, A.; Richards, J.; and Smith, B. 2007. An Introduction to XBRL. In British Accounting Association’s Annual Conference.
-
Salinas Alvarado, J. C.; Verspoor, K.; and Baldwin, T. 2015. Domain Adaption of Named Entity Recognition to Support Credit Risk Assessment. In Hachey, B.; and Webster, K., eds., Proceedings of the Australasian Language Technology Association Workshop 2015, 84–90. Parramatta, Australia.
-
Sinha, A.; and Khandait, T. 2020. Impact of News on the Commodity Market: Dataset and Results. arXiv:2009.04202.
-
Wang, Y.; Xu, Y.; Ma, Z.; Xu, H.; Du, B.; Gao, H.; and Wu, J. 2024. TWIN-GPT: Digital Twins for Clinical Trials via Large Language Model. arXiv preprint arXiv:2404.01273.