CompAct: Compressing Retrieved Documents Actively for
Question Answering

Chanwoong Yoon1
&Taewhoo Lee1
&Hyeon Hwang1
Korea University1
AIGEN Sciences2
{cwyoon99, taewhoo, hyeon-hwang, minbyuljeong, kangj}@korea.ac.kr &Minbyul Jeong1
&Jaewoo Kang1,2
Abstract

Retrieval-augmented generation supports language models to strengthen their factual groundings by providing external contexts. However, language models often face challenges when given extensive information, diminishing their effectiveness in solving questions. Context compression tackles this issue by filtering out irrelevant information, but current methods still struggle in realistic scenarios where crucial information cannot be captured with a single-step approach. To overcome this limitation, we introduce CompAct, a novel framework that employs an active strategy to condense extensive documents without losing key information. Our experiments demonstrate that CompAct brings significant improvements in both performance and compression rate on multi-hop question-answering (QA) benchmarks. CompAct flexibly operates as a cost-efficient plug-in module with various off-the-shelf retrievers or readers, achieving exceptionally high compression rates (47x).111Our code is available at https://github.com/dmis-lab/CompAct.

CompAct: Compressing Retrieved Documents Actively for
Question Answering


Chanwoong Yoon1                        Taewhoo Lee1                        Hyeon Hwang1 Korea University1 AIGEN Sciences2 {cwyoon99, taewhoo, hyeon-hwang, minbyuljeong, kangj}@korea.ac.kr                        Minbyul Jeong1                        Jaewoo Kang1,2


1 Introduction

Retrieval-augmented generation empowers language models to solidify their factual groundings, presenting relevant contexts to answer questions (Khandelwal et al., 2019; Lewis et al., 2020; Karpukhin et al., 2020a; Izacard et al., 2023). While these approaches extend the knowledge scope of language models beyond their inherent capabilities, they also introduce challenges when it comes to handling long contexts (Li et al., 2024; An et al., 2024; Qian et al., 2024). First, models often struggle to find key information from extensive contexts, which diminishes their abilities to reference documents Liu et al. (2024). Also, they often fail to integrate information across multiple documents, which is a common occurrence in real-world scenarios (Cheng et al., 2024). To this end, there is a growing need for methods that can assist models with handling long contexts.

One way to address these challenges is by compressing contexts into concise forms Li et al. (2023); Pan et al. (2024). The main goal of compression is to reduce the amount of tokens from the original text without losing core information.

Refer to caption
Figure 1: Performance of HotpotQA with different top-k𝑘kitalic_k documents, using LLaMA3-8B (AI@Meta, 2024) as the reader. CompAct framework shows solid performance improvements that align with those of gold documents. This highlights CompAct’s ability to effectively leverage the benefits of increased top-k𝑘kitalic_k, unlike other methods that struggle with increased noisy context.

However, simply compressing contexts can be suboptimal for QA tasks Joshi et al. (2017); Kwiatkowski et al. (2019), where important details may be filtered out during the compression process (Li et al., 2023). Maintaining redundant information without compression can harm performance, as they may serve as distractors that can induce models to generate incorrect responses. To handle these limitations, query-focused compression emerges as an effective approach, which focuses on preserving information relevant to the question Jiang et al. (2023c); Xu et al. (2024); Cao et al. (2024).

However, existing query-focused compressors still struggle to take advantage of information located behind lengthy contexts, leaving out potential opportunities for reader models to improve their answers. In Figure 1, the increase in retrieval recall parallel to the number of documents indicates that useful information is still present even in the lower-ranked results. This demonstrates that these documents should also be covered to fully exploit given information.

Furthermore, existing methods lack the ability to integrate information across multiple documents, which is required in real-world scenarios Gutiérrez et al. (2024). Figure 2 depicts an example: the question is "What ‘Virtual Choir’-noted conductor has created works for the Austin-based ensemble Conspirare?". To answer this, not only do we need to retrieve information implied within the question ("conductors worked for the Austin-based ensemble Conspirare"), we should also holistically connect and synthesize information across multiple documents ("‘Virtual Choir’-noted conductor"). In other words, the quality of answers hinges on the ability of models to dynamically integrate information across multiple documents, which is yet to be fully explored in the field of compression.

To this end, we propose CompAct, a novel framework that can address these challenges by using an active strategy to compress extensive documents and retain crucial information. Our framework has two key components: active compression and early termination. During compression, the model actively encapsulates input documents by jointly analyzing previously compressed contexts with newly provided segments. This ensures that only the most relevant information (here we refer to the compressed text) to the question is preserved at each step, creating a dense and compact context. At each step, the model then decides whether to terminate the compression process. This decision is made based on the relevance and completeness of the information gathered to answer the query.

Our approach offers two distinct advantages. First, it effectively captures essential context from long documents by incorporating segments along with the previously compressed context. This is crucial for complex QA tasks that require in-depth reasoning and synthesis of information. Second, it condenses large volumes of documents with a high compression rate, without missing essential information. We conduct experiments on five QA benchmark datasets and demonstrate that our framework brings significant improvement in compression rate and end performance in multi-document QA benchmarks. This represents the effectiveness of our framework, as it preserves necessary context without losing critical information.

Our contributions are as follows: (1) We propose CompAct, a novel framework that employs an active strategy for compressing extensive documents. We address the challenge of long contexts by dynamically preserving query-related contexts, focusing on the integration of information across documents. (2) Our framework outperforms existing compressors by a significant margin, achieving a 7.0 (F1) improvement on HotpotQA Yang et al. (2018) with a higher compression rate (47x). Also, it surpasses the performance of long-context large language models in multi-document QA benchmark datasets. (3) We demonstrate the compatibility of CompAct with various retrievers and readers, underscoring its effectiveness as a plug-in module between retrievers and readers.

Refer to caption
Figure 2: Overall CompAct framework as a plug-in module between the retriever and the reader LLM. After splitting retrieved documents into segments, CompAct sequentially compresses these segments into a compacted context. By jointly analyzing the previous context with newly provided segments, we actively compress input documents while preserving essential information in the compressed context. If the segments do not offer complete information to answer the question (1st and 2nd segments), CompAct continues to the next step to acquire new information. Once all supporting clues are fully captured (N𝑁Nitalic_N-th segment), the iteration ends.

2 Preliminaries

2.1 Multi-Document Question Answering

Multi-document (or multi-hop) question answering (QA) involves the task of answering questions that require gathering information from multiple documents. (Yang et al., 2018; Ho et al., 2020b; Chen et al., 2020; Trivedi et al., 2022; Mavi et al., 2022) This is more complicated than single-document QA, since it requires models to locate and combine information scattered across multiple sources. Even if models can afford lengthy input contexts, they still face challenges in effectively integrating dispersed information from documents.

2.2 Context Compression

Several studies have focused on compressing the inputs of language models to reduce inference cost while preserving core information. Mu et al. (2024) introduce gisting, a method that compresses input prompts into shorter transformer activations that can be generalized to unseen prompts. ICAE Ge et al. (2024) proposes training objectives that compress contexts to be restored as closely to the original as possible. Selective-Context Li et al. (2023) and LLMLingua Jiang et al. (2023b) utilize conditional probabilities of LLMs to assess the importance of information within contexts. xRAG Cheng et al. (2024) uses modality fusion to embed document representations into language models, achieving high compression rates.

Additionally, some works have focused on compressing long context inputs. For example, AutoCompressors Chevalier et al. (2023) transform segments of input context into soft prompts, which are then attached to the next segment as summary vectors. LongLLMLingua Jiang et al. (2023c) select candidates from documents and then perform token-level compression to retain valuable information relevant to a question. Concurrent with our work, Chain-of-Agents Zhang et al. (2024) have utilized an iterative framework, which enables information aggregation and context reasoning over long-context tasks. However, our work aims to address a crucial aspect by integrally linking and synthesizing pivotal information between segments while compressing contexts.

2.3 Task Formulation

In retrieval-augmented generation, a model M𝑀Mitalic_M predicts an output y𝑦yitalic_y conditioned on an input x𝑥xitalic_x and k𝑘kitalic_k retrieved passages Dk={d1,,dk}i=1ksubscript𝐷𝑘superscriptsubscriptsubscript𝑑1subscript𝑑𝑘𝑖1𝑘D_{k}={\{d_{1},...,d_{k}\}}_{i=1}^{k}italic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = { italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT. For the task of question answering, the input x𝑥xitalic_x typically consists of a question q𝑞qitalic_q with an instruction. Thus, M𝑀Mitalic_M generates an answer y𝑦yitalic_y based on x𝑥xitalic_x and the retrieved documents Dksubscript𝐷𝑘D_{k}italic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT as follows: M(y|x,Dk)𝑀conditional𝑦𝑥subscript𝐷𝑘M(y|x,D_{k})italic_M ( italic_y | italic_x , italic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ).

To mitigate the costs of M𝑀Mitalic_M caused by processing a large number of tokens, several approaches have been recently proposed to compress the documents into a shorter context  Wang et al. (2023); Xu et al. (2024). Building on these approaches, our goal is described as follows:

argmaxπPM(yCπ,x)𝑎𝑟𝑔subscript𝜋subscript𝑃𝑀conditional𝑦subscript𝐶𝜋𝑥arg\max_{\pi}P_{M}(y\mid C_{\pi},x)italic_a italic_r italic_g roman_max start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_y ∣ italic_C start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT , italic_x )
Cπ=π(q,Dk)withl(Cπ)l(Dk)formulae-sequencesubscript𝐶𝜋𝜋𝑞subscript𝐷𝑘withmuch-less-than𝑙subscript𝐶𝜋𝑙subscript𝐷𝑘C_{\pi}=\pi(q,D_{k})\quad\text{with}\quad l(C_{\pi})\ll l(D_{k})italic_C start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT = italic_π ( italic_q , italic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) with italic_l ( italic_C start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ) ≪ italic_l ( italic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT )

where l𝑙litalic_l represents the number of tokens and π𝜋\piitalic_π is a function that compresses documents Dksubscript𝐷𝑘D_{k}italic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT into a shorter context Cπsubscript𝐶𝜋C_{\pi}italic_C start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT based on the question q𝑞qitalic_q. It is important to note that we do not aim to optimize the model M𝑀Mitalic_M or the retriever. Instead, our primary focus is on compressing the provided contexts into a concise format, ensuring the essential information is retained to answer the question.

3 CompAct

We introduce CompAct, a novel compression framework that actively compresses documents until it finds all necessary evidence for answering a question. To condense a large amount of information from documents, we devise an iterative architecture where the compressed contexts are updated at each iteration. In this section, we provide a comprehensive explanation of our framework and detail the data construction process for training our model.

3.1 Active Compression

We reconsider compression as sequential updates of contexts based on the previous information. Figure 2 clearly shows the concept of our framework. Given a question and documents Dk={d1,,dk}i=1ksubscript𝐷𝑘superscriptsubscriptsubscript𝑑1subscript𝑑𝑘𝑖1𝑘D_{k}={\{d_{1},...,d_{k}\}}_{i=1}^{k}italic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = { italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT from a retrieval system, we first group the documents as follows:

St={d(t1)×j+1,d(t1)×j+2,,d(t1)×j+j}subscript𝑆𝑡subscript𝑑𝑡1𝑗1subscript𝑑𝑡1𝑗2subscript𝑑𝑡1𝑗𝑗S_{t}=\{d_{(t-1)\times j+1},d_{(t-1)\times j+2},...,d_{(t-1)\times j+j}\}italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { italic_d start_POSTSUBSCRIPT ( italic_t - 1 ) × italic_j + 1 end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT ( italic_t - 1 ) × italic_j + 2 end_POSTSUBSCRIPT , … , italic_d start_POSTSUBSCRIPT ( italic_t - 1 ) × italic_j + italic_j end_POSTSUBSCRIPT }

where Stsubscript𝑆𝑡S_{t}italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is a t𝑡titalic_t-th segment consisting of j𝑗jitalic_j documents, and j𝑗jitalic_j represents the predefined number of documents to be compressed at each iteration. For example, S1={d1,d2,,d5}subscript𝑆1subscript𝑑1subscript𝑑2subscript𝑑5S_{1}=\{{d_{1},d_{2},...,d_{5}\}}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = { italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_d start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT } when j=5𝑗5j=5italic_j = 5. We then begin compressing each segment iteratively until it satisfies the end condition. It can be formulated as follows:

Ct,Et=π(q,St,Ct1)subscript𝐶𝑡subscript𝐸𝑡𝜋𝑞subscript𝑆𝑡subscript𝐶𝑡1C_{t},E_{t}=\pi(q,S_{t},C_{t-1})italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_π ( italic_q , italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT )

Here, q𝑞qitalic_q is a given question to answer. Ctsubscript𝐶𝑡C_{t}italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and Etsubscript𝐸𝑡E_{t}italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT represent the compressed context and an evaluation at step t𝑡titalic_t, respectively. Ctsubscript𝐶𝑡C_{t}italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is used as part of the input for the next step. During compression, the model actively integrates information related to the question by jointly analyzing the previously compressed context with a newly provided segment. This approach ensures that only the most relevant information is preserved at each step, resulting in a compact context. As the output context is designed to preserve query-related information, it serves as a comprehensive memory of all iterations up to the current step. We describe an example in Table 11.

3.2 Early Termination

To ensure that the iteration does not continue unnecessarily once enough information is obtained, we introduce a specific end condition for early termination. We implement this by including an evaluation E𝐸Eitalic_E in the generation process to decide the endpoint. The evaluation E𝐸Eitalic_E consists of a rationale and a condition token ([COMPLETE] or [INCOMPLETE]). The purpose of E𝐸Eitalic_E is to assess whether an input segment Stsubscript𝑆𝑡S_{t}italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, combined with the previous context Ct1subscript𝐶𝑡1C_{t-1}italic_C start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT, provides sufficient details to answer the question. If the token indicates that the provided context is sufficient, the iteration terminates; otherwise, we continue to gather missing information until all details are fully obtained.

This early termination offers three primary benefits. First, it prevents redundant contexts from entering the compressed contexts or acting as a distraction. Second, it avoids meaningless iterations, thereby drastically lowering the computational burden that may stem from iterative processing steps. Third, it dynamically adjusts to the complexity of the question and the information density of the documents. This flexibility allows our CompAct framework to be both effective and efficient across a wide range of scenarios, from simple questions to more complex, multi-hop questions requiring extensive context integration.

Refer to caption
Figure 3: Distribution of iteration points where models determine the compressed contexts to be complete. The frequencies of completeness are accumulated over iterations. We compare the distribution between GPT-4o (Yellow) and CompAct (Green). We also measure the percentage of correctness in each iteration where each model judges the contexts to be complete.

3.3 Dataset Construction

We compress documents into a query-related context while concurrently determining the endpoint of the iterations. To cultivate this capability, we instruct a superior LLM to follow a three-step process. We provide the prompt in Table 10.

Sentence-Level Selection.

We begin by asking the LLM to analyze sentences, particularly focusing on relevant clues that may help the answer the question. If certain sentences provide relevant information or implicitly clarify ambiguous points within the question, the LLM is prompted to generate these sentences from the provided documents.

Query-focused Compression.

We then generate a compressed text of the selected sentences based on the question. We explicitly restrict the LLM from making assumptions or attempting to conclude without supporting evidence, as follows: "DO NOT make assumptions or attempt to answer the question; your job is to summarize only.". This restriction is crucial because our main objective here is to condense relevant information from the provided documents, instead of directly answering the questions. Skipping the logical steps required to answer the question, as if relying on parametric knowledge, can harm the compression performance by increasing the likelihood of missing necessary information.

Dataset [COMPLETE] [INCOMPLETE] Total
first subsequent first subsequent
HotpotQA 7.2K 7.2K 7.2K 7.2K 28.8K
Table 1: Statistics of our generated dataset. We categorize it into four cases: [COMPLETE] and [INCOMPLTE], each further split based on whether it is the first or subsequent iteration.

Determining the Early Termination.

We also prompt the LLM to evaluate its own compressed contexts based solely on the provided information, without any additional background context. We direct the LLM to generate a condition token (e.g., [COMPLETE] or [INCOMPLETE]) along with the rationale for its judgment.

Overall, we construct a synthetic dataset by instructing the LLM based on the three-step processes described above. Table 1 shows the dataset statistics. We conduct data construction from two scenarios: realistic and distractor. In realistic scenarios, provided documents are the results of a retrieval system. However, the retriever’s limited performance leads to infrequent appearances of gold documents, which can hinder the collection of cases with early termination. This results in a scarcity of cases in the dataset where the iteration is terminated early (i.e. [COMPLETE] at a subsequent iteration). To address this issue, we collect data from distractor scenarios which include predefined documents that contain all supporting facts needed to answer the question. After filtering the collected datasets from both scenarios, we build a training dataset consisting of 28k instances categorized into four distinct groups.

4 Experiment

Methods HotpotQA MuSiQue 2WikiMQA NQ TriviaQA
Comp. EM F1 Comp. EM F1 Comp. EM F1 Comp. EM F1 Comp. EM F1
Oracle 10.8x 39.9 51.2 10.3x 14.2 23.6 11.0x 37.4 43.2 - - - - - -
Raw Document 1x 29.4 40.3 1x 6.5 15.6 1x 25.4 31.2 1x 39.0 51.3 1x 68.9 77.1
Long-Context LLM
InternLM2-chat-7B 1x 8.0 20.3 1x 1.0 6.8 1x 9.3 19.5 1x 7.6 22.6 1x 12.1 31.5
Mistral-7B-Instruct-v0.2 1x 9.5 22.6 1x 1.0 7.9 1x 1.2 15.4 1x 4.3 20.9 1x 35.3 50.4
FILM-7B 1x 32.4 43.7 1x 6.9 15.7 1x 26.4 31.7 1x 38.2 50.8 1x 62.7 71.7
GPT-3.5-turbo 1x 32.8 43.8 1x 7.3 16.1 1x 28.6 33.9 1x 40.8 54.6 1x 69.9 77.4
Compressor
AutoCompressors 35.4x 18.4 28.4 34.7x 3.9 11.9 36.2x 19.0 24.5 34.4x 17.3 31.8 34.5x 55.3 64.3
LongLLMLingua 3.4x 25.6 35.3 3.4x 4.8 13.5 3.6x 27.9 32.9 3.5x 27.7 40.6 3.3x 64.0 70.8
RECOMP (extractive) 34.3x 29.7 39.9 32.7x 6.7 15.7 35.9x 29.9 34.9 32.7x 34.6 45.1 39.2x 67.6 74.1
CompAct (Ours. new) 47.6x 35.5 46.9 37.2x 8.7 18.1 51.2x 31.0 37.1 48.5x 38.4 50.0 49.4x 65.4 74.9
Table 2: Main results. We set the reader as LLaMA3-8b (AI@Meta, 2024). We retrieve top-30 documents. We use three multi-document (HotpotQA, MuSiQue, and 2WikiMQA) and two single-document (NQ and TriviaQA) question-answering datasets. Since our training datasets consist of a subset of HotpotQA, we perform zero-shot evaluation on the rest of the datasets. Comp. refers to the compression rate which is denoted as follows: compression rate=# of tokens in retrieved documents# of tokens in compressed textcompression rate# of tokens in retrieved documents# of tokens in compressed text\text{compression rate}=\frac{\text{\# of tokens in retrieved documents}}{% \text{\# of tokens in compressed text}}compression rate = divide start_ARG # of tokens in retrieved documents end_ARG start_ARG # of tokens in compressed text end_ARG.

4.1 Experimental Setup

Dataset Construction

We employ GPT-4o OpenAI (2024) API (2024-05-13) as the LLM to collect our dataset. We only use a subset of HotpotQA Yang et al. (2018) train set for data collection. To retrieve documents, we use Contriever Izacard et al. (2022), fine-tuned on MS-MARCO Bajaj et al. (2016), as our retrieval system on the 2018 Wikipedia corpus Karpukhin et al. (2020b). We set the default number of documents per segment j𝑗jitalic_j as 5 and top-k𝑘kitalic_k to 30, allowing for a maximum of 6 iterations per query. To prevent lengthy API responses, the maximum number of generated tokens is limited to 700.

Training & Inference

We perform supervised fine-tuning to train our model using the collected dataset. Without using specific labeling or methods for particular iterations, we focus on teaching the model to effectively update the previous context based on the question and given documents at the current steps. We use instruction-tuned Mistral-7B Jiang et al. (2023a) as our backbone base model. At inference, we process the same number of segments and inputs as training. Further information is provided in the Appendix A.2.

4.2 Datasets

We evaluate CompAct on both single-document and multi-document question-answering (QA) datasets. For single-document QA, we use Natural Question (NQ) Kwiatkowski et al. (2019) and TriviaQA (TQA) Joshi et al. (2017). For multi-document QA, we evaluate on HotpotQA Yang et al. (2018), MuSiQue Trivedi et al. (2022), and 2WikiMultiHopQA Ho et al. (2020a). The evaluation is conducted on the dev set of each dataset, except for TriviaQA, which is evaluated on the test set. As mentioned, we comprise the training data only from HotpotQA. Therefore, we conducted zero-shot evaluation on the other datasets without accessing their training set.

4.3 Baselines

In Table 2, we compare CompAct to several baseline methods. To ensure a fair comparison, we feed compressed contexts from each baseline to the same reader model, LLaMA3-8B (AI@Meta, 2024). We consider the following baselines. (1) Oracle. We provide the reader with documents that contain the answer to the questions. If such documents are not available, we include five documents as a default. (2) Raw Document. We simply concatenate the top-k retrieved documents. (3) Long-Context LLM. As these LLMs are designed to handle large inputs, they align with our objective of managing extensive contexts, making them suitable for our baselines. We use Mistral-7B-Instruct-v0.2 (Jiang et al., 2023a), GPT-3.5-turbo-0125 (OpenAI, 2023), InternLM2-chat-7B (Cai et al., 2024), and FILM (An et al., 2024). (4) Compressor. We compare CompAct with three compression-based methods: AutoCompressors Chevalier et al. (2023), RECOMP Xu et al. (2024), and LongLLMLingua Jiang et al. (2023c). We describe the detailed descriptions of using baselines in Appendix A.3.

4.4 Results

We assess the performance of CompAct using three metrics: Compression rate (Comp.), Exact Match (EM), and F1 score. Overall, CompAct exhibits strong performance across all QA benchmark datasets, achieving the highest compression rate across all baselines. Specifically, it surpasses other compression-based methods in all three metrics, demonstrating its strong ability to compress abundant information (similar-to\sim3k tokens) efficiently.

CompAct falls short of the performance of GPT-3.5-turbo in single-document QA (NQ and TriviaQA), which may be due to our model being trained exclusively on a subset of HotpotQA. Even with this constraint, our framework outperforms existing compressors and achieves competitive performance with long-context LLMs. Plus, it represents entire contexts using significantly fewer tokens, highlighting its efficiency in providing compact representations. Moreover, in multi-document QA, CompAct achieves superior performance compared to the other baselines. This underscores the persistent challenge of integrating information across multiple documents and emphasizes how CompAct excels at such tasks.

5 Analysis

We investigate ways to facilitate the usage of CompAct as a plug-in module collaborating with diverse retrievers and readers (Section 5.1). We conduct an ablation study to assess the impact of components on performance (Section 5.2) and examine the cost-effectiveness of our framework using black-box proprietary models (Section 5.3). Finally, we discuss inference latency involved in our framework (Section 5.4).

5.1 Compressor as a Plug-in Module

In Figure 2, we depict the compressor as a plug-in module, highlighting that retrievers and readers can be easily replaced with other models. We investigate if CompAct can flexibly compress context provided by diverse retrievers, while preserving useful information regardless of various readers.

Generalizability across Retrievers.

Refer to caption
Figure 4: Performance of HotpotQA using Contriever.

In Figure 4 and 6, we use Contriever (Izacard et al., 2022) and BM25 (Robertson et al., 2009), two of the most well-known retrievers, to replace source documents. We evaluate our framework with 500 random samples from the HotpotQA (Yang et al., 2018) dev set, using different top-k𝑘kitalic_k. We compare our results with several baselines: gold documents (oracle), raw documents, and RECOMP (Xu et al., 2024).

Using the Contriever setup, where the retriever often fails to locate relevant documents at high-rank positions, increasing the top-k𝑘kitalic_k leads to more distinct performance improvements. This shows that our framework effectively captures and utilizes valuable information from lower-ranked documents. Additionally, in the BM25 setup, CompAct shows consistent performance while retrieving up to top-40 documents. Notably, our framework achieves a similar saturated performance trend to the gold documents setup, indicating its competence in filtering noisy contexts. In both setups, CompAct achieves significantly higher performance compared to other baselines. As we intended, these observations demonstrate that CompAct shows robustness across various retriever setups.

Generalizability across Readers.

Refer to caption
Figure 5: Performance of HotpotQA with different top-k𝑘kitalic_k documents. We set the reader as GPT-3.5-Turbo.

We look into whether CompAct truly provides generalized compressed text suitable for diverse readers. To this end, we assess the quality of our compressed texts with diverse reader LLMs: LLaMA2 13B (Touvron et al., 2023), LLaMA-3-8b (AI@Meta, 2024), and GPT-3.5-Turbo (OpenAI, 2023). In Figure 5, we report the results of using GPT-3.5-Turbo as a reader. We provide LLaMA2-13B and LLaMA3-8B results in Table 7.

Our results show that CompAct sufficiently delivers high-quality compressed texts applicable to different readers. Also, we prove its effectiveness on the top-k𝑘kitalic_k documents with high k𝑘kitalic_k. In Figure 5, there is little difference in performance up to top-20 between the raw documents setup and ours. We hypothesize this is attributed to the strong performance of the reader, GPT-3.5-Turbo, in processing moderate length of contexts. However, at the top-30 and top-40 documents, performance degradation occurs as more documents are included, reflecting the difficulty of handling lengthy documents with increased noisy information. In contrast, CompAct exhibits marginal performance degradation even with a higher number of documents.

Furthermore, CompAct achieves a high compression rate above 40x, which significantly reduces the number of input tokens, making it highly cost-effective for API operations. This efficiency, combined with its ability to maintain performance across diverse readers, underscores the superior capability of CompAct.

5.2 Ablation Studies

Component Effectiveness.

Components HotpotQA MuSiQue 2WikiMQA
Comp. F1 Comp. F1 Comp. F1
LLaMA3-8B
Rationale. 130.8x 41.6 120.0x 15.9 141.3x 32.3
CT 47.5x 48.3 36.5x 19.1 52.2x 36.2
CT + Rationale 33.6x 47.3 27.1x 19.0 36.4x 35.6
LLaMA2-13B
Rationale. 141.8x 41.8 129.2x 16.9 152.4x 30.8
CT 48.1x 48.5 37.0x 18.6 52.7x 35.6
CT + Rationale. 34.6x 47.3 28.0x 18.6 37.4x 34.2
GPT-3.5-Turbo
Rationale. 135.2x 38.0 123.5x 13.8 146.2x 24.0
CT 48.1x 49.2 37.0x 20.9 53.0x 34.0
CT + Rationale. 33.9x 47.0 27.4x 18.5 36.7x 36.5
Table 3: Results of each component effectiveness. CT refers to the compressed text.

CompAct actively compresses source documents by generating an intermediate compressed text (CT) with termination evaluation for each iteration. The evaluation consists of two components: a rationale explaining the reasons for termination and a condition token to decide the termination. To understand how each component affects end performance, we conduct an ablation study of components as shown in Table 3. When only the rationale is provided, the compression rate increases dramatically, but the end performance (EM & F1) significantly drops (Row 1). Conversely, when we only provide compressed text, we achieve the highest performance with most readers. However, when adding the rationale to the compressed text, performance declines in most cases. We hypothesize that some judgments in the rationale distract the readers from generating an answer purely from the compressed context. This could serve as a negative shortcut in the answering process, leading to decreased performance.

Model Raw RECOMP Lingua* CompAct
Cost F1 Cost F1 Cost F1 Cost F1
GPT-3.5-Turbo 1.09 44.5 0.04 40.1 0.33 38.4 0.04 49.2
GPT-4o 10.75 55.8 0.43 48.1 3.31 47.6 0.28 56.0
Claude-3.5 6.45 36.0 0.26 37.0 1.99 30.2 0.17 42.2
Gemini-1.5-pro 7.54 52.0 0.31 41.7 2.36 40.1 0.20 44.8
Table 4: API cost of 500 samples from a HotpotQA dev set. Lingua* refers to LongLLMLingua. We assess the inference cost (USD) of each method when employing proprietary models as readers.

5.3 Cost Efficiency

To evaluate the cost-saving benefits, we employ four proprietary models as readers: GPT-3.5-Turbo OpenAI (2023), GPT-4o OpenAI (2024), Claude-3.5-sonnet Anthropic (2024), and Gemini-1.5-pro Google (2024). In Table 4, we present that our framework achieves superior performance with the lowest cost compared to other baselines. Surprisingly, CompAct achieves competitive performance to the raw document setups with superior models known to possess exceptional long-context understanding ability. This indicates CompAct’s high-level expertise in compressing contexts.

5.4 Inference Latency

While CompAct offers a significant cost-saving advantage, we also consider a potential increase in inference latency due to the active strategy of our framework. To investigate this, we measure the inference time of our framework and other baselines, as shown in Table 5.

We reveal that CompAct has a longer inference time than other compressors. However, this does not undermine the value of our framework. Although RECOMP Xu et al. (2024) achieves faster compression latency, it often causes a substantial loss of key information during compression, which falls short of the performance level of CompAct. Furthermore, our framework shows reading times almost equivalent to a closed-book setup (no documents), while fully leveraging the benefits of the retrieval-augmented approach. This indicates that despite the longer compression time, our framework excels in performance and reading efficiency. Additionally, with the help of a strong retrieval system that can provide informative contexts earlier, our framework has the potential to improve latency by leveraging early termination.

Methods Compress Read Total / Throughput F1
No Documents - 1.5m 1.5m / 5.58 31.7
Raw Documents - 11.5m 11.5m / 0.72 42.5
RECOMP 2.1m 1.9m 4.0m / 2.10 41.5
LongLLMLingua 32.3m 3.3m 35.6m / 0.23 35.5
CompAct (5 docs)
147.9m 1.5m 149.3m / 0.06 47.3
CompAct (10 docs)
77.2m 1.9m 79.1m / 0.11 45.4
Table 5: We measure GPU time taken to compress source documents and read the compressed texts. Also, we calculate the throughput (examples per second) and report the corresponding end performance (F1) to compare the benefit of CompAct with other baselines.

Speed up Inference of CompAct (increasing Segment Size).

Extending the segment size is a way to improve inference speed. Instead of retraining the model to handle more documents, we simply increase the number of documents provided per iteration. Specifically, we apply 10 documents to CompAct, which has originally been trained to compress 5 documents only. We observe a slight degradation of performance when increasing the segment size than those seen during training, but we still observe a high level of performance. Adopting this approach can yield advantages during inference time in our framework.

6 Conclusion

We introduce CompAct, a novel framework that employs an active strategy to compress extensive retrieved documents. Our framework effectively captures pivotal information from a large number of documents by dynamically retaining essential contexts and incorporating information. We demonstrate that CompAct significantly outperforms existing compressors, showing a large performance gap with a higher compression rate in multi-document question-answering benchmarks. Furthermore, it serves as a convenient plug-in module that can seamlessly collaborate with various off-the-shelf retrievers and readers while providing cost-saving benefits.

Limitations

We acknowledge that CompAct has a longer inference time when processing retrieved documents, compared to other compressors. Given that our framework contributes to addressing complex question types, which is pioneering in the field of compression, we believe that future research can build upon CompAct to further improve these issues.

Additionally, even a strong proprietary model like GPT-4o can make mistakes when determining the completeness of given contexts. There may still be error cases in our data construction process, although we attempt to address this issue by filtering them out.

Lastly, we only use Mistral-7B-Instruct-v0.2 as our base model due to resource limitations. Verifying whether CompAct works well across a range of model sizes, both smaller (< 7B) and larger (> 7B), could lead to interesting findings.

Ethics Statement

Our training process can incur significant environmental costs due to its computationally intensive nature. To mitigate this, we fine-tune a single Mistral model to minimize computational expenses. Furthermore, a potential risk of this work is that the generated dataset may contain biases from API calls, such as stereotypes related to race and gender. To our knowledge, there haven’t been significant issues reported when creating question-answering datasets. However, it would be beneficial to apply methods that robustly train or validate against such concerns.

Acknowledgments

This work was supported in part by the National Research Foundation of Korea [NRF2023R1A2C3004176], the Ministry of Health & Welfare, Republic of Korea [HR20C002103], the Ministry of Science and ICT (MSIT) [RS-2023-00220195], and the ICT Creative Consilience program through the Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the MSIT [IITP-2024-2020-0-01819].

References

  • AI@Meta (2024) AI@Meta. 2024. Llama 3 model card.
  • An et al. (2024) Shengnan An, Zexiong Ma, Zeqi Lin, Nanning Zheng, and Jian-Guang Lou. 2024. Make your llm fully utilize the context. arXiv preprint arXiv:2404.16811.
  • Anthropic (2024) Anthropic. 2024. claude-3.5-sonnet.
  • Bajaj et al. (2016) Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, et al. 2016. Ms marco: A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268.
  • Cai et al. (2024) Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, Xiaoyi Dong, Haodong Duan, Qi Fan, Zhaoye Fei, Yang Gao, Jiaye Ge, Chenya Gu, Yuzhe Gu, Tao Gui, Aijia Guo, Qipeng Guo, Conghui He, Yingfan Hu, Ting Huang, Tao Jiang, Penglong Jiao, Zhenjiang Jin, Zhikai Lei, Jiaxing Li, Jingwen Li, Linyang Li, Shuaibin Li, Wei Li, Yining Li, Hongwei Liu, Jiangning Liu, Jiawei Hong, Kaiwen Liu, Kuikun Liu, Xiaoran Liu, Chengqi Lv, Haijun Lv, Kai Lv, Li Ma, Runyuan Ma, Zerun Ma, Wenchang Ning, Linke Ouyang, Jiantao Qiu, Yuan Qu, Fukai Shang, Yunfan Shao, Demin Song, Zifan Song, Zhihao Sui, Peng Sun, Yu Sun, Huanze Tang, Bin Wang, Guoteng Wang, Jiaqi Wang, Jiayu Wang, Rui Wang, Yudong Wang, Ziyi Wang, Xingjian Wei, Qizhen Weng, Fan Wu, Yingtong Xiong, Chao Xu, Ruiliang Xu, Hang Yan, Yirong Yan, Xiaogui Yang, Haochen Ye, Huaiyuan Ying, Jia Yu, Jing Yu, Yuhang Zang, Chuyu Zhang, Li Zhang, Pan Zhang, Peng Zhang, Ruijie Zhang, Shuo Zhang, Songyang Zhang, Wenjian Zhang, Wenwei Zhang, Xingcheng Zhang, Xinyue Zhang, Hui Zhao, Qian Zhao, Xiaomeng Zhao, Fengzhe Zhou, Zaida Zhou, Jingming Zhuo, Yicheng Zou, Xipeng Qiu, Yu Qiao, and Dahua Lin. 2024. Internlm2 technical report. Preprint, arXiv:2403.17297.
  • Cao et al. (2024) Zhiwei Cao, Qian Cao, Yu Lu, Ningxin Peng, Luyang Huang, Shanbo Cheng, and Jinsong Su. 2024. Retaining key information under high compression ratios: Query-guided compressor for llms. arXiv preprint arXiv:2406.02376.
  • Chen et al. (2020) Wenhu Chen, Hanwen Zha, Zhiyu Chen, Wenhan Xiong, Hong Wang, and William Yang Wang. 2020. Hybridqa: A dataset of multi-hop question answering over tabular and textual data. In Findings of the Association for Computational Linguistics: EMNLP 2020.
  • Cheng et al. (2024) Xin Cheng, Xun Wang, Xingxing Zhang, Tao Ge, Si-Qing Chen, Furu Wei, Huishuai Zhang, and Dongyan Zhao. 2024. xrag: Extreme context compression for retrieval-augmented generation with one token. arXiv preprint arXiv:2405.13792.
  • Chevalier et al. (2023) Alexis Chevalier, Alexander Wettig, Anirudh Ajith, and Danqi Chen. 2023. Adapting language models to compress contexts. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing.
  • Ge et al. (2024) Tao Ge, Hu Jing, Lei Wang, Xun Wang, Si-Qing Chen, and Furu Wei. 2024. In-context autoencoder for context compression in a large language model. In The Twelfth International Conference on Learning Representations.
  • Google (2024) Google. 2024. gemini-1.5-pro.
  • Gutiérrez et al. (2024) Bernal Jiménez Gutiérrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, and Yu Su. 2024. Hipporag: Neurobiologically inspired long-term memory for large language models. arXiv preprint arXiv:2405.14831.
  • Ho et al. (2020a) Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. 2020a. Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps. In Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain (Online). International Committee on Computational Linguistics.
  • Ho et al. (2020b) Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. 2020b. Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps. In Proceedings of the 28th International Conference on Computational Linguistics.
  • Izacard et al. (2022) Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave. 2022. Unsupervised dense information retrieval with contrastive learning. Transactions on Machine Learning Research.
  • Izacard et al. (2023) Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave. 2023. Atlas: Few-shot learning with retrieval augmented language models. Journal of Machine Learning Research.
  • Jiang et al. (2023a) Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. 2023a. Mistral 7b. arXiv preprint arXiv:2310.06825.
  • Jiang et al. (2023b) Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. 2023b. LLMLingua: Compressing prompts for accelerated inference of large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore. Association for Computational Linguistics.
  • Jiang et al. (2023c) Huiqiang Jiang, Qianhui Wu, Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. 2023c. Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression. arXiv preprint arXiv:2310.06839.
  • Joshi et al. (2017) Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. 2017. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics.
  • Karpukhin et al. (2020a) Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020a. Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906.
  • Karpukhin et al. (2020b) Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020b. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics.
  • Khandelwal et al. (2019) Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. 2019. Generalization through memorization: Nearest neighbor language models. In International Conference on Learning Representations.
  • Kingma and Ba (2015) Diederik Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), San Diega, CA, USA.
  • Kwiatkowski et al. (2019) Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. 2019. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466.
  • Lewis et al. (2020) Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems.
  • Li et al. (2024) Tianle Li, Ge Zhang, Quy Duc Do, Xiang Yue, and Wenhu Chen. 2024. Long-context llms struggle with long in-context learning. arXiv preprint arXiv:2404.02060.
  • Li et al. (2023) Yucheng Li, Bo Dong, Frank Guerin, and Chenghua Lin. 2023. Compressing context to enhance inference efficiency of large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore. Association for Computational Linguistics.
  • Liu et al. (2024) Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics.
  • Mavi et al. (2022) Vaibhav Mavi, Anubhav Jangra, and Adam Jatowt. 2022. A survey on multi-hop question answering and generation. arXiv preprint arXiv:2204.09140.
  • Mu et al. (2024) Jesse Mu, Xiang Lisa Li, and Noah Goodman. 2024. Learning to compress prompts with gist tokens. In Proceedings of the 37th International Conference on Neural Information Processing Systems, Red Hook, NY, USA. Curran Associates Inc.
  • OpenAI (2023) OpenAI. 2023. Chatgpt.
  • OpenAI (2024) OpenAI. 2024. Gpt-4o.
  • Pan et al. (2024) Zhuoshi Pan, Qianhui Wu, Huiqiang Jiang, Menglin Xia, Xufang Luo, Jue Zhang, Qingwei Lin, Victor Rühle, Yuqing Yang, Chin-Yew Lin, et al. 2024. Llmlingua-2: Data distillation for efficient and faithful task-agnostic prompt compression. arXiv preprint arXiv:2403.12968.
  • Paszke et al. (2019) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems.
  • Qian et al. (2024) Hongjin Qian, Zheng Liu, Peitian Zhang, Kelong Mao, Yujia Zhou, Xu Chen, and Zhicheng Dou. 2024. Are long-llms a necessity for long-context tasks? arXiv preprint arXiv:2405.15318.
  • Robertson et al. (2009) Stephen Robertson, Hugo Zaragoza, et al. 2009. The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends® in Information Retrieval.
  • Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  • Trivedi et al. (2022) Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. 2022. Musique: Multihop questions via single-hop question composition. Transactions of the Association for Computational Linguistics.
  • Tunstall et al. (2023) Lewis Tunstall, Edward Beeching, Nathan Lambert, Nazneen Rajani, Shengyi Huang, Kashif Rasul, Alexander M. Rush, and Thomas Wolf. 2023. The alignment handbook. https://github.com/huggingface/alignment-handbook.
  • Wang et al. (2023) Zhiruo Wang, Jun Araki, Zhengbao Jiang, Md Rizwan Parvez, and Graham Neubig. 2023. Learning to filter context for retrieval-augmented generation. arXiv preprint arXiv:2311.08377.
  • Wolf et al. (2019) Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. 2019. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771.
  • Xu et al. (2024) Fangyuan Xu, Weijia Shi, and Eunsol Choi. 2024. RECOMP: Improving retrieval-augmented LMs with context compression and selective augmentation. In The Twelfth International Conference on Learning Representations.
  • Yang et al. (2018) Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. 2018. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.
  • Zhang et al. (2024) Yusen Zhang, Ruoxi Sun, Yanfei Chen, Tomas Pfister, Rui Zhang, and Sercan Ö. Arik. 2024. Chain of agents: Large language models collaborating on long-context tasks. Preprint, arXiv:2406.02818.

Appendix A Example Appendix

Dataset Train Dev Test
Avg. # of
Supporting Documents
# of
Pre-defined Context
NaturalQuestions (Kwiatkowski et al., 2019) 79,168 8,757 3,610 - -
TriviaQA (Joshi et al., 2017) 78,785 8,837 11,313 - -
HotpotQA (Yang et al., 2018) 90,447 7,405 - 2 10
MuSiQue (Trivedi et al., 2022) 39,876 4,834 4,918 1.89 (Dev) 20
2WikiMultiHopQA (Ho et al., 2020a) 167,454 12,576 12,576 2.44 (Dev) 10
Table 6: Statistics of multi-hop and single-hop question answering datasets.

A.1 Practicality of Compressing Contexts

Sequence Length Language Models (%)
128 14.7
512 62.8
\geq 1024 22.5
Table 7: Huggingface Models Statistics. 77.5% of models cannot receive at least top-5 documents as input. We select frequently-used models downloaded at least 1M in https://huggingface.co/Models.

To ensure the practicality of providing context with fewer tokens, we present an additional point to reinforce the necessity of our research. In table 7, we investigate the maximum input length of language models with over 1 million downloads on Huggingface222https://huggingface.co/Models. We find that only 77.5% of these models can afford inputs of 512 tokens or fewer. Despite ongoing research trends on LLMs capable of handling long texts, it is evident that many users still frequently employ models with smaller token inputs. Considering the current state, CompAct offers substantial benefits to models with smaller input lengths by allowing them to access more information, effectively acting as a bridge.

Average Length of Compressed Text per Iteration.

In Table 8, we provide detailed length information of compressed texts per iteration. As the token length slightly increases with each iterations, We observe that CompAct maintains a high compression rate on average, which compresses 30 documents into under 200 tokens.

A.2 Training & Inference Details

We use 4 Nvidia A100 with 80GB memory to train our CompAct framework. Our code is written in PyTorch (Paszke et al., 2019) and HuggingFace (Wolf et al., 2019). We use supervised fine-tuning through published alignment-handbook (Tunstall et al., 2023). We train the model with Adam optimizer (Kingma and Ba, 2015), using a learning rate of 2e-6, a batch size of 64, and 0.1 warm up ratio for 7 epochs. For inference, we use batch decoding to speed up our inference time.

A.3 Details of Baselines

Long-context LLMs.

(1) InternLM2-chat-7B Cai et al. (2024) has shown near-perfect performance on the Needle-in-the-Haystack task, which tests how well a model utilizes information within a long context. (2) Mistral-7B-Instruct-v0.2 Jiang et al. (2023a) has recently shown strong performance across various benchmarks and supports a 32k context window. (3) FILM-7B An et al. (2024), trained with a synthetic long-context question-answering dataset, has shown strong performance on tasks that require information awareness in the long context. (4) We also experiment with GPT-3.5-turbo, a popular proprietary LLM that supports a 16k context window.

Datasets N-th Iterations
1 2 3 4 5 6
HotpotQA 78.1 114.1 128.5 126.5 135.9 147.5
MuSiQue 77.5 110.6 135.2 91.6 145.6 124.0
Table 8: Average token length of compressed texts per iteration. 5 documents are compressed for each iteration, as default setup of our framework.

Compressors.

(5) AutoCompressors Chevalier et al. (2023) process segments of long context into soft prompts, which are prepended to the next segment as summary vectors. We use 50 summary tokens for every 2,048 tokens, following the setup from the original paper. (6) LongLLMLingua Jiang et al. (2023c) takes a perplexity-based approach to filter out tokens with less importance. (7) RECOMP Xu et al. (2024) suggests an extractive compressor that extracts relevant sentences using a dual encoder model, and an abstractive compressor that summarizes documents using an encoder-decoder model. We experiment with the extractive compressor setting, selecting 4 sentences from documents to ensure a fair comparison at similar text lengths.

Refer to caption
Figure 6: Performance of HotpotQA using BM25.
Refer to caption
Refer to caption
Figure 7: Performance of HotpotQA with different top-k𝑘kitalic_k documents. We set the reader as LLaMA2-13B (Left) and LLaMA3-8B (Right).
First Iteration:
1. Generate a summary of source documents to answer the question. Ensure the summary is under 200 words and does not include any pronouns. DO NOT make assumptions or attempt to answer the question; your job is to summarize only.
2. Evaluate the summary based solely on the information of it, without any additional background context: if it lacks sufficient details to answer the question, print [INCOMPLETE]. If it provides all necessary details, print [COMPLETE]. You should provide the reason of the evaluation.
Question: [QUESTION]
Source documents: [SOURCE DOCUMENTS]
Summary:
Subsequent Iterations:
1. Generate a summary of the source documents and the previous summary to answer the question based on the evaluation of the previous summary. The evaluation indicates the missing information needed to answer the question. Ensure the summary is under 200 words and does not include any pronouns. DO NOT make assumptions or attempt to answer the question; your job is to summarize only.
2. Evaluate the summary based solely on the information of it, without any additional background context: if it lacks sufficient details to answer the question, print [INCOMPLETE]. If it provides all necessary details, print [COMPLETE]. You should provide the reason of the evaluation.
Question: [QUESTION]
Evaluation of previous summary: [EVALUATION OF PREVIOUS SUMMARY]
Previous summary: [PREVIOUS SUMMARY]
Source documents: [SOURCE DOCUMENTS]
Summary:
Table 9: Prompts used in CompAct
Source sentences: [SOURCE SENTENCES]
Reference sentences: [REFERENCE SENTENCES]
Question: [QUESTION]
Follow instructions below.
1. Choose 0 to 3 sentences that directly address the critical points needed to answer the question. Additionally, include 0 to 3 sentences that provide useful context, even if they do not directly answer the question. Ensure that you avoid selecting multiple sentences with overlapping content. (prefix: Sentences)
2. Generate a summary of reference sentences and chosen sentences (prefix: Summary). Ensure the summary is under 200 words and does not include any pronouns. DO NOT make assumptions or attempt to answer the question; your job is to summarize only.
3. Evaluate the summary based solely on the information of it, without any additional background context: if it lacks sufficient details to answer the question, print [INCOMPLETE]. If it provides all necessary details, print [COMPLETE]. You should provide the reason of evaluation (prefix: Evaluation)
Sentences:
Table 10: Prompts for dataset construction
Question: Who was the eldest brother of the Mexican drug trafficker born 12 March 1952?
Model: CompAct (Ours)
Iter 1
Document (Rank 3): Benjamín Arellano Félix (born 12 March 1952) is a Mexican drug trafficker and former leader of the Mexican criminal organization known as the Tijuana Cartel or "Arellano-Félix Organization". Benjamín Arellano Félix, who worked closely with his brothers, was one of Mexico’s most powerful drug lords and the supplier of one-third of the U.S.’s cocaine. Benjamín had six brothers: He also has four sisters. Two of them, Alicia and Enedina, are most active in the cartel’s affairs. Benjamín was first arrested on 18 June 1982, in Downey, California, for receiving 100 kilos of cocaine smuggled through the San
Compressed Text: Benjamín Arellano Félix, born on 12 March 1952, is a Mexican drug trafficker and former leader of the Tijuana Cartel. He had six brothers, including He, and four sisters. He also has four sisters.
Evaluation: The summary provides information about Benjamín Arellano Félix and his siblings but does not specify who the eldest brother is. [INCOMPLETE]
Iter 2
Document (Rank 7): Francisco Rafael Arellano Félix (24 October 1949 - 18 October 2013) was a Mexican drug lord and former leader of the Tijuana Cartel, a drug trafficking organization. He was the oldest of seven brothers and headed the criminal organization early in the 1990s alongside them. Through his brother Benjamín Francisco Rafael joined the Tijuana Cartel in 1989 following the arrest of Miguel Ángel Félix Gallardo, one of the most prominent drug czars in Mexico during the 1980s. When the Arellano Félix took control of the organization in the early 1990s, tensions with the rival Sinaloa Cartel
Compressed Text: Francisco Rafael Arellano Félix, born on 24 October 1949, was a Mexican drug lord and former leader of the Tijuana Cartel. He was the oldest of seven brothers and headed the criminal organization early in the 1990s alongside them. Benjamín Arellano Félix, born on 12 March 1952, is a Mexican drug trafficker and former leader of the Tijuana Cartel. He had six brothers, including He, and four sisters. He also has four sisters.
Evaluation: The summary provides all necessary details to answer the question. It identifies Francisco Rafael Arellano Félix as the oldest brother of Benjamín Arellano Félix, who was born on 12 March 1952. [COMPLETE]
Answer: Francisco Rafael Arellano Félix (Correct)
Model: Recomp (Xu et al., 2024)
Summary: Miguel Rodríguez Orejuela He is the younger brother of Gilberto Rodríguez Orejuela.
Roberto de Jesús Escobar Gaviria Roberto de Jesús Escobar Gaviria Roberto de Jesús Escobar Gaviria (born January 13, 1947), nicknamed El Osito (Little Bear), was the brother of the drug kingpin, Pablo Escobar, and the former accountant of the Medellín Cartel, which was responsible for up to 80 percent of the cocaine smuggled into the United States.
Answer: Fabio Ochoa Vásquez (Wrong)
Table 11: A compression example of CompAct and comparison with another compressor (Recomp)