What's the best way to speed up inference? #8878

darinkishore · 2023-11-13T02:35:09Z

darinkishore
Nov 13, 2023

What's the best way to speed up local inference?
How might you, say, replace the HF model with a pipeline, to use the latest deepspeed optimizations?
Alternatively, how do you use the HuggingFaceInferenceAPI? How can you set ports? How can you make sure that the proper chat template is being passed in, and that llama-index is indeed setting the system prompt?

ie:
how do I convert this example,

def messages_to_prompt(messages):
  prompt = ""
  for message in messages:
    if message.role == 'system':
      prompt += f"<|system|>\n{message.content}</s>\n"
    elif message.role == 'user':
      prompt += f"<|user|>\n{message.content}</s>\n"
    elif message.role == 'assistant':
      prompt += f"<|assistant|>\n{message.content}</s>\n"

  # ensure we start with a system prompt, insert blank if needed
  if not prompt.startswith("<|system|>\n"):
    prompt = "<|system|>\n</s>\n" + prompt

  # add final assistant prompt
  prompt = prompt + "<|assistant|>\n"

  return prompt


llm = HuggingFaceLLM(
    model_name="HuggingFaceH4/zephyr-7b-alpha",
    tokenizer_name="HuggingFaceH4/zephyr-7b-alpha",
    query_wrapper_prompt=PromptTemplate("<|system|>\n</s>\n<|user|>\n{query_str}</s>\n<|assistant|>\n"),
    context_window=3900,
    max_new_tokens=256,
    model_kwargs={"quantization_config": quantization_config},
    # tokenizer_kwargs={},
    generate_kwargs={"temperature": 0.7, "top_k": 50, "top_p": 0.95},
    messages_to_prompt=messages_to_prompt,
    device_map="auto",
)

to an equivalent HuggingFaceInferenceAPI example?

ie: #6608
#6647
^ the code from above:

    system_prompt = """# Falcon-7B Instruct
    - You are a helpful AI assistant and provide the answer for the question based on the given context.
    - You answer the question as truthfully as possible using the provided text, and if the answer is not contained within the text below, you say "I don't know".
    """ 

    # This will wrap the default prompts that are internal to llama-index
    query_wrapper_prompt = SimpleInputPrompt(">>QUESTION<<{query_str}\n>>ANSWER<<")

    # Change default model
    embed_model = LangchainEmbedding(HuggingFaceEmbeddings())
    tgis_predictor = LLMPredictor(
        llm=HuggingFaceTextGenInference(
            inference_server_url="http://localhost:8010/",
            max_new_tokens=256,
            temperature=0.7,
            server_kwargs={},
        ),
        system_prompt=system_prompt, # added in llama-index by myself
        query_wrapper_prompt=query_wrapper_prompt # added in llama-index by myself
    )
        service_context = ServiceContext.from_defaults(chunk_size=1024, llm_predictor=tgis_predictor, embed_model=embed_model)

@dosu-bot · 2023-11-13T02:44:56Z

dosubot[bot]
bot Nov 13, 2023

🤖

To speed up local inference, you can replace the HuggingFaceLLM with HuggingFaceInferenceAPI. This can be done by replacing the HuggingFaceLLM instantiation with HuggingFaceInferenceAPI. Here's an example of how you can do it:

from llama_index.llms import HuggingFaceInferenceAPI, ChatMessage, MessageRole
from llama_index.prompts import ChatPromptTemplate

# Define your chat templates
chat_text_qa_msgs = [
    ChatMessage(
        role=MessageRole.SYSTEM,
        content=(
            "Always answer the question, even if the context isn't helpful."
        ),
    ),
    ChatMessage(
        role=MessageRole.USER,
        content=(
            "Context information is below.\n"
            "---------------------\n"
            "{context_str}\n"
            "---------------------\n"
            "Given the context information and not prior knowledge, "
            "answer the question: {query_str}\n"
        ),
    ),
]
text_qa_template = ChatPromptTemplate(chat_text_qa_msgs)

llm = HuggingFaceInferenceAPI(
    context_window=4096,
    max_new_tokens=256,
    generate_kwargs={"temperature": 0.7, "do_sample": False},
    system_prompt=system_prompt,
    query_wrapper_prompt=query_wrapper_prompt,
    tokenizer_name="StabilityAI/stablelm-tuned-alpha-3b",
    model_name="StabilityAI/stablelm-tuned-alpha-3b",
    device_map="auto",
    stopping_ids=[50278, 50279, 50277, 1, 0],
    tokenizer_kwargs={"max_length": 4096},
    # uncomment this if using CUDA to reduce memory usage
    # model_kwargs={"torch_dtype": torch.float16}
)
service_context = ServiceContext.from_defaults(chunk_size=1024, llm=llm, text_qa_template=text_qa_template)

In this example, a custom chat template is defined and passed to the ServiceContext. This ensures that the correct chat template is used when making queries. The HuggingFaceInferenceAPI class is indeed available in the llama_index.llms module. The test cases provided also demonstrate how to use the HuggingFaceInferenceAPI for chat and text completion tasks. You can refer to these examples to understand how to use the API in your application.

For more details, you can refer to these sources:

The HuggingFaceInferenceAPI in LlamaIndex handles the system prompt and query wrapping through the _extend_prompt and _extend_messages methods in the LLMPredictor class. The _extend_prompt method adds the system and query wrapper prompts to the base prompt. If a system prompt is provided, it is added to the beginning of the base prompt with two newline characters in between. If a query wrapper prompt is provided, it is formatted with the extended prompt as the query string. The _extend_messages method adds the system prompt to the chat message list. If a system prompt is provided, it is added as a ChatMessage with the role of MessageRole.SYSTEM to the beginning of the messages list. These methods are used in the predict, stream, apredict, and astream methods of the LLMPredictor class to handle the system prompt and query wrapping.

For more details, you can refer to this source:

llama_index/llm_predictor/base.py

I hope this helps! If you have any more questions, feel free to ask.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

2 replies

darinkishore Nov 13, 2023
Author

@dosu-bot how can I make sure that the default, llama-index system prompt is being used? I would rather not configure my own. Also, you didn't use the actual arguments for HuggingFaceInferenceAPI.

dosubot[bot] bot Nov 13, 2023

🤖

Based on the context provided, you can modify your current LLM application code that uses the LlamaIndex data framework and HuggingFaceLLM to instead use the HuggingFaceInferenceAPI. Here's how you can do it:

First, import the necessary modules and classes:

import os
from llama_index.llms import HuggingFaceInferenceAPI
from llama_index.prompts import PromptTemplate

Next, get the Hugging Face token from the environment variables. If it's not available, it will default to None:

HF_TOKEN: Optional[str] = os.getenv("HUGGING_FACE_TOKEN")

Define the system prompt and query wrapper prompt as per the model card instructions:

system_prompt = """<|SYSTEM|># StableLM Tuned (Alpha version)
- StableLM is a helpful and harmless open-source AI language model developed by StabilityAI.
- StableLM is excited to be able to help the user, but will refuse to do anything that could be considered harmful to the user.
- StableLM is more than just an information source, StableLM is also able to write poetry, short stories, and make jokes.
- StableLM will refuse to participate in anything that could harm a human.
"""
query_wrapper_prompt = PromptTemplate("<|USER|>{query_str}<|ASSISTANT|>")

Replace the instance of HuggingFaceLLM with HuggingFaceInferenceAPI. Make sure to use the same model name and pass the token, system prompt, and query wrapper prompt:

remotely_run = HuggingFaceInferenceAPI(
    model_name="StabilityAI/stablelm-tuned-alpha-3b", 
    token=HF_TOKEN, 
    system_prompt=system_prompt, 
    query_wrapper_prompt=query_wrapper_prompt
)

If you want to use the Hugging Face Inference API anonymously, you can skip providing a token:

remotely_run_anon = HuggingFaceInferenceAPI(
    model_name="StabilityAI/stablelm-tuned-alpha-3b", 
    system_prompt=system_prompt, 
    query_wrapper_prompt=query_wrapper_prompt
)

If you don't provide a model_name to the HuggingFaceInferenceAPI, Hugging Face's recommended model gets used:

remotely_run_recommended = HuggingFaceInferenceAPI(
    token=HF_TOKEN, 
    system_prompt=system_prompt, 
    query_wrapper_prompt=query_wrapper_prompt
)

To use the default system prompt from llama-index, you can call the complete method on the HuggingFaceInferenceAPI instance:

completion_response = remotely_run_recommended.complete("To infinity, and")
print(completion_response)

To validate if the model is supported by the Hugging Face Inference API service, you can use the validate_supported method:

remotely_run.validate_supported(task="text-generation")

To get metadata on the current model from Hugging Face, you can use the get_model_info method:

model_info = remotely_run.get_model_info()
print(model_info)

This code is based on the context provided in the SimpleIndexDemo-Huggingface_stablelm.ipynb notebook from the LlamaIndex documentation and the llama_index/llms/huggingface.py file. The system prompt and query wrapper prompt are based on the context shared in the docs/examples/customization/llms/SimpleIndexDemo-Huggingface_stablelm.ipynb file.

As for the difference between HuggingFaceLLM and HuggingFaceInferenceAPI in terms of performance and functionality, I wasn't able to find a direct comparison within the LlamaIndex repository. However, the main difference lies in where the inference is performed. HuggingFaceLLM performs inference locally on your machine, while HuggingFaceInferenceAPI performs inference on Hugging Face's servers. This could potentially lead to differences in inference speed, depending on your local machine's capabilities and your internet connection.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's the best way to speed up inference? #8878

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

What's the best way to speed up inference? #8878

darinkishore Nov 13, 2023

Replies: 1 comment · 2 replies

dosubot[bot] bot Nov 13, 2023

Sources

darinkishore Nov 13, 2023 Author

dosubot[bot] bot Nov 13, 2023

Sources

darinkishore
Nov 13, 2023

Replies: 1 comment 2 replies

dosubot[bot]
bot Nov 13, 2023

darinkishore Nov 13, 2023
Author