What's the best way to speed up inference? #8878
Replies: 1 comment 2 replies
-
🤖 To speed up local inference, you can replace the HuggingFaceLLM with HuggingFaceInferenceAPI. This can be done by replacing the HuggingFaceLLM instantiation with HuggingFaceInferenceAPI. Here's an example of how you can do it: from llama_index.llms import HuggingFaceInferenceAPI, ChatMessage, MessageRole
from llama_index.prompts import ChatPromptTemplate
# Define your chat templates
chat_text_qa_msgs = [
ChatMessage(
role=MessageRole.SYSTEM,
content=(
"Always answer the question, even if the context isn't helpful."
),
),
ChatMessage(
role=MessageRole.USER,
content=(
"Context information is below.\n"
"---------------------\n"
"{context_str}\n"
"---------------------\n"
"Given the context information and not prior knowledge, "
"answer the question: {query_str}\n"
),
),
]
text_qa_template = ChatPromptTemplate(chat_text_qa_msgs)
llm = HuggingFaceInferenceAPI(
context_window=4096,
max_new_tokens=256,
generate_kwargs={"temperature": 0.7, "do_sample": False},
system_prompt=system_prompt,
query_wrapper_prompt=query_wrapper_prompt,
tokenizer_name="StabilityAI/stablelm-tuned-alpha-3b",
model_name="StabilityAI/stablelm-tuned-alpha-3b",
device_map="auto",
stopping_ids=[50278, 50279, 50277, 1, 0],
tokenizer_kwargs={"max_length": 4096},
# uncomment this if using CUDA to reduce memory usage
# model_kwargs={"torch_dtype": torch.float16}
)
service_context = ServiceContext.from_defaults(chunk_size=1024, llm=llm, text_qa_template=text_qa_template) In this example, a custom chat template is defined and passed to the ServiceContext. This ensures that the correct chat template is used when making queries. The HuggingFaceInferenceAPI class is indeed available in the llama_index.llms module. The test cases provided also demonstrate how to use the HuggingFaceInferenceAPI for chat and text completion tasks. You can refer to these examples to understand how to use the API in your application. For more details, you can refer to these sources:
The HuggingFaceInferenceAPI in LlamaIndex handles the system prompt and query wrapping through the For more details, you can refer to this source: I hope this helps! If you have any more questions, feel free to ask. Sources
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot. |
Beta Was this translation helpful? Give feedback.
-
What's the best way to speed up local inference?
How might you, say, replace the HF model with a pipeline, to use the latest deepspeed optimizations?
Alternatively, how do you use the HuggingFaceInferenceAPI? How can you set ports? How can you make sure that the proper chat template is being passed in, and that llama-index is indeed setting the system prompt?
ie:
how do I convert this example,
to an equivalent HuggingFaceInferenceAPI example?
ie: #6608
#6647
^ the code from above:
Beta Was this translation helpful? Give feedback.
All reactions