Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using_to_llama_similarities() normalization technique may lead to low quality retrieved documents / chunks. #12567

Closed
1 task done
Subham0793 opened this issue Apr 4, 2024 · 4 comments
Labels
question Further information is requested

Comments

@Subham0793
Copy link

Question Validation

  • I have searched both the documentation and discord for an answer.

Question

Because of _to_llama_similarities function, the scores are normalized and maximum score is always at 1.0

What is the reason of this normalization technique, and why has this being called by default?

The problem may arise when some query which is not related to a particular document, is used for retrieving the chunks, and still the results will contain scores starting from 1.0 , even when there is no relation of the chunk with respect to the query.

Code snippet:

return VectorStoreQueryResult( nodes=top_k_nodes, ids=top_k_ids, similarities=_to_llama_similarities(top_k_scores), )

Line no. 623 in

https://github.com/run-llama/llama_index/blob/main/llama-index-integrations/vector_stores/llama-index-vector-stores-elasticsearch/llama_index/vector_stores/elasticsearch/base.py

Isn't this a bug? Why was this specifically made to keep the max value at 1.0? say we are getting top 5 docs from a query. And the doc at k=1 need not be of score 1.0, it can be of score 0.85 The problem with this is we are giving high scores for those documents which are not that similar to the query.

I had posted similar question earlier but did not get a satisfactory reply.

@Subham0793 Subham0793 added the question Further information is requested label Apr 4, 2024
Copy link

dosubot bot commented Apr 4, 2024

Hey @Subham0793, welcome back! 👋 It's always interesting to dive into these kinds of challenges with you. I'm rolling up my sleeves to take a closer look at the _to_llama_similarities() function and the normalization technique. Hang tight, I'll get back to you with a full analysis shortly.

@logan-markewich
Copy link
Collaborator

Please make a PR if you find this to be a bug. I agree, but don't have elastic to test with

Copy link

dosubot bot commented Apr 4, 2024

Hey @Subham0793,

I've taken a deep dive into your query regarding the _to_llama_similarities() function and its normalization technique. It's a great question and touches on a fundamental aspect of how similarity scores are interpreted and utilized within the context of document retrieval systems.

The normalization to a maximum score of 1.0 is a common practice in many retrieval systems, not just in the context of the LLaMA index. The primary reason for this approach is to standardize the output scores across different queries, making them easier to interpret and compare. By scaling the highest score to 1.0, it ensures that the scores are within a predictable range, which can be particularly useful when integrating with systems that expect similarity scores to fall within a specific range.

However, your concern about the potential for high scores for documents that are not significantly similar to the query is valid. The normalization process does indeed mean that the highest-scoring document will always receive a score of 1.0, regardless of its absolute similarity to the query. This can be misleading in scenarios where the most relevant document is not highly similar to the query, as it might give the impression of a high degree of relevance that isn't actually there.

The decision to implement this normalization by default likely stems from a balance between standardization and practical utility. For many applications, having a consistent scoring range is beneficial and simplifies downstream processing. However, for more nuanced applications where the absolute similarity scores are critical, this approach might not be ideal.

Given the feedback from the maintainer, Logan Markewich, suggesting to make a PR if you find this to be a bug, it seems there is openness to revisiting this design choice. It might be worth exploring alternative approaches that retain the benefits of score normalization while providing more granularity or context about the absolute similarity levels. For instance, including additional metrics or flags that indicate the confidence level of the similarity score could be one way to address this concern.

In summary, the normalization technique used in _to_llama_similarities() is a design choice aimed at standardizing score outputs, but your observation highlights an important limitation in scenarios where absolute similarity levels are crucial. It might be an opportune time to contribute to the project by suggesting enhancements or modifications that can accommodate a wider range of use cases.

If you have any more questions or need further clarification, feel free to reach out!

Best,
Dosu

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@logan-markewich
Copy link
Collaborator

as long as they are sorted most to least similar, I don't have a preference

@dosubot dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Jul 4, 2024
@dosubot dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 11, 2024
@dosubot dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Jul 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants