Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Scores in the retrieved nodes are in reversed order in the Weaviate integration #14728

Open
terilias opened this issue Jul 12, 2024 · 4 comments

Comments

@terilias
Copy link

terilias commented Jul 12, 2024

Bug Description

Hello,
I was using the retriever from a vector store index that has been initialized from a Weaviate collection. I noticed that the retrieved nodes have scores in reversed order: the first (most relevant) node, has score equals to zero and as we move to the least relevant nodes, the score increases.

We found in the code that LlamaIndex performs subtraction 1 - score, where score is the score that the Weaviate returns. But the Weaviate now, returns similarity score instead of distance. I think that only in vector (instead of hybrid) search, the distance can be returned instead of similarity (see here). You can use the code I provide below (from a Jupyter Notebook) in order to see the scores that LlamaIndex gives and the scores that Weaviate returns.

Version

llama-index==0.10.53
llama-index-vector-stores-weaviate==1.0.0
weaviate-client==4.6.5

Steps to Reproduce

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, StorageContext, Document
from llama_index.vector_stores.weaviate import WeaviateVectorStore
from llama_index.core.vector_stores import VectorStoreQuery
from llama_index.core.schema import TextNode

from llama_index.embeddings.text_embeddings_inference import TextEmbeddingsInference
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core.callbacks import CallbackManager, TokenCountingHandler
from llama_index.core.node_parser import SimpleNodeParser

import weaviate
import os

from transformers import AutoTokenizer, AutoModel
import tiktoken
import requests
from IPython.display import Markdown, display


# In[ ]:
# Embeddings initialization: OpenAI
embed_model = OpenAIEmbedding(model="text-embedding-3-small", api_key=os.environ.get("OPEN_AI_API_KEY"))
tokenizer = tiktoken.encoding_for_model("text-embedding-3-small").encode


# In[ ]:
tokenizer_obj = tokenizer
# The chunk_size must be compatible with the sequence length of the embed_model_obj that is used.
chunk_size = 450
chunk_overlap = 50
# Initialize a node parser that we will use in the documents parsing.
# First initialize the TokenCountingHandler with our tokenizer and the CallbackManager with our token counter.
# And then the node parser.
token_counter_handler = TokenCountingHandler(tokenizer=tokenizer_obj)
callback_manager = CallbackManager([token_counter_handler])
node_parser = SimpleNodeParser.from_defaults(chunk_size=chunk_size,
                                                  chunk_overlap=chunk_overlap,
                                                  callback_manager=callback_manager)


# In[66]:
client = weaviate.connect_to_local()


# In[127]:
# Now that the collection is already created we just connected to it.
vector_store = WeaviateVectorStore(
    weaviate_client=client, index_name="Test"
)


# In[128]:
vector_store_index = VectorStoreIndex.from_vector_store(vector_store=vector_store,
                                                        embed_model=embed_model,
                                                        transformations=[node_parser],
                                                        show_progress=True)


# In[100]:
def get_wikipedia_article_text(title):
    url = "https://en.wikipedia.org/w/api.php"
    params = {"action": "query", "format": "json", "prop": "extracts", "explaintext": True, "titles": title}
    response = requests.get(url, params=params).json()
    page = next(iter(response["query"]["pages"].values()))
    return page.get("extract", "Article not found.")

python_doc_text = get_wikipedia_article_text("Python (programming language)")
lion_doc_text = get_wikipedia_article_text("Lion")
lion_paragraph = lion_doc_text[:1000]

# In[25]:
python_doc = Document(doc_id='1',
                      text=python_doc_text,
                      metadata={
                           "title_of_parental_document": "Python_(programming_language)",
                           "source": "https://en.wikipedia.org/wiki/Python_(programming_language)"
                       })


# In[101]:
lion_doc = Document(doc_id='2',
                    text=lion_paragraph,
                    metadata={
                       "title_of_parental_document": "Lion",
                       "source": "https://en.wikipedia.org/wiki/Lion"
                   })


# In[104]:
vector_store_index.insert(document=python_doc)
vector_store_index.insert(document=lion_doc)

# In[129]:
retriever = vector_store_index.as_retriever(similarity_top_k=10, 
                                            vector_store_query_mode="hybrid",
                                            alpha=0.5)
nodes = retriever.retrieve("What is lion?")

# In[131]:
# Always the retriever returns a list of nodes in decsending order based on the score (most relevant chunks going first in the list).
# But why here the most relevant chunk has a zero score?
for node in nodes:
    print(node.text)
    print()
    print(node.score)
    print("__________________________________________________________________________________________________________")
    print("__________________________________________________________________________________________________________")

print([node.score for node in nodes])
# The scores are: [0.0,  0.9217832833528519,  0.9288488179445267,  0.9365298748016357,  0.937725093215704,    #0.9396311119198799,  0.9409564286470413,  0.9446112886071205,  0.9455222226679325,  0.9476451091468334]



# In[108]:

# Code to query Weaviate without LlamaIndex.
query = "what is lion?"
query_vector = embed_model.get_query_embedding(query=query)

# In[121]:
# query_vector
from weaviate.classes.query import MetadataQuery


# In[123]:
from weaviate.classes.query import MetadataQuery
response = collection.query.hybrid(
    query=query,
    vector=query_vector,
    return_metadata=MetadataQuery(distance=True,
                             certainty=True,
                             score=True,
                             explain_score=True),
    alpha=0.50,
    limit=10,
)
x=[]
for o in response.objects:
    print(o.properties)
    print()
    print(o.metadata)
    print("_______________")
   x.append(o.metadata.score)

print(x)

# Scores from weaviate:
# [1.0,  0.08082851022481918,  0.07240726053714752,  0.07034952938556671,  0.0660715326666832,  #0.06566160172224045,  0.06270790100097656,  0.05426621064543724,  0.05242578685283661,  0.05184878036379814]

Relevant Logs/Tracbacks

No response

@terilias terilias added bug Something isn't working triage Issue needs to be triaged/prioritized labels Jul 12, 2024
Copy link

dosubot bot commented Jul 12, 2024

The issue arises because LlamaIndex performs a subtraction operation (1 - score) on the scores returned by Weaviate, which now returns similarity scores instead of distances. This results in the scores being reversed.

To fix this, you can modify the code to handle similarity scores correctly. Here’s how you can adjust the scores:

  1. Modify the score calculation:
    Update the part of the code where the score is calculated to use the similarity score directly without subtracting from 1.

  2. Reverse the order of the retrieved results:
    Ensure that the results are sorted in descending order of relevance.

Here’s an example of how you can adjust the code:

# Modify the score calculation
def adjust_scores(scores):
    return scores  # Use the similarity scores directly

# Reverse the order of the retrieved results
response = retriever.retrieve("What is lion?")
response = response[::-1]  # Reverse the order of the results
print(response[0])

Additionally, if you are using the VectorContextRetriever from the LlamaIndex framework, you can modify the retrieve_from_graph method to handle the scores correctly:

class VectorContextRetriever(BasePGRetriever):
    # ... (other parts of the class)

    def retrieve_from_graph(self, query_bundle: QueryBundle) -> List[NodeWithScore]:
        vector_store_query = self._get_vector_store_query(query_bundle)

        triplets = []
        kg_ids = []
        new_scores = []
        if self._graph_store.supports_vector_queries:
            result = self._graph_store.vector_query(vector_store_query)
            if len(result) != 2:
                raise ValueError("No nodes returned by vector_query")
            kg_nodes, scores = result

            kg_ids = [node.id for node in kg_nodes]
            triplets = self._graph_store.get_rel_map(
                kg_nodes, depth=self._path_depth, ignore_rels=[KG_SOURCE_REL]
            )
        elif self._vector_store is not None:
            query_result = self._vector_store.query(vector_store_query)
            if query_result.nodes is not None and query_result.similarities is not None:
                kg_ids = self._get_kg_ids(query_result.nodes)
                scores = query_result.similarities
                kg_nodes = self._graph_store.get(ids=kg_ids)
                triplets = self._graph_store.get_rel_map(
                    kg_nodes, depth=self._path_depth, ignore_rels=[KG_SOURCE_REL]
                )

            elif query_result.ids is not None and query_result.similarities is not None:
                kg_ids = query_result.ids
                scores = query_result.similarities
                kg_nodes = self._graph_store.get(ids=kg_ids)
                triplets = self._graph_store.get_rel_map(
                    kg_nodes, depth=self._path_depth, ignore_rels=[KG_SOURCE_REL]
                )

        for triplet in triplets:
            score1 = (
                scores[kg_ids.index(triplet[0].id)] if triplet[0].id in kg_ids else 0.0
            )
            score2 = (
                scores[kg_ids.index(triplet[2].id)] if triplet[2].id in kg_ids else 0.0
            )
            new_scores.append(max(score1, score2))

        assert len(triplets) == len(new_scores)

        # sort by score
        top_k = sorted(zip(triplets, new_scores), key=lambda x: x[1], reverse=True)

        return self._get_nodes_with_score([x[0] for x in top_k], [x[1] for x in top_k])[::-1]  # Reverse the order of the results

This modification will ensure that the results are returned in the correct order of relevance directly from the retriever [1][2][3].

To continue talking to Dosu, mention @dosu.

@logan-markewich
Copy link
Collaborator

@terilias Seems like you have a decent understanding of the issue. Feel free to open a PR 👍🏻

@logan-markewich logan-markewich added P2 package:llama-index-vector-stores-weaviate and removed triage Issue needs to be triaged/prioritized labels Jul 12, 2024
@terilias
Copy link
Author

terilias commented Jul 15, 2024

Hi @logan-markewich , thanks, I would like to give it a try! I have found some more issues in the Weaviate integration and I was thinking to post them for reference, each in a separate thread and try to work on them. Is that ok?

@logan-markewich
Copy link
Collaborator

@terilias go for it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants