Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] ImplicitCF failing if not using stratified split #2024

Closed
daviddavo opened this issue Oct 18, 2023 · 8 comments
Closed

[BUG] ImplicitCF failing if not using stratified split #2024

daviddavo opened this issue Oct 18, 2023 · 8 comments
Labels
bug Something isn't working

Comments

@daviddavo
Copy link
Collaborator

Description

ImplicitCF raises an IndexError if the user appears in the test dataset but not on the training dataset.

How do we replicate the issue?

Split a dataset using a method like TimeSeriesSplit or python_chrono_split. I.e: len(ImplicitCF.interact_status) < len(ImplicitCF.user_idx)

Expected behavior (i.e. solution)

Raisign a meaningful error if the dataset needs to be stratified, or assuming that if the user is not on the ImplicitCF.interact_status table, it should have the empty set of items.

Other Comments

Meanwhile, I solved it by using:

data.interact_status = data.interact_status.reindex(data.user_idx['userID_idx'])
data.interact_status['userID'] = data.interact_status.index
data.interact_status['itemID_interacted'] = data.interact_status['itemID_interacted'].fillna("").apply(set)

This will create a the remaining "empty" users

Or just deleting items in test that don't appear in train

@daviddavo daviddavo added the bug Something isn't working label Oct 18, 2023
@loomlike
Copy link
Collaborator

loomlike commented Nov 2, 2023

@cclinet @Qcactus Can you check this? If you cannot find a time to handle this, let me know.

@cclinet
Copy link
Contributor

cclinet commented Nov 3, 2023

I'm not certain under what circumstances 'user in test' should be used in 'interact_status.' Based on my understanding, 'interact_status' should only be applied to the training dataset. Could you provide me with more information on this?

@daviddavo
Copy link
Collaborator Author

Basically, to test how a system would work in real-life cold-start conditions.

@cclinet
Copy link
Contributor

cclinet commented Nov 8, 2023

I've become a bit unclear about the logic here, but I don't believe it's necessary for 'interact_status' to include users from
the test set.
@loomlike Do you have time to further investigate this issue?

@daviddavo
Copy link
Collaborator Author

daviddavo commented Jun 24, 2024

The problem is that self.n_user does include the users in test. The sampling, therefore, also includes these "test users" that might not be available in training.

  indices = range(self.n_users)
  users = random.sample(indices, batch_size)

Another problem to remember is that the definition of epoch usually depends on the number of elements in the train set (users in this case), and should not change with the number of items in train. That's why my proposed solution in 2023 was wrong. With all of this in mind, we could:

Note: With remaining users I refer to users in test that are not in training

  • Add a boolean mask of size n_users that tells whether each user is in train or not
  • Add a second n_users called n_train_users, and make all remaining users have an index value greater than n_train_users, then sample only from 0 to n_train_users

About the second option, it already concats first the train and then the test, and then drops the duplicates keeping just the first one, so every user in train will be in the first part of the user_idx.

@daviddavo
Copy link
Collaborator Author

Yeah, the second option seems to work fine and I just changed two lines of code

https://gist.github.com/daviddavo/92a79db3d94bc23e8cdb03279475a221

miguelgfierro added a commit that referenced this issue Jun 25, 2024
Fix ImplicitCF failing if not using stratified split #2024
@anargyri
Copy link
Collaborator

@daviddavo should this issue be closed now?

@daviddavo
Copy link
Collaborator Author

Yep, I thought it was automatically closed with #2117

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants