-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] ImplicitCF failing if not using stratified split #2024
Comments
I'm not certain under what circumstances 'user in test' should be used in 'interact_status.' Based on my understanding, 'interact_status' should only be applied to the training dataset. Could you provide me with more information on this? |
Basically, to test how a system would work in real-life cold-start conditions. |
I've become a bit unclear about the logic here, but I don't believe it's necessary for 'interact_status' to include users from |
The problem is that self.n_user does include the users in test. The sampling, therefore, also includes these "test users" that might not be available in training. indices = range(self.n_users)
users = random.sample(indices, batch_size) Another problem to remember is that the definition of epoch usually depends on the number of elements in the train set (users in this case), and should not change with the number of items in train. That's why my proposed solution in 2023 was wrong. With all of this in mind, we could:
About the second option, it already concats first the train and then the test, and then drops the duplicates keeping just the first one, so every user in train will be in the first part of the user_idx. |
Yeah, the second option seems to work fine and I just changed two lines of code https://gist.github.com/daviddavo/92a79db3d94bc23e8cdb03279475a221 |
Fix ImplicitCF failing if not using stratified split #2024
@daviddavo should this issue be closed now? |
Yep, I thought it was automatically closed with #2117 |
Description
ImplicitCF raises an
IndexError
if the user appears in the test dataset but not on the training dataset.How do we replicate the issue?
Split a dataset using a method like TimeSeriesSplit or python_chrono_split. I.e:
len(ImplicitCF.interact_status) < len(ImplicitCF.user_idx)
Expected behavior (i.e. solution)
Raisign a meaningful error if the dataset needs to be stratified, or assuming that if the user is not on the
ImplicitCF.interact_status
table, it should have the empty set of items.Other Comments
Meanwhile, I solved it by using:
This will create a the remaining "empty" users
Or just deleting items in test that don't appear in train
The text was updated successfully, but these errors were encountered: