Page MenuHomePhabricator

VE mobile default test buckets are unbalanced
Closed, ResolvedPublic

Description

In T229426#5468481, we uncovered, "...53.4% of the users (both registered and anonymous) who were bucketed ended up in wikitext default bucket. It turns out that it would be incredibly unlikely (p << 10^-15) to get an imbalance this big if our random assignment was actually 50%–50%. So there's clearly a serious issue somewhere that we need to understand."

bucketusers
source default1,302,187
visual default1,214,917

"Done"

This ticket is intended to represent the work of trying to understand the following:

  • What might be causing contributors not being assigned to test buckets in a more balanced way?

Contributors' bucket assignments were being recorded after the editor finished loading. This means in situations where contributors abandon their edits before the editor loads, no information about them (including their bucket assignment) is stored. See: T232237#5501940

  • What are the implications of contributors not being assigned to test buckets in a more balanced way? (e.g. Are the results, and by extension the test, valid?)

There is a percentage of contributors' editing behavior we do not have access to. This makes further analysis more difficult considering we'd need to exclude them from this analysis or make assumptions about their behavior. See: T232237#5520035

Event Timeline

For the code-side generation of this distribution, this is where it's defined: https://github.com/wikimedia/mediawiki-extensions-MobileFrontend/blob/master/includes/MobileFrontendEditorHooks.php#L48

if ( $user->isAnon() ) {
    $cookie = $context->getRequest()->getCookie( 'MFDefaultEditorABToken' );
    if ( !$cookie ) {
        $cookie = MWCryptRand::generateHex( 32 );
        $context->getRequest()->response()->setCookie(
            'MFDefaultEditorABToken', $cookie, time() + ( 90 * 86400 )
        );
    }
    $vars['wgMFSchemaEditAttemptStepAnonymousUserId'] = $cookie;
    $anonid = base_convert( substr( $cookie, 0, 8 ), 16, 10 );
    $defaultEditor = $anonid % 2 === 0 ? 'source' : 'visual';
    $vars['wgMFSchemaEditAttemptStepBucket'] = 'default-' . $defaultEditor;
} elseif ( $user->getEditCount() <= 100 ) {
    $defaultEditor = $user->getId() % 2 === 0 ? 'source' : 'visual';
    $vars['wgMFSchemaEditAttemptStepBucket'] = 'default-' . $defaultEditor;
}

For anonymous users, a persistent ID is generated via MWCryptRand::generateHex( 32 ). This is the same method used to generate the sessionid for desktop wikieditor. It's then converted to a decimal number via that base_convert call, using the same method as EventLogging's sessionInSample.

For logged in users, their userid is used.

The ID generated is then used to decide which editor they're assigned. If it was even they get the source editor, odd they get the visual editor.

So, question that naturally arises from this: is the imbalance present in anonymous and logged in users? If it's just one, there might be something in the assignment which is skewing things. If both, perhaps it's in the recording.

nshahquinn-wmf renamed this task from Why are the numbers of contributors in each test bucket not more balanced? to VE mobile default test buckets are unbalanced.Sep 9 2019, 8:57 AM

I discussed this task with @Neil_P._Quinn_WMF today. See current proposed plan and thoughts below and let me know if you have any other suggestions.

What might be causing contributors not being assigned to test buckets in a more balanced way?

As a first step, I'll investigate if the imbalance is present for any particular dimension. I'll first look at anonymous and logged in users and then look into other potential dimensions such as browser and country). Since the sample size is large, we should expect the buckets to be balanced across most dimensions. Isolating the imbalance issue to a specific dimension (if we are able to identify one) should help inform where the distribution error might be occurring and next steps.

What are the implications of contributors not being assigned to test buckets in a more balanced way? (e.g. Are the results, and by extension the test, valid?)

This will depend on the issue that causes the imbalance and whether it resulted in a random or non-random assignment of users.

If the identified issue is that a random subset of users did not get placed in the correct bucket (for example, if the bucket wasn't properly stored for a random ~5% of people put into the visual default bucket), then the average results reported are still valid because the assignment is still random.

If the issue we identify is somehow causing a non-random assignment (for example, users with longer load times or connection issues are not stored properly), then the results of our test will be impacted.

If the issue we identify is somehow causing a non-random assignment (for example, users with longer load times or connection issues are not stored properly), then the results of our test will be impacted.

Just reviewed this with Megan, and this is worth checking ahead of the analysis, but it shouldn't greatly impact timelines (e.g. it probably won't be necessary to run the whole test again).

If there's a skew in the assignment, that could be handled in the analysis itself (it may make the analysis slightly more tricky as a result, but it's still doable). If a particular subgroup ended up being heavily impacted (e.g. 100% were in only one condition), the subgroup should be excluded from the analysis and it might be worth doing a follow-up with that subgroup.

Megan is wrapping up something for Web this week, but will be able to start the most pressing Editing task starting on Friday (whether it's this one or another).

Thank you for investigating this proactively, @DLynch and @MNeisler/@kzimmerman for laying out the potential implications and our steps forward (summarized below).

A question RE other potential "dimensions": @MNeisler, besides browser and country, can you think of what other – if any – variables are knowable to us and could impact how a contributor is assigned to a test bucket? I ask this in the context of thinking through the other potential investigations we could do after first looking at logged in vs. IP contributors. cc @DLynch

One thought that comes to mind: Might there have been a specific time window during the test when contributors began being unevenly assigned? Thinking: maybe we released something else that unknowingly impacted how contributors get/got assigned.


Paths forward

ScenarioAction
If we find a non-random subset of contributors did not get assigned to the correct test bucket......then we: 1. Re-run our initial check T229426, excluding contributors along this "dimension" from our analysis 2. Conduct our full analysis (T221198), also excluding contributors along this "dimension" 3. Potentially do a follow up test with contributors along this "dimension"
If we find a random subset of contributors did not get assigned to the correct test bucket......then we: 1. Consider our initial check (T229426) to be valid and then 2. Conduct our full analysis T221198 as planned

Documenting a question @Esanders posed during today's planning meeting:

What happens when a contributor tries to edit using a browser that VE doesn't support? How – if at all – are they assigned to a test bucket?

@ppelberg They're assigned to the buckets server-side. They then log their bucket as the assigned one, regardless of which editor loads.

I spent some time yesterday digging into the data to determine if the imbalance was present for any particular subgroup. See summary of findings so far below and full results linked here:

  • Unable to isolate the imbalance to any of the following subgroups: user group (registered or anonymous), country, wiki or event action. Across all these groups, there was a similar imbalance between the wikitext and visual editor buckets with a higher number of users (around 51% to 54%) being placed in the wikitext bucket. See user bucket numbers for registered and anonymous users below:

Anonymous Users:

bucketusers
default-source1,891,449
default-visual1,769,443

Registered Users:

bucketusers
default-source9,137
default-visual8,819
  • There was also a similar imbalance across all of the major browsers; however, I did find significant imbalance between the wikitext and visual editor user buckets for Bingbot and BingPreview browsers (both web crawling bot browsers). For Bingbot, about 98% of the 384 users (both registered and anonymous) were placed in the wikitext as default bucket. This imbalance alone wouldn't account for the total imbalance we are seeing in overall users but it might be a clue into where the issue is happening. Do we filter out bots from the test group?
  • Since we're seeing the same imbalance towards the wikitext buckets across these various dimensions and with init actions, I'm wondering if this might be a load time issue.

    Can we confirm that both buckets assigned on the server-side? Are they both assigned at the same time an init event is recorded on the server? If for some reason, wikitext bucket is assigned on the server-side and visual editor on the client side then that delay might be why we are seeing fewer people in the visual editor buckets.

    Any other thoughts or ideas? Let me know if there any other areas you'd like me to help investigate.

Buckets are all assigned server-side, yes -- I posted a link to where it's done above.

I looked into how we record the events again and I think I noticed something we weren't thinking of before: events are only recorded after the editor loads. The code that would record events is downloaded together with the editor code, so it can't record anything until the editor code is downloaded.

Events which are generated before the editor loads, like the 'init' event, are stored it in a temporary queue, and get recorded only after the editor code is loaded.

If the user has a poor network connection and loading the editor fails (or is cancelled by the user), nothing is recorded. Because the size of visual editor code is much larger than the wikitext editor code (it depends on wiki config, but around 6x larger), the effect is much larger for it, and results in this imbalance.

I tested using Chrome dev tools set to simulate "Slow 3G" connection (tested on https://ca.m.wikipedia.org/wiki/Zífid_de_quatre_dents_petit). For wikitext editor, the 'init' event was recorded ~4 seconds after the user clicked the edit pencil. For visual editor, it was recorded after ~13 seconds. If the user cancelled loading before that point, or lost their network connection, nothing would have been recorded.

WikitextVisual
image.png (980×1 px, 114 KB)
image.png (980×1 px, 134 KB)

If we wanted to fix this, it's technically an easy change to load this code earlier: https://gerrit.wikimedia.org/r/537574 – but this would noticeably increase the code size of initial page load. I'm not sure how significant the increase is, but it's probably not a good idea to worsen the users' experience just so that we can log some data about them…


I think that's the explanation for the missing users. (Or at least, probably the closest thing to an explanation we're going to get.)

In the data that we have logged now, we should probably treat the missing visual users as if they tried to use the editor but failed to load it. (Note that there would also be missing users for the wikitext editor, but we have no way to tell how many, other than guessing it's a lot fewer than for visual.)

Great work, @DLynch, @MNeisler and @matmarex.

So it sounds like we ended up in the situation where a non-random group of contributors are being affected by this issue. In T232237#5481099, we said we could correct for an issue like this in our analysis. Granted at that point, we didn't know what the issue was.

However, after talking with Megan about this today, I got the impression we now have less confidence in statistical methods to correct for this issue in our analysis, which leads me to wonder: @MNeisler, are you able to share how our results and our confidence in them might be affected should we move forward with analyzing the data we've gathered so far?

As for the issue itself [2], @matmarex, considering you already have a patch up and you, @DLynch and @Esanders seem to have worked out a path forward with it in chat [1]: Is it safe for me to assume there isn't much – if any – engineering work left to fix the issue by recording events before the editor loads?


  1. Path forward: "There's not a great deal of harm in pushing it sooner, with async as a decent compromise."
  2. Issue: The event that records which test bucket a contributor is assigned to gets recorded after the editor loads. This means, contributors are invisible to the test in instances where the editor fails to load.

As for the issue itself [2], @matmarex, considering you already have a patch up and you, @DLynch and @Esanders seem to have worked out a path forward with it in chat [1]: Is it safe for me to assume there isn't much – if any – engineering work left to fix the issue by recording events before the editor loads?

RE engineering work to be done to resolve the issue...

During today's standup @DLynch and @Esanders came to the agreement that: there is a minimal amount of engineering work to be done to resolve the issue of contributors' test bucket assignments not being recorded when the editor fails to load.

@MNeisler, are you able to share how our results and our confidence in them might be affected should we move forward with analyzing the data we've gathered so far?

I discussed this with @Neil_P._Quinn_WMF today. There are a couple potential ways we could correct for the identified issue in the analysis if we decided to proceed with analyzing the data as is.

  1. We could add in all the missing visual users treating them all as if they tried to use the editor but failed to load it. The problem with this approach is we don’t know if they were all different users or any other user details associated with these events. This also assumes that we are not missing any users for wikitext due to a load timing issue.
  2. We could subtract randomly selected init sessions from the wikitext bucket to bring the bucket numbers in line. This might artificially increase the edit completion rate a little but would avoid us having to make assumptions about the missing user data.

While either of these options is not perfect (and would include some assumptions we'd need to caveat), I think they are feasible methods for correcting the issue and would not invalidate our overall results if we decided to proceed without the code fix option due to timing concerns.

Change 539933 had a related patch set uploaded (by Bartosz Dziewoński; owner: Bartosz Dziewoński):
[mediawiki/extensions/MobileFrontend@master] Load editor EventLogging code earlier

https://gerrit.wikimedia.org/r/539933

If we wanted to fix this, it's technically an easy change to load this code earlier: https://gerrit.wikimedia.org/r/537574 – but this would noticeably increase the code size of initial page load. I'm not sure how significant the increase is, but it's probably not a good idea to worsen the users' experience just so that we can log some data about them…

@Esanders suggested that we can load this after the initial page load. This should not negatively impact the initial page loading performance, but should allow us to record editor initialization in most cases. The new patch does that.

If we wanted to fix this, it's technically an easy change to load this code earlier: https://gerrit.wikimedia.org/r/537574 – but this would noticeably increase the code size of initial page load. I'm not sure how significant the increase is, but it's probably not a good idea to worsen the users' experience just so that we can log some data about them…

@Esanders suggested that we can load this after the initial page load. This should not negatively impact the initial page loading performance, but should allow us to record editor initialization in most cases. The new patch does that.

Thanks for following up on the @matmarex. Regarding the assumption this patch, "...should not negatively impact the initial page loading performance..."; what gives us confidence this is the case? The negligible impact this patch would have on the payload of the page? Some testing we've done? Something else?

Well honestly it's an educated guess, based on how I believe browsers work. I think folks on the Reading team keep tabs on the page load performance (sorry, I don't know how exactly, I just remember seeing tasks with pretty graphs once or twice when there was a regression), so if it turns out this is wrong, we will be able to notice and revert or reconsider.

Updating the task description to add the bolded text to the "Done" section:

  • What might be causing contributors not being assigned to test buckets in a more balanced way?

Contributors' bucket assignments were being recorded after the editor finished loading. This means in situations where contributors abandon their edits before the editor loads, no information about them (including their bucket assignment) is stored. See: T232237#5501940

  • What are the implications of contributors not being assigned to test buckets in a more balanced way? (e.g. Are the results, and by extension the test, valid?)

There is a percentage of contributors' editing behavior we do not have access to. This makes further analysis more difficult considering we'd need to exclude them from this analysis or make assumptions about their behavior. See: T232237#5520035

Well honestly it's an educated guess, based on how I believe browsers work. I think folks on the Reading team keep tabs on the page load performance (sorry, I don't know how exactly, I just remember seeing tasks with pretty graphs once or twice when there was a regression), so if it turns out this is wrong, we will be able to notice and revert or reconsider.

Got it – thank you for explaining that. And just to be doubly sure, would I be correct to think any potential negative impact we would see from this change would show up in how long it takes an article, in read mode, to load?

@matmarex do you know who is best suited to QA this patch? QA? Engineering? Product Analytics? cc @JTannerWMF @marcella

Engineering. (Note that it's still not merged, Krinkle had some concerns on the patch, and I just updated it.)

Change 539933 merged by jenkins-bot:
[mediawiki/extensions/MobileFrontend@master] Load editor EventLogging code earlier

https://gerrit.wikimedia.org/r/539933

Engineering. (Note that it's still not merged, Krinkle had some concerns on the patch, and I just updated it.)

Noted – thanks for confirming, @matmarex.

Considering this is now merged, I'm moving this to "Engineering QA".

Related to QA: as part of testing, are you able to see if this approach to the patch impacts page load time?

I saw Jon +2'd the patch which leads me to think he might've checked this in his review, but that's a leap, so I'm wanting to be sure.

ppelberg added a subscriber: MNeisler.

Seems to work as expected, as far as I can tell, the 'init' event gets logged immediately, before the editor loads.

Oh, and I just saw that it also looks good on the Analytics side: T235104#5573854

Related to QA: as part of testing, are you able to see if this approach to the patch impacts page load time?

I saw Jon +2'd the patch which leads me to think he might've checked this in his review, but that's a leap, so I'm wanting to be sure.

I can't really tell myself, it doesn't seem slower for me, but of course that doesn't mean much :)

@Jdlrobson I vaguely recall you had some fancy graphs / tracking for mobile site's performance. Can you help us find where that is, and confirm that https://gerrit.wikimedia.org/r/539933 did not have a negative impact?

Seems to work as expected, as far as I can tell, the 'init' event gets logged immediately, before the editor loads.

! In T232237#5576479, @matmarex wrote:

Oh, and I just saw that it also looks good on the Analytics side: T235104#5573854

Thanks for checking, @matmarex. Let's await confirmation from Jon before we call this done.

Thanks!

I guess we're interested in the "Barack Obama on 3G (en.m)" graph (which has loading time data for the favorite test page, heh). You can zoom in on it by clicking the header and choosing "View" from the menu. It shows the last 7 days of data, but we're interested in the change since last week's deployment, so let's adjust to 14 days using the menu in top-right of the page. Direct link to that view: https://grafana.wikimedia.org/d/000000205/mobile-2g?panelId=32&fullscreen&orgId=1&from=now-14d&to=now

image.png (980×1 px, 229 KB)

It seems like a bunch of data between 2019-10-04 and 2019-10-11 is missing, which is unfortunate, because our change was deployed on 2019-10-10 (with the train). But I think we can still say that there's no visible change before and after.

This is a big help – thank you, @Jdlrobson.

I guess we're interested in the "Barack Obama on 3G (en.m)" graph (which has loading time data for the favorite test page, heh). You can zoom in on it by clicking the header and choosing "View" from the menu. It shows the last 7 days of data, but we're interested in the change since last week's deployment, so let's adjust to 14 days using the menu in top-right of the page. Direct link to that view: https://grafana.wikimedia.org/d/000000205/mobile-2g?panelId=32&fullscreen&orgId=1&from=now-14d&to=now

image.png (980×1 px, 229 KB)

Your explanation combined with having the deployment date right there makes thinking about this easier. Thank you for that, @matmarex.

But I think we can still say that there's no visible change before and after.

As for the potential impact of the change [1], it looks like the average render [stable] time was 4.14s in the 5 days between 29-Sep and 4-Oct (to your point, we don't have data between 2019-10-04 and 2019-10-11 ) and then the average render [stable] time was 4.27s in the 5 days following the change being deployed, 11-Oct – 15-Oct, which looks like a 3% increase.

Assuming render [stable] is the correct data point to look at to assess loading time [2], then agreed, a 3% increase seems negligible.

Bartosz, if you see any holes in my thinking above, please let me know. If not, let's call this QA complete.


  1. I'm assuming other code would have been deployed on 2019-10-10 that could've impacted loading performance.
  2. render [stable]: in my time looking, I wasn't able to locate where this, or any of the other attributes at the footer of the graphs were defined. I looked here , here and here

! In T232237#5582683, @ppelberg wrote:
Bartosz, if you see any holes in my thinking above, please let me know. If not, let's call this QA complete.

Bartosz and I just talked about the above in chat. We're going to call QA for this complete.

matmarex claimed this task.
DannyS712 subscribed.

[batch] remove patch for review tag from resolved tasks