Nanyu Chen

Nanyu Chen

San Francisco Bay Area
1K followers 500+ connections

About

I am a data science practitioner. I have over 10 years of experience in solving complex…

Activity

Join now to see all activity

Experience

  • Embedding VC Graphic

    Embedding VC

    San Francisco Bay Area

  • -

    San Francisco Bay Area

  • -

    San Francisco Bay Area

  • -

  • -

    San Francisco Bay Area

  • -

    Greater Los Angeles Area

  • -

    San Francisco Bay Area

  • -

  • -

  • -

  • -

  • -

Education

Publications

  • Top Challenges from the first Practical Online Controlled Experiments Summit

    KDD explorations journal

    Online controlled experiments (OCEs), also known as A/B tests, have become ubiquitous in evaluating the impact of changes made to software products and services. While the concept of online controlled experiments is simple, there are many practical challenges in running OCEs at scale. To understand the top practical challenges in running OCEs at scale, representatives with experience in large-scale experimentation from thirteen different organizations (Airbnb, Amazon, Booking.com, Facebook…

    Online controlled experiments (OCEs), also known as A/B tests, have become ubiquitous in evaluating the impact of changes made to software products and services. While the concept of online controlled experiments is simple, there are many practical challenges in running OCEs at scale. To understand the top practical challenges in running OCEs at scale, representatives with experience in large-scale experimentation from thirteen different organizations (Airbnb, Amazon, Booking.com, Facebook, Google, LinkedIn, Lyft, Microsoft, Netflix, Twitter, Uber, Yandex, and Stanford University) were invited to the first Practical Online Controlled Experiments Summit. All thirteen organizations sent representatives. Together these organizations tested more than one hundred thousand experiment treatments last year. Thirty-four experts from these organizations participated in the summit in Sunnyvale, CA, USA on December 13-14, 2018.
    While there are papers from individual organizations on some of the challenges and pitfalls in running OCEs at scale, this is the first paper to provide the top challenges faced across the industry for running OCEs at scale and some common solutions.

    Other authors
    • many others
  • A Method for Measuring Network Effects of One-to-One Communication Features in Online A/B Tests

    arXiv

    A/B testing is an important decision making tool in product development because can provide an accurate estimate of the average treatment effect of a new features, which allows developers to understand how the business impact of new changes to products or algorithms. However, an important assumption of A/B testing, Stable Unit Treatment Value Assumption (SUTVA), is not always a valid assumption to make, especially for products that facilitate interactions between individuals. In contexts like…

    A/B testing is an important decision making tool in product development because can provide an accurate estimate of the average treatment effect of a new features, which allows developers to understand how the business impact of new changes to products or algorithms. However, an important assumption of A/B testing, Stable Unit Treatment Value Assumption (SUTVA), is not always a valid assumption to make, especially for products that facilitate interactions between individuals. In contexts like one-to-one messaging we should expect network interference; if an experimental manipulation is effective, behavior of the treatment group is likely to influence members in the control group by sending them messages, violating this assumption. In this paper, we propose a novel method that can be used to account for network effects when A/B testing changes to one-to-one interactions. Our method is an edge-based analysis that can be applied to standard Bernoulli randomized experiments to retrieve an average treatment effect that is not influenced by network interference. We develop a theoretical model, and methods for computing point estimates and variances of effects of interest via network-consistent permutation testing. We then apply our technique to real data from experiments conducted on the messaging product at LinkedIn. We find empirical support for our model, and evidence that the standard method of analysis for A/B tests underestimates the impact of new features in one-to-one messaging contexts.

    See publication
  • How A/B tests could go wrong: Automatic diagnosis of invalid online experiments

    WSDM 2019

    We have seen a massive growth of online experiments at Internet companies. Although conceptually simple, A/B tests can easily go wrong in the hands of inexperienced users and on an A/B testing platform with little governance. An invalid A/B test leads to bad business decisions, and bad decisions hurt the business. Therefore, it is now more important than ever to create an intelligent A/B platform that democratizes A/B testing and allow everyone to make quality decisions through built-in…

    We have seen a massive growth of online experiments at Internet companies. Although conceptually simple, A/B tests can easily go wrong in the hands of inexperienced users and on an A/B testing platform with little governance. An invalid A/B test leads to bad business decisions, and bad decisions hurt the business. Therefore, it is now more important than ever to create an intelligent A/B platform that democratizes A/B testing and allow everyone to make quality decisions through built-in detection and diagnosis of invalid tests. In this paper, we share how we mined through historical A/B tests and identi ed the most common causes for invalid tests, ranging from biased design, self-selection bias to attempting to generalize A/B test result beyond the experiment population and time frame. Furthermore, we also developed scalable algorithms to automatically detect invalid A/B tests and diagnose the root cause of invalidity. Surfacing up invalidity not only improved decision quality, but also served as a user education and reduced problematic experiment designs in the long run.

  • False Discovery Rate Controlled Heterogeneous Treatment Effect Detection for Online Controlled Experiments

    KDD 2018


    Online controlled experiments (a.k.a. A/B testing) have been used as the mantra for data-driven decision making on feature changing and product shipping in many Internet companies. However, it is still a great challenge to systematically measure how every code or feature change impacts millions of users with great heterogeneity (e.g. countries, ages, devices). The most commonly used A/B testing framework in many companies is based on Average Treatment Effect (ATE), which cannot detect the…


    Online controlled experiments (a.k.a. A/B testing) have been used as the mantra for data-driven decision making on feature changing and product shipping in many Internet companies. However, it is still a great challenge to systematically measure how every code or feature change impacts millions of users with great heterogeneity (e.g. countries, ages, devices). The most commonly used A/B testing framework in many companies is based on Average Treatment Effect (ATE), which cannot detect the heterogeneity of treatment effect on users with different characteristics. In this paper, we propose statistical methods that can systematically and accurately identify Heterogeneous Treatment Effect (HTE) of any user cohort of interest (e.g. mobile device type, country), and determine which factors (e.g. age, gender) of users contribute to the heterogeneity of the treatment effect in an A/B test. By applying these methods on both simulation data and real-world experimentation data, we show how they work robustly with controlled low False Discover Rate (FDR), and at the same time, provides us with useful insights about the heterogeneity of identified user groups. We have deployed a toolkit based on these methods, and have used it to measure the Heterogeneous Treatment Effect of many A/B tests at Snap.

    Other authors
    See publication
  • Evaluating Mobile Apps with A/B and Quasi A/B Tests

    KDD 2016

    We have seen an explosive growth of mobile usage, particularly on mobile apps. It is more important than ever to be able to properly evaluate mobile app release. A/B testing is a standard framework to evaluate new ideas. We have seen much of its applications in the online world across the industry [9,10,12]. Running A/B tests on mobile apps turns out to be quite different, and much of it is attributed to the fact that we cannot ship code easily to mobile apps other than going through a lengthy…

    We have seen an explosive growth of mobile usage, particularly on mobile apps. It is more important than ever to be able to properly evaluate mobile app release. A/B testing is a standard framework to evaluate new ideas. We have seen much of its applications in the online world across the industry [9,10,12]. Running A/B tests on mobile apps turns out to be quite different, and much of it is attributed to the fact that we cannot ship code easily to mobile apps other than going through a lengthy build, review and release process. Mobile infrastructure and user behavior differences also contribute to how A/B tests are conducted differently on mobile apps, which will be discussed in details in this paper. In addition to measuring features individually in the new app version through randomized A/B tests, we have a unique opportunity to evaluate the mobile app as a whole using the quasi-experimental framework [21]. Not all features can be A/B tested due to infrastructure changes and wholistic product redesign. We propose and establish quasi-experiment techniques for measuring impact from mobile app release, with results shared from a recent major app launch at LinkedIn.

    Other authors
  • From Infrastructure to Culture: A/B Testing Challenges in Large Scale Social Networks

    KDD 2015

    A/B testing, also known as bucket testing, split testing, or controlled experiment, is a standard way to evaluate user engagement or satisfaction from a new service, feature, or product. It is widely used among online websites, including social network sites such as Facebook, LinkedIn, and Twitter to make data-driven decisions. At LinkedIn, we have seen tremendous growth of controlled experiments over time, with now over 400 concurrent experiments running per day. General A/B testing frameworks…

    A/B testing, also known as bucket testing, split testing, or controlled experiment, is a standard way to evaluate user engagement or satisfaction from a new service, feature, or product. It is widely used among online websites, including social network sites such as Facebook, LinkedIn, and Twitter to make data-driven decisions. At LinkedIn, we have seen tremendous growth of controlled experiments over time, with now over 400 concurrent experiments running per day. General A/B testing frameworks and methodologies, including challenges and pitfalls, have been discussed extensively in several previous KDD work. In this paper, we describe in depth the experimentation platform we have built at LinkedIn and the challenges that arise particularly when running A/B tests at large scale in a social network setting. We start with an introduction of the experimentation platform and how it is built to handle each step of the A/B testing process at LinkedIn, from designing and deploying experiments to analyzing them. It is then followed by discussions on several more sophisticated A/B testing scenarios, such as running offline experiments and addressing the network effect, where one user’s action can influence that of another. Lastly, we talk about features and processes that are crucial for building a strong experimentation culture.

    Other authors
    See publication

Patents

  • POST-EXPERIMENT NETWORK EFFECT ESTIMATION BASED ON LOGGED MESSAGING EVENTS

    Filed US 60352-0367

  • MODEL-BASED MATCHING FOR REMOVING BIAS IN QUASI-EXPERIMENTAL TESTING OF MOBILE APPLICATIONS

    Filed US 15/140239

  • MODEL VALIDATION AND BIAS REMOVAL IN QUASI-EXPERIMENTAL TESTING OF MOBILE APPLICATIONS

    Filed US 15/140250

  • A/B TESTING ON DEMAND

    Filed US 15/140,186

  • Flexible Targeting

    Filed US 14/944,100

  • Site Wide Impact

    Filed US 62/141,126

  • Triggered Targetting

    Filed US 62/140,366

  • Most Impactful Experiments

    Filed US 62/141,193

Languages

  • English

    Full professional proficiency

  • Mandarin

    Native or bilingual proficiency

  • Cantonese

    Professional working proficiency

  • French

    Elementary proficiency

Organizations

  • American Statistical Association

    -

    - Present

More activity by Nanyu

View Nanyu’s full profile

  • See who you know in common
  • Get introduced
  • Contact Nanyu directly
Join to view full profile

Other similar profiles

Explore collaborative articles

We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

Explore More

Others named Nanyu Chen