Haiyuan Cao

Haiyuan Cao

Redmond, Washington, United States
8K followers 500+ connections

About

Principal SDE in Azure AI. Working on infusing AI into product, including big LLM models,…

Activity

Join now to see all activity

Experience

  • Microsoft Graphic

    Microsoft

    Bellevue, WA

  • -

    Greater New York City Area

  • -

    Greater New York City Area

  • -

    Shanghai

  • -

    United States

Education

  • Columbia University Graphic

    Columbia University in the City of New York

    -

    Activities and Societies: Columbia Data Science Society

    Focus on Machine Leaning, Platform for Big Data Analysis and Developing Data Driven Product

  • -

    Activities and Societies: Member of American Physical Society, Referee of the journal 'Nanotechnology' (impact factor 3.821)

    Major in computational simulation, mathematical modelling and data analysis in energy transport in nano-structures and magnetic properties in iron superconductors. Proposed an algorithm in calculating the magnetic interaction and co-proposed an global optimization method in search the structure of complex grain boundaries.

  • -

Licenses & Certifications

Volunteer Experience

  • Team Member of Youth Ambassador Program for Minorities (YAPM)

    Technology and Education: Connecting Cultures, Inc. (TECC) - 501c3

    - Present 14 years 2 months

    Education

    China, home to 55 minority groups, enjoys a rich ethnic and cultural diversity. However, minority cultures are at risk of being marginalized by economic modernization and national education. Confronted with the outside influences of Western and Han culture, local youngsters are unaware of the role they should play in preserving their own culture.
    In this project, we went to a remote village of Honghe Hani and Yi Autonomous Prefecture in Yunnan Province. Out of all Hani Villages in Honghe…

    China, home to 55 minority groups, enjoys a rich ethnic and cultural diversity. However, minority cultures are at risk of being marginalized by economic modernization and national education. Confronted with the outside influences of Western and Han culture, local youngsters are unaware of the role they should play in preserving their own culture.
    In this project, we went to a remote village of Honghe Hani and Yi Autonomous Prefecture in Yunnan Province. Out of all Hani Villages in Honghe in north Yunnan Province, only a little part of villages still speak the Hani language. While most programs address minority culture preservation through recording and documenting carried out by outside observers, we actively involved local youth by making them ambassadors of their own cultures. The vision of us is to awaken a sense of responsibility among local youngsters and empower them to play a positive role in protecting their own culture and constructively impacting their surroundings.
    Project objective:
    1. Raised awareness about quintessential elements of Buyi culture and stress the importance of cultural preservation among local youth.
    2. Educated local youngsters to use digital cameras, audio recorders and the Internet to record the Hani culture. Encourage local students to communicate with the outside world and provide a platform for global interaction.
    3. Devised an organized and effective ethnic culture course to incorporate into local schools’ daily curriculum.
    4. Refined a local culture preservation model which other minority groups can adopt.

  • Referee

    Nanoscale

    - Present 13 years

    Science and Technology

    Initiated as the referee of the leading peer reviewed journal focused in Nano-science <Nanoscale>.

Publications

Courses

  • Algorithms for Data Science

    CSOR 4246

  • Bayesian Model in Machine Learning

    EECS 6720

  • Computer Systems for Data Science

    COMS 4121

  • Data Mining

    STAT 4240

  • Exploratory Data Analysis and Visualisation

    STAT 4701

  • Foundations of Graphical Models

    STAT 6701

  • Introduction to Databases

    COMS 4111

  • Natural Language Processing

    COMS 4705

  • Statistical Machine Learning

    STAT 4400

Projects

  • (Kaggle Like) Rang-Tech Data Analytics Competition

    This is a Kaggle like competition which used the transaction data to predict the active customer.

    i. Understanding the data, clean the data and subset the data.
    We are not provided with a background intro to the data so we spend some time looking into the each variable and tried to find some pattern. Luckily we finally found some correlation between variables and then grouped and reduced the number of variables.
    We do not use the…

    This is a Kaggle like competition which used the transaction data to predict the active customer.

    i. Understanding the data, clean the data and subset the data.
    We are not provided with a background intro to the data so we spend some time looking into the each variable and tried to find some pattern. Luckily we finally found some correlation between variables and then grouped and reduced the number of variables.
    We do not use the features directly, and we do the feature engineering carefully for each feature. For some feature has outliers ,we eliminate those outliers. For some feature has the range value is quite wide, we do the sqrt transform. For some data, we also found that the NAs occur in all the records about food so we decided to train separate models on the data containing food NAs and those without NAs, We do really a lot of work on feature transformation and engineering.

    ii. Add new features. We tried with the variables from the data but cannot make progess when we hit approximately 68% in public leader board. One teammate found a paper describing some interesting features to be used in the customer classification using transaction data. In that paper the authors introduced the variable "number of NAs and number of 0 for each customers" are quite important for final prediction of the active customer, so we add these features to our result and the model give the result goes beyond 69%.

    iii. Ensemble methods. We first tried a single model but stopped at around 69%. After that we tried to combine 13 kinds of models with both parametric and non-parametric machine learning method. Based on these prediction models, we use the 2-layer 5-fold stacking method ensemble the output of the first-layer models.

    Other creators
    See project
  • Entity Resolution Matching between Foursquare and Locu’s dataset

    1. Take two datasets from Foursquare and Locu that describe the same entities, and identify which entity in one dataset is the same as an entity in the other dataset.
    2. We construct some features according to the input dataset. We construct the features hiversine_distances for the location information including longitude and lattitude. The 'name' and 'address' information are evaluated used the jaccard similarity score for both the whole entry and each character in the entry. The 'phone…

    1. Take two datasets from Foursquare and Locu that describe the same entities, and identify which entity in one dataset is the same as an entity in the other dataset.
    2. We construct some features according to the input dataset. We construct the features hiversine_distances for the location information including longitude and lattitude. The 'name' and 'address' information are evaluated used the jaccard similarity score for both the whole entry and each character in the entry. The 'phone number' is evaluted through the simple matching. The missing values in 'phone number' and 'address' are also marked by the dummy variable feature.
    3. In our algorithm, we combine the records in the locu train dataset and the foursquare train dataset, featurize the dataset and then add the tag that whether they are in the matched list or not. Then we use the training data to train the random forest classifier. The number of trees are chosen by the cross validation method and the number of features are used the general "sqrt" method. Finally we choose the random forest classifier with the 100 trees according to the cross-validation F1 score.
    4. Here we set a threshold 0.53 which comes the cross validation used in the matching method. For several matched items in the test dataset through the random forest classifier, we use the matched item with the highest probability.
    5. Our result has precision 100%, recall 98.33% and F1 score 99.16%.

    Other creators
  • Using AWS Cloud Platform and Spark Machine Learning to Recommend Music and with the Last.fm’s Audioscrobbler Data Set

    1. Using the data set published by Audioscrobbler with 24.2 million records about user’s player of artists to build the music recommender engine.
    2. Implementing the alternating least squares recommender algorithm through the MLLib on Spark to build the music recommener
    3. Preprocessing the raw data set using python functional programming to correct the misspelled or nonstandard artist’s ID
    4. Using cross validation on Spark to select the hyperparameters for the matrix factorization…

    1. Using the data set published by Audioscrobbler with 24.2 million records about user’s player of artists to build the music recommender engine.
    2. Implementing the alternating least squares recommender algorithm through the MLLib on Spark to build the music recommener
    3. Preprocessing the raw data set using python functional programming to correct the misspelled or nonstandard artist’s ID
    4. Using cross validation on Spark to select the hyperparameters for the matrix factorization model
    5. Implement the final model on AWS platform to handle the huge amount of data

    Other creators
  • Using Hadoop Hive and Mapreduce to analysis Nasa Server Logs

    1. Dealing with the data set contains Apache Logs gathered by NASA's server in the months of July-October, 1995, which is around 1 GB using the HDFS.
    2. Create a schema for the dataset in Hive through the regular expression to describe a concrete structure describing all the required fields.
    3. Make the plot to depicting the number of requests made in a day for every day in the month of October.
    4. Write a MapReduce job to calculate total bandwidth add all the response bytes sent by…

    1. Dealing with the data set contains Apache Logs gathered by NASA's server in the months of July-October, 1995, which is around 1 GB using the HDFS.
    2. Create a schema for the dataset in Hive through the regular expression to describe a concrete structure describing all the required fields.
    3. Make the plot to depicting the number of requests made in a day for every day in the month of October.
    4. Write a MapReduce job to calculate total bandwidth add all the response bytes sent by NASA webserver.

  • Zynga Game Payer Prediction and User Pattern Analysis

    1. Processing real user data and metrics from Zynga platform with 1 million user records and 247 features.
    2. Implemented Lasso, ridge regression with logistic regression and random forest method to select the important features in predicting whether the user would be a payer.
    3. Ensemble the stochastic gradient descent classifier with perceptron, log and hinge loss function, the knn method and the decision tree method with the selected important features to predict the payer. The…

    1. Processing real user data and metrics from Zynga platform with 1 million user records and 247 features.
    2. Implemented Lasso, ridge regression with logistic regression and random forest method to select the important features in predicting whether the user would be a payer.
    3. Ensemble the stochastic gradient descent classifier with perceptron, log and hinge loss function, the knn method and the decision tree method with the selected important features to predict the payer. The precision, recall and F1 score all reach up to 95%.
    4. Using Kmeans++ method with the important features to cluster the user patterns on Zynga platform. The number of cluster is determined by the elbow method. Using cluster method, we can correctly reveal the difference pattern between paying users, the risk-prefer user and the mature user.
    5. Based on the user pattern, we propose the strategy to hold campaign between different group of users to improve the engagement of users.

  • Handwriting Recognizing by SVM and Adaboost Supervised Learning with R

    1. Processed JPEG data from the USPS open handwriting datasets data into the matrix with R.
    2. Implemented the non-linear SVM method and Adaboost with R to recognize the handwriting numbers.
    3. Chosen the kernel and margin parameters through cross validation to improve the recognized rate to 90%.

  • Document Text Classification Using Lasso/Ridge Regression and Naïve Bayes

    1. Building an efficient Naïve Bayes classifier to classify the papers belonging to Hamilton or Madison with the help of natural language processing package of R
    2. Implementing the Ridge regression, Lasso and mutual information selection, respectively, to remove the irrelevant features in the text documents to improve the efficiency of the Bayes classifier.

  • Mining the NYPD Open Datasets to Predict the Danger Area for Car Collision in NYC on AWS Cloud Platform

    1. Cleaned, processed and selected a bunch of features to find correlation between the rate of vehicle collisions and the location, time and weather of the driving route with R script through the API of NYC open dataset.
    2. Applied normalization and PCA for the features of data, then implementing the unsupervised K-means++ method on AWS Cloud Platform with Spark, obtain the heat map of high danger area in NYC with the inputting time and driving route.

    Other creators
  • Study the Relation Between Users’ Sentiment and Location Tags in Twitter with SQL and

    1. Processed tweets from Twitter Streaming API to extract tweets with locations tags using Python and SQL
    2. Done sentiment analysis by writing the classifiers with python: naive Bayes classifier, maximum entropy classifier and support vector machines. The NLTK package is used to parse and analyze each tweet.
    3. Improved the accuracy of self-written machine learning classifier by using the bi-grams, tri-grams and word dictionaries. The accuracy is around 80%.

  • Predict the SSE Index by Bouchard-Sornette option pricing-model

    Developed C code to implement Bouchard-Sornette option pricing-model to predict the SSE Index

  • Computational study of the phase transition in the Hexagonal Ising Model

    Implemented Wolff-Monte-Carlo method by C code to study the phase transition in hexagonal Ising model.

  • Developing New Global Optimization Algorithm for Material Science with Hadoop

    -

    1. Proposed a new global optimization algorithm for functional material searching based on the differential evolution algorithm using python.
    2. Utilized new algorithm to find the grain boundary structures with lower formation energy on Hadoop.
    3. Design A/B test to select the components in the algorithm to make the optimization efficient.

  • Developing High Efficient Algorithm in Scientific Computation

    -

    1. Developed the efficient algorithm with python to accelerate parallel large-scale data-analysis on Hadoop.
    2. Accelerating the efficiency of calculation 10 times without lost the major accuracy comparing to the previous.

  • Computational Study of the Energy Transport in Nanostructures, Fudan University

    -

    1. Developed python code to simulate the thermal transport in graphene-based materials.
    2. Using multivariate numerical method with R to analyze the datasets obtained from the experiments.
    3. Designed a new kind of 2D thermal rectifier and publish in peer-reviewed paper (top cited paper in journal)

Honors & Awards

  • Rank 91/2070 (top 5%) in Kaggle Two Sigma Financial Model Challenge

    Kaggle

    As a member in the team attending Kaggle Two Sigma Financial Model Challenge. Implementing time-series feature engineering and linear/tree regressors to build the model which achieve top 5% score in the private leaderboard on test dataset.

  • Brown Medal in Hackerrank Coding Contest (top 15%)

    Hackerrank

    Top 15% in Hackerrank Week of Code 23 contest with 10000+ attendees.

  • Rank No.1 among 279 teams in Rang-Technology Data Analytic Competition (Kaggle like data competition)

    Rang-Technology

    https://rang.shinyapps.io/Competition/

    Rank No.1 among 279 teams composed of Master students around 50 Universities, including CMU, Columbia, Cornell, USC, UIUC etc in the Rang-Tech Data Analytics Competition, a Kaggle like competition which used the transaction data to predict the active customer.

  • 2015 Web of Science Highly Cited Paper Worldwide (First author)

    Thomson Reuters

    My first author paper published on Physical Review B about the theoretical computation on magnetic materials "Antiferromagnetic ground state with pair-checkerboard order in FeSe " has been selected as the "Highly Cited Paper" in 2015 period.

    My first author paper has been selected as the top 1% high quality science paper from the about 1170000 papers published in physics related subjects. This is the most prestigious criteria about the research impact in the science research field.

  • National Scholarship

    Ministry of Education of People's Republic of China

    Top honor for the best academic achievement of graduate student in China.

  • Fellowship for Graduate Student’s Short-term International Visiting (to Lawrence Berkeley National Lab)

    Fudan University

    Fellowship for excellent graduate student to visit top-class institutions worldwide.

  • Distinguished award for new graduate student

    Fudan University

    For the excellent new coming graduate student.

Languages

  • Mandarin

    Native or bilingual proficiency

  • English

    Professional working proficiency

Organizations

  • American Physics Society

    Member

    - Present

    Student member of the American Physics Society. Give two oral talks in the 2013 and 2014 APS Annual March Meeting.

More activity by Haiyuan

View Haiyuan’s full profile

  • See who you know in common
  • Get introduced
  • Contact Haiyuan directly
Join to view full profile

Other similar profiles

Explore collaborative articles

We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

Explore More

Others named Haiyuan Cao

Add new skills with these courses