About
Principal SDE in Azure AI. Working on infusing AI into product, including big LLM models,…
Activity
-
Exciting news! 🚀 Atlassian is expanding and we're hiring! Our journey began with Jira, helping developers track bugs and manage work. With our newly…
Exciting news! 🚀 Atlassian is expanding and we're hiring! Our journey began with Jira, helping developers track bugs and manage work. With our newly…
Liked by Haiyuan Cao
-
Extreme Deep Learning scalability is driving breathtaking AI advancements, but at the cost of making I/O a major bottleneck. I am excited to announce…
Extreme Deep Learning scalability is driving breathtaking AI advancements, but at the cost of making I/O a major bottleneck. I am excited to announce…
Liked by Haiyuan Cao
-
Can we transfer knowledge from large model to small model via weights instead of logits? Yes! we verified this idea via a series of experiments one…
Can we transfer knowledge from large model to small model via weights instead of logits? Yes! we verified this idea via a series of experiments one…
Liked by Haiyuan Cao
Experience
Education
-
Columbia University in the City of New York
-
Activities and Societies: Columbia Data Science Society
Focus on Machine Leaning, Platform for Big Data Analysis and Developing Data Driven Product
-
-
Activities and Societies: Member of American Physical Society, Referee of the journal 'Nanotechnology' (impact factor 3.821)
Major in computational simulation, mathematical modelling and data analysis in energy transport in nano-structures and magnetic properties in iron superconductors. Proposed an algorithm in calculating the magnetic interaction and co-proposed an global optimization method in search the structure of complex grain boundaries.
-
-
Licenses & Certifications
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
Intermediate Python for Data Science
DataCamp
Credential ID 94cbf2127dfec10ccb37c06bbcde31da55a89475 -
Introduction to TensorFlow for Artificial Intelligence, Machine Learning, and Deep Learning
Coursera
Credential ID LG9Y6YFTNUSL
Volunteer Experience
-
Team Member of Youth Ambassador Program for Minorities (YAPM)
Technology and Education: Connecting Cultures, Inc. (TECC) - 501c3
- Present 14 years 2 months
Education
China, home to 55 minority groups, enjoys a rich ethnic and cultural diversity. However, minority cultures are at risk of being marginalized by economic modernization and national education. Confronted with the outside influences of Western and Han culture, local youngsters are unaware of the role they should play in preserving their own culture.
In this project, we went to a remote village of Honghe Hani and Yi Autonomous Prefecture in Yunnan Province. Out of all Hani Villages in Honghe…China, home to 55 minority groups, enjoys a rich ethnic and cultural diversity. However, minority cultures are at risk of being marginalized by economic modernization and national education. Confronted with the outside influences of Western and Han culture, local youngsters are unaware of the role they should play in preserving their own culture.
In this project, we went to a remote village of Honghe Hani and Yi Autonomous Prefecture in Yunnan Province. Out of all Hani Villages in Honghe in north Yunnan Province, only a little part of villages still speak the Hani language. While most programs address minority culture preservation through recording and documenting carried out by outside observers, we actively involved local youth by making them ambassadors of their own cultures. The vision of us is to awaken a sense of responsibility among local youngsters and empower them to play a positive role in protecting their own culture and constructively impacting their surroundings.
Project objective:
1. Raised awareness about quintessential elements of Buyi culture and stress the importance of cultural preservation among local youth.
2. Educated local youngsters to use digital cameras, audio recorders and the Internet to record the Hani culture. Encourage local students to communicate with the outside world and provide a platform for global interaction.
3. Devised an organized and effective ethnic culture course to incorporate into local schools’ daily curriculum.
4. Refined a local culture preservation model which other minority groups can adopt. -
Referee
Nanoscale
- Present 13 years
Science and Technology
Initiated as the referee of the leading peer reviewed journal focused in Nano-science <Nanoscale>.
Publications
-
Thermal conductivity of disordered two-dimensional binary alloys
Nanoscale
Using advanced statistical simulations, we have studied the effect of disorder on the thermal conductivity of two-dimensional alloys. We find that the thermal conductivity not only depends on the substitution concentration of different elements, but also strongly depends on the disorder distribution.
-
Oxygen Vacancy Induced Flat Phonon Mode at FeSe /SrTiO3 interface
Nature Scientific Reports
-
Antiferromagnetic ground state with pair-checkerboard order in FeSe
Physical Review B
-
Chinese edition of <Decoding the City>
Yeeyan Gutenberg Project (译言古登堡计划)
We translate the book <Decoding the City> of Dietmar Offenhuber, a publication of the SENSEable City Lab at the renowned MIT, which is about the current contribution to the discussion about Big Data and urbanism into the Chinese edition.
Other authors -
-
Measurement of an Enhanced Superconducting Phase and a Pronounced Anisotropy of the Energy Gap of a Strained FeSe Single Layer in FeSe/Nb: SrTiO 3/KTaO 3 Heterostructures Using Photoemission Spectroscopy
Physical Review Letters (Top journal in physics community)
-
Interfacial effects on the spin density wave in FeSe/SrTiO3 thin films
Physical Review B
-
Chinese edition of <Bohemia in London>
Yeeyan Gutenberg Project (译言古登堡计划)
We translate the famous book <Bohemia in London) of Arthur Ransome into Chinese edition.
Other authors -
-
Unexpected large thermal rectification in asymmetric grain boundary of graphene
Solid State Communications
Courses
-
Algorithms for Data Science
CSOR 4246
-
Bayesian Model in Machine Learning
EECS 6720
-
Computer Systems for Data Science
COMS 4121
-
Data Mining
STAT 4240
-
Exploratory Data Analysis and Visualisation
STAT 4701
-
Foundations of Graphical Models
STAT 6701
-
Introduction to Databases
COMS 4111
-
Natural Language Processing
COMS 4705
-
Statistical Machine Learning
STAT 4400
Projects
-
(Kaggle Like) Rang-Tech Data Analytics Competition
This is a Kaggle like competition which used the transaction data to predict the active customer.
i. Understanding the data, clean the data and subset the data.
We are not provided with a background intro to the data so we spend some time looking into the each variable and tried to find some pattern. Luckily we finally found some correlation between variables and then grouped and reduced the number of variables.
We do not use the…This is a Kaggle like competition which used the transaction data to predict the active customer.
i. Understanding the data, clean the data and subset the data.
We are not provided with a background intro to the data so we spend some time looking into the each variable and tried to find some pattern. Luckily we finally found some correlation between variables and then grouped and reduced the number of variables.
We do not use the features directly, and we do the feature engineering carefully for each feature. For some feature has outliers ,we eliminate those outliers. For some feature has the range value is quite wide, we do the sqrt transform. For some data, we also found that the NAs occur in all the records about food so we decided to train separate models on the data containing food NAs and those without NAs, We do really a lot of work on feature transformation and engineering.
ii. Add new features. We tried with the variables from the data but cannot make progess when we hit approximately 68% in public leader board. One teammate found a paper describing some interesting features to be used in the customer classification using transaction data. In that paper the authors introduced the variable "number of NAs and number of 0 for each customers" are quite important for final prediction of the active customer, so we add these features to our result and the model give the result goes beyond 69%.
iii. Ensemble methods. We first tried a single model but stopped at around 69%. After that we tried to combine 13 kinds of models with both parametric and non-parametric machine learning method. Based on these prediction models, we use the 2-layer 5-fold stacking method ensemble the output of the first-layer models.Other creatorsSee project -
Entity Resolution Matching between Foursquare and Locu’s dataset
1. Take two datasets from Foursquare and Locu that describe the same entities, and identify which entity in one dataset is the same as an entity in the other dataset.
2. We construct some features according to the input dataset. We construct the features hiversine_distances for the location information including longitude and lattitude. The 'name' and 'address' information are evaluated used the jaccard similarity score for both the whole entry and each character in the entry. The 'phone…1. Take two datasets from Foursquare and Locu that describe the same entities, and identify which entity in one dataset is the same as an entity in the other dataset.
2. We construct some features according to the input dataset. We construct the features hiversine_distances for the location information including longitude and lattitude. The 'name' and 'address' information are evaluated used the jaccard similarity score for both the whole entry and each character in the entry. The 'phone number' is evaluted through the simple matching. The missing values in 'phone number' and 'address' are also marked by the dummy variable feature.
3. In our algorithm, we combine the records in the locu train dataset and the foursquare train dataset, featurize the dataset and then add the tag that whether they are in the matched list or not. Then we use the training data to train the random forest classifier. The number of trees are chosen by the cross validation method and the number of features are used the general "sqrt" method. Finally we choose the random forest classifier with the 100 trees according to the cross-validation F1 score.
4. Here we set a threshold 0.53 which comes the cross validation used in the matching method. For several matched items in the test dataset through the random forest classifier, we use the matched item with the highest probability.
5. Our result has precision 100%, recall 98.33% and F1 score 99.16%.
Other creators -
Using AWS Cloud Platform and Spark Machine Learning to Recommend Music and with the Last.fm’s Audioscrobbler Data Set
1. Using the data set published by Audioscrobbler with 24.2 million records about user’s player of artists to build the music recommender engine.
2. Implementing the alternating least squares recommender algorithm through the MLLib on Spark to build the music recommener
3. Preprocessing the raw data set using python functional programming to correct the misspelled or nonstandard artist’s ID
4. Using cross validation on Spark to select the hyperparameters for the matrix factorization…1. Using the data set published by Audioscrobbler with 24.2 million records about user’s player of artists to build the music recommender engine.
2. Implementing the alternating least squares recommender algorithm through the MLLib on Spark to build the music recommener
3. Preprocessing the raw data set using python functional programming to correct the misspelled or nonstandard artist’s ID
4. Using cross validation on Spark to select the hyperparameters for the matrix factorization model
5. Implement the final model on AWS platform to handle the huge amount of data
Other creators -
Using Hadoop Hive and Mapreduce to analysis Nasa Server Logs
1. Dealing with the data set contains Apache Logs gathered by NASA's server in the months of July-October, 1995, which is around 1 GB using the HDFS.
2. Create a schema for the dataset in Hive through the regular expression to describe a concrete structure describing all the required fields.
3. Make the plot to depicting the number of requests made in a day for every day in the month of October.
4. Write a MapReduce job to calculate total bandwidth add all the response bytes sent by…1. Dealing with the data set contains Apache Logs gathered by NASA's server in the months of July-October, 1995, which is around 1 GB using the HDFS.
2. Create a schema for the dataset in Hive through the regular expression to describe a concrete structure describing all the required fields.
3. Make the plot to depicting the number of requests made in a day for every day in the month of October.
4. Write a MapReduce job to calculate total bandwidth add all the response bytes sent by NASA webserver.
-
Zynga Game Payer Prediction and User Pattern Analysis
1. Processing real user data and metrics from Zynga platform with 1 million user records and 247 features.
2. Implemented Lasso, ridge regression with logistic regression and random forest method to select the important features in predicting whether the user would be a payer.
3. Ensemble the stochastic gradient descent classifier with perceptron, log and hinge loss function, the knn method and the decision tree method with the selected important features to predict the payer. The…1. Processing real user data and metrics from Zynga platform with 1 million user records and 247 features.
2. Implemented Lasso, ridge regression with logistic regression and random forest method to select the important features in predicting whether the user would be a payer.
3. Ensemble the stochastic gradient descent classifier with perceptron, log and hinge loss function, the knn method and the decision tree method with the selected important features to predict the payer. The precision, recall and F1 score all reach up to 95%.
4. Using Kmeans++ method with the important features to cluster the user patterns on Zynga platform. The number of cluster is determined by the elbow method. Using cluster method, we can correctly reveal the difference pattern between paying users, the risk-prefer user and the mature user.
5. Based on the user pattern, we propose the strategy to hold campaign between different group of users to improve the engagement of users.
-
Handwriting Recognizing by SVM and Adaboost Supervised Learning with R
1. Processed JPEG data from the USPS open handwriting datasets data into the matrix with R.
2. Implemented the non-linear SVM method and Adaboost with R to recognize the handwriting numbers.
3. Chosen the kernel and margin parameters through cross validation to improve the recognized rate to 90%.
-
Document Text Classification Using Lasso/Ridge Regression and Naïve Bayes
1. Building an efficient Naïve Bayes classifier to classify the papers belonging to Hamilton or Madison with the help of natural language processing package of R
2. Implementing the Ridge regression, Lasso and mutual information selection, respectively, to remove the irrelevant features in the text documents to improve the efficiency of the Bayes classifier.
-
Mining the NYPD Open Datasets to Predict the Danger Area for Car Collision in NYC on AWS Cloud Platform
1. Cleaned, processed and selected a bunch of features to find correlation between the rate of vehicle collisions and the location, time and weather of the driving route with R script through the API of NYC open dataset.
2. Applied normalization and PCA for the features of data, then implementing the unsupervised K-means++ method on AWS Cloud Platform with Spark, obtain the heat map of high danger area in NYC with the inputting time and driving route.
Other creators -
Study the Relation Between Users’ Sentiment and Location Tags in Twitter with SQL and
1. Processed tweets from Twitter Streaming API to extract tweets with locations tags using Python and SQL
2. Done sentiment analysis by writing the classifiers with python: naive Bayes classifier, maximum entropy classifier and support vector machines. The NLTK package is used to parse and analyze each tweet.
3. Improved the accuracy of self-written machine learning classifier by using the bi-grams, tri-grams and word dictionaries. The accuracy is around 80%.
-
Predict the SSE Index by Bouchard-Sornette option pricing-model
Developed C code to implement Bouchard-Sornette option pricing-model to predict the SSE Index
-
Computational study of the phase transition in the Hexagonal Ising Model
Implemented Wolff-Monte-Carlo method by C code to study the phase transition in hexagonal Ising model.
-
Developing New Global Optimization Algorithm for Material Science with Hadoop
-
1. Proposed a new global optimization algorithm for functional material searching based on the differential evolution algorithm using python.
2. Utilized new algorithm to find the grain boundary structures with lower formation energy on Hadoop.
3. Design A/B test to select the components in the algorithm to make the optimization efficient.
-
Developing High Efficient Algorithm in Scientific Computation
-
1. Developed the efficient algorithm with python to accelerate parallel large-scale data-analysis on Hadoop.
2. Accelerating the efficiency of calculation 10 times without lost the major accuracy comparing to the previous.
-
Computational Study of the Energy Transport in Nanostructures, Fudan University
-
1. Developed python code to simulate the thermal transport in graphene-based materials.
2. Using multivariate numerical method with R to analyze the datasets obtained from the experiments.
3. Designed a new kind of 2D thermal rectifier and publish in peer-reviewed paper (top cited paper in journal)
Honors & Awards
-
Rank 91/2070 (top 5%) in Kaggle Two Sigma Financial Model Challenge
Kaggle
As a member in the team attending Kaggle Two Sigma Financial Model Challenge. Implementing time-series feature engineering and linear/tree regressors to build the model which achieve top 5% score in the private leaderboard on test dataset.
-
Brown Medal in Hackerrank Coding Contest (top 15%)
Hackerrank
Top 15% in Hackerrank Week of Code 23 contest with 10000+ attendees.
-
Rank No.1 among 279 teams in Rang-Technology Data Analytic Competition (Kaggle like data competition)
Rang-Technology
https://rang.shinyapps.io/Competition/
Rank No.1 among 279 teams composed of Master students around 50 Universities, including CMU, Columbia, Cornell, USC, UIUC etc in the Rang-Tech Data Analytics Competition, a Kaggle like competition which used the transaction data to predict the active customer. -
2015 Web of Science Highly Cited Paper Worldwide (First author)
Thomson Reuters
My first author paper published on Physical Review B about the theoretical computation on magnetic materials "Antiferromagnetic ground state with pair-checkerboard order in FeSe " has been selected as the "Highly Cited Paper" in 2015 period.
My first author paper has been selected as the top 1% high quality science paper from the about 1170000 papers published in physics related subjects. This is the most prestigious criteria about the research impact in the science research field. -
National Scholarship
Ministry of Education of People's Republic of China
Top honor for the best academic achievement of graduate student in China.
-
Fellowship for Graduate Student’s Short-term International Visiting (to Lawrence Berkeley National Lab)
Fudan University
Fellowship for excellent graduate student to visit top-class institutions worldwide.
-
Distinguished award for new graduate student
Fudan University
For the excellent new coming graduate student.
Languages
-
Mandarin
Native or bilingual proficiency
-
English
Professional working proficiency
Organizations
-
American Physics Society
Member
- PresentStudent member of the American Physics Society. Give two oral talks in the 2013 and 2014 APS Annual March Meeting.
More activity by Haiyuan
-
Today, in collaboration between Microsoft Azure and GitHub we released #GitHub #Models. This will enable 100M developers in GitHub to easily…
Today, in collaboration between Microsoft Azure and GitHub we released #GitHub #Models. This will enable 100M developers in GitHub to easily…
Liked by Haiyuan Cao
-
Build AI applications right where you manage your code. With GitHub Models, now more than 100 million developers can access and experiment with top…
Build AI applications right where you manage your code. With GitHub Models, now more than 100 million developers can access and experiment with top…
Liked by Haiyuan Cao
-
Introducing GitHub Models, a new way for the more than 100 million developers who call GitHub home to build with industry-leading AI models directly…
Introducing GitHub Models, a new way for the more than 100 million developers who call GitHub home to build with industry-leading AI models directly…
Liked by Haiyuan Cao
-
Passionate about how AI will shape the world? Join us in building AGI with/for our customers and users. Check out this exciting opportunity for a…
Passionate about how AI will shape the world? Join us in building AGI with/for our customers and users. Check out this exciting opportunity for a…
Liked by Haiyuan Cao
-
More than 60,000 organizations are now using our AI Platform, up nearly 60% yoy with avg spend / customer growing. As customers move from…
More than 60,000 organizations are now using our AI Platform, up nearly 60% yoy with avg spend / customer growing. As customers move from…
Liked by Haiyuan Cao
-
Efficiently inferencing and fine-tuning massive models like Llama 3.1 405B requires a synthesis of multiple memory optimizations, parallelism and…
Efficiently inferencing and fine-tuning massive models like Llama 3.1 405B requires a synthesis of multiple memory optimizations, parallelism and…
Liked by Haiyuan Cao
-
I left Microsoft last week. Among the many things to celebrate for my ten years there, I am especially grateful for the opportunity to have known and…
I left Microsoft last week. Among the many things to celebrate for my ten years there, I am especially grateful for the opportunity to have known and…
Liked by Haiyuan Cao
-
Congratulations to Meta on their latest addition to the Llama family for advanced synthetic data generation and distillation. Llama 3.1 405B is on…
Congratulations to Meta on their latest addition to the Llama family for advanced synthetic data generation and distillation. Llama 3.1 405B is on…
Liked by Haiyuan Cao
-
I've been sharing a lot about new models; today another one is live on Azure AI with Cohere Rerank! We're also adding capabilities to the current…
I've been sharing a lot about new models; today another one is live on Azure AI with Cohere Rerank! We're also adding capabilities to the current…
Liked by Haiyuan Cao
-
Meta's Llama-3.1 405B is available on Azure AI in yet another New Model Day! This is a massive model with really impressive benchmarks competing…
Meta's Llama-3.1 405B is available on Azure AI in yet another New Model Day! This is a massive model with really impressive benchmarks competing…
Liked by Haiyuan Cao
-
Now that AI can code, learning to code is more important than ever (and AI is here to help). See guidance for CS classrooms: teachai.org/cs. Let's…
Now that AI can code, learning to code is more important than ever (and AI is here to help). See guidance for CS classrooms: teachai.org/cs. Let's…
Liked by Haiyuan Cao
-
🎉 Introducing GPT-4o mini, OpenAI's new flagship model now available on Azure AI. GPT-4o mini can reason across vision and text at faster speeds and…
🎉 Introducing GPT-4o mini, OpenAI's new flagship model now available on Azure AI. GPT-4o mini can reason across vision and text at faster speeds and…
Liked by Haiyuan Cao
-
🚀 Excited to announce our ICML'24 Tutorial: "Mixture-of-Experts in the Era of LLMs: A New Odyssey"! 🗓️ Mon 22 Jul ⏰ 1 — 3 p.m. CEST 📍 Hall A1…
🚀 Excited to announce our ICML'24 Tutorial: "Mixture-of-Experts in the Era of LLMs: A New Odyssey"! 🗓️ Mon 22 Jul ⏰ 1 — 3 p.m. CEST 📍 Hall A1…
Liked by Haiyuan Cao
-
Everyone loves New Model Day! Bringing GPT4o-mini to Azure AI, a lightning-fast model that crushes GPT3.5 in MMLU. The pace of innovation is…
Everyone loves New Model Day! Bringing GPT4o-mini to Azure AI, a lightning-fast model that crushes GPT3.5 in MMLU. The pace of innovation is…
Liked by Haiyuan Cao
-
We are excited to announce the release of OpenAI’s fastest model, GPT4o mini, also available on Azure AI today! GPT-4o mini is the lowest cost small…
We are excited to announce the release of OpenAI’s fastest model, GPT4o mini, also available on Azure AI today! GPT-4o mini is the lowest cost small…
Liked by Haiyuan Cao
-
I've been at Microsoft for almost 8 years now doing the same job (CTO), and spent six years prior at LinkedIn doing one job (SVP Eng/Ops). So I…
I've been at Microsoft for almost 8 years now doing the same job (CTO), and spent six years prior at LinkedIn doing one job (SVP Eng/Ops). So I…
Liked by Haiyuan Cao
Other similar profiles
Explore collaborative articles
We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.
Explore MoreOthers named Haiyuan Cao
21 others named Haiyuan Cao are on LinkedIn
See others named Haiyuan Cao