Mostofa Patwary

South San Francisco, California, United States

2K followers 500+ connections

View mutual connections with Mostofa

Welcome back

Email or phone

Password

Forgot password?

or

New to LinkedIn? Join now

or

New to LinkedIn? Join now

Join to view profile

NVIDIA

Universitetet i Bergen (UiB)

Personal Website

About

Research Areas:

Large Foundational Language Model Pretraining, Large Scale Deep…

Activity

We have reclaimed #1 on the MTEB Leaderboard 🏆 Our NV-Embed-v2, has achieved a record-breaking score of 72.31 across 56 text embedding/retrieval…

We have reclaimed #1 on the MTEB Leaderboard 🏆 Our NV-Embed-v2, has achieved a record-breaking score of 72.31 across 56 text embedding/retrieval…

Liked by Mostofa Patwary
Congrats Sharath TS, Saurav Muralidharan, Pavlo Molchanov & team!

Congrats Sharath TS, Saurav Muralidharan, Pavlo Molchanov & team!

Liked by Mostofa Patwary
Come see my talk! I will be speaking as part of the virtual conference on Data Engineering for AI/ML on September 12th. I’ll be giving a…

Come see my talk! I will be speaking as part of the virtual conference on Data Engineering for AI/ML on September 12th. I’ll be giving a…

Liked by Mostofa Patwary

Join now to see all activity

Experience

NVIDIA

Santa Clara, California, United States
-

Santa Clara, California, United States
-

Santa Clara
-

San Francisco Bay Area
-

San Francisco Bay Area
-

Evanston, IL
-

Bergen Area, Norway
-

West Lafayette, IN
-

Utrecht Area, Netherlands
-

Dhaka, Bangladesh

Education

Universitetet i Bergen (UiB)

2008 - 2011
2004 - 2006

Publications

Large Scale Multi-Actor Generative Dialog Modeling

The 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020) 2020

Non-goal oriented dialog agents (i.e. chatbots) aim to produce varying and engaging conversations with a user; however, they typically exhibit either inconsistent personality across conversations or the average personality of all users. This paper addresses these issues by controlling an agent’s persona upon generation via conditioning on prior conversations of a target actor. In doing so, we are able to utilize more abstract patterns within a person’s speech and better emulate them in…

Non-goal oriented dialog agents (i.e. chatbots) aim to produce varying and engaging conversations with a user; however, they typically exhibit either inconsistent personality across conversations or the average personality of all users. This paper addresses these issues by controlling an agent’s persona upon generation via conditioning on prior conversations of a target actor. In doing so, we are able to utilize more abstract patterns within a person’s speech and better emulate them in generated responses. This work introduces the Generative Conversation Control model, an augmented and fine-tuned GPT-2 language model that conditions on past reference conversations to probabilistically model multi-turn conversations in the actor’s persona. We introduce an accompanying data collection procedure to obtain 10.3M conversations from 6 months worth of Reddit comments. We demonstrate that scaling model sizes from 117M to 8.3B parameters yields an improvement from 23.14 to 13.14 perplexity on 1.7M held out Reddit conversations. Increasing model scale yielded similar improvements in human evaluations that measure preference of model samples to the held out target distribution in terms of realism (31% increased to 37% preference), style matching (37% to 42%), grammar and content quality (29% to 42%), and conversation coherency (32% to 40%). We find that conditionally modeling past conversations improves perplexity by 0.47 in automatic evaluations. Through human trials we identify positive trends between conditional modeling and style matching and outline steps to further improve persona control.

See publication
Language Modeling at Scale

33rd IEEE International Parallel & Distributed Processing Symposium (IPDPS'2019) October 24, 2018

We show how Zipf's Law can be used to scale up language modeling (LM) to take advantage of more training data and more GPUs. LM plays a key role in many important natural language applications such as speech recognition and machine translation. Scaling up LM is important since it is widely accepted by the community that there is no data like more data. Eventually, we would like to train on terabytes (TBs) of text (trillions of words). Modern training methods are far from this goal, because of…

We show how Zipf's Law can be used to scale up language modeling (LM) to take advantage of more training data and more GPUs. LM plays a key role in many important natural language applications such as speech recognition and machine translation. Scaling up LM is important since it is widely accepted by the community that there is no data like more data. Eventually, we would like to train on terabytes (TBs) of text (trillions of words). Modern training methods are far from this goal, because of various bottlenecks, especially memory (within GPUs) and communication (across GPUs). This paper shows how Zipf's Law can address these bottlenecks by grouping parameters for common words and character sequences, because U≪N, where U is the number of unique words (types) and N is the size of the training set (tokens). For a local batch size K with G GPUs and a D-dimension embedding matrix, we reduce the original per-GPU memory and communication asymptotic complexity from Θ(GKD) to Θ(GK+UD). Empirically, we find U∝(GK)0.64 on four publicly available large datasets. When we scale up the number of GPUs to 64, a factor of 8, training time speeds up by factors up to 6.7× (for character LMs) and 6.3× (for word LMs) with negligible loss of accuracy. Our weak scaling on 192 GPUs on the Tieba dataset shows a 35\% improvement in LM prediction accuracy by training on 93 GB of data (2.5× larger than publicly available SOTA dataset), but taking only 1.25× increase in training time, compared to 3 GB of the same dataset running on 6 GPUs.

See publication
Deep Learning at 15PF: Supervised and Semi-Supervised Classification for Scientific Data

International Conference for High Performance Computing, Networking, Storage and Analysis (Supercomputing, SC'17) Nov 2017

This paper presents the first, 15-PetaFLOP Deep Learning system for solving scientific pattern classification problems on contemporary HPC architectures. We develop supervised convolutional architectures for discriminating signals in high-energy physics data as well as semi-supervised architectures for localizing and classifying extreme weather in climate data. Our Intelcaffe-based implementation obtains ∼2TFLOP/s on a single Cori Phase-II Xeon-Phi node. We use a hybrid strategy employing…

This paper presents the first, 15-PetaFLOP Deep Learning system for solving scientific pattern classification problems on contemporary HPC architectures. We develop supervised convolutional architectures for discriminating signals in high-energy physics data as well as semi-supervised architectures for localizing and classifying extreme weather in climate data. Our Intelcaffe-based implementation obtains ∼2TFLOP/s on a single Cori Phase-II Xeon-Phi node. We use a hybrid strategy employing synchronous node-groups, while using asynchronous communication across groups. We use this strategy to scale training of a single model to ∼9600 Xeon-Phi nodes; obtaining peak performance of 11.73-15.07 PFLOP/s and sustained performance of 11.41-13.27 PFLOP/s. At scale, our HEP architecture produces state-of-the-art classification accuracy on a dataset with 10M images, exceeding that achieved by selections on high-level physics-motivated features. Our semi-supervised architecture successfully extracts weather patterns in a 15TB climate dataset. Our results demonstrate that Deep Learning can be optimized and scaled effectively on many-core, HPC systems.

See publication
Deep Learning Scaling is Predictable, Empirically

Nov 2017

This paper presents an empirical characterization for deeper understanding of the relationships between training set size, computational scale, and model accuracy improvements to advance the state-of-the-art. We introduce a methodology for the measurement of generalization error, model size and training set growth and test in four machine learning domains: machine translation, language modeling, image processing, and speech recognition. We believe that these scaling relationships have…

This paper presents an empirical characterization for deeper understanding of the relationships between training set size, computational scale, and model accuracy improvements to advance the state-of-the-art. We introduce a methodology for the measurement of generalization error, model size and training set growth and test in four machine learning domains: machine translation, language modeling, image processing, and speech recognition. We believe that these scaling relationships have significant implications on deep learning research, practice, and systems.

A short summary is available at http://research.baidu.com/deep-learning-scaling-predictable-empirically/

See publication
Galactos: Computing 3-pt anisotropic correlation for 2B Outer Rim galaxies

International Conference for High Performance Computing, Networking, Storage and Analysis (Supercomputing, SC'17) Nov 2017

The nature of dark energy and the complete theory of gravity are two central questions currently facing cosmology. A vital tool for addressing them is the 3-point correlation function (3PCF), which probes deviations from a spatially random distribution of galaxies. However, the 3PCF's formidable computational expense has prevented its application to astronomical surveys comprising millions to billions of galaxies. We present Galactos, a high-performance implementation of a novel, O(N^2)…

The nature of dark energy and the complete theory of gravity are two central questions currently facing cosmology. A vital tool for addressing them is the 3-point correlation function (3PCF), which probes deviations from a spatially random distribution of galaxies. However, the 3PCF's formidable computational expense has prevented its application to astronomical surveys comprising millions to billions of galaxies. We present Galactos, a high-performance implementation of a novel, O(N^2) algorithm that uses a load-balanced k-d tree and spherical harmonic expansions to compute the anisotropic 3PCF. Our implementation is optimized for the Intel Xeon Phi architecture, exploiting SIMD parallelism, instruction and thread concurrency, and significant L1 and L2 cache reuse, reaching 39% of peak performance on a single node. Galactos scales to the full Cori system, achieving 9.8PF (peak) and 5.06PF (sustained) across 9636 nodes, making the 3PCF easily computable for all galaxies in the observable universe.

See publication
Efficient Approximation Algorithms for Weighted b-Matching

SIAM Journal on Scientific Computing Apr 2016
Other authors
See publication
GraphPad: Optimized Graph Primitives for Parallel and Distributed Platforms

30th IEEE International Parallel & Distributed Processing Symposium (IPDPS'16) December 18, 2015
Abstract: The duality between graphs and matrices means that many common graph analyses can be expressed with primitives such as generalized sparse matrix-vector multiplication (SpMSpV) and sparse matrix-matrix multiplication (SpGEMM). Achieving high performance on these primitives is challenging due to limited arithmetic intensity, irregular memory accesses, and significant network communication requirements in the distributed setting. In this paper we implement four graph applications using…

Abstract: The duality between graphs and matrices means that many common graph analyses can be expressed with primitives such as generalized sparse matrix-vector multiplication (SpMSpV) and sparse matrix-matrix multiplication (SpGEMM). Achieving high performance on these primitives is challenging due to limited arithmetic intensity, irregular memory accesses, and significant network communication requirements in the distributed setting. In this paper we implement four graph applications using GraphPad, our optimized multinode implementations of generalized linear algebra prim-itives such as SpMSpV and SpGEMM. GraphPad is highly flexible to accommodate multiple data layouts, partitioning strategies, and incorporates communication optimizations. Our performance at scale can exceed that of CombBLAS by up to 40×. In addition to GraphPad's performance in a distributed setting, it is also within 2× the performance of GraphMat, a high performance graph framework on a single node for four out of five benchmarks. We also show our communication optimizations and flexibility are critical for good performance on both HPC clusters and commodity cloud platforms.

Other authors
See publication
GraphMat: high performance graph analytics made productive

International Conference on Very Large Scale Databases (VLDB'15) Jul 2015
Other authors
See publication
Scalable Bayesian Optimization Using Deep Neural Networks

International Conference on Machine Learning, ICML'15 Jul 2015
Other authors
See publication
Parallel Efficient Sparse Matrix-Matrix Multiplication on Multicore Platforms

ISC High Performance Computing Conference 2015 Jun 2015
Other authors
See publication
Navigating the Maze of Graph Analytics Frameworks using Massive Graph Datasets

ACM SIGMOD 2014
Other authors
PARDICLE: Parallel Approximate Density-Based Clustering

International Conference for High Performance Computing, Networking, Storage and Analysis (Supercomputing, SC'14) 2014
Other authors
Scalable Parallel OPTICS Data Clustering Using Graph Algorithmic Techniques

International Conference for High Performance Computing, Networking, Storage and Analysis (Supercomputing, SC'13) Nov 2013
Other authors
ColPack: Software for graph coloring and related problems in scientific computing

Journal of ACM Transactions on Mathematical Software (TOMS) Sep 2013
Other authors
See publication
A New Scalable Parallel DBSCAN Algorithm Using the Disjoint Set Data Structure

International Conference for High Performance Computing, Networking, Storage and Analysis (Supercomputing, SC'12) 2012
Accelerating pairwise statistical significance estimation for local alignment by harvesting GPU's power

BMC Bioinformatics, Volume 13 (Suppl 5):S3 2012

See publication
Multi-core spanning forest algorithms using the disjoint-set data structure

Proceedings of 26th IEEE International Parallel & Distributed Processing Symposium (IPDPS'12) 2012

See publication
New Multithreaded Ordering and Coloring Algorithms for Multicore Architectures

Proceedings of 17th International European Conference on Parallel and Distributed Computing (Euro-Par'11), Springer LNCS, 6853 2011

See publication
Parallel Algorithms for Bipartite Matching Problems on Distributed Memory Computers

Parallel Computing, Volume 37, Issue 12, pp. 820-845 2011

See publication
Experiments on Union-Find Algorithms for the Disjoint-Set Data Structure

Proceedings of 9th International Symposium on Experimental Algorithms (SEA'10), Springer LNCS 6049, pp. 411–423 2010

See publication
A Scalable Parallel Union-Find Algorithm for Distributed Memory Computers

Proceedings of Eighth International Conference on Parallel Processing and Applied Mathmatics (PPAM'09), Springer LNCS 6067, vol. 1, pp. 186–195 2009

See publication
Reservation Based Adaptive Uplink Admission Control for WCDMA

International Conference on Next-Generation Wireless Systems (ICNEWS) Jan 2006
This project proposed an improved admission control algorithm for WCDMA networks.

Other authors
See publication
BD-CATS: Big Data Clustering at Trillion Particle Scale

International Conference for High Performance Computing, Networking, Storage and Analysis (Supercomputing, SC'15)
Other authors
PANDA: Extreme Scale Parallel K-Nearest Neighbor on Distributed Architectures

30th IEEE International Parallel & Distributed Processing Symposium (IPDPS'16)
Other authors

Courses

Advanced Algorithmic Techniques

-
Algorithm Engineering

-
Computer Architecture

-
Operating Systems

-
Parallel Programming

-
Software Engineering

-

Honors & Awards

System and Software Research Division Recognition Award

Intel Labs

Apr 2016

Demonstrating industry leading graph analytics performance on Xeon and Xeon Phi
without sacrificing programmability.
Technical Computing Group Employee Recognition Award

Intel

2014

Optimization of HPCG Benchmark for Top500 machines.

More activity by Mostofa

Best 8B base model! Congratulations team!

Best 8B base model! Congratulations team!

Shared by Mostofa Patwary
Reallusion Character Creator 4 provides the tools to craft realistic avatars, while Convai infuses them with the ability to engage in natural…

Reallusion Character Creator 4 provides the tools to craft realistic avatars, while Convai infuses them with the ability to engage in natural…

Liked by Mostofa Patwary
View my verified achievement from UBS.

View my verified achievement from UBS.

Liked by Mostofa Patwary
I'm fortunate enough to receive a Young Faculty Research Award (Special Mention) from the College of Engineering, NTU. My sincere gratitude goes to…

I'm fortunate enough to receive a Young Faculty Research Award (Special Mention) from the College of Engineering, NTU. My sincere gratitude goes to…

Liked by Mostofa Patwary
We've pruned LLaMa3.1 down to 4B parameters using our recent Minitron work, delivering a smaller and more efficient model…

We've pruned LLaMa3.1 down to 4B parameters using our recent Minitron work, delivering a smaller and more efficient model…

Liked by Mostofa Patwary
Excited to announce our jointly trained model with Mistral AI

Excited to announce our jointly trained model with Mistral AI

Liked by Mostofa Patwary
Llama 3.1 launched today with full support from NVIDIA's AI Foundry. https://lnkd.in/gQ6hM8a9

Llama 3.1 launched today with full support from NVIDIA's AI Foundry. https://lnkd.in/gQ6hM8a9

Liked by Mostofa Patwary
🤖 Excited to announce Minitron, a new family of language models obtained through a combination of weight pruning and knowledge distillation. 📄…

🤖 Excited to announce Minitron, a new family of language models obtained through a combination of weight pruning and knowledge distillation. 📄…

Liked by Mostofa Patwary
🎥 Watch our VP of AI, Kari Ann Briski and MIT Technology Review's Mat Honan as they explore how #generativeAI is transforming the approach to global…

🎥 Watch our VP of AI, Kari Ann Briski and MIT Technology Review's Mat Honan as they explore how #generativeAI is transforming the approach to global…

Liked by Mostofa Patwary

View Mostofa’s full profile

See who you know in common
Get introduced
Contact Mostofa directly

Join to view full profile

Other similar profiles

Explore collaborative articles

We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

Explore More

Others named Mostofa Patwary

2 others named Mostofa Patwary are on LinkedIn

See others named Mostofa Patwary

Add new skills with these courses

See all courses

Mostofa Patwary

South San Francisco, California, United States 2K followers 500+ connections

See your mutual connections View mutual connections with Mostofa Sign in Welcome back Email or phone Password Show Forgot password? Sign in or New to LinkedIn? Join now or New to LinkedIn? Join now

About

Activity

We have reclaimed #1 on the MTEB Leaderboard 🏆 Our NV-Embed-v2, has achieved a record-breaking score of 72.31 across 56 text embedding/retrieval…

Liked by Mostofa Patwary

Congrats Sharath TS, Saurav Muralidharan, Pavlo Molchanov & team!

Liked by Mostofa Patwary

Come see my talk! I will be speaking as part of the virtual conference on Data Engineering for AI/ML on September 12th. I’ll be giving a…

Liked by Mostofa Patwary

Experience

-

-

-

-

-

-

-

-

-

Education

Publications

The 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020) 2020

33rd IEEE International Parallel & Distributed Processing Symposium (IPDPS'2019) October 24, 2018

International Conference for High Performance Computing, Networking, Storage and Analysis (Supercomputing, SC'17) Nov 2017

Nov 2017

International Conference for High Performance Computing, Networking, Storage and Analysis (Supercomputing, SC'17) Nov 2017

SIAM Journal on Scientific Computing Apr 2016

30th IEEE International Parallel & Distributed Processing Symposium (IPDPS'16) December 18, 2015

International Conference on Very Large Scale Databases (VLDB'15) Jul 2015

International Conference on Machine Learning, ICML'15 Jul 2015

ISC High Performance Computing Conference 2015 Jun 2015

Navigating the Maze of Graph Analytics Frameworks using Massive Graph Datasets

ACM SIGMOD 2014

PARDICLE: Parallel Approximate Density-Based Clustering

International Conference for High Performance Computing, Networking, Storage and Analysis (Supercomputing, SC'14) 2014

Scalable Parallel OPTICS Data Clustering Using Graph Algorithmic Techniques

International Conference for High Performance Computing, Networking, Storage and Analysis (Supercomputing, SC'13) Nov 2013

Journal of ACM Transactions on Mathematical Software (TOMS) Sep 2013

A New Scalable Parallel DBSCAN Algorithm Using the Disjoint Set Data Structure

International Conference for High Performance Computing, Networking, Storage and Analysis (Supercomputing, SC'12) 2012

BMC Bioinformatics, Volume 13 (Suppl 5):S3 2012

Proceedings of 26th IEEE International Parallel & Distributed Processing Symposium (IPDPS'12) 2012

Proceedings of 17th International European Conference on Parallel and Distributed Computing (Euro-Par'11), Springer LNCS, 6853 2011

Parallel Computing, Volume 37, Issue 12, pp. 820-845 2011

Proceedings of 9th International Symposium on Experimental Algorithms (SEA'10), Springer LNCS 6049, pp. 411–423 2010

Proceedings of Eighth International Conference on Parallel Processing and Applied Mathmatics (PPAM'09), Springer LNCS 6067, vol. 1, pp. 186–195 2009

International Conference on Next-Generation Wireless Systems (ICNEWS) Jan 2006

BD-CATS: Big Data Clustering at Trillion Particle Scale

International Conference for High Performance Computing, Networking, Storage and Analysis (Supercomputing, SC'15)

PANDA: Extreme Scale Parallel K-Nearest Neighbor on Distributed Architectures

30th IEEE International Parallel & Distributed Processing Symposium (IPDPS'16)

Courses

Advanced Algorithmic Techniques

-

Algorithm Engineering

-

Computer Architecture

-

Operating Systems

-

Parallel Programming

-

Software Engineering

-

Honors & Awards

System and Software Research Division Recognition Award

Intel Labs

Technical Computing Group Employee Recognition Award

Intel

More activity by Mostofa

Best 8B base model! Congratulations team!

Shared by Mostofa Patwary

Reallusion Character Creator 4 provides the tools to craft realistic avatars, while Convai infuses them with the ability to engage in natural…

Liked by Mostofa Patwary

View my verified achievement from UBS.

Liked by Mostofa Patwary

I'm fortunate enough to receive a Young Faculty Research Award (Special Mention) from the College of Engineering, NTU. My sincere gratitude goes to…

Liked by Mostofa Patwary

South San Francisco, California, United States

2K followers 500+ connections

View mutual connections with Mostofa

Welcome back

Email or phone

Password

Forgot password?

or

New to LinkedIn? Join now

or

New to LinkedIn? Join now