Mostofa Patwary

Mostofa Patwary

South San Francisco, California, United States
2K followers 500+ connections

About

Research Areas:

Large Foundational Language Model Pretraining, Large Scale Deep…

Activity

Join now to see all activity

Experience

  • NVIDIA Graphic

    NVIDIA

    Santa Clara, California, United States

  • -

    Santa Clara, California, United States

  • -

    Santa Clara

  • -

    San Francisco Bay Area

  • -

    San Francisco Bay Area

  • -

    Evanston, IL

  • -

    Bergen Area, Norway

  • -

    West Lafayette, IN

  • -

    Utrecht Area, Netherlands

  • -

    Dhaka, Bangladesh

Education

Publications

  • Large Scale Multi-Actor Generative Dialog Modeling

    The 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020)

    Non-goal oriented dialog agents (i.e. chatbots) aim to produce varying and engaging conversations with a user; however, they typically exhibit either inconsistent personality across conversations or the average personality of all users. This paper addresses these issues by controlling an agent’s persona upon generation via conditioning on prior conversations of a target actor. In doing so, we are able to utilize more abstract patterns within a person’s speech and better emulate them in…

    Non-goal oriented dialog agents (i.e. chatbots) aim to produce varying and engaging conversations with a user; however, they typically exhibit either inconsistent personality across conversations or the average personality of all users. This paper addresses these issues by controlling an agent’s persona upon generation via conditioning on prior conversations of a target actor. In doing so, we are able to utilize more abstract patterns within a person’s speech and better emulate them in generated responses. This work introduces the Generative Conversation Control model, an augmented and fine-tuned GPT-2 language model that conditions on past reference conversations to probabilistically model multi-turn conversations in the actor’s persona. We introduce an accompanying data collection procedure to obtain 10.3M conversations from 6 months worth of Reddit comments. We demonstrate that scaling model sizes from 117M to 8.3B parameters yields an improvement from 23.14 to 13.14 perplexity on 1.7M held out Reddit conversations. Increasing model scale yielded similar improvements in human evaluations that measure preference of model samples to the held out target distribution in terms of realism (31% increased to 37% preference), style matching (37% to 42%), grammar and content quality (29% to 42%), and conversation coherency (32% to 40%). We find that conditionally modeling past conversations improves perplexity by 0.47 in automatic evaluations. Through human trials we identify positive trends between conditional modeling and style matching and outline steps to further improve persona control.

    See publication
  • Language Modeling at Scale

    33rd IEEE International Parallel & Distributed Processing Symposium (IPDPS'2019)

    We show how Zipf's Law can be used to scale up language modeling (LM) to take advantage of more training data and more GPUs. LM plays a key role in many important natural language applications such as speech recognition and machine translation. Scaling up LM is important since it is widely accepted by the community that there is no data like more data. Eventually, we would like to train on terabytes (TBs) of text (trillions of words). Modern training methods are far from this goal, because of…

    We show how Zipf's Law can be used to scale up language modeling (LM) to take advantage of more training data and more GPUs. LM plays a key role in many important natural language applications such as speech recognition and machine translation. Scaling up LM is important since it is widely accepted by the community that there is no data like more data. Eventually, we would like to train on terabytes (TBs) of text (trillions of words). Modern training methods are far from this goal, because of various bottlenecks, especially memory (within GPUs) and communication (across GPUs). This paper shows how Zipf's Law can address these bottlenecks by grouping parameters for common words and character sequences, because U≪N, where U is the number of unique words (types) and N is the size of the training set (tokens). For a local batch size K with G GPUs and a D-dimension embedding matrix, we reduce the original per-GPU memory and communication asymptotic complexity from Θ(GKD) to Θ(GK+UD). Empirically, we find U∝(GK)0.64 on four publicly available large datasets. When we scale up the number of GPUs to 64, a factor of 8, training time speeds up by factors up to 6.7× (for character LMs) and 6.3× (for word LMs) with negligible loss of accuracy. Our weak scaling on 192 GPUs on the Tieba dataset shows a 35\% improvement in LM prediction accuracy by training on 93 GB of data (2.5× larger than publicly available SOTA dataset), but taking only 1.25× increase in training time, compared to 3 GB of the same dataset running on 6 GPUs.

    See publication
  • Deep Learning at 15PF: Supervised and Semi-Supervised Classification for Scientific Data

    International Conference for High Performance Computing, Networking, Storage and Analysis (Supercomputing, SC'17)

    This paper presents the first, 15-PetaFLOP Deep Learning system for solving scientific pattern classification problems on contemporary HPC architectures. We develop supervised convolutional architectures for discriminating signals in high-energy physics data as well as semi-supervised architectures for localizing and classifying extreme weather in climate data. Our Intelcaffe-based implementation obtains ∼2TFLOP/s on a single Cori Phase-II Xeon-Phi node. We use a hybrid strategy employing…

    This paper presents the first, 15-PetaFLOP Deep Learning system for solving scientific pattern classification problems on contemporary HPC architectures. We develop supervised convolutional architectures for discriminating signals in high-energy physics data as well as semi-supervised architectures for localizing and classifying extreme weather in climate data. Our Intelcaffe-based implementation obtains ∼2TFLOP/s on a single Cori Phase-II Xeon-Phi node. We use a hybrid strategy employing synchronous node-groups, while using asynchronous communication across groups. We use this strategy to scale training of a single model to ∼9600 Xeon-Phi nodes; obtaining peak performance of 11.73-15.07 PFLOP/s and sustained performance of 11.41-13.27 PFLOP/s. At scale, our HEP architecture produces state-of-the-art classification accuracy on a dataset with 10M images, exceeding that achieved by selections on high-level physics-motivated features. Our semi-supervised architecture successfully extracts weather patterns in a 15TB climate dataset. Our results demonstrate that Deep Learning can be optimized and scaled effectively on many-core, HPC systems.

    See publication
  • Deep Learning Scaling is Predictable, Empirically

    This paper presents an empirical characterization for deeper understanding of the relationships between training set size, computational scale, and model accuracy improvements to advance the state-of-the-art. We introduce a methodology for the measurement of generalization error, model size and training set growth and test in four machine learning domains: machine translation, language modeling, image processing, and speech recognition. We believe that these scaling relationships have…

    This paper presents an empirical characterization for deeper understanding of the relationships between training set size, computational scale, and model accuracy improvements to advance the state-of-the-art. We introduce a methodology for the measurement of generalization error, model size and training set growth and test in four machine learning domains: machine translation, language modeling, image processing, and speech recognition. We believe that these scaling relationships have significant implications on deep learning research, practice, and systems.

    A short summary is available at http://research.baidu.com/deep-learning-scaling-predictable-empirically/

    See publication
  • Galactos: Computing 3-pt anisotropic correlation for 2B Outer Rim galaxies

    International Conference for High Performance Computing, Networking, Storage and Analysis (Supercomputing, SC'17)

    The nature of dark energy and the complete theory of gravity are two central questions currently facing cosmology. A vital tool for addressing them is the 3-point correlation function (3PCF), which probes deviations from a spatially random distribution of galaxies. However, the 3PCF's formidable computational expense has prevented its application to astronomical surveys comprising millions to billions of galaxies. We present Galactos, a high-performance implementation of a novel, O(N^2)…

    The nature of dark energy and the complete theory of gravity are two central questions currently facing cosmology. A vital tool for addressing them is the 3-point correlation function (3PCF), which probes deviations from a spatially random distribution of galaxies. However, the 3PCF's formidable computational expense has prevented its application to astronomical surveys comprising millions to billions of galaxies. We present Galactos, a high-performance implementation of a novel, O(N^2) algorithm that uses a load-balanced k-d tree and spherical harmonic expansions to compute the anisotropic 3PCF. Our implementation is optimized for the Intel Xeon Phi architecture, exploiting SIMD parallelism, instruction and thread concurrency, and significant L1 and L2 cache reuse, reaching 39% of peak performance on a single node. Galactos scales to the full Cori system, achieving 9.8PF (peak) and 5.06PF (sustained) across 9636 nodes, making the 3PCF easily computable for all galaxies in the observable universe.

    See publication
  • GraphPad: Optimized Graph Primitives for Parallel and Distributed Platforms

    30th IEEE International Parallel & Distributed Processing Symposium (IPDPS'16)

    Abstract: The duality between graphs and matrices means that many common graph analyses can be expressed with primitives such as generalized sparse matrix-vector multiplication (SpMSpV) and sparse matrix-matrix multiplication (SpGEMM). Achieving high performance on these primitives is challenging due to limited arithmetic intensity, irregular memory accesses, and significant network communication requirements in the distributed setting. In this paper we implement four graph applications using…

    Abstract: The duality between graphs and matrices means that many common graph analyses can be expressed with primitives such as generalized sparse matrix-vector multiplication (SpMSpV) and sparse matrix-matrix multiplication (SpGEMM). Achieving high performance on these primitives is challenging due to limited arithmetic intensity, irregular memory accesses, and significant network communication requirements in the distributed setting. In this paper we implement four graph applications using GraphPad, our optimized multinode implementations of generalized linear algebra prim-itives such as SpMSpV and SpGEMM. GraphPad is highly flexible to accommodate multiple data layouts, partitioning strategies, and incorporates communication optimizations. Our performance at scale can exceed that of CombBLAS by up to 40×. In addition to GraphPad's performance in a distributed setting, it is also within 2× the performance of GraphMat, a high performance graph framework on a single node for four out of five benchmarks. We also show our communication optimizations and flexibility are critical for good performance on both HPC clusters and commodity cloud platforms.

    Other authors
    See publication
  • GraphMat: high performance graph analytics made productive

    International Conference on Very Large Scale Databases (VLDB'15)

  • Scalable Bayesian Optimization Using Deep Neural Networks

    International Conference on Machine Learning, ICML'15

  • Navigating the Maze of Graph Analytics Frameworks using Massive Graph Datasets

    ACM SIGMOD

    Other authors
    • Nadathur Satish
    • Narayanan Sundaram
    • Jiwon Seo, Jongsoo Park, Muhammad Hassan, Shubo Sengupta, Zhaoming Yin, and Pradeep Dubey
  • PARDICLE: Parallel Approximate Density-Based Clustering

    International Conference for High Performance Computing, Networking, Storage and Analysis (Supercomputing, SC'14)

    Other authors
  • Scalable Parallel OPTICS Data Clustering Using Graph Algorithmic Techniques

    International Conference for High Performance Computing, Networking, Storage and Analysis (Supercomputing, SC'13)

    Other authors
  • ColPack: Software for graph coloring and related problems in scientific computing

    Journal of ACM Transactions on Mathematical Software (TOMS)

    Other authors
    See publication
  • A New Scalable Parallel DBSCAN Algorithm Using the Disjoint Set Data Structure

    International Conference for High Performance Computing, Networking, Storage and Analysis (Supercomputing, SC'12)

  • Multi-core spanning forest algorithms using the disjoint-set data structure

    Proceedings of 26th IEEE International Parallel & Distributed Processing Symposium (IPDPS'12)

  • New Multithreaded Ordering and Coloring Algorithms for Multicore Architectures

    Proceedings of 17th International European Conference on Parallel and Distributed Computing (Euro-Par'11), Springer LNCS, 6853

  • Experiments on Union-Find Algorithms for the Disjoint-Set Data Structure

    Proceedings of 9th International Symposium on Experimental Algorithms (SEA'10), Springer LNCS 6049, pp. 411–423

  • A Scalable Parallel Union-Find Algorithm for Distributed Memory Computers

    Proceedings of Eighth International Conference on Parallel Processing and Applied Mathmatics (PPAM'09), Springer LNCS 6067, vol. 1, pp. 186–195

  • Reservation Based Adaptive Uplink Admission Control for WCDMA

    International Conference on Next-Generation Wireless Systems (ICNEWS)

    This project proposed an improved admission control algorithm for WCDMA networks.

    Other authors
    See publication
  • BD-CATS: Big Data Clustering at Trillion Particle Scale

    International Conference for High Performance Computing, Networking, Storage and Analysis (Supercomputing, SC'15)

    Other authors
  • PANDA: Extreme Scale Parallel K-Nearest Neighbor on Distributed Architectures

    30th IEEE International Parallel & Distributed Processing Symposium (IPDPS'16)

    Other authors

Courses

  • Advanced Algorithmic Techniques

    -

  • Algorithm Engineering

    -

  • Computer Architecture

    -

  • Operating Systems

    -

  • Parallel Programming

    -

  • Software Engineering

    -

Honors & Awards

  • System and Software Research Division Recognition Award

    Intel Labs

    Demonstrating industry leading graph analytics performance on Xeon and Xeon Phi
    without sacrificing programmability.

  • Technical Computing Group Employee Recognition Award

    Intel

    Optimization of HPCG Benchmark for Top500 machines.

More activity by Mostofa

View Mostofa’s full profile

  • See who you know in common
  • Get introduced
  • Contact Mostofa directly
Join to view full profile

Other similar profiles

Explore collaborative articles

We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

Explore More

Others named Mostofa Patwary

Add new skills with these courses