Stylianos I. Venieris

Senior Research Scientist / Head of Distributed AI group @ Samsung AI, Cambridge, UK | AI systems, Deep Learning, FPGAs

London, England, United Kingdom

1K followers 500+ connections

View mutual connections with Stylianos

Welcome back

Email or phone

Password

Forgot password?

or

New to LinkedIn? Join now

or

New to LinkedIn? Join now

Join to view profile

Samsung Electronics

Imperial College London

Activity

We had a great day presenting our research on efficient ML Systems at the Turing Institute! Chong Tang Neelam Singh

We had a great day presenting our research on efficient ML Systems at the Turing Institute! Chong Tang Neelam Singh

Liked by Stylianos I. Venieris
It was a pleasure to host Prof. Jongse Park (KAIST) at Meta for an insightful talk - "Towards Fast and Efficient On-Device Generative AI". Learned a…

It was a pleasure to host Prof. Jongse Park (KAIST) at Meta for an insightful talk - "Towards Fast and Efficient On-Device Generative AI". Learned a…

Liked by Stylianos I. Venieris
Job done at HeteroPar workshop at Euro-Par 2024! Interesting talks and discussions! 🙂 https://lnkd.in/gCakzbiG

Job done at HeteroPar workshop at Euro-Par 2024! Interesting talks and discussions! 🙂 https://lnkd.in/gCakzbiG

Liked by Stylianos I. Venieris

Join now to see all activity

Experience

Samsung Electronics

5 years 10 months
- Head of Distributed AI group / Senior Research Scientist - Samsung AI
  
  Samsung Electronics
  
  Mar 2022 - Present 2 years 6 months
  
  Cambridge, England, United Kingdom
- Researcher - Samsung AI
  
  Samsung Electronics
  
  Nov 2018 - Mar 2022 3 years 5 months
  
  Cambridge, United Kingdom
Research Assistant

Imperial College London

Feb 2018 - Oct 2018 9 months

London, United Kingdom
Research Intern

Imperial College London

Jul 2013 - Sep 2013 3 months

London, United Kingdom
Software Development Engineer Intern

SingularLogic

Jul 2012 - Sep 2012 3 months

Athens, Greece

Education

Imperial College London

PhD

2014 - 2018
Imperial College London

MEng

2010 - 2014

Publications

FjORD: Fair and Accurate Federated Learning under heterogeneous targets with Ordered Dropout

35th Conference on Neural Information Processing Systems (NeurIPS) (Spotlight) October 4, 2021
Federated Learning (FL) has been gaining significant traction across different ML tasks, ranging from vision to keyboard predictions. In large-scale deployments, client heterogeneity is a fact, and constitutes a primary problem for fairness, training performance and accuracy. Although significant efforts have been made into tackling statistical data heterogeneity, the diversity in the processing capabilities and network bandwidth of clients, termed as system heterogeneity, has remained largely…

Federated Learning (FL) has been gaining significant traction across different ML tasks, ranging from vision to keyboard predictions. In large-scale deployments, client heterogeneity is a fact, and constitutes a primary problem for fairness, training performance and accuracy. Although significant efforts have been made into tackling statistical data heterogeneity, the diversity in the processing capabilities and network bandwidth of clients, termed as system heterogeneity, has remained largely unexplored. Current solutions either disregard a large portion of available devices or set a uniform limit on the model's capacity, restricted by the least capable participants. In this work, we introduce Ordered Dropout, a mechanism that achieves an ordered, nested representation of knowledge in Neural Networks and enables the extraction of lower footprint submodels without the need of retraining. We further show that for linear maps our Ordered Dropout is equivalent to SVD. We employ this technique, along with a self-distillation methodology, in the realm of FL in a framework called FjORD. FjORD alleviates the problem of client system heterogeneity by tailoring the model width to the client's capabilities. Extensive evaluation on both CNNs and RNNs across diverse modalities shows that FjORD consistently leads to significant performance gains over state-of-the-art baselines, while maintaining its nested structure.

Other authors
See publication
How to Reach Real-Time AI on Consumer Devices? Solutions for Programmable and Custom Architectures

IEEE 32nd International Conference on Application-specific Systems, Architectures and Processors (ASAP) July 7, 2021
The unprecedented performance of deep neural networks (DNNs) has led to large strides in various Artificial Intelligence (AI) inference tasks, such as object and speech recognition. Nevertheless, deploying such AI models across commodity devices faces significant challenges: large computational cost, multiple performance objectives, hardware heterogeneity and a common need for high accuracy, together pose critical problems to the deployment of DNNs across the various embedded and mobile devices…

The unprecedented performance of deep neural networks (DNNs) has led to large strides in various Artificial Intelligence (AI) inference tasks, such as object and speech recognition. Nevertheless, deploying such AI models across commodity devices faces significant challenges: large computational cost, multiple performance objectives, hardware heterogeneity and a common need for high accuracy, together pose critical problems to the deployment of DNNs across the various embedded and mobile devices in the wild. As such, we have yet to witness the mainstream usage of state-of-the-art deep learning algorithms across consumer devices. In this paper, we provide preliminary answers to this potentially game-changing question by presenting an array of design techniques for efficient AI systems. We start by examining the major roadblocks when targeting both programmable processors and custom accelerators. Then, we present diverse methods for achieving real-time performance following a cross-stack approach. These span model-, system- and hardware-level techniques, and their combination. Our findings provide illustrative examples of AI systems that do not overburden mobile hardware, while also indicating how they can improve inference accuracy. Moreover, we showcase how custom ASIC- and FPGA-based accelerators can be an enabling factor for next-generation AI applications, such as multi-DNN systems. Collectively, these results highlight the critical need for further exploration as to how the various cross-stack solutions can be best combined in order to bring the latest advances in deep learning close to users, in a robust and efficient manner.

Other authors
See publication
OODIn: An Optimised On-Device Inference Framework for Heterogeneous Mobile Devices

IEEE SMARTCOMP June 12, 2021
Radical progress in the field of deep learning (DL) has led to unprecedented accuracy in diverse inference tasks. As such, deploying DL models across mobile platforms is vital to enable the development and broad availability of the next-generation intelligent apps. Nevertheless, the wide and optimised deployment of DL models is currently hindered by the vast system heterogeneity of mobile devices, the varying computational cost of different DL models and the variability of performance needs…

Radical progress in the field of deep learning (DL) has led to unprecedented accuracy in diverse inference tasks. As such, deploying DL models across mobile platforms is vital to enable the development and broad availability of the next-generation intelligent apps. Nevertheless, the wide and optimised deployment of DL models is currently hindered by the vast system heterogeneity of mobile devices, the varying computational cost of different DL models and the variability of performance needs across DL applications. This paper proposes OODIn, a framework for the optimised deployment of DL apps across heterogeneous mobile devices. OODIn comprises a novel DL-specific software architecture together with an analytical framework for modelling DL applications that: (1) counteract the variability in device resources and DL models by means of a highly parametrised multi-layer design; and (2) perform a principled optimisation of both model- and system-level parameters through a multi-objective formulation, designed for DL inference apps, in order to adapt the deployment to the user-specified performance requirements and device capabilities. Quantitative evaluation shows that the proposed framework consistently outperforms status-quo designs across heterogeneous devices and delivers up to 4.3x and 3.5x performance gain over highly optimised platform- and model-aware designs respectively, while effectively adapting execution to dynamic changes in resource availability.

Other authors
See publication
Deep Neural Network-based Enhancement for Image and Video Streaming Systems: A Survey and Future Directions

ACM Computing Surveys (CSUR) June 11, 2021
Internet-enabled smartphones and ultra-wide displays are transforming a variety of visual apps spanning from on-demand movies and 360° videos to video-conferencing and live streaming. However, robustly delivering visual content under fluctuating networking conditions on devices of diverse capabilities remains an open problem. In recent years, advances in the field of deep learning on tasks such as super-resolution and image enhancement have led to unprecedented performance in generating…

Internet-enabled smartphones and ultra-wide displays are transforming a variety of visual apps spanning from on-demand movies and 360° videos to video-conferencing and live streaming. However, robustly delivering visual content under fluctuating networking conditions on devices of diverse capabilities remains an open problem. In recent years, advances in the field of deep learning on tasks such as super-resolution and image enhancement have led to unprecedented performance in generating high-quality images from low-quality ones, a process we refer to as neural enhancement. In this paper, we survey state-of-the-art content delivery systems that employ neural enhancement as a key component in achieving both fast response time and high visual quality. We first present the components and architecture of existing content delivery systems, highlighting their challenges and motivating the use of neural enhancement models as a countermeasure. We then cover the deployment challenges of these models and analyze existing systems and their design decisions in efficiently overcoming these technical challenges. Additionally, we underline the key trends and common approaches across systems that target diverse use-cases. Finally, we present promising future directions based on the latest insights from deep learning research to further boost the quality of experience of content delivery systems.

Other authors
See publication
unzipFPGA: Enhancing FPGA-based CNN Engines with On-the-Fly Weights Generation

29th IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM) April 26, 2021
Single computation engines have become a popular design choice for FPGA-based convolutional neural networks (CNNs) enabling the deployment of diverse models without fabric reconfiguration. This flexibility, however, often comes with significantly reduced performance on memory-bound layers and resource underutilisation due to suboptimal mapping of certain layers on the engine's fixed configuration. In this work, we investigate the implications in terms of CNN engine design for a class of models…

Single computation engines have become a popular design choice for FPGA-based convolutional neural networks (CNNs) enabling the deployment of diverse models without fabric reconfiguration. This flexibility, however, often comes with significantly reduced performance on memory-bound layers and resource underutilisation due to suboptimal mapping of certain layers on the engine's fixed configuration. In this work, we investigate the implications in terms of CNN engine design for a class of models that introduce a pre-convolution stage to decompress the weights at run time. We refer to these approaches as on-the-fly. To minimise the negative impact of limited bandwidth on memory-bound layers, we present a novel hardware component that enables the on-chip on-the-fly generation of weights. We further introduce an input selective processing element (PE) design that balances the load between PEs on suboptimally mapped layers. Finally, we present unzipFPGA, a framework to train on-the-fly models and traverse the design space to select the highest performing CNN engine configuration. Quantitative evaluation shows that unzipFPGA yields an average speedup of 2.14x and 71% over optimised status-quo and pruned CNN engines under constrained bandwidth and up to 3.69x higher performance density over the state-of-the-art FPGA-based CNN accelerators.

Other authors
See publication
It's always personal: Using Early Exits for Efficient On-Device CNN Personalisation

22nd International Workshop on Mobile Systems and Applications (HotMobile) February 24, 2021
On-device machine learning is becoming a reality thanks to the availability of powerful hardware and model compression techniques. Typically, these models are pretrained on large GPU clusters and have enough parameters to generalise across a wide variety of inputs. In this work, we observe that a much smaller, personalised model can be employed to fit a specific scenario, resulting in both higher accuracy and faster execution. Nevertheless, on-device training is extremely challenging, imposing…

On-device machine learning is becoming a reality thanks to the availability of powerful hardware and model compression techniques. Typically, these models are pretrained on large GPU clusters and have enough parameters to generalise across a wide variety of inputs. In this work, we observe that a much smaller, personalised model can be employed to fit a specific scenario, resulting in both higher accuracy and faster execution. Nevertheless, on-device training is extremely challenging, imposing excessive computational and memory requirements even for flagship smartphones. At the same time, on-device data availability might be limited and samples are most frequently unlabelled. To this end, we introduce PersEPhonEE, a framework that attaches early exits on the model and personalises them on-device. These allow the model to progressively bypass a larger part of the computation as more personalised data become available. Moreover, we introduce an efficient on-device algorithm that trains the early exits in a semi-supervised manner at a fraction of the whole network's personalisation time. Results show that PersEPhonEE boosts accuracy by up to 15.9% while dropping the training cost by up to 2.2x and inference latency by 2.2-3.2x on average for the same accuracy, depending on the availability of labels on-device.

Other authors
See publication
Neural Enhancement in Content Delivery Systems: The State-of-the-Art and Future Directions

1st Workshop on Distributed Machine Learning (DistributedML), CoNEXT December 1, 2020
Internet-enabled smartphones and ultra-wide displays are transforming a variety of visual apps spanning from ondemand movies and 360° videos to video-conferencing and live streaming. However, robustly delivering visual content under fluctuating networking conditions on devices of diverse capabilities remains an open problem. In recent years, advances in the field of deep learning on tasks such as superresolution and image enhancement have led to unprecedented performance in generating…

Internet-enabled smartphones and ultra-wide displays are transforming a variety of visual apps spanning from ondemand movies and 360° videos to video-conferencing and live streaming. However, robustly delivering visual content under fluctuating networking conditions on devices of diverse capabilities remains an open problem. In recent years, advances in the field of deep learning on tasks such as superresolution and image enhancement have led to unprecedented performance in generating high-quality images from low-qualityones,aprocesswerefertoasneuralenhancement. In this paper, we survey state-of-the-art content delivery systems that employ neural enhancement as a key component in achieving both fast response time and high visual quality. We first present the deployment challenges of neural enhancement models. We then cover systems targeting diverse use-cases and analyze their design decisions in overcoming technical challenges. Moreover, we present promising directions based on the latest insights from deep learning research to further boost the quality of experience of these systems.

Other authors
See publication
HAPI: Hardware-Aware Progressive Inference

IEEE/ACM International Conference on Computer-Aided Design (ICCAD) November 1, 2020
Multi-exit CNNs, or progressive inference networks, are becoming an emerging approach for delivering inference with an adaptive accuracy-complexity trade-off. Nevertheless, existing studies on early exiting have primarily focused on the training scheme, without considering the use-case requirements or the deployment platform. This work presents HAPI, a hardware-aware methodology for generating optimised high-performance multi-exit networks. At its core lies a synchronous dataflow modelling…

Multi-exit CNNs, or progressive inference networks, are becoming an emerging approach for delivering inference with an adaptive accuracy-complexity trade-off. Nevertheless, existing studies on early exiting have primarily focused on the training scheme, without considering the use-case requirements or the deployment platform. This work presents HAPI, a hardware-aware methodology for generating optimised high-performance multi-exit networks. At its core lies a synchronous dataflow modelling framework that, in contrast to conventional modelling for *static* CNNs, is capable of capturing the *dynamic* conditional execution of multi-exit CNNs. By explicitly considering the target hardware, HAPI efficiently traverses the design space and yields a multi-exit network, tailored to the application's multi-objective needs. Quantitative evaluation shows that our system consistently outperforms alternative search mechanisms and state-of-the-art early-exit schemes across various latency budgets. Moreover, it pushes further the performance of highly optimised hand-crafted early-exit CNNs, delivering up to 5.11× speedup over lightweight models on imposed latency-driven SLAs for embedded devices. Overall, this work shows how multi-exit CNNs together with hardware-aware customisation can be key enablers for meeting the performance goals of AI applications across diverse platforms.

Other authors
See publication
Caffe Barista: Brewing Caffe with FPGAs in the Training Loop

30th International Conference on Field-Programmable Logic and Applications (FPL) September 1, 2020
As the complexity of deep learning (DL) models increases, their compute requirements increase accordingly. Deploying a Convolutional Neural Network (CNN) involves two phases: training and inference. With the inference task typically taking place on resource-constrained devices, a lot of research has explored the field of low-power inference on custom hardware accelerators. On the other hand, training is both more computeand memory-intensive and is primarily performed on powerhungry GPUs in…

As the complexity of deep learning (DL) models increases, their compute requirements increase accordingly. Deploying a Convolutional Neural Network (CNN) involves two phases: training and inference. With the inference task typically taking place on resource-constrained devices, a lot of research has explored the field of low-power inference on custom hardware accelerators. On the other hand, training is both more computeand memory-intensive and is primarily performed on powerhungry GPUs in large-scale data centres. CNN training on FPGAs is a nascent field of research. This is primarily due to the lack of tools to easily prototype and deploy various hardware and/or algorithmic techniques for power-efficient CNN training. This work presents Barista, an automated toolflow that provides seamless integration of FPGAs into the training of CNNs within the popular deep learning framework Caffe. To the best of our knowledge, this is the only tool that allows for such versatile and rapid deployment of hardware and algorithms for the FPGA-based training of CNNs, providing the necessary infrastructure for further research and development.

Other authors
See publication
Countering Acoustic Adversarial Attacks in Microphone-equipped Smart Home Devices

Proc. of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT/UbiComp) September 1, 2020
Deep neural networks (DNNs) continue to demonstrate superior generalization performance in an increasing range of applications, including speech recognition and image understanding. Recent innovations in compression algorithms, design of efficient architectures and hardware accelerators have prompted a rapid growth in deploying DNNs on mobile and IoT devices to redefine user experiences. Relying on the superior inference quality of DNNs, various voice-enabled devices have started to pervade our…

Deep neural networks (DNNs) continue to demonstrate superior generalization performance in an increasing range of applications, including speech recognition and image understanding. Recent innovations in compression algorithms, design of efficient architectures and hardware accelerators have prompted a rapid growth in deploying DNNs on mobile and IoT devices to redefine user experiences. Relying on the superior inference quality of DNNs, various voice-enabled devices have started to pervade our everyday lives and are increasingly used for, e.g., opening and closing doors, starting or stopping washing machines, ordering products online, and authenticating monetary transactions. As the popularity of these voice-enabled services increases, so does their risk of being attacked. Recently, DNNs have been shown to be extremely brittle under adversarial attacks and people with malicious intentions can potentially exploit this vulnerability to compromise DNN-based voice-enabled systems. Although some existing work already highlights the vulnerability of audio models, very little is known of the behaviour of compressed on-device audio models under adversarial attacks. This paper bridges this gap by investigating thoroughly the vulnerabilities of compressed audio DNNs and makes a stride towards making compressed models robust. In particular, we propose a stochastic compression technique that generates compressed models with greater robustness to adversarial attacks. We present an extensive set of evaluations on adversarial vulnerability and robustness of DNNs in two diverse audio recognition tasks, while considering two popular attack algorithms: FGSM and PGD. We found that error rates of conventionally trained audio DNNs under attack can be as high as 100%. Under both white-and black-box attacks, our proposed approach is found to decrease the error rate of DNNs under attack by a large margin.

Other authors
See publication
SPINN: Synergistic Progressive Inference of Neural Networks over Device and Cloud

26th Annual International Conference on Mobile Computing and Networking (MobiCom) September 1, 2020
Despite the soaring use of convolutional neural networks (CNNs) in mobile applications, uniformly sustaining high-performance inference on mobile has been elusive due to the excessive computational demands of modern CNNs and the increasing diversity of deployed devices. A popular alternative comprises offloading CNN processing to powerful cloud-based servers. Nevertheless, by relying on the cloud to produce outputs, emerging mission-critical and high-mobility applications, such as drone…

Despite the soaring use of convolutional neural networks (CNNs) in mobile applications, uniformly sustaining high-performance inference on mobile has been elusive due to the excessive computational demands of modern CNNs and the increasing diversity of deployed devices. A popular alternative comprises offloading CNN processing to powerful cloud-based servers. Nevertheless, by relying on the cloud to produce outputs, emerging mission-critical and high-mobility applications, such as drone obstacle avoidance or interactive applications, can suffer from the dynamic connectivity conditions and the uncertain availability of the cloud. In this paper, we propose SPINN, a distributed inference system that employs synergistic device-cloud computation together with a progressive inference method to deliver fast and robust CNN inference across diverse settings. The proposed system introduces a novel scheduler that co-optimises the early-exit policy and the CNN splitting at run time, in order to adapt to dynamic conditions and meet user-defined service-level requirements. Quantitative evaluation illustrates that SPINN outperforms its state-of-the-art collaborative inference counterparts by up to 2× in achieved throughput under varying network conditions, reduces the server cost by up to 6.8× and improves accuracy by 20.7% under latency constraints, while providing robust operation under uncertain connectivity conditions and significant energy savings compared to cloud-centric execution.

Other authors
See publication
Journey Towards Tiny Perceptual Super-Resolution

16th European Conference on Computer Vision (ECCV) August 1, 2020
Recent works in single-image perceptual super-resolution (SR) have demonstrated unprecedented performance in generating realistic textures by means of deep convolutional networks. However, these convolutional models are excessively large and expensive, hindering their effective deployment to end devices. In this work, we propose a neural architecture search (NAS) approach that integrates NAS and generative adversarial networks (GANs) with recent advances in perceptual SR and pushes the…

Recent works in single-image perceptual super-resolution (SR) have demonstrated unprecedented performance in generating realistic textures by means of deep convolutional networks. However, these convolutional models are excessively large and expensive, hindering their effective deployment to end devices. In this work, we propose a neural architecture search (NAS) approach that integrates NAS and generative adversarial networks (GANs) with recent advances in perceptual SR and pushes the efficiency of small perceptual SR models to facilitate ondevice execution. Specifically, we search over the architectures of both the generator and the discriminator sequentially, highlighting the unique challenges and key observations of searching for an SR-optimized discriminator and comparing them with existing discriminator architectures in the literature. Our tiny perceptual SR (TPSR) models outperform SRGANand EnhanceNet on both full-reference perceptual metric (LPIPS) and distortion metric (PSNR) while being up to 26.4× more memory efficient and 33.6× more compute efficient respectively.

Other authors
See publication
Multi-Precision Policy Enforced Training (MuPPET): A precision-switching strategy for quantised fixed-point training of CNNs

37th International Conference on Machine Learning (ICML) July 1, 2020
Large-scale convolutional neural networks (CNNs) suffer from very long training times, spanning from hours to weeks, limiting the productivity and experimentation of deep learning practitioners. As networks grow in size and complexity, training time can be reduced through low-precision data representations and computations. However, in doing so the final accuracy suffers due to the problem of vanishing gradients. Existing state-of-the-art methods combat this issue by means of a mixed-precision…

Large-scale convolutional neural networks (CNNs) suffer from very long training times, spanning from hours to weeks, limiting the productivity and experimentation of deep learning practitioners. As networks grow in size and complexity, training time can be reduced through low-precision data representations and computations. However, in doing so the final accuracy suffers due to the problem of vanishing gradients. Existing state-of-the-art methods combat this issue by means of a mixed-precision approach utilising two different precision levels, FP32 (32-bit floating-point) and FP16/FP8 (16-/8-bit floating-point), leveraging the hardware support of recent GPU architectures for FP16 operations to obtain performance gains. This work pushes the boundary of quantised training by employing a multilevel optimisation approach that utilises multiple precisions including lowprecision fixed-point representations. The novel training strategy, MuPPET, combines the use of multiple number representation regimes together with a precision-switching mechanism that decides at run time the transition point between precision regimes. Overall, the proposed strategy tailors the training process to the hardware-level capabilities of the target hardware architecture and yields improvements in training time and energy efficiency compared to state-of-the-art approaches. Applying MuPPET on the training of AlexNet, ResNet18 and GoogLeNet on ImageNet (ILSVRC12) and targeting an NVIDIA Turing GPU, MuPPET achieves the same accuracy as standard full-precision training with training-time speedup of up to 1.84× and an average speedup of 1.58× across the networks.

Other authors
See publication
A Throughput-Latency Co-Optimised Cascade of Convolutional Neural Network Classifiers

International Conference on Design, Automation and Test in Europe (DATE) Mar 2020
Convolutional Neural Networks constitute a prominent AI model for classification tasks, serving a broad span of diverse application domains. To enable their efficient deployment in real-world tasks, the inherent redundancy of CNNs is frequently exploited to eliminate unnecessary computational costs. Driven by the fact that not all inputs require the same amount of computation to drive a confident prediction, multi-precision cascade classifiers have been recently introduced. FPGAs comprise a…

Convolutional Neural Networks constitute a prominent AI model for classification tasks, serving a broad span of diverse application domains. To enable their efficient deployment in real-world tasks, the inherent redundancy of CNNs is frequently exploited to eliminate unnecessary computational costs. Driven by the fact that not all inputs require the same amount of computation to drive a confident prediction, multi-precision cascade classifiers have been recently introduced. FPGAs comprise a promising platform for the deployment of such input-dependent computation models, due to their enhanced customisation capabilities. Current literature, however, is limited to throughput-optimised cascade implementations, employing large batching at the expense of a substantial latency aggravation prohibiting their deployment on real-time scenarios. In this work, we introduce a novel methodology for throughput-latency co-optimised cascaded CNN classification, deployed on a custom FPGA architecture tailored to the target application and deployment platform, with respect to a set of user-specified requirements on accuracy and performance. Our experiments indicate that the proposed approach achieves comparable throughput gains with related state-of-the-art works, under substantially reduced overhead in latency, enabling its deployment on latency-sensitive applications.

Other authors
See publication
Approximate LSTMs for Time-Constrained Inference: Enabling Fast Reaction in Self-Driving Cars

IEEE Consumer Electronics Magazine (CEM) February 1, 2020
Other authors
See publication
Power-Aware FPGA Mapping of Convolutional Neural Networks

International Conference on Field-Programmable Technology (ICFPT) Dec 2019
With an unprecedented accuracy in numerous AI tasks, convolutional neural networks (CNNs) are rapidly deployed on power-limited mobile and embedded applications. Existing mapping approaches focus on achieving high performance without explicit consideration of power consumption, leading to suboptimal solutions when power is considered in a subsequent stage. In this context, there is an emerging need for power-aware methodologies for the design of custom CNN engines. In this work, a methodology…

With an unprecedented accuracy in numerous AI tasks, convolutional neural networks (CNNs) are rapidly deployed on power-limited mobile and embedded applications. Existing mapping approaches focus on achieving high performance without explicit consideration of power consumption, leading to suboptimal solutions when power is considered in a subsequent stage. In this context, there is an emerging need for power-aware methodologies for the design of custom CNN engines. In this work, a methodology is presented for modelling the power consumption of FPGA-based CNN accelerators using a high-level description of modules, together with a power-centric search strategy for exploring power-performance trade-offs within the CNN-to-FPGA design space. By integrating into an existing CNN-to-FPGA toolflow, the proposed power estimation method can yield a prediction accuracy of 93.4% for total system power consumption. Furthermore, it is demonstrated that the associated power-oriented exploration approach can generate CNN accelerators with a 20.1% power reduction over a purely throughput-driven design for AlexNet, maintaining the design's throughput.

Other authors
See publication
MobiSR: Efficient On-Device Super-resolution through Heterogeneous Mobile Processors

25th Annual International Conference on Mobile Computing and Networking (MobiCom) Aug 2019
In recent years, convolutional networks have demonstrated unprecedented performance in the image restoration task of super-resolution (SR). SR entails the upscaling of a single low-resolution image in order to meet application-specific image quality demands and plays a key role in mobile devices. To comply with privacy regulations and reduce the overhead of cloud computing, executing SR models locally on-device constitutes a key alternative approach. Nevertheless, the excessive compute and…

In recent years, convolutional networks have demonstrated unprecedented performance in the image restoration task of super-resolution (SR). SR entails the upscaling of a single low-resolution image in order to meet application-specific image quality demands and plays a key role in mobile devices. To comply with privacy regulations and reduce the overhead of cloud computing, executing SR models locally on-device constitutes a key alternative approach. Nevertheless, the excessive compute and memory requirements of SR workloads pose a challenge in mapping SR networks on resource-constrained mobile platforms. This work presents MobiSR, a novel framework for performing efficient super-resolution on-device. Given a target mobile platform, the proposed framework considers popular model compression techniques and traverses the design space to reach the highest performing trade-off between image quality and processing speed. At run time, a novel scheduler dispatches incoming image patches to the appropriate model-engine pair based on the patch's estimated upscaling difficulty in order to meet the required image quality with minimum processing latency. Quantitative evaluation shows that the proposed framework yields on-device SR designs that achieve an average speedup of 2.13x over highly-optimized parallel difficulty-unaware mappings and 4.79x over highly-optimized single compute engine implementations.

Other authors
See publication
Towards Efficient On-Board Deployment of Deep Neural Networks on Intelligent Autonomous Systems

IEEE Computer Society Annual Symposium on VLSI (ISVLSI) Jul 2019
With their unprecedented performance in major AI tasks, deep neural networks (DNNs) have emerged as a primary building block in modern autonomous systems. Intelligent systems such as drones, mobile robots and driverless cars largely base their perception, planning and application-specific tasks on DNN models. Nevertheless, due to the nature of these applications, such systems require on-board local processing in order to retain their autonomy and meet latency and throughput constraints. In this…

With their unprecedented performance in major AI tasks, deep neural networks (DNNs) have emerged as a primary building block in modern autonomous systems. Intelligent systems such as drones, mobile robots and driverless cars largely base their perception, planning and application-specific tasks on DNN models. Nevertheless, due to the nature of these applications, such systems require on-board local processing in order to retain their autonomy and meet latency and throughput constraints. In this respect, the large computational and memory demands of DNN workloads pose a significant barrier on their deployment on the resource- and power-constrained compute platforms that are available on-board. This paper presents an overview of recent methods and hardware architectures that address the system-level challenges of modern DNN-enabled autonomous systems at both the algorithmic and hardware design level. Spanning from latency-driven approximate computing techniques to high-throughput mixed-precision cascaded classifiers, the presented set of works paves the way for the onboard deployment of sophisticated DNN models on robots and autonomous systems.

Other authors
See publication
EmBench: Quantifying Performance Variations of Deep Neural Networks across Modern Commodity Devices

3rd International Workshop on Embedded and Mobile Deep Learning (EMDL), MobiSys Jun 2019
In recent years, advances in deep learning have resulted in unprecedented leaps in diverse tasks spanning from speech and object recognition to context awareness and health monitoring. As a result, an increasing number of AI-enabled applications are being developed targeting ubiquitous and mobile devices. While deep neural networks (DNNs) are getting bigger and more complex, they also impose a heavy computational and energy burden on the host devices, which has led to the integration of various…

In recent years, advances in deep learning have resulted in unprecedented leaps in diverse tasks spanning from speech and object recognition to context awareness and health monitoring. As a result, an increasing number of AI-enabled applications are being developed targeting ubiquitous and mobile devices. While deep neural networks (DNNs) are getting bigger and more complex, they also impose a heavy computational and energy burden on the host devices, which has led to the integration of various specialized processors in commodity devices. Given the broad range of competing DNN architectures and the heterogeneity of the target hardware, there is an emerging need to understand the compatibility between DNN-platform pairs and the expected performance benefits on each platform. This work attempts to demystify this landscape by systematically evaluating a collection of state-of-the-art DNNs on a wide variety of commodity devices. In this respect, we identify potential bottlenecks in each architecture and provide important guidelines that can assist the community in the co-design of more efficient DNNs and accelerators.

Other authors
See publication
fpgaConvNet: Mapping Regular and Irregular Convolutional Neural Networks on FPGAs

IEEE Transactions on Neural Networks and Learning Systems (TNNLS) July 2, 2018
Since neural networks renaissance, convolutional neural networks (ConvNets) have demonstrated a state-of-the-art performance in several emerging artificial intelligence tasks. The deployment of ConvNets in real-life applications requires power-efficient designs that meet the application-level performance needs. In this context, field-programmable gate arrays (FPGAs) can provide a potential platform that can be tailored to application-specific requirements. However, with the complexity of…

Since neural networks renaissance, convolutional neural networks (ConvNets) have demonstrated a state-of-the-art performance in several emerging artificial intelligence tasks. The deployment of ConvNets in real-life applications requires power-efficient designs that meet the application-level performance needs. In this context, field-programmable gate arrays (FPGAs) can provide a potential platform that can be tailored to application-specific requirements. However, with the complexity of ConvNet models increasing rapidly, the ConvNet-to-FPGA design space becomes prohibitively large. This paper presents fpgaConvNet, an end-to-end framework for the optimized mapping of ConvNets on FPGAs. The proposed framework comprises an automated design methodology based on the synchronous dataflow (SDF) paradigm and defines a set of SDF transformations in order to efficiently navigate the architectural design space. By proposing a systematic multiobjective optimization formulation, the presented framework is able to generate hardware designs that are cooptimized for the ConvNet workload, the target device, and the application's performance metric of interest. Quantitative evaluation shows that the proposed methodology yields hardware designs that improve the performance by up to 6.65x over highly optimized graphics processing unit designs for the same power constraints and achieve up to 2.94x higher performance density compared with the state-of-the-art FPGA-based ConvNet architectures.

Other authors
See publication
Deploying Deep Neural Networks in the Embedded Space

2nd International Workshop on Embedded and Mobile Deep Learning (EMDL) Jun 2018
Recently, Deep Neural Networks (DNNs) have emerged as the dominant model across various AI applications. In the era of IoT and mobile systems, the efficient deployment of DNNs on embedded platforms is vital to enable the development of intelligent applications. This paper summarises our recent work on the optimised mapping of DNNs on embedded settings. By covering such diverse topics as DNN-to-accelerator toolflows, high-throughput cascaded classifiers and domain-specific model design, the…

Recently, Deep Neural Networks (DNNs) have emerged as the dominant model across various AI applications. In the era of IoT and mobile systems, the efficient deployment of DNNs on embedded platforms is vital to enable the development of intelligent applications. This paper summarises our recent work on the optimised mapping of DNNs on embedded settings. By covering such diverse topics as DNN-to-accelerator toolflows, high-throughput cascaded classifiers and domain-specific model design, the presented set of works aim to enable the deployment of sophisticated deep learning models on cutting-edge mobile and embedded systems.

Other authors
See publication
CascadeCNN: Pushing the Performance Limits of Quantisation in Convolutional Neural Networks

28th International Conference on Field Programmable Logic & Applications (FPL) May 2018
This work presents CascadeCNN, an automated toolflow that pushes the quantisation limits of any given CNN model, aiming to perform high-throughput inference. A two-stage architecture tailored for any given CNN-FPGA pair is generated, consisting of a low- and high-precision unit in a cascade. A confidence evaluation unit is employed to identify misclassified cases from the excessively low-precision unit and forward them to the high-precision unit for re-processing. Experiments demonstrate that…

This work presents CascadeCNN, an automated toolflow that pushes the quantisation limits of any given CNN model, aiming to perform high-throughput inference. A two-stage architecture tailored for any given CNN-FPGA pair is generated, consisting of a low- and high-precision unit in a cascade. A confidence evaluation unit is employed to identify misclassified cases from the excessively low-precision unit and forward them to the high-precision unit for re-processing. Experiments demonstrate that the proposed toolflow can achieve a performance boost up to 55% for VGG-16 and 48% for AlexNet over the baseline design for the same resource budget and accuracy, without the need of retraining the model or accessing the training data.

Other authors
See publication
f-CNNx: A Toolflow for Mapping Multiple Convolutional Neural Networks on FPGAs

28th International Conference on Field Programmable Logic & Applications (FPL) May 2018
The predictive power of Convolutional Neural Networks (CNNs) has been an integral factor for emerging latency-sensitive applications, such as autonomous drones and vehicles. Such systems employ multiple CNNs, each one trained for a particular task. The efficient mapping of multiple CNNs on a single FPGA device is a challenging task as the allocation of compute resources and external memory bandwidth needs to be optimised at design time. This paper proposes f-CNNx, an automated toolflow for the…

The predictive power of Convolutional Neural Networks (CNNs) has been an integral factor for emerging latency-sensitive applications, such as autonomous drones and vehicles. Such systems employ multiple CNNs, each one trained for a particular task. The efficient mapping of multiple CNNs on a single FPGA device is a challenging task as the allocation of compute resources and external memory bandwidth needs to be optimised at design time. This paper proposes f-CNNx, an automated toolflow for the optimised mapping of multiple CNNs on FPGAs, comprising a novel multi-CNN hardware architecture together with an automated design space exploration method that considers the user-specified performance requirements for each model to allocate compute resources and generate a synthesisable accelerator. Moreover, f-CNNx employs a novel scheduling algorithm that alleviates the limitations of the memory bandwidth contention between CNNs and sustains the high utilisation of the architecture. Experimental evaluation shows that f-CNNx's designs outperform contention-unaware FPGA mappings by up to 50% and deliver up to 6.8x higher performance-per-Watt over highly optimised GPU designs for multi-CNN systems.

Other authors
See publication
Approximate FPGA-based LSTMs under Computation Time Constraints

14th International Symposium on Applied Reconfigurable Computing (ARC) Apr 2018
Recurrent Neural Networks, with the prominence of Long Short-Term Memory (LSTM) networks, have demonstrated state-of-the-art accuracy in several emerging Artificial Intelligence tasks. Nevertheless, the highest performing LSTM models are becoming increasingly demanding in terms of computational and memory load. At the same time, emerging latency-sensitive applications including mobile robots and autonomous vehicles often operate under stringent computation time constraints. In this paper, we…

Recurrent Neural Networks, with the prominence of Long Short-Term Memory (LSTM) networks, have demonstrated state-of-the-art accuracy in several emerging Artificial Intelligence tasks. Nevertheless, the highest performing LSTM models are becoming increasingly demanding in terms of computational and memory load. At the same time, emerging latency-sensitive applications including mobile robots and autonomous vehicles often operate under stringent computation time constraints. In this paper, we address the challenge of deploying computationally demanding LSTMs at a constrained time budget by introducing an approximate computing scheme that combines iterative low-rank compression and pruning, along with a novel FPGA-based LSTM architecture. Combined in an end-to-end framework, the approximation method parameters are optimised and the architecture is configured to address the problem of high-performance LSTM execution in time-constrained applications. Quantitative evaluation on a real-life image captioning application indicates that the proposed system required up to 6.5x less time to achieve the same application-level accuracy compared to a baseline method, while achieving an average of 25x higher accuracy under the same computation time constraints.

Other authors
See publication
DroNet: Efficient Convolutional Neural Network Detector for Real-Time UAV Applications

International Conference on Design, Automation and Test in Europe (DATE) Mar 2018
Unmanned Aerial Vehicles (drones) are emerging as a promising technology for both environmental and infrastructure monitoring, with broad use in a plethora of applications. Many such applications require the use of computer vision algorithms in order to analyse the information captured from an on-board camera. Such applications include detecting vehicles for emergency response and traffic monitoring. This paper therefore, explores the trade-offs involved in the development of a single-shot…

Unmanned Aerial Vehicles (drones) are emerging as a promising technology for both environmental and infrastructure monitoring, with broad use in a plethora of applications. Many such applications require the use of computer vision algorithms in order to analyse the information captured from an on-board camera. Such applications include detecting vehicles for emergency response and traffic monitoring. This paper therefore, explores the trade-offs involved in the development of a single-shot object detector based on deep convolutional neural networks (CNNs) that can enable UAVs to perform vehicle detection under a resource constrained environment such as in a UAV. The paper presents a holistic approach for designing such systems; the data collection and training stages, the CNN architecture, and the optimizations necessary to efficiently map such a CNN on a lightweight embedded processing platform suitable for deployment on UAVs. Through the analysis we propose a CNN architecture that is capable of detecting vehicles from aerial UAV images and can operate between 5-18 frames-per-second for a variety of platforms with an overall accuracy of ~ 95%. Overall, the proposed architecture is suitable for UAV applications, utilizing low-power embedded processors that can be deployed on commercial UAVs.

Other authors
See publication
CascadeCNN: Pushing the performance limits of quantisation

SysML Feb 2018
This work presents CascadeCNN, an automated toolflow that pushes the quantisation limits of any given CNN model, to perform high-throughput inference by exploiting the computation time-accuracy trade-off. Without the need for retraining, a two-stage architecture tailored for any given FPGA device is generated, consisting of a low- and a high-precision unit. A confidence evaluation unit is employed between them to identify misclassified cases at run time and forward them to the high-precision…

This work presents CascadeCNN, an automated toolflow that pushes the quantisation limits of any given CNN model, to perform high-throughput inference by exploiting the computation time-accuracy trade-off. Without the need for retraining, a two-stage architecture tailored for any given FPGA device is generated, consisting of a low- and a high-precision unit. A confidence evaluation unit is employed between them to identify misclassified cases at run time and forward them to the high-precision unit or terminate computation. Experiments demonstrate that CascadeCNN achieves
a performance boost of up to 55% for VGG-16 and 48% for AlexNet over the baseline design for the same resource budget and accuracy.

Other authors
See publication
Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions

ACM Computing Surveys (CSUR) Feb 2018
In the past decade, Convolutional Neural Networks (CNNs) have demonstrated state-of-the-art performance in various Artificial Intelligence tasks. To accelerate the experimentation and development of CNNs, several software frameworks have been released, primarily targeting power-hungry CPUs and GPUs. In this context, reconfigurable hardware in the form of FPGAs constitutes a potential alternative platform that can be integrated in the existing deep learning ecosystem to provide a tunable balance…

In the past decade, Convolutional Neural Networks (CNNs) have demonstrated state-of-the-art performance in various Artificial Intelligence tasks. To accelerate the experimentation and development of CNNs, several software frameworks have been released, primarily targeting power-hungry CPUs and GPUs. In this context, reconfigurable hardware in the form of FPGAs constitutes a potential alternative platform that can be integrated in the existing deep learning ecosystem to provide a tunable balance between performance, power consumption and programmability. In this paper, a survey of the existing CNN-to-FPGA toolflows is presented, comprising a comparative study of their key characteristics which include the supported applications, architectural choices, design space exploration methods and achieved performance. Moreover, major challenges and objectives introduced by the latest trends in CNN algorithmic research are identified and presented. Finally, an evaluation methodology is proposed, aiming at the comprehensive, complete and in-depth evaluation of CNN-to-FPGA toolflows.

Other authors
See publication
fpgaConvNet: A Toolflow for Mapping Diverse Convolutional Neural Networks on Embedded FPGAs

NIPS 2017 Workshop on Machine Learning on the Phone and other Consumer Devices (MLPCD) Dec 2017
In recent years, Convolutional Neural Networks (ConvNets) have become an enabling technology for a wide range of novel embedded Artificial Intelligence systems. Across the range of applications, the performance needs vary significantly, from high-throughput video surveillance to the very low-latency requirements of autonomous cars. In this context, FPGAs can provide a potential platform that can be optimally configured based on the different performance needs. However, the complexity of ConvNet…

In recent years, Convolutional Neural Networks (ConvNets) have become an enabling technology for a wide range of novel embedded Artificial Intelligence systems. Across the range of applications, the performance needs vary significantly, from high-throughput video surveillance to the very low-latency requirements of autonomous cars. In this context, FPGAs can provide a potential platform that can be optimally configured based on the different performance needs. However, the complexity of ConvNet models keeps increasing making their mapping to an FPGA device a challenging task. This work presents fpgaConvNet, an end-to-end framework for mapping ConvNets on FPGAs. The proposed framework employs an automated design methodology based on the Synchronous Dataflow (SDF) paradigm and defines a set of SDF transformations in order to efficiently explore the architectural design space. By selectively optimising for throughput, latency or multiobjective criteria, the presented tool is able to efficiently explore the design space and generate hardware designs from high-level ConvNet specifications, explicitly optimised for the performance metric of interest. Overall, our framework yields designs that improve the performance by up to 6.65x over highly optimised embedded GPU designs for the same power constraints in embedded environments.

Other authors
See publication
Latency-Driven Design for FPGA-based Convolutional Neural Networks

27th International Conference on Field-programmable Logic and Applications (FPL) Sep 2017
In recent years, Convolutional Neural Networks (ConvNets) have become the quintessential component of several state-of-the-art Artificial Intelligence tasks. Across the spectrum of applications, the performance needs vary significantly, from high-throughput image recognition to the very low-latency requirements of autonomous cars. In this context, FPGAs can provide a potential platform that can be optimally configured based on different performance requirements. However, with the increasing…

In recent years, Convolutional Neural Networks (ConvNets) have become the quintessential component of several state-of-the-art Artificial Intelligence tasks. Across the spectrum of applications, the performance needs vary significantly, from high-throughput image recognition to the very low-latency requirements of autonomous cars. In this context, FPGAs can provide a potential platform that can be optimally configured based on different performance requirements. However, with the increasing complexity of ConvNet models, the architectural design space becomes overwhelmingly large, asking for principled design flows that address the application-level needs. This paper presents a latency-driven design methodology for mapping ConvNets on FPGAs. The proposed design flow employs novel transformations over a Synchronous Dataflow-based modelling framework together with a latency-centric optimisation procedure in order to efficiently explore the design space targeting low-latency designs. Quantitative evaluation shows large improvements in latency when latency-driven optimisation is in place yielding designs that improve the latency of AlexNet by 73.54x and VGG16 by 5.61x over throughput-optimised designs.

Other authors
See publication
fpgaConvNet: A Framework for Mapping Convolutional Neural Networks on FPGAs

24th IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM) May 2016
Convolutional Neural Networks (ConvNets) are a powerful Deep Learning model, providing state-of-the-art accuracy to many emerging classification problems. However, ConvNet classification is a computationally heavy task, suffering from rapid complexity scaling. This paper presents fpgaConvNet, a novel domain-specific modelling framework together with an automated design methodology for the mapping of ConvNets onto reconfigurable FPGA-based platforms. By interpreting ConvNet classification as a…

Convolutional Neural Networks (ConvNets) are a powerful Deep Learning model, providing state-of-the-art accuracy to many emerging classification problems. However, ConvNet classification is a computationally heavy task, suffering from rapid complexity scaling. This paper presents fpgaConvNet, a novel domain-specific modelling framework together with an automated design methodology for the mapping of ConvNets onto reconfigurable FPGA-based platforms. By interpreting ConvNet classification as a streaming application, the proposed framework employs the Synchronous Dataflow (SDF) model of computation as its basis and proposes a set of transformations on the SDF graph that explore the performance-resource design space, while taking into account platform-specific resource constraints. A comparison with existing ConvNet FPGA works shows that the proposed fully-automated methodology yields hardware designs that improve the performance density by up to 1.62x and reach up to 90.75% of the raw performance of architectures that are hand-tuned for particular ConvNets.

Other authors
See publication
Towards Heterogeneous Solvers for Large-Scale Linear Systems

25th International Conference on Field-programmable Logic and Applications (FPL) Sep 2015
Applying Linear Regression to systems with a massive amount of observations, a scenario which is becoming increasingly common in the era of Big Data, poses major algorithmic and computational challenges. This paper proposes a novel high-performance FPGA-based architecture for largescale
Linear Regression problems as well as a heterogeneous system comprising the custom FPGA architecture, an enhanced GPU module and a multi-core CPU for addressing the aforementioned problem. The system…

Applying Linear Regression to systems with a massive amount of observations, a scenario which is becoming increasingly common in the era of Big Data, poses major algorithmic and computational challenges. This paper proposes a novel high-performance FPGA-based architecture for largescale
Linear Regression problems as well as a heterogeneous system comprising the custom FPGA architecture, an enhanced GPU module and a multi-core CPU for addressing the aforementioned problem. The system adaptively assigns Linear Regression workloads to the three computing devices to minimise runtime. The device with the highest performance is chosen based on an analytical framework, as well as the workload’s size and structure. A quantitative comparison with existing FPGA, GPU and multi-core CPU designs yields speed-ups of up to 18.07x, 32.67x and 25.84x respectively.

Other authors
See publication

Honors & Awards

EPSRC Doctoral Training Studentship

Engineering and Physical Sciences Research Council (EPSRC)

Oct 2014

Awarded a Doctoral Training Studentship by the EPSRC for my PhD studies, 2014-2018.
Governors' MEng Prize

Department of Electrical and Electronic Engineering, Imperial College London

Oct 2014

Awarded to the top graduating student in the MEng course in Electrical and Electronic Engineering, 2014.
Faculty of Engineering Bursary

Faculty of Engineering, Imperial College London

Jul 2013

Awarded by the Faculty of Engineering to conduct a summer research placement under the Undergraduate Research Opportunity (UROP) scheme of Imperial College London, 2013.
Imperial College Engineering Dean’s List

Faculty of Engineering, Imperial College London

Placed on the Imperial College Engineering Dean’s List as a record of academic achievement in 2012 (top 10% of the class), 2013 (2nd in class) and 2014 (1st in class).

More activity by Stylianos

Very excited to co-chair this topic on Applications of #ArtificialIntelligence systems aiming to bring together researchers across the AI application…

Very excited to co-chair this topic on Applications of #ArtificialIntelligence systems aiming to bring together researchers across the AI application…

Liked by Stylianos I. Venieris
CaMLSys students are at #KDD2024! PISTOL was selected for oral presentation at the KDD workshop on Evaluation and Trustworthiness of Generative AI…

CaMLSys students are at #KDD2024! PISTOL was selected for oral presentation at the KDD workshop on Evaluation and Trustworthiness of Generative AI…

Liked by Stylianos I. Venieris
Happy to announce that in 2025's edition of Design, Automation and Test in Europe Conference (DATE), I'm joing Christos Kyrkou to co-chair a new…

Happy to announce that in 2025's edition of Design, Automation and Test in Europe Conference (DATE), I'm joing Christos Kyrkou to co-chair a new…

Shared by Stylianos I. Venieris
Making future #AR safer: a poster at #IEEE #ISMAR demonstrates an approach for preventing virtual contents from blocking user's view of important…

Making future #AR safer: a poster at #IEEE #ISMAR demonstrates an approach for preventing virtual contents from blocking user's view of important…

Liked by Stylianos I. Venieris
Glad and feel honoured to be part of the team!

Glad and feel honoured to be part of the team!

Liked by Stylianos I. Venieris
Catch the second part of our series "Digging into FlowerLLM". This time we focus on our recent capability to support federations of federations. Alex…

Catch the second part of our series "Digging into FlowerLLM". This time we focus on our recent capability to support federations of federations. Alex…

Liked by Stylianos I. Venieris
Do matematyków, kognitywistów, informatyków, fizyków, do artystów, do buntowników, do widzących świat inaczej, do Was, których nie można zignorować!…

Do matematyków, kognitywistów, informatyków, fizyków, do artystów, do buntowników, do widzących świat inaczej, do Was, których nie można zignorować!…

Liked by Stylianos I. Venieris
Here are the first 11 start-up books I've read. What should come next? I notice there is a big gap in books that cover the intersection of…

Here are the first 11 start-up books I've read. What should come next? I notice there is a big gap in books that cover the intersection of…

Liked by Stylianos I. Venieris
I think it's time to admit that the core idea of the Internet of Things, ubiquitous network connectivity for everyday objects, has failed. I know…

I think it's time to admit that the core idea of the Internet of Things, ubiquitous network connectivity for everyday objects, has failed. I know…

Liked by Stylianos I. Venieris
The Internet Measurement Conference (IMC) is moving to a two-deadline model. Alberto Dainotti and I will chairing the conference in 2025. The website…

The Internet Measurement Conference (IMC) is moving to a two-deadline model. Alberto Dainotti and I will chairing the conference in 2025. The website…

Liked by Stylianos I. Venieris
Our paper "Time Reversal for Near-Field Communications on Multi-Chip Wireless Networks" was featured in Shannon Wireless. Check it out! Here's the…

Our paper "Time Reversal for Near-Field Communications on Multi-Chip Wireless Networks" was featured in Shannon Wireless. Check it out! Here's the…

Liked by Stylianos I. Venieris
Join the 🌼Flower team in Barcelona next week!

Join the 🌼Flower team in Barcelona next week!

Liked by Stylianos I. Venieris

View Stylianos’ full profile

See who you know in common
Get introduced
Contact Stylianos directly

Join to view full profile

Other similar profiles

Explore collaborative articles

We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

Explore More

Add new skills with these courses

See all courses

Stylianos I. Venieris

Senior Research Scientist / Head of Distributed AI group @ Samsung AI, Cambridge, UK | AI systems, Deep Learning, FPGAs

London, England, United Kingdom 1K followers 500+ connections

See your mutual connections View mutual connections with Stylianos Sign in Welcome back Email or phone Password Show Forgot password? Sign in or New to LinkedIn? Join now or New to LinkedIn? Join now

Activity

We had a great day presenting our research on efficient ML Systems at the Turing Institute! Chong Tang Neelam Singh

Liked by Stylianos I. Venieris

It was a pleasure to host Prof. Jongse Park (KAIST) at Meta for an insightful talk - "Towards Fast and Efficient On-Device Generative AI". Learned a…

Liked by Stylianos I. Venieris

Job done at HeteroPar workshop at Euro-Par 2024! Interesting talks and discussions! 🙂 https://lnkd.in/gCakzbiG

Liked by Stylianos I. Venieris

Experience

Samsung Electronics

Head of Distributed AI group / Senior Research Scientist - Samsung AI

Researcher - Samsung AI

Research Assistant

Research Intern

Software Development Engineer Intern

Education

PhD

MEng

Publications

35th Conference on Neural Information Processing Systems (NeurIPS) (Spotlight) October 4, 2021

IEEE 32nd International Conference on Application-specific Systems, Architectures and Processors (ASAP) July 7, 2021

IEEE SMARTCOMP June 12, 2021

ACM Computing Surveys (CSUR) June 11, 2021

29th IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM) April 26, 2021

22nd International Workshop on Mobile Systems and Applications (HotMobile) February 24, 2021

1st Workshop on Distributed Machine Learning (DistributedML), CoNEXT December 1, 2020

IEEE/ACM International Conference on Computer-Aided Design (ICCAD) November 1, 2020

30th International Conference on Field-Programmable Logic and Applications (FPL) September 1, 2020

Proc. of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT/UbiComp) September 1, 2020

26th Annual International Conference on Mobile Computing and Networking (MobiCom) September 1, 2020

16th European Conference on Computer Vision (ECCV) August 1, 2020

37th International Conference on Machine Learning (ICML) July 1, 2020

International Conference on Design, Automation and Test in Europe (DATE) Mar 2020

IEEE Consumer Electronics Magazine (CEM) February 1, 2020

International Conference on Field-Programmable Technology (ICFPT) Dec 2019

25th Annual International Conference on Mobile Computing and Networking (MobiCom) Aug 2019

IEEE Computer Society Annual Symposium on VLSI (ISVLSI) Jul 2019

3rd International Workshop on Embedded and Mobile Deep Learning (EMDL), MobiSys Jun 2019

IEEE Transactions on Neural Networks and Learning Systems (TNNLS) July 2, 2018

2nd International Workshop on Embedded and Mobile Deep Learning (EMDL) Jun 2018

28th International Conference on Field Programmable Logic & Applications (FPL) May 2018

28th International Conference on Field Programmable Logic & Applications (FPL) May 2018

14th International Symposium on Applied Reconfigurable Computing (ARC) Apr 2018

International Conference on Design, Automation and Test in Europe (DATE) Mar 2018

SysML Feb 2018

ACM Computing Surveys (CSUR) Feb 2018

NIPS 2017 Workshop on Machine Learning on the Phone and other Consumer Devices (MLPCD) Dec 2017

27th International Conference on Field-programmable Logic and Applications (FPL) Sep 2017

24th IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM) May 2016

25th International Conference on Field-programmable Logic and Applications (FPL) Sep 2015

Honors & Awards

EPSRC Doctoral Training Studentship

Engineering and Physical Sciences Research Council (EPSRC)

Governors' MEng Prize

Department of Electrical and Electronic Engineering, Imperial College London

Faculty of Engineering Bursary

Faculty of Engineering, Imperial College London

Imperial College Engineering Dean’s List

Faculty of Engineering, Imperial College London

More activity by Stylianos

Very excited to co-chair this topic on Applications of #ArtificialIntelligence systems aiming to bring together researchers across the AI application…

Liked by Stylianos I. Venieris

CaMLSys students are at #KDD2024! PISTOL was selected for oral presentation at the KDD workshop on Evaluation and Trustworthiness of Generative AI…

Liked by Stylianos I. Venieris

Happy to announce that in 2025's edition of Design, Automation and Test in Europe Conference (DATE), I'm joing Christos Kyrkou to co-chair a new…

Shared by Stylianos I. Venieris

Making future #AR safer: a poster at #IEEE #ISMAR demonstrates an approach for preventing virtual contents from blocking user's view of important…

Liked by Stylianos I. Venieris

Glad and feel honoured to be part of the team!

Liked by Stylianos I. Venieris

Catch the second part of our series "Digging into FlowerLLM". This time we focus on our recent capability to support federations of federations. Alex…

Liked by Stylianos I. Venieris

Do matematyków, kognitywistów, informatyków, fizyków, do artystów, do buntowników, do widzących świat inaczej, do Was, których nie można zignorować!…

Liked by Stylianos I. Venieris

Here are the first 11 start-up books I've read. What should come next? I notice there is a big gap in books that cover the intersection of…

Liked by Stylianos I. Venieris

I think it's time to admit that the core idea of the Internet of Things, ubiquitous network connectivity for everyday objects, has failed. I know…

London, England, United Kingdom

1K followers 500+ connections

View mutual connections with Stylianos

Welcome back

Email or phone

Password

Forgot password?

or

New to LinkedIn? Join now

or

New to LinkedIn? Join now