Ali-Reza Adl-Tabatabai

Atherton, California, United States

2K followers 500+ connections

View mutual connections with Ali-Reza

Welcome back

Email or phone

Password

Forgot password?

or

New to LinkedIn? Join now

or

New to LinkedIn? Join now

Join to view profile

Gitar

Carnegie Mellon University

Company Website

About

Specialties: Programming languages, language runtimes and virtual machines, static and…

Activity

We’re excited to announce that we’ve raised a $14M Series A round. This funding marks a step forward in our mission to collapse the sprawl of the…

We’re excited to announce that we’ve raised a $14M Series A round. This funding marks a step forward in our mission to collapse the sprawl of the…

Liked by Ali-Reza Adl-Tabatabai
Proud to welcome Alexander Les to Gitar! I'm excited to see our engineering team grow...

Proud to welcome Alexander Les to Gitar! I'm excited to see our engineering team grow...

Posted by Ali-Reza Adl-Tabatabai
📍 San Francisco, CA Guess each 🐶's name in the comments below #NationalTakeYourDogToWorkDay

📍 San Francisco, CA Guess each 🐶's name in the comments below #NationalTakeYourDogToWorkDay

Liked by Ali-Reza Adl-Tabatabai

Join now to see all activity

Experience

Gitar

San Mateo, California, United States
-
-

Mountain View
-

Menlo Park
-
-
-
-
-
-
-
-
-
-
-

Education

Carnegie Mellon University

1992 - 1996

Thesis: "Source-Level Debugging of Globally Optimized Code"

Developed new algorithms for source-level debugging of optimized code. Led the development of cmcc, a retargetable optimizing C compiler (written in C++) with code generators for MIPS, SPARC, and iWarp (a VLIW processor co-developed by Intel and CMU). Developed cmcc's intermediate representation and object-oriented frameworks for optimization and retargetable code generation. Implemented all global optimizations…

Thesis: "Source-Level Debugging of Globally Optimized Code"

Developed new algorithms for source-level debugging of optimized code. Led the development of cmcc, a retargetable optimizing C compiler (written in C++) with code generators for MIPS, SPARC, and iWarp (a VLIW processor co-developed by Intel and CMU). Developed cmcc's intermediate representation and object-oriented frameworks for optimization and retargetable code generation. Implemented all global optimizations, including partial redundancy elimination, (partial) dead code elimination, strength reduction, induction variable optimization, and other classical optimizations.

Worked on Omniware, a safe and language-independent virtual machine for mobile code. Implemented its MIPS code generator.

Managed a team of graduate students in the implementation of software for Navigator, a wearable computer with a head-mounted display, speech input, and GPS, dedicated to campus navigation.
1990 - 1992
1984 - 1990

Awarded the School of Engineering and Applied Science’s "Most Outstanding Bachelor of Science Award" for 1990.

Awarded an NSF undergraduate research grant for research on arithmetic algorithms for ASICs.

Publications

Keeping Master Green at Scale

EuroSys '19: Proceedings of the Fourteenth EuroSys Conference 2019 Mar 2019

Giant monolithic source-code repositories are one of the fundamental pillars of the back end infrastructure in large and fast-paced software companies. The sheer volume of everyday code changes demands a reliable and efficient change management system with three uncompromisable key requirements --- always green master, high throughput, and low commit turnaround time. Green refers to a master branch that always successfully compiles and passes all build steps, the opposite being red. A broken…

Giant monolithic source-code repositories are one of the fundamental pillars of the back end infrastructure in large and fast-paced software companies. The sheer volume of everyday code changes demands a reliable and efficient change management system with three uncompromisable key requirements --- always green master, high throughput, and low commit turnaround time. Green refers to a master branch that always successfully compiles and passes all build steps, the opposite being red. A broken master (red) leads to delayed feature rollouts because a faulty code commit needs to be detected and rolled backed. Additionally, a red master has a cascading effect that hampers developer productivity--- developers might face local test/build failures, or might end up working on a codebase that will eventually be rolled back.

This paper presents the design and implementation of SubmitQueue. It guarantees an always green master branch at scale: all build steps (e.g., compilation, unit tests, UI tests) successfully execute for every commit point. SubmitQueue has been in production for over a year, and can scale to thousands of daily commits to giant monolithic repositories.

See publication
Efficient Mapping of Irregular C++ Applications to Integrated GPUs

CGO 2014: IEEE/ACM International Symposium on Code Generation and Optimization February 17, 2014
CoreRacer: A Practical Memory Race Recorder for Multicore x86 TSO Processors

44th International Symposium on Microarchitecture (MICRO-44) Dec 2011
Other authors
A study of transactional memory vs. locks in practice

SPAA '11: Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures Jun 2011

Transactional Memory (TM) promises to simplify parallel programming by replacing locks with atomic transactions. Despite much recent progress in TM research, there is very little experience using TM to develop realistic parallel programs from scratch. In this paper, we present the results of a detailed case study comparing teams of programmers developing a parallel program from scratch using transactional memory and locks. We analyze and quantify in a realistic environment the development time,…

Transactional Memory (TM) promises to simplify parallel programming by replacing locks with atomic transactions. Despite much recent progress in TM research, there is very little experience using TM to develop realistic parallel programs from scratch. In this paper, we present the results of a detailed case study comparing teams of programmers developing a parallel program from scratch using transactional memory and locks. We analyze and quantify in a realistic environment the development time, programming progress, code metrics, programming patterns, and ease of code understanding for six teams who each wrote a parallel desktop search engine over a fifteen week period. Three randomly chosen teams used Intel's Software Transactional Memory compiler and Pthreads, while the other teams used just Pthreads. Our analysis is exploratory: Given the same requirements, how far did each team get? The TM teams were among the first to have a prototype parallel search engine. Compared to the locks teams, the TM teams spent less than half the time debugging segmentation faults, but had more problems tuning performance and implementing queries. Code inspections with industry experts revealed that TM code was easier to understand than locks code, because the locks teams used many locks (up to thousands) to improve performance. Learning from each team's individual success and failure story, this paper provides valuable lessons for improving TM.

See publication
Generic workers: towards unified distributed and parallel JavaScript programming model

PSI EtA '10: Programming Support Innovations for Emerging Distributed Applications Oct 2010

In this paper we introduce generic workers, a programming model for JavaScript unifying parallel and distributed computing paradigms, that allows the same application to run well on a variety of clients while utilizing the available resources in the best possible way. We describe the design and implementation of an infrastructure supporting our programming model and evaluate performance of selected applications running on devices with differing computational capabilities.

See publication
Architecting a chunk-based memory race recorder in Modern CMPs

2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) Dec 2009

Prior work on HW support for memory race recording piggybacks time stamps on coherence messages and logs the outcome of memory races using point-to-point or chunk-based approaches. These memory race recorder (MRR) techniques are effective, but they require modifications to the cache coherence protocol that can hurt performance. In addition, prior work has mostly focused on directory coherence and considered only CMP systems with single-level cache hierarchies. Most modern CMP systems shipped…

Prior work on HW support for memory race recording piggybacks time stamps on coherence messages and logs the outcome of memory races using point-to-point or chunk-based approaches. These memory race recorder (MRR) techniques are effective, but they require modifications to the cache coherence protocol that can hurt performance. In addition, prior work has mostly focused on directory coherence and considered only CMP systems with single-level cache hierarchies. Most modern CMP systems shipped today, however, implement snoop coherence and feature multilevel cache hierarchies. To be practical, a MRR must target CMPs with multilevel caches, mitigate the coherence overhead due to piggybacking, and emphasize on replay speed to broaden applicability of deterministic replay. This paper contributes three new solutions for making chunk-based MRR practical for modern CMPs. We show that MRR interactions with a cache hierarchy can degrade performance and present a novel mechanism that mitigates this degradation. We propose new mechanisms for snoop-based caches that eliminate coherence traffic overhead due to piggybacking. We finally propose new techniques for improving replay speed and introduce a novel framework for evaluating the replay speed potential of MRR designs.

See publication
Architecting a Chunk-based Memory Race Recorder in Modern CMPs

42nd International Symposium on Microarchitecture (MICRO-42) Dec 2009
Other authors
Optimizing transactions for captured memory

SPAA '09: Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures Aug 2009

In this paper, we identify transaction-local memory as a major source of overhead from compiler instrumentation in software transactional memory (STM). Transaction-local memory is memory allocated inside a transaction, which cannot escape (i.e., is captured by) the allocating transaction. Accesses to such memory do not require calls to STM memory access functions (also called STM barriers). A compiler unaware of that, however, may translate simple memory load/store operations accessing such…

In this paper, we identify transaction-local memory as a major source of overhead from compiler instrumentation in software transactional memory (STM). Transaction-local memory is memory allocated inside a transaction, which cannot escape (i.e., is captured by) the allocating transaction. Accesses to such memory do not require calls to STM memory access functions (also called STM barriers). A compiler unaware of that, however, may translate simple memory load/store operations accessing such memory into more expensive STM barriers. This presents us opportunities to improve STM performance. Our measurements with the STAMP benchmark suite (version 0.9.9) revealed that as many as 60% of the STM barriers generated by our baseline compiler can be accesses to captured memory, which include 90% of the write barriers and 45% of the read barriers. We propose runtime and compiler optimizations to elide STM barriers to captured memory. Similar techniques can also be used to elide barriers for accesses to thread-local and read-only data. We implemented those optimizations in the Intel C++ STM compiler. Our experiments with the STAMP benchmark suite on a Intel Dunnington system (with 24 cores in a 4-node SMP system) showed that upto 18% performance improvement could be achieved at 16 threads.

See publication
Towards transactional memory semantics for C++

SPAA '09: Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures Aug 2009

Transactional memory (TM) eliminates many problems associated with lock-based synchronization. Over recent years, much progress has been made in software and hardware implementation techniques for TM. However, before transactional memory can be integrated into mainstream programming languages, we must precisely define its meaning in the context of these languages. In particular, TM semantics should address the advanced features present in the existing software TM implementations, such as…

Transactional memory (TM) eliminates many problems associated with lock-based synchronization. Over recent years, much progress has been made in software and hardware implementation techniques for TM. However, before transactional memory can be integrated into mainstream programming languages, we must precisely define its meaning in the context of these languages. In particular, TM semantics should address the advanced features present in the existing software TM implementations, such as interactions between transactions and locks, explicit user-level abort and support for legacy code.

In this paper, we address these topics from both theoretical and practical points of view. We give precise formulations of several popular TM semantics for the domain of sequentially consistent executions and show that some of these semantics are equivalent for C++ programs that do not contain other forms of synchronization. We show that lock-based semantics, such as Single Global Lock Atomicity (SLA) or Disjoint Lock Atomicity (DLA), do not actually guarantee atomicity for race-free programs and propose a new semantics, Race-Free Atomicity (RFA) that gives such a guarantee. We compare these semantics from the programmer and implementation points of view and explain why supporting non-atomic transactions is useful. Finally, we propose a new set of language constructs that let programmers explicitly specify whether transactions should be atomic and describe how these constructs interact with user-level abort and legacy code.

See publication
NePaLTM: Design and Implementation of Nested Parallelism for Transactional Memory Systems

ECOOP 2009 – Object-Oriented Programming Jul 2009

Transactional memory (TM) promises to simplify construction of parallel applications by allowing programmers to reason about interactions between concurrently executing code fragments in terms of high-level properties they should possess. However, all currently existing TM systems deliver on this promise only partially by disallowing parallel execution of computations performed inside transactions. This paper fills in that gap by introducing NePaLTM (Nested PAralleLism for Transactional…

Transactional memory (TM) promises to simplify construction of parallel applications by allowing programmers to reason about interactions between concurrently executing code fragments in terms of high-level properties they should possess. However, all currently existing TM systems deliver on this promise only partially by disallowing parallel execution of computations performed inside transactions. This paper fills in that gap by introducing NePaLTM (Nested PAralleLism for Transactional Memory), the first TM system supporting nested parallelism inside transactions. We describe a programming model where TM constructs (atomic blocks) are integrated with OpenMP constructs enabling nested parallelism. We also discuss the design and implementation of a working prototype where atomic blocks can be used for concurrency control at an arbitrary level of nested parallelism. Finally, we present a performance evaluation of our system by comparing transactions-based concurrency control mechanism for nested parallel computations with a mechanism already provided by OpenMP based on mutual exclusion.

See publication
An analytic model of optimistic Software Transactional Memory

2009 IEEE International Symposium on Performance Analysis of Systems and Software Apr 2009

An analytic model is proposed to assess the performance of optimistic software transactional memory (STM) systems with in-place memory updates for write operations. Based on an absorbing discrete-time Markov chain, closed-form analytic expressions are developed, which are quickly solved iteratively to determine key parameters of the STM system. The model covers complex implementation details such as read/write locking, data consistency checks and conflict management. It provides fundamental…

An analytic model is proposed to assess the performance of optimistic software transactional memory (STM) systems with in-place memory updates for write operations. Based on an absorbing discrete-time Markov chain, closed-form analytic expressions are developed, which are quickly solved iteratively to determine key parameters of the STM system. The model covers complex implementation details such as read/write locking, data consistency checks and conflict management. It provides fundamental insight into the system behavior, when we vary input parameters like number and size of concurrent transactions or the number of the data objects. Numerical results are validated by comparison with a discrete-event simulation.

See publication
Exceptions and Transactions in C++

Proceedings of the First USENIX Workshop on Hot Topics in Parallelism Mar 2009

There has been significant discussion—and significant disagreement—on the issue of how exceptions should interact with atomic blocks implemented using transactions. We present a proposal that offers a significant contribution towards resolving this issue, at least for C++, and we raise remaining areas of disagreement for discussion at the workshop and in the community in general.

See publication
Design and implementation of transactional constructs for C/C++

OOPSLA 2008 Oct 2008
Other authors
Dynamic optimization for efficient strong atomicity

Proceedings of the 23rd ACM SIGPLAN Conference on Object-Oriented Programming Systems Languages and Applications Oct 2008

Transactional memory (TM) is a promising concurrency control alternative to locks. Recent work has highlighted important memory model issues regarding TM semantics and exposed problems in existing TM implementations. For safe, managed languages such as Java, there is a growing consensus towards strong atomicity semantics as a sound, scalable solution. Strong atomicity has presented a challenge to implement efficiently because it requires instrumentation of non-transactional memory accesses…

Transactional memory (TM) is a promising concurrency control alternative to locks. Recent work has highlighted important memory model issues regarding TM semantics and exposed problems in existing TM implementations. For safe, managed languages such as Java, there is a growing consensus towards strong atomicity semantics as a sound, scalable solution. Strong atomicity has presented a challenge to implement efficiently because it requires instrumentation of non-transactional memory accesses, incurring significant overhead even when a program makes minimal or no use of transactions. To minimize overhead, existing solutions require either a sophisticated type system, specialized hardware, or static whole-program analysis. These techniques do not translate easily into a production setting on existing hardware. In this paper, we present novel dynamic optimizations that significantly reduce strong atomicity overheads and make strong atomicity practical for dynamic language environments. We introduce analyses that optimistically track which non-transactional memory accesses can avoid strong atomicity instrumentation, and we describe a lightweight speculation and recovery mechanism that applies these analyses to generate speculatively-optimized but safe code for strong atomicity in a dynamically-loaded environment. We show how to implement these mechanisms efficiently by leveraging existing dynamic optimization infrastructure in a Java system. Measurements on a set of transactional and non-transactional Java workloads demonstrate that our techniques substantially reduce the overhead of strong atomicity from a factor of 5x down to 10% or less over an efficient weak atomicity baseline.

See publication
A Uniform Transactional Execution Environment for Java

ECOOP 2008 – Object-Oriented Programming Jul 2008

Transactional memory (TM) has recently emerged as an effective tool for extracting fine-grain parallelism from declarative critical sections. In order to make STM systems practical, significant effort has been made to integrate transactions into existing programming languages. Unfortunately, existing approaches fail to provide a simple implementation that permits lock-based and transaction-based abstractions to coexist seamlessly. Because of the fundamental semantic differences between locks…

Transactional memory (TM) has recently emerged as an effective tool for extracting fine-grain parallelism from declarative critical sections. In order to make STM systems practical, significant effort has been made to integrate transactions into existing programming languages. Unfortunately, existing approaches fail to provide a simple implementation that permits lock-based and transaction-based abstractions to coexist seamlessly. Because of the fundamental semantic differences between locks and transactions, legacy applications or libraries written using locks can not be transparently used within atomic regions. To address these shortcomings, we implement a uniform transactional execution environment for Java programs in which transactions can be integrated with more traditional concurrency control constructs. Programmers can run arbitrary programs that utilize traditional mutual-exclusion-based programming techniques, execute new programs written with explicit transactional constructs, and freely combine abstractions that use both coding styles.

See publication
Irrevocable transactions and their applications

SPAA '08: Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures Jun 2008

Transactional memory (TM) provides a safer, more modular, and more scalable alternative to traditional lock-based synchronization. Implementing high performance TM systems has recently been an active area of research. However, current TM systems provide limited, if any, support for transactions executing irrevocable actions, such as I/O and system calls, whose side effects cannot in general be rolled back. This severely limits the ability of these systems to run commercial…

Transactional memory (TM) provides a safer, more modular, and more scalable alternative to traditional lock-based synchronization. Implementing high performance TM systems has recently been an active area of research. However, current TM systems provide limited, if any, support for transactions executing irrevocable actions, such as I/O and system calls, whose side effects cannot in general be rolled back. This severely limits the ability of these systems to run commercial workloads.

This paper describes the design of a transactional memory system that allows irrevocable actions to be executed inside of transactions. While one transaction is executing an irrevocable action, other transactions can still execute and commit concurrently. We use a novel mechanism called single-owner read locks to implement irrevocable actions inside transactions that maximizes concurrency and avoids overhead when the mechanism is not used. We also show how irrevocable transactions can be leveraged for contention management to handle actions whose effects may be expensive to roll back. Finally, we present a thorough performance evaluation of the irrevocability mechanism for the different usage models.

See publication
Kicking the tires of software transactional memory: why the going gets tough

SPAA '08: Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures Jun 2008

Transactional Memory (TM) promises to simplify concurrent programming, which has been notoriously difficult but crucial in realizing the performance benefit of multi-core processors. Software Transaction Memory (STM), in particular, represents a body of important TM technologies since it provides a mechanism to run transactional programs when hardware TM support is not available, or when hardware TM resources are exhausted. Nonetheless, most previous researches on STMs were constrained to…

Transactional Memory (TM) promises to simplify concurrent programming, which has been notoriously difficult but crucial in realizing the performance benefit of multi-core processors. Software Transaction Memory (STM), in particular, represents a body of important TM technologies since it provides a mechanism to run transactional programs when hardware TM support is not available, or when hardware TM resources are exhausted. Nonetheless, most previous researches on STMs were constrained to executing trivial, small-scale workloads. The assumption was that the same techniques applied to small-scale workloads could readily be applied to real-life, large-scale workloads. However, by executing several nontrivial workloads such as particle dynamics simulation and game physics engine on a state of the art STM, we noticed that this assumption does not hold. Specifically, we identified four major performance bottlenecks that were unique to the case of executing large-scale workloads on an STM: false conflicts, over-instrumentation, privatization-safety cost, and poor amortization. We believe that these bottlenecks would be common for any STM targeting real-world applications. In this paper, we describe those identified bottlenecks in detail, and we propose novel solutions to alleviate the issues. We also thoroughly validate these approaches with experimental results on real machines.

See publication
Practical weak-atomicity semantics for java stm

SPAA '08: Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures Jun 2008

As memory transactions have been proposed as a language-level replacement for locks, there is growing need for well-defined semantics. In contrast to database transactions, transaction memory (TM) semantics are complicated by the fact that programs may access the same memory locations both inside and outside transactions. Strongly atomic semantics, where non transactional accesses are treated as implicit single-operation transactions, remain difficult to provide without specialized hardware…

As memory transactions have been proposed as a language-level replacement for locks, there is growing need for well-defined semantics. In contrast to database transactions, transaction memory (TM) semantics are complicated by the fact that programs may access the same memory locations both inside and outside transactions. Strongly atomic semantics, where non transactional accesses are treated as implicit single-operation transactions, remain difficult to provide without specialized hardware support or significant performance overhead. As an alternative, many in the community have informally proposed that a single global lock semantics [18,10], where transaction semantics are mapped to those of regions protected by a single global lock, provide an intuitive and efficiently implementable model for programmers.

In this paper, we explore the implementation and performance implications of single global lock semantics in a weakly atomic STM from the perspective of Java, and we discuss why even recent STM implementations fall short of these semantics. We describe a new weakly atomic Java STM implementation that provides single global lock semantics while permitting concurrent execution, but we show that this comes at a significant performance cost. We also propose and implement various alternative semantics that loosen single lock requirements while still providing strong guarantees. We compare our new implementations to previous ones, including a strongly atomic STM

See publication
Single global lock semantics in a weakly atomic STM

Workshop on transactional computing (Transact 2008) May 2008

As memory transactions have been proposed as a language-level replacement for locks, there is growing need for well-defined semantics. In contrast to database transactions, transaction memory (TM) semantics are complicated by the fact that programs may access the same memory locations both inside and outside transactions. Strongly atomic semantics, where non-transactional accesses are treated as implicit single-operation transactions, remain difficult to provide without specialized hardware…

As memory transactions have been proposed as a language-level replacement for locks, there is growing need for well-defined semantics. In contrast to database transactions, transaction memory (TM) semantics are complicated by the fact that programs may access the same memory locations both inside and outside transactions. Strongly atomic semantics, where non-transactional accesses are treated as implicit single-operation transactions, remain difficult to provide without specialized hardware support and/or significant performance overhead. As an alternative, many in the community have informally proposed that a single global lock semantics [16, 9], where transaction semantics are mapped to those of regions protected by a single global lock, provide an intuitive and efficiently implementable model for programmers.

In this paper, we explore the implementation and performance implications of single global lock semantics in a weakly atomic STM from the perspective of Java, and we discuss why even recent STM implementations fall short of these semantics. We describe a new weakly atomic Java STM implementation that provides single global lock semantics while permitting concurrent execution, but we show that this comes at a significant performance cost. We also propose and implement various alternative semantics that loosen single lock requirements while still providing strong guarantees. We compare our new implementations to previous ones, including a strongly atomic STM.

See publication
Fault-safe code motion for type-safe languages

CGO '08: Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization Apr 2008

Compilers for Java and other type-safe languages have historically worked to overcome overheads and constraints imposed by runtime safety checks and precise exception semantics. We instead exploit these safety properties to perform code motion optimizations that are even more aggressive than those possible in unsafe languages such as C++.

We present a novel framework for speculative motion of dangerous (potentially faulting) instructions in safe, object-oriented languages such as Java…

Compilers for Java and other type-safe languages have historically worked to overcome overheads and constraints imposed by runtime safety checks and precise exception semantics. We instead exploit these safety properties to perform code motion optimizations that are even more aggressive than those possible in unsafe languages such as C++.

We present a novel framework for speculative motion of dangerous (potentially faulting) instructions in safe, object-oriented languages such as Java and C#. Unlike earlier work, our approach requires no hardware or operating system support. We leverage the properties already provided by a safe language to define fault safety, a more precise notion of safety that guarantees that a dangerous operation (e.g., a memory load) will not fault at a given program point.

We illustrate how typical code motion optimizations are easily adapted to exploit our safety framework. First, we modify the standard SSAPRE partial redundancy elimination (PRE) algorithm to use fault safety, rather than the traditional down safety property. Our modified algorithm better exploits profile information by inserting of dangerous instructions on new paths when it is profitable and provably safe. Second, we extend an instruction trace scheduler to use fault safety to safely schedule load instructions across branches to better tolerate memory latency and to more compactly target instruction slots.

We implemented these optimizations in StarJIT, a dynamic compiler, and show performance benefits of up to 10% on a set of standard Java benchmarks.

See publication
Concurrent GC Leveraging Transactional Memory

PPoPP '08 Feb 2008
We predict that the ever-growing number of cores on our desktops will require a re-examination of concurrent programming. Two technologies are likely to become mainstream in response: Transactional memory provides a superior programming model to traditional lock-based concurrency, while Concurrent GC can take advantage of multiple cores to eliminate perceptible pauses in desktop applications such as games or Internet telephony. This paper proposes a combination of the two technologies…

We predict that the ever-growing number of cores on our desktops will require a re-examination of concurrent programming. Two technologies are likely to become mainstream in response: Transactional memory provides a superior programming model to traditional lock-based concurrency, while Concurrent GC can take advantage of multiple cores to eliminate perceptible pauses in desktop applications such as games or Internet telephony. This paper proposes a combination of the two technologies, producing a synergy that improves scalability while eliminating the annoyance of user-perceivable pauses.

Specifically, we show how concurrent GC can share some of the mechanisms required for transactional memory. Thus as transactional memory becomes more efficient, so too will concurrent GC. We demonstrate how, using a state of the art software transactional memory system, we can build a state of the art concurrent collector. Our goal was to reduce 90% of pause times to under one millisecond. Of the remainder, we aim for 90% to be under 10ms, and90% of those left to be under 100ms. Our performance results show that we were able to achieve these targets, with pause times between one or two orders of magnitude lower than mainstream technologies.

Other authors
See publication
Compression in Cache Design

ICS '07: Proceedings of the 21st annual international conference on Supercomputing Jun 2007

Increasing cache capacity via compression enables designers to improve performance of existing designs for small incremental cost, further leveraging the large die area invested in last level caches. This paper explores the compressed cache design space with focus on implementation feasibility.

Our compression schemes use companion line pairs -- cache lines whose addresses differ by a single bit -- as candidates for compression. We propose two novel compressed cache organizations: the…

Increasing cache capacity via compression enables designers to improve performance of existing designs for small incremental cost, further leveraging the large die area invested in last level caches. This paper explores the compressed cache design space with focus on implementation feasibility.

Our compression schemes use companion line pairs -- cache lines whose addresses differ by a single bit -- as candidates for compression. We propose two novel compressed cache organizations: the companion bit remapped cache and the pseudoassociative cache. Our cache organizations use fixed-width physical cache line implementation while providing a variablelength logical cache line organization, without changing the number of sets or ways and with minimal increase in state per tag. We evaluate banked and pairwise schemes as two alternatives for storing compressed companion pairs within a physical cache line. We evaluate companion line prefetching (CLP), a simple yet effective prefetching mechanism that works in conjunction with our compression scheme. CLP is nearly pollution free since it only prefetches lines that are compression candidates.

Using a detailed cycle accurate IA-32 simulator, we measure the performance of several third-level compressed cache designs simulating a representative collection of workloads. Our experiments show that our cache compression designs improve IPC for all cache-sensitive workloads, even those with modest data compressibility. The pairwise pseudo-associative compressed cache organization with companion line prefetching is the best configuration, providing a mean IPC improvement of 19% for cache-sensitive workloads, and a best-case IPC improvement of 84%. Finally, our cache designs exhibit negligible overall IPC degradation for cache-insensitive workloads.

See publication
Enforcing isolation and ordering in STM

PLDI '07: Proceedings of the 28th ACM SIGPLAN Conference on Programming Language Design and Implementation Jun 2007

Transactional memory provides a new concurrency control mechanism that avoids many of the pitfalls of lock-based synchronization. High-performance software transactional memory (STM) implementations thus far provide weak atomicity: Accessing shared data both inside and outside a transaction can result in unexpected, implementation-dependent behavior. To guarantee isolation and consistent ordering in such a system, programmers are expected to enclose all shared-memory accesses inside…

Transactional memory provides a new concurrency control mechanism that avoids many of the pitfalls of lock-based synchronization. High-performance software transactional memory (STM) implementations thus far provide weak atomicity: Accessing shared data both inside and outside a transaction can result in unexpected, implementation-dependent behavior. To guarantee isolation and consistent ordering in such a system, programmers are expected to enclose all shared-memory accesses inside transactions.

A system that provides strong atomicity guarantees isolation even in the presence of threads that access shared data outside transactions. A strongly-atomic system also orders transactions with conflicting non-transactional memory operations in a consistent manner.

In this paper, we discuss some surprising pitfalls of weak atomicity, and we present an STM system that avoids these problems via strong atomicity. We demonstrate how to implement non-transactional data accesses via efficient read and write barriers, and we present compiler optimizations that further reduce the overheads of these barriers. We introduce a dynamic escape analysis that differentiates private and public data at runtime to make barriers cheaper and a static not-accessed-in-transaction analysis that removes many barriers completely. Our results on a set of Java programs show that strong atomicity can be implemented efficiently in a high-performance STM system.

See publication
Code Generation and Optimization for Transactional Memory Constructs in an Unmanaged Language

International Symposium on Code Generation and Optimization (CGO'07) Mar 2007

Transactional memory offers significant advantages for concurrency control compared to locks. This paper presents the design and implementation of transactional memory constructs in an unmanaged language. Unmanaged languages pose a unique set of challenges to transactional memory constructs - for example, lack of type and memory safety, use of function pointers, aliasing of local variables, and others. This paper describes novel compiler and runtime mechanisms that address these challenges and…

Transactional memory offers significant advantages for concurrency control compared to locks. This paper presents the design and implementation of transactional memory constructs in an unmanaged language. Unmanaged languages pose a unique set of challenges to transactional memory constructs - for example, lack of type and memory safety, use of function pointers, aliasing of local variables, and others. This paper describes novel compiler and runtime mechanisms that address these challenges and optimize the performance of transactions in an unmanaged environment. We have implemented these mechanisms in a production-quality C compiler and a high-performance software transactional memory runtime. We measure the effectiveness of these optimizations and compare the performance of lock-based versus transaction-based programming on a set of concurrent data structures and the SPLASH-2 benchmark suite. On a 16 processor SMP system, the transaction-based version of the SPLASH-2 benchmarks scales much better than the coarse-grain locking version and performs comparably to the fine-grain locking version. Compiler optimizations significantly reduce the overheads of transactional memory so that, on a single thread, the transaction-based version incurs only about 6.4% overhead compared to the lock-based version for the SPLASH-2 benchmark suite. Thus, our system is the first to demonstrate that transactions integrate well with an unmanaged language, and can perform as well as fine-grain locking while providing the programming ease of coarse-grain locking even on an unmanaged environment

See publication
Enabling scalability and performance in a large scale CMP environment

EuroSys '07: Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007 Mar 2007

Hardware trends suggest that large-scale CMP architectures, with tens to hundreds of processing cores on a single piece of silicon, are iminent within the next decade. While existing CMP machines have traditionally been handled in the same way as SMPs, this magnitude of parallelism introduces several fundamental challenges at the architectural level and this, in turn, translates to novel challenges in the design of the software stack for these platforms. This paper presents the "Many Core Run…

Hardware trends suggest that large-scale CMP architectures, with tens to hundreds of processing cores on a single piece of silicon, are iminent within the next decade. While existing CMP machines have traditionally been handled in the same way as SMPs, this magnitude of parallelism introduces several fundamental challenges at the architectural level and this, in turn, translates to novel challenges in the design of the software stack for these platforms. This paper presents the "Many Core Run Time" (McRT), a software prototype of an integrated language runtime that was designed to explore configurations of the software stack for enabling performance and scalability on large scale CMP platforms. This paper presents the architecture of McRT and discusses our experiences with the system, including experimental evaluation that lead to several interesting, non-intuitive findings, providing key insights about the structure of the system stack at this scale. A key contribution of this paper is to demonstrate how McRT enables near linear improvements in performance and scalability for desktop workloads such as the popular XviD encoder and a set of RMS (recognition, mining, and synthesis) applications. Another key contribution of this work is its use of McRT to explore non-traditional system configurations such as a light-weight executive in which McRT runs on "bare metal" and replaces the traditional OS. Such configurations are becoming an increasingly attractive alternative to leverage heterogeneous computing uints as seen in today's CPU-GPU configurations.

See publication
Open nesting in software transactional memory

PPoPP '07: Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming Mar 2007

Transactional memory (TM) promises to simplify concurrent programming while providing scalability competitive to fine-grained locking. Language-based constructs allow programmers to denote atomic regions declaratively and to rely on the underlying system to provide transactional guarantees along with concurrency. In contrast with fine-grained locking, TM allows programmers to write simpler programs that are composable and deadlock-free.

TM implementations operate by tracking loads and…

Transactional memory (TM) promises to simplify concurrent programming while providing scalability competitive to fine-grained locking. Language-based constructs allow programmers to denote atomic regions declaratively and to rely on the underlying system to provide transactional guarantees along with concurrency. In contrast with fine-grained locking, TM allows programmers to write simpler programs that are composable and deadlock-free.

TM implementations operate by tracking loads and stores to memory and by detecting concurrent conflicting accesses by different transactions. By automating this process, they greatly reduce the programmer's burden, but they also are forced to be conservative. Incertain cases, conflicting memory accesses may not actually violate the higher-level semantics of a program, and a programmer may wish to allow seemingly conflicting transactions to execute concurrently.

Open nested transactions enable expert programmers to differentiate between physical conflicts, at the level of memory, and logical conflicts that actually violate application semantics. A TMsystem with open nesting can permit physical conflicts that are not logical conflicts, and thus increase concurrency among application threads.

Here we present an implementation of open nested transactions in a Java-based software transactional memory (STM)system. We describe new language constructs to support open nesting in Java, and we discuss new abstract locking mechanisms that a programmer can use to prevent logical conflicts. We demonstrate how these constructs can be mapped efficiently to existing STM data structures. Finally, we evaluate our system on a set of Java applications and data structures, demonstrating how open nesting can enhance application scalability.

See publication
Architectural Support for Software Transactional Memory

2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06) Dec 2006

Transactional memory provides a concurrency control mechanism that avoids many of the pitfalls of lock-based synchronization. Researchers have proposed several different implementations of transactional memory, broadly classified into software transactional memory (STM) and hardware transactional memory (HTM). Both approaches have their pros and cons: STMs provide rich and flexible transactional semantics on stock processors but incur significant overheads. HTMs, on the other hand, provide high…

Transactional memory provides a concurrency control mechanism that avoids many of the pitfalls of lock-based synchronization. Researchers have proposed several different implementations of transactional memory, broadly classified into software transactional memory (STM) and hardware transactional memory (HTM). Both approaches have their pros and cons: STMs provide rich and flexible transactional semantics on stock processors but incur significant overheads. HTMs, on the other hand, provide high performance but implement restricted semantics or add significant hardware complexity. This paper is the first to propose architectural support for accelerating transactions executed entirely in software. We propose instruction set architecture (ISA) extensions and novel hardware mechanisms that improve STM performance. We adapt a high-performance STM algorithm supporting rich transactional semantics to our ISA extensions (called hardware accelerated software transactional memory or HASTM). HASTM accelerates fully virtualized nested transactions, supports language integration, and provides both object-based and cache-line based conflict detection. We have implemented HASTM in an accurate multi-core IA32 simulator. Our simulation results show that (1) HASTM single-thread performance is comparable to a conventional HTM implementation; (2) HASTM scaling is comparable to a STM implementation; and (3) HASTM is resilient to spurious aborts and can scale better than HTM in a multi-core setting. Thus, HASTM provides the flexibility and rich semantics of STM, while giving the performance of HTM

See publication
Unlocking Concurrency

ACM Queue Dec 2006

Multicore architectures are an inflection point in mainstream software development because they force developers to write parallel programs. In a previous article in Queue, Herb Sutter and James Larus pointed out, “The concurrency revolution is primarily a software revolution. The difficult problem is not building multicore hardware, but programming it in a way that lets mainstream applications benefit from the continued exponential growth in CPU performance.” 1 In this new multicore world…

Multicore architectures are an inflection point in mainstream software development because they force developers to write parallel programs. In a previous article in Queue, Herb Sutter and James Larus pointed out, “The concurrency revolution is primarily a software revolution. The difficult problem is not building multicore hardware, but programming it in a way that lets mainstream applications benefit from the continued exponential growth in CPU performance.” 1 In this new multicore world, developers must write explicitly parallel applications that can take advantage of the increasing number of cores that each successive multicore generation will provide.

See publication
Compiler and runtime support for efficient software transactional memory

PLDI '06: Proceedings of the 27th ACM SIGPLAN Conference on Programming Language Design and Implementation Jun 2006

Programmers have traditionally used locks to synchronize concurrent access to shared data. Lock-based synchronization, however, has well-known pitfalls: using locks for fine-grain synchronization and composing code that already uses locks are both difficult and prone to deadlock. Transactional memory provides an alternate concurrency control mechanism that avoids these pitfalls and significantly eases concurrent programming. Transactional memory language constructs have recently been proposed…

Programmers have traditionally used locks to synchronize concurrent access to shared data. Lock-based synchronization, however, has well-known pitfalls: using locks for fine-grain synchronization and composing code that already uses locks are both difficult and prone to deadlock. Transactional memory provides an alternate concurrency control mechanism that avoids these pitfalls and significantly eases concurrent programming. Transactional memory language constructs have recently been proposed as extensions to existing languages or included in new concurrent language specifications, opening the door for new compiler optimizations that target the overheads of transactional memory.This paper presents compiler and runtime optimizations for transactional memory language constructs. We present a high-performance software transactional memory system (STM) integrated into a managed runtime environment. Our system efficiently implements nested transactions that support both composition of transactions and partial roll back. Our JIT compiler is the first to optimize the overheads of STM, and we show novel techniques for enabling JIT optimizations on STM operations. We measure the performance of our optimizations on a 16-way SMP running multi-threaded transactional workloads. Our results show that these techniques enable transactional memory's performance to compete with that of well-tuned synchronization.

See publication
McRT-Malloc: a scalable transactional memory allocator

ISMM '06: Proceedings of the 5th international symposium on Memory management Jun 2006

Emerging multi-core processors promise to provide an exponentially increasing number of hardware threads with every generation. Applications will need to be highly concurrent to fullyuse the power of these processors. To enable maximum concurrency, libraries (such as malloc-free packages) would therefore need to use non-blocking algorithms. But lock-free algorithms are notoriously difficult to reason about and inappropriate for average programmers. Transactional memory promises to significantly…

Emerging multi-core processors promise to provide an exponentially increasing number of hardware threads with every generation. Applications will need to be highly concurrent to fullyuse the power of these processors. To enable maximum concurrency, libraries (such as malloc-free packages) would therefore need to use non-blocking algorithms. But lock-free algorithms are notoriously difficult to reason about and inappropriate for average programmers. Transactional memory promises to significantly ease concurrent programming for the average programmer. This paper describes a highly efficient non-blocking malloc/free algorithm that supports memory allocation and deallocation inside transactional code blocks. Thus this paper describes a memory allocator that is suitable for emerging multi-core applications, while supporting modern concurrency constructs.This paper makes several novel contributions. It is the first to integrate a software transactional memory system with a malloc/free based memory allocator. We present the first algorithm which ensures that space allocated in an aborted transaction is properly freed and does not lead to a space blowup. Unlike previous lock-free malloc packages, our algorithm avoids atomic operations on typical code paths, making our algorithm substantially more efficient.

See publication
McRT-STM: A High-Performance Software Transactional Memory System for a Multi-core Runtime

PPoPP'06: Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming Mar 2006

Applications need to become more concurrent to take advantage of the increased computational power provided by chip level multiprocessing. Programmers have traditionally managed this concurrency using locks (mutex based synchronization). Unfortunately, lock based synchronization often leads to deadlocks, makes fine-grained synchronization difficult, hinders composition of atomic primitives, and provides no support for error recovery. Transactions avoid many of these problems, and therefore…

Applications need to become more concurrent to take advantage of the increased computational power provided by chip level multiprocessing. Programmers have traditionally managed this concurrency using locks (mutex based synchronization). Unfortunately, lock based synchronization often leads to deadlocks, makes fine-grained synchronization difficult, hinders composition of atomic primitives, and provides no support for error recovery. Transactions avoid many of these problems, and therefore, promise to ease concurrent programming.We describe a software transactional memory (STM) system that is part of McRT, an experimental Multi-Core RunTime. The McRT-STM implementation uses a number of novel algorithms, and supports advanced features such as nested transactions with partial aborts, conditional signaling within a transaction, and object based conflict detection for C/C++ applications. The McRT-STM exports interfaces that can be used from C/C++ programs directly or as a target for compilers translating higher level linguistic constructs.We present a detailed performance analysis of various STM design tradeoffs such as pessimistic versus optimistic concurrency, undo logging versus write buffering, and cache line based versus object based conflict detection. We also show a MCAS implementation that works on arbitrary values, coexists with the STM, and can be used as a more efficient form of transactional memory. To provide a baseline we compare the performance of the STM with that of fine-grained and coarse-grained locking using a number of concurrent data structures on a 16-processor SMP system. We also show our STM performance on a non-synthetic workload -- the Linux sendmail application.

See publication
A verifiable SSA program representation for aggressive compiler optimization

POPL '06: Conference record of the 33rd ACM SIGPLAN-SIGACT symposium on Principles of programming languages Jan 2006

We present a verifiable low-level program representation to embed, propagate, and preserve safety information in high perfor-mance compilers for safe languages such as Java and C#. Our representation precisely encodes safety information via static single-assignment (SSA) [11, 3] proof variables that are first-class constructs in the program.We argue that our representation allows a compiler to both (1) express aggressively optimized machine-independent code and (2) leverage existing compiler…

We present a verifiable low-level program representation to embed, propagate, and preserve safety information in high perfor-mance compilers for safe languages such as Java and C#. Our representation precisely encodes safety information via static single-assignment (SSA) [11, 3] proof variables that are first-class constructs in the program.We argue that our representation allows a compiler to both (1) express aggressively optimized machine-independent code and (2) leverage existing compiler infrastructure to preserve safety information during optimization. We demonstrate that this approach supports standard compiler optimizations, requires minimal changes to the implementation of those optimizations, and does not artificially impede those optimizations to preserve safety. We also describe a simple type system that formalizes type safety in an SSA-style control-flow graph program representation. Through the types of proof variables, our system enables compositional verification of memory safety in optimized code. Finally, we discuss experiences integrating this representation into the machine-independent global optimizer of STARJIT, a high-performance just-in-time compiler that performs aggressive control-flow, data-flow, and algebraic optimizations and is competitive with top production systems.

See publication
Prefetch injection based on hardware monitoring and object metadata

PLDI '04 Jun 2004

Cache miss stalls hurt performance because of the large gap between memory and processor speeds - for example, the popular server benchmark SPEC JBB2000 spends 45% of its cycles stalled waiting for memory requests on the Itanium® 2 processor. Traversing linked data structures causes a large portion of these stalls. Prefetching for linked data structures remains a major challenge because serial data dependencies between elements in a linked data structure preclude the timely materialization of…

Cache miss stalls hurt performance because of the large gap between memory and processor speeds - for example, the popular server benchmark SPEC JBB2000 spends 45% of its cycles stalled waiting for memory requests on the Itanium® 2 processor. Traversing linked data structures causes a large portion of these stalls. Prefetching for linked data structures remains a major challenge because serial data dependencies between elements in a linked data structure preclude the timely materialization of prefetch addresses. This paper presents Mississippi Delta (MS Delta), a novel technique for prefetching linked data structures that closely integrates the hardware performance monitor (HPM), the garbage collector's global view of heap and object layout, the type-level metadata inherent in type-safe programs, and JIT compiler analysis. The garbage collector uses the HPM's data cache miss information to identify cache miss intensive traversal paths through linked data structures, and then discovers regular distances (deltas) between these linked objects. JIT compiler analysis injects prefetch instructions using deltas to materialize prefetch addresses.We have implemented MS Delta in a fully dynamic profile-guided optimization system: the StarJIT dynamic compiler [1] and the ORP Java virtual machine [9]. We demonstrate a 28-29% reduction in stall cycles attributable to the high-latency cache misses targeted by MS Delta and a speedup of 11-14% on the cache miss intensive SPEC JBB2000 benchmark.

See publication
Improving 64-Bit Java IPF Performance by Compressing Heap References

CGO 2004: International Symposium on Code generation and Optimization Mar 2004
The StarJIT Compiler: A Dynamic Compiler for Managed Runtime Environments

Intel Technology Journal, Vol. 7, Issue 1, 2003 Feb 2003

Dynamic compilers (or Just-in-Time [JIT] compilers) are a key component of managed runtime environments. This paper describes the design and implementation of the StarJIT compiler, a dynamic compiler for Java Virtual Machines and Common Language Runtime platforms. The goal of the StarJIT compiler is to build an infrastructure to research the influence of managed runtime environments on Intel architectures. The StarJIT compiler can compile both Java Infrastructure (CLI) bytecodes, and it uses a…

Dynamic compilers (or Just-in-Time [JIT] compilers) are a key component of managed runtime environments. This paper describes the design and implementation of the StarJIT compiler, a dynamic compiler for Java Virtual Machines and Common Language Runtime platforms. The goal of the StarJIT compiler is to build an infrastructure to research the influence of managed runtime environments on Intel architectures. The StarJIT compiler can compile both Java Infrastructure (CLI) bytecodes, and it uses a single intermediate representation and global optimization framework for both Java and CLI. The StarJIT compiler is designed to generate optimized code for the major Intel architectures and currently targets two Intel architectures: IA-32 and the Itanium Processor Family.

See publication
Just-in-time Java compilation for the Itanium processor

Proceedings.International Conference on Parallel Architectures and Compilation Techniques Sep 2002

This paper describes a just-in-time (JIT) Java compiler for the Intel/spl reg/ Itanium/spl reg/ processor. The Itanium processor is an example of an Explicitly Parallel Instruction Computing (EPIC) architecture and thus relies on aggressive and expensive compiler optimizations for performance. Static compilers for Itanium use aggressive global scheduling algorithms to extract instruction-level parallelism. In a JIT compiler, however, the additional overhead of such expensive optimizations may…

This paper describes a just-in-time (JIT) Java compiler for the Intel/spl reg/ Itanium/spl reg/ processor. The Itanium processor is an example of an Explicitly Parallel Instruction Computing (EPIC) architecture and thus relies on aggressive and expensive compiler optimizations for performance. Static compilers for Itanium use aggressive global scheduling algorithms to extract instruction-level parallelism. In a JIT compiler, however, the additional overhead of such expensive optimizations may offset any gains from the improved code. In this paper, we describe lightweight code generation techniques for generating efficient Itanium code. Our compiler relies on two basic methods to generate efficient code. First, the compiler uses inexpensive scheduling heuristics to model the Itanium microarchitecture. Second, the compiler uses the semantics of the Java virtual machine to extract instruction-level parallelism.

See publication
Fusion-based register allocation

ACM Transactions on Programming Languages and Systems May 2000

The register allocation phase of a compiler maps live ranges of a program to registers. If there are more candidates than there are physical registers, the register allocator must spill a live range (the home location is in memory) or split a live range (the live range occupies multiple locations). One of the challenges for a register allocator is to deal with spilling and splitting together. Fusion-based register allocation uses the structure of the program to make splitting and spilling…

The register allocation phase of a compiler maps live ranges of a program to registers. If there are more candidates than there are physical registers, the register allocator must spill a live range (the home location is in memory) or split a live range (the live range occupies multiple locations). One of the challenges for a register allocator is to deal with spilling and splitting together. Fusion-based register allocation uses the structure of the program to make splitting and spilling decisions, with the goal to move overhead operations to infrequently executed parts of a program. The basic idea of fusion-based register allocation is to build up the interference graph. Starting with some base region (e.g., a basic block, a loop), the register allocator adds basic blocks to the region and incrementallly builds the interference graph. When there are more live ranges than registers, the register allocator selects live ranges to split; these live ranges are split along the edge that was most recently added to the region. This article describes fusion-based register allocation in detail and compares it with other approaches to register allocation. For programs from the SPEC92 suite, fusion-based register allocation can improve the execution time (of optimized programs, for the MIPS architecture) by up to 8.4% over Chaitin-style register allocation.

See publication
Fast, effective code generation in a just-in-time Java compiler

PLDI '98: Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation May 1998

A "Just-In-Time" (JIT) Java compiler produces native code from Java byte code instructions during program execution. As such, compilation speed is more important in a Java JIT compiler than in a traditional compiler, requiring optimization algorithms to be lightweight and effective. We present the structure of a Java JIT compiler for the Intel Architecture, describe the lightweight implementation of JIT compiler optimizations (e.g., common subexpression elimination, register allocation, and…

A "Just-In-Time" (JIT) Java compiler produces native code from Java byte code instructions during program execution. As such, compilation speed is more important in a Java JIT compiler than in a traditional compiler, requiring optimization algorithms to be lightweight and effective. We present the structure of a Java JIT compiler for the Intel Architecture, describe the lightweight implementation of JIT compiler optimizations (e.g., common subexpression elimination, register allocation, and elimination of array bounds checking), and evaluate the performance benefits and tradeoffs of the optimizations. This JIT compiler has been shipped with version 2.5 of Intel's VTune for Java product.

See publication
Code reuse in an optimizing compiler

OOPSLA '96: Proceedings of the 11th ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications Oct 1996

This paper describes how the cmcc compiler reuses code---both internally (reuse between different modules) and externally (reuse between versions for different target machines). The key to reuse are the application frameworks developed for global data-flow analysis, code generation, instruction scheduling, and register allocation.The code produced by cmcc is as good as the code produced by the native compilers for the MIPS and SPARC, although significantly less resources have been spent on cmcc…

This paper describes how the cmcc compiler reuses code---both internally (reuse between different modules) and externally (reuse between versions for different target machines). The key to reuse are the application frameworks developed for global data-flow analysis, code generation, instruction scheduling, and register allocation.The code produced by cmcc is as good as the code produced by the native compilers for the MIPS and SPARC, although significantly less resources have been spent on cmcc (overall, about 6 man years by 2.5 persons). cmcc is implemented in C++, which allowed for a compact expression of the frameworks as class hierarchies. The results support the claim that suitable frameworks facilitate reuse and thereby significantly improve developer effectiveness.

See publication
Efficient and language-independent mobile programs

PLDI '96: Proceedings of the ACM SIGPLAN 1996 conference on Programming language design and implementation May 1996

This paper evaluates the design and implementation of Omniware: a safe, efficient, and language-independent system for executing mobile program modules. Previous approaches to implementing mobile code rely on either language semantics or abstract machine interpretation to enforce safety. In the former case, the mobile code system sacrifices universality to gain safety by dictating a particular source language or type system. In the latter case, the mobile code system sacrifices performance to…

This paper evaluates the design and implementation of Omniware: a safe, efficient, and language-independent system for executing mobile program modules. Previous approaches to implementing mobile code rely on either language semantics or abstract machine interpretation to enforce safety. In the former case, the mobile code system sacrifices universality to gain safety by dictating a particular source language or type system. In the latter case, the mobile code system sacrifices performance to gain safety through abstract machine interpretation.Omniware uses software fault isolation, a technology developed to provide safe extension code for databases and operating systems, to achieve a unique combination of language-independence and excellent performance. Software fault isolation uses only the semantics of the underlying processor to determine whether a mobile code module can corrupt its execution environment. This separation of programming language implementation from program module safety enables our mobile code system to use a radically simplified virtual machine as its basis for portability. We measured the performance of Omniware using a suite of four SPEC92 programs on the Pentium, PowerPC, Mips, and Sparc processor architectures. Including the overhead for enforcing safety on all four processors, OmniVM executed the benchmark programs within 21% as fast as the optimized, unsafe code produced by the vendor-supplied compiler.

See publication
Source-level debugging of scalar optimized code

PLDI '96: Proceedings of the ACM SIGPLAN 1996 conference on Programming language design and implementation May 1996

Although compiler optimizations play a crucial role in the performance of modern computer systems, debugger technology has lagged behind in its support of optimization. Yet debugging the unoptimized translation is often impossible or futile, so handling of code optimizations in the debugger is necessary. But compiler optimizations make it difficult to provide source-level debugger functionality: Global optimizations can cause the runtime value of a variable to be inconsistent with the…

Although compiler optimizations play a crucial role in the performance of modern computer systems, debugger technology has lagged behind in its support of optimization. Yet debugging the unoptimized translation is often impossible or futile, so handling of code optimizations in the debugger is necessary. But compiler optimizations make it difficult to provide source-level debugger functionality: Global optimizations can cause the runtime value of a variable to be inconsistent with the source-level value expected at a breakpoint; such variables are called endangered variables. A debugger must detect and warn the user of endangered variables otherwise the user may draw incorrect conclusions about the program. This paper presents a new algorithm for detecting variables that are endangered due to global scalar optimization. Our approach provides more precise classifications of variables and is still simpler than past approaches. We have implemented and evaluated our techniques in the context of the cmcc optimizing C compiler. We describe the compiler extensions necessary to perform the required bookkeeping of compiler optimization. We present measurements of the effect of optimizations on a debugger's ability to present the expected values of variables to the user.

See publication
Detection and recovery of endangered variables caused by instruction scheduling

PLDI '93: SIGPLAN Symp. on Programming Language Design and Implementation Jun 1993

Instruction scheduling re-orders and interleaves instruction sequences from different source statements. This impacts the task of a symbolic debugger, which attempts to present the user a picture of program execution that matches the source program. At a breakpoint B, if the value in the run-time location of a variable V may not correspond to the value the user expects V to have, then this variable is endangered at B. This paper describes an approach to detecting and recovering endangered…

Instruction scheduling re-orders and interleaves instruction sequences from different source statements. This impacts the task of a symbolic debugger, which attempts to present the user a picture of program execution that matches the source program. At a breakpoint B, if the value in the run-time location of a variable V may not correspond to the value the user expects V to have, then this variable is endangered at B. This paper describes an approach to detecting and recovering endangered variables caused by instruction scheduling. We measure the effects of instruction scheduling on a symbolic debugger's ability to recover source values at a breakpoint. This paper reports measurements for three C programs from the SPEC suite and a collection of programs from the Numerical Recipes, which have been compiled with a variant of a commercial C compiler.

See publication
Evicted variables and the interaction of global register allocation and symbolic debugging

POPL '93: Proceedings of the 20th ACM SIGPLAN-SIGACT symposium on Principles of programming languages Mar 1993

A symbolic debugger allows a user to display the values of program variables at a breakpoint. However, problems arise if the program is translated by an optimizing compiler. This paper addresses the effects of global register allocation and assignment: a register assigned to a variable V may not be holding V's value at a breakpoint since the register can also be assigned to other variables. We define the problem of determining whether a variable is in its assigned register as the residence…

A symbolic debugger allows a user to display the values of program variables at a breakpoint. However, problems arise if the program is translated by an optimizing compiler. This paper addresses the effects of global register allocation and assignment: a register assigned to a variable V may not be holding V's value at a breakpoint since the register can also be assigned to other variables. We define the problem of determining whether a variable is in its assigned register as the residence problem. Prior work on debugging of optimized code has focused on the currency problem; detecting whether a variable's run-time value is the expected value. Determining residence is a more serious problem than currency detection. We present a data flow algorithm that accurately computes a variable's residency, by determining when a variable becomes evicted from its register. We measure the effectiveness of different approaches to determine variable residence for three C programs from the SPEC suite.

See publication

Patents

Optimizing quiescence in a software transactional memory (STM) system

Issued February 19, 2019 US US10210018

A method and apparatus for optimizing quiescence in a transactional memory system is herein described. Non-ordering transactions, such as read-only transactions, transactions that do not access non-transactional data, and write-buffering hardware transactions, are identified. Quiescence in weak atomicity software transactional memory (STM) systems is optimized through selective application of quiescence. As a result, transactions may be decoupled from dependency on quiescing/waiting on previous…

A method and apparatus for optimizing quiescence in a transactional memory system is herein described. Non-ordering transactions, such as read-only transactions, transactions that do not access non-transactional data, and write-buffering hardware transactions, are identified. Quiescence in weak atomicity software transactional memory (STM) systems is optimized through selective application of quiescence. As a result, transactions may be decoupled from dependency on quiescing/waiting on previous non-ordering transaction to increase parallelization and reduce inefficiency based on serialization of transactions.

See patent
Private memory regions and coherency optimization by controlling snoop traffic volume in multi-level cache hierarchy

Issued September 19, 2017 US US9767027

A system for optimizing cache coherence message traffic volume is disclosed. The system includes a plurality of caches in a multi-level memory hierarchy and a plurality of agents. Each agent is associated with a cache. The system includes one or more monitoring engines. Each agent in the plurality of agents is associated with a monitoring engine. The agents can execute a processor level software instruction causing a memory region to be private to the agent. Each of the agents is configured to…

A system for optimizing cache coherence message traffic volume is disclosed. The system includes a plurality of caches in a multi-level memory hierarchy and a plurality of agents. Each agent is associated with a cache. The system includes one or more monitoring engines. Each agent in the plurality of agents is associated with a monitoring engine. The agents can execute a processor level software instruction causing a memory region to be private to the agent. Each of the agents is configured to execute a memory access for data on an associated cache and to send a request for data up the hierarchy on a cache miss. The monitoring engine is configured to intercept request for data from an agent and to prevent snooping for the cache line in peer caches when the cache line associated with a memory region represented as private to the agent.

See patent
Tracing mechanism for recording shared memory interleavings on multi-core processors

Issued January 31, 2017 US US9558118

A memory race recorder (MRR) is provided. The MRR includes a multi-core processor having a relaxed memory consistency model, an extension to the multi-core processor, the extension to store chunks, the chunk having a chunk size (CS) and an instruction count (IC), and a plurality of cores to execute instructions. The plurality of cores executes load/store instructions to/from a store buffer (STB) and a simulated memory to store the value when the value is not in the STB. The oldest value in the…

A memory race recorder (MRR) is provided. The MRR includes a multi-core processor having a relaxed memory consistency model, an extension to the multi-core processor, the extension to store chunks, the chunk having a chunk size (CS) and an instruction count (IC), and a plurality of cores to execute instructions. The plurality of cores executes load/store instructions to/from a store buffer (STB) and a simulated memory to store the value when the value is not in the STB. The oldest value in the STB is transferred to the simulated memory when the IC is equal to zero and the CS is greater than zero. The MRR logs a trace entry comprising the CS, the IC, and a global timestamp, the global timestamp proving a total order across all logged chunks.

See patent
Enlarging control regions to optimize script code compilation

Issued January 24, 2017 US US9552195

Disclosed here are methods, systems, paradigms and structures for incrementally compiling scripts at runtime to generate executable code. The incremental compilation generates executable code corresponding to basic blocks of a script in various phases and at various scopes. In a first phase, an executable code for a basic block of the script is generated for a set of types of variables of the basic block. The generated executable block is stored and executed for subsequent requests. In a second…

Disclosed here are methods, systems, paradigms and structures for incrementally compiling scripts at runtime to generate executable code. The incremental compilation generates executable code corresponding to basic blocks of a script in various phases and at various scopes. In a first phase, an executable code for a basic block of the script is generated for a set of types of variables of the basic block. The generated executable block is stored and executed for subsequent requests. In a second phase, a set of executable blocks whose profiling information, such as frequency of (a) execution, (b) transition between two executable blocks, or (c) execution of a particular path, satisfies an optimization criterion is identified. The identified set of executable blocks are combined to generate an executable control region, which is more optimal than the executable blocks generated in the first phase. The executable control region is executed for subsequent requests.

See patent
Optimizing intermediate representation of script code by eliminating redundant reference count operations

Issued July 5, 2016 US US9383979

Disclosed here are methods, systems, paradigms and structures for optimizing generation of intermediate representation (IR) for a script code by eliminating redundant object reference count operations from the IR. An IR of the script includes (a) a set of first code that increments a reference count of an object when a programming construct refers to the object, and (b) an associated set of second code which decrements the reference count of the object when a reference to the object is removed.…

Disclosed here are methods, systems, paradigms and structures for optimizing generation of intermediate representation (IR) for a script code by eliminating redundant object reference count operations from the IR. An IR of the script includes (a) a set of first code that increments a reference count of an object when a programming construct refers to the object, and (b) an associated set of second code which decrements the reference count of the object when a reference to the object is removed. The IR is analyzed to identify a subset of the set of second code which, upon execution, does not decrement the reference count of the object to a zero value. The subset of second code and the first code corresponding to the subset is removed from the IR to generate an optimized IR. The optimized IR is further converted to an executable code.

See patent
Hybrid linear validation algorithm for software transactional memory (STM) systems

Issued May 10, 2016 US US9336066

A method and apparatus for hybrid validation for a Software Transaction Memory (STM) is herein described. During execution of a transaction, when acquiring ownership of meta-data associated with a data element, the meta-data is updated with an ownership reference to a transaction to enable efficient subsequent ownership tests. However, during validation, for some conditions, meta-data is updated from the ownership reference to a write entry reference to enable efficient validation.

See patent
Optimizing intermediate representation of script code for atomic execution

Issued April 19, 2016 US US9317265

Disclosed here are methods, systems, paradigms and structures for optimizing intermediate representation (IR) of a script code for atomic execution. Atomic execution of the script is achieved by generating portions of the IR as an atomic transaction. In an atomic transaction, a series of operations either all execute, or none executes. The IR includes checkpoints that evaluate to one of two possible values. The checkpoint evaluates to a first value when there is no error during execution, and…

Disclosed here are methods, systems, paradigms and structures for optimizing intermediate representation (IR) of a script code for atomic execution. Atomic execution of the script is achieved by generating portions of the IR as an atomic transaction. In an atomic transaction, a series of operations either all execute, or none executes. The IR includes checkpoints that evaluate to one of two possible values. The checkpoint evaluates to a first value when there is no error during execution, and evaluates to a second value when an error occurs. The IR is optimized for atomic execution by regenerating a portion of the IR including the checkpoint and code associated with the checkpoint as a transaction. When an error occurs during the execution of the transaction, the transaction is aborted and a state of execution of the script code is reverted to a state prior to the beginning of the transaction.

See patent
Optimizing intermediate representation of script code for fast path execution

Issued March 29, 2016 US US9298433

Disclosed here are methods, systems, paradigms and structures for optimizing intermediate representation (IR) of a script code for fast path execution. A fast path is typically a path that handles most commonly occurring tasks more efficiently than less commonly occurring ones which are handled by slow paths. The less commonly occurring tasks may include uncommon cases, error handling, and other anomalies. The IR includes checkpoints which evaluate to two possible values resulting in either a…

Disclosed here are methods, systems, paradigms and structures for optimizing intermediate representation (IR) of a script code for fast path execution. A fast path is typically a path that handles most commonly occurring tasks more efficiently than less commonly occurring ones which are handled by slow paths. The less commonly occurring tasks may include uncommon cases, error handling, and other anomalies. The IR includes checkpoints which evaluate to two possible values resulting in either a fast path or slow path execution. The IR is optimized for fast path execution by regenerating a checkpoint as a labeled checkpoint. The code in the portion of the IR following the checkpoint is optimized assuming the checkpoint evaluates to a value resulting in fast path. The code for handling situations where the checkpoint evaluates to a value resulting in slow path is transferred to a portion of the IR identified by the label.

See patent
Using buffered stores or monitoring to filter redundant transactional accesses and mechanisms for mapping data to buffered metadata

Issued March 8, 2016 US US9280397

A method and apparatus for accelerating a Software Transactional Memory (STM) system is herein described. A data object and metadata for the data object may each be associated with a filter, such as a hardware monitor or ephemerally held filter information. The filter is in a first, default state when no access, such as a read, from the data object has occurred during a pendancy of a transaction. Upon encountering a first access to the metadata, such as a first read, access barrier operations…

A method and apparatus for accelerating a Software Transactional Memory (STM) system is herein described. A data object and metadata for the data object may each be associated with a filter, such as a hardware monitor or ephemerally held filter information. The filter is in a first, default state when no access, such as a read, from the data object has occurred during a pendancy of a transaction. Upon encountering a first access to the metadata, such as a first read, access barrier operations, such as logging of the metadata; setting a read monitor; or updating ephemeral filter information with an ephemeral/buffered store operation, are performed. Upon a subsequent/redundant access to the metadata, such as a second read, access barrier operations are elided to accelerate the subsequent access based on the filter being set to the second state to indicate a previous access occurred. Additionally, mapping of data objects to ephemeral information may be provided by software, such as through a pointer to the ephemeral information associated with the data object; an offset from a base address of the data object to the ephemeral information included associated with the data object; an index into a segment containing the ephemeral information associated with the data object; mapping the data object to the ephemeral information utilizing address arithmetic; and a hash that maps the data object to ephemeral information.

See patent
Optimization for safe elimination of weak atomicity overhead

Issued January 3, 2016 US US9274855

A method and apparatus for optimizing weak atomicity overhead is herein described. A state table is maintained either during static or dynamic compilation of code to track data non-transactionally accessed. Within execution of a transaction, such as at transactional memory accesses or within a commit function, it is determined if data associated with memory access within the transaction is to be conflictingly accessed outside the transaction from the state table. If the data is not accessed…

A method and apparatus for optimizing weak atomicity overhead is herein described. A state table is maintained either during static or dynamic compilation of code to track data non-transactionally accessed. Within execution of a transaction, such as at transactional memory accesses or within a commit function, it is determined if data associated with memory access within the transaction is to be conflictingly accessed outside the transaction from the state table. If the data is not accessed outside the transaction, then the transaction potentially commits without weak atomicity safety mechanisms, such as privatization. Furthermore, even if data is accessed outside the transaction, optimized safety mechanisms may be performed to ensure isolation between the potentially conflicting accesses, while eliding the mechanisms for data not accessed outside the transaction.

See patent
Systems and methods for incremental compilation at runtime using relaxed guards

Issued November 24, 2015 US US9195441

Techniques provided herein facilitate just-in-time compilation of source code, such as a script, during execution. According to some embodiments, a tracelet is limited to a single basic block of code. The data types of variable values provided by one or more variables used in the single basic block of code are known by generalized categories, rather than only being known by specific data types. Accordingly, guard code associated with each tracelet, which ensures that variable values received by…

Techniques provided herein facilitate just-in-time compilation of source code, such as a script, during execution. According to some embodiments, a tracelet is limited to a single basic block of code. The data types of variable values provided by one or more variables used in the single basic block of code are known by generalized categories, rather than only being known by specific data types. Accordingly, guard code associated with each tracelet, which ensures that variable values received by the tracelet though the variables are of the data types expected by the tracelet's associated code body, can use generalized data types. The tracelet can contain code body that can handle input values that meet those generalized data types. A generalized data type can be defined according to one or more common characteristics shared by two or more specific data types.

See patent
Methods and systems to identify and reproduce concurrency violations in multi-threaded programs using expressions

Issued September 15, 2015 US US9135139

Methods and systems to identify and reproduce concurrency bugs in multi-threaded programs are disclosed. An example method disclosed herein includes defining a data type. The data type includes a first predicate associated with a first thread of a multi-threaded program that is associated with a first condition, a second predicate that is associated with a second thread of the multi-threaded program, the second predicate being associated with a second condition, and an expression that defines a…

Methods and systems to identify and reproduce concurrency bugs in multi-threaded programs are disclosed. An example method disclosed herein includes defining a data type. The data type includes a first predicate associated with a first thread of a multi-threaded program that is associated with a first condition, a second predicate that is associated with a second thread of the multi-threaded program, the second predicate being associated with a second condition, and an expression that defines a relationship between the first predicate and the second predicate. The relationship, when satisfied, causes the concurrency bug to be detected. A concurrency bug detector conforming to the data type is used to detect the concurrency bug in the multi-threaded program.

See patent
Instrumentation of hardware assisted transactional memory system

Issued July 28, 2015 US US9092253B2

Monitoring performance of one or more architecturally significant processor caches coupled to a processor. The methods include executing an application on one or more processors coupled to one or more architecturally significant processor caches, where the application utilizes the architecturally significant portions of the architecturally significant processor caches. The methods further include at least one of generating metrics related to performance of the architecturally significant…

Monitoring performance of one or more architecturally significant processor caches coupled to a processor. The methods include executing an application on one or more processors coupled to one or more architecturally significant processor caches, where the application utilizes the architecturally significant portions of the architecturally significant processor caches. The methods further include at least one of generating metrics related to performance of the architecturally significant processor caches; implementing one or more debug exceptions related to performance of the architecturally significant processor caches; or implementing one or more transactional breakpoints related to performance of the architecturally significant processor caches as a result of utilizing the architecturally significant portions of the architecturally significant processor caches.

See patent
Incremental compilation of a script code in a distributed environment

Issued March 17, 2015 US US8984492

Disclosed here are methods, systems, paradigms and structures for incrementally compiling scripts at runtime to generate executable code. In a first phase, an executable block for a basic block of the script is generated for a set of types of variables of the basic block. In a second phase, a set of executable blocks whose profiling information, such as frequency of (a) execution, (b) transition between executable blocks, or (c) execution of a path, satisfies an optimization criterion is…

Disclosed here are methods, systems, paradigms and structures for incrementally compiling scripts at runtime to generate executable code. In a first phase, an executable block for a basic block of the script is generated for a set of types of variables of the basic block. In a second phase, a set of executable blocks whose profiling information, such as frequency of (a) execution, (b) transition between executable blocks, or (c) execution of a path, satisfies an optimization criterion is identified, and an executable control region is generated. In a third phase, profiling information from a number of systems in a distributed environment is aggregated, and an executable control region corresponding to the aggregated profile is generated. The executable code generated in each of the phases is more optimal than the code generated in a previous phase, and is used for execution until replaced by the code of a subsequent phase.

See patent
Systems and methods for data-parallel processing

Issued February 10, 2015 US US8954986

Methods, systems, and mediums are described for scheduling data parallel tasks onto multiple thread execution units of processing system. Embodiments of a lock-free queue structure and methods of operation are described to implement a method for scheduling fine-grained data-parallel tasks for execution in a computing system. The work of one of a plurality of worker threads is wait-free with respect to the other worker threads. Each node of the queue holds a reference to a task that may be…

Methods, systems, and mediums are described for scheduling data parallel tasks onto multiple thread execution units of processing system. Embodiments of a lock-free queue structure and methods of operation are described to implement a method for scheduling fine-grained data-parallel tasks for execution in a computing system. The work of one of a plurality of worker threads is wait-free with respect to the other worker threads. Each node of the queue holds a reference to a task that may be concurrently performed by multiple thread execution units, but each on a different subset of data. Various embodiments relate to software-based scheduling of data-parallel tasks on a multi-threaded computing platform that does not perform such scheduling in hardware. Other embodiments are also described and claimed.

See patent
Methods and systems for mapping a function pointer to the device code

Issued February 3, 2015 US US8949777B2

Methods for mapping a function pointer to the device code are presented. In one embodiment, a method includes identifying a function which is executable by processing devices. The method includes generating codes including a first code corresponds to a first processing device and a second code corresponds to a second processing device. The second processing device is architecturally different from the first processing device. The method further includes storing the second code in a byte string…

Methods for mapping a function pointer to the device code are presented. In one embodiment, a method includes identifying a function which is executable by processing devices. The method includes generating codes including a first code corresponds to a first processing device and a second code corresponds to a second processing device. The second processing device is architecturally different from the first processing device. The method further includes storing the second code in a byte string such that the second code is retrievable if the function will be executed by the second processing device.

See patent
Private memory regions and coherence optimizations

Issued August 19, 2014 US US8812796

Private or shared read-only memory regions. One embodiment may be practiced in a computing environment including a plurality of agents. A method includes acts for declaring one or more memory regions private to a particular agent or shared read only amongst agents by having software utilize processor level instructions to specify to hardware the private or shared read only memory address regions. The method includes an agent executing a processor level instruction to specify one or more memory…

Private or shared read-only memory regions. One embodiment may be practiced in a computing environment including a plurality of agents. A method includes acts for declaring one or more memory regions private to a particular agent or shared read only amongst agents by having software utilize processor level instructions to specify to hardware the private or shared read only memory address regions. The method includes an agent executing a processor level instruction to specify one or more memory regions as private to the agent or shared read-only amongst a plurality of agents. As a result of an agent executing a processor level instruction to specify one or more memory regions as private to the agent or shared read-only amongst a plurality of agents, a hardware component monitoring the one or more memory regions for conflicting accesses or prevents conflicting accesses on the one or more memory regions.

See patent
Method and system for safe enqueuing of events

Issued August 19, 2014 US US8813083

A method and system to facilitate a user level application executing in a first processing unit to enqueue work or task(s) safely for a second processing unit without performing any ring transition. For example, in one embodiment of the invention, the first processing unit executes one or more user level applications, where each user level application has a task to be offloaded to a second processing unit. The first processing unit signals the second processing unit to handle the task from each…

A method and system to facilitate a user level application executing in a first processing unit to enqueue work or task(s) safely for a second processing unit without performing any ring transition. For example, in one embodiment of the invention, the first processing unit executes one or more user level applications, where each user level application has a task to be offloaded to a second processing unit. The first processing unit signals the second processing unit to handle the task from each user level application without performing any ring transition in one embodiment of the invention.

See patent
Software filtering in a transactional memory system

Issued May 6, 2014 US US8719514

A method and apparatus for utilizing hardware mechanisms of a transactional memory system is herein described. Various embodiments relate to software-based filtering of operations from read and write barriers and read isolation barriers during transactional execution. Other embodiments relate to software-implemented read barrier processing to accelerate strong atomicity. Other embodiments are also described and claimed.

See patent
Handling precompiled binaries in a hardware accelerated software transactional memory system

Issued May 6, 2014 US US8719807

A method and apparatus for enabling a Software Transactional Memory (STM) with precompiled binaries is herein described. Upon encountering an access operation in a transaction, an annotation field associated with a memory location referenced by the access is checked. In response to the memory location representing a previous similar access within the transaction, the access is performed without access barriers. However, if the annotation field is in a default state representing no previous…

A method and apparatus for enabling a Software Transactional Memory (STM) with precompiled binaries is herein described. Upon encountering an access operation in a transaction, an annotation field associated with a memory location referenced by the access is checked. In response to the memory location representing a previous similar access within the transaction, the access is performed without access barriers. However, if the annotation field is in a default state representing no previous access during a pendancy of the transaction, then a mode of the processor is determined. If the processor mode is in implicit mode, an access handler/barrier is asynchronously executed. Conversely, in an explicit mode, a flag is set instead of asynchronously executing the handler. In addition, during compilation convert explicit and convert implicit instructions are inserted to intelligently convert modes for precompiled and newly compiled binaries. Furthermore, new versions of newly compiled functions may be inserted to provide strong atomicity between previously and newly compiled functions.

See patent
Mechanisms for strong atomicity in a transactional memory system

Issued April 22, 2014 US US8706982

A method and apparatus for providing efficient strong atomicity is herein described. Optimized strong operations may be inserted at non-transactional read accesses to provide efficient strong atomicity. A global transaction value is copied at a beginning of a non-transactional function to a local transaction value; essentially creating a local timestamp of the global transaction value. At a non-transactional memory access within the function, a counter value or version value is compared to the…

A method and apparatus for providing efficient strong atomicity is herein described. Optimized strong operations may be inserted at non-transactional read accesses to provide efficient strong atomicity. A global transaction value is copied at a beginning of a non-transactional function to a local transaction value; essentially creating a local timestamp of the global transaction value. At a non-transactional memory access within the function, a counter value or version value is compared to the LTV to see if a transaction has started updating memory locations, or specifically the memory location accessed. If memory locations have not been updated by a transaction, execution is accelerated by avoiding a full set of slowpath strong atomic operations to ensure validity of data accessed. In contrast, the slowpath operations may be executed to resolve contention between a transactional and non-transaction access contending for the same memory location.

See patent
Unbounded transactional memory systems

Issued March 25, 2014 US US8683143

Methods and apparatus to provide unbounded transactional memory systems are described. In one embodiment, an operation corresponding to a software transactional memory (STM) access may be executed if a preceding hardware transactional memory (HTM) access operation fails.

See patent
Accelerating software lookups by using buffered or ephemeral stores

Issued February 18, 2014 US US8656113

A method and apparatus for accelerating lookups in an address based table is herein described. When an address and value pair is added to an address based table, the value is privately stored in the address to allow for quick and efficient local access to the value. In response to the private store, a cache line holding the value is transitioned to a private state, to ensure the value is not made globally visible. Upon eviction of the privately held cache line, the information is not…

A method and apparatus for accelerating lookups in an address based table is herein described. When an address and value pair is added to an address based table, the value is privately stored in the address to allow for quick and efficient local access to the value. In response to the private store, a cache line holding the value is transitioned to a private state, to ensure the value is not made globally visible. Upon eviction of the privately held cache line, the information is not written-back to ensure locality of the value. In one embodiment, the address based table includes a transactional write buffer to hold addresses, which correspond to tentatively updated values during a transaction. Accesses to the tentative values during the transaction may be accelerated through use of annotation bits and private stores as discussed herein. Upon commit of the transaction, the values are copied to the location to make the updates globally visible.

See patent
Mechanism for irrevocable transactions

Issued January 7, 2014 US US8627048

A method and apparatus for designating and handling irrevocable transactions is herein described. In response to detecting an irrevocable event, such as an I/O operation, a user-defined irrevocable designation, and a dynamic failure profile, a transaction is designated as irrevocable. In response to designating a transaction as irrevocable, Single Owner Read Locks (SORLs) are acquired for previous and subsequent reads in the irrevocably designated transaction to ensure the transaction is able…

A method and apparatus for designating and handling irrevocable transactions is herein described. In response to detecting an irrevocable event, such as an I/O operation, a user-defined irrevocable designation, and a dynamic failure profile, a transaction is designated as irrevocable. In response to designating a transaction as irrevocable, Single Owner Read Locks (SORLs) are acquired for previous and subsequent reads in the irrevocably designated transaction to ensure the transaction is able to complete without modification to locations read from, while permitting remote resources to load from those locations to continue execution.

See patent
Dynamic optimization for removal of strong atomicity barriers

Issued December 17, 2013 US US8612950B2

A method and apparatus for dynamic optimization of strong atomicity barriers is herein described. During runtime compilation, code including non-transactional memory accesses that are to conflict with transactional memory accesses is patched to insert transactional barriers at the conflicting non-transactional memory accesses to ensure isolation and strong atomicity. However, barriers are omitted or removed from non-transactional memory accesses that do not conflict with transactional memory…

A method and apparatus for dynamic optimization of strong atomicity barriers is herein described. During runtime compilation, code including non-transactional memory accesses that are to conflict with transactional memory accesses is patched to insert transactional barriers at the conflicting non-transactional memory accesses to ensure isolation and strong atomicity. However, barriers are omitted or removed from non-transactional memory accesses that do not conflict with transactional memory accesses to reduce barrier execution overhead.

See patent
Method and apparatus to facilitate shared pointers in a heterogeneous platform

Issued October 22, 2013 US US8566537

A method and apparatus to facilitate shared pointers in a heterogeneous platform. In one embodiment of the invention, the heterogeneous or non-homogeneous platform includes, but is not limited to, a central processing core or unit, a graphics processing core or unit, a digital signal processor, an interface module, and any other form of processing cores. The heterogeneous platform has logic to facilitate sharing of pointers to a location of a memory shared by the CPU and the GPU. By sharing…

A method and apparatus to facilitate shared pointers in a heterogeneous platform. In one embodiment of the invention, the heterogeneous or non-homogeneous platform includes, but is not limited to, a central processing core or unit, a graphics processing core or unit, a digital signal processor, an interface module, and any other form of processing cores. The heterogeneous platform has logic to facilitate sharing of pointers to a location of a memory shared by the CPU and the GPU. By sharing pointers in the heterogeneous platform, the data or information sharing between different cores in the heterogeneous platform can be simplified.

See patent
Technique for using memory attributes

Issued October 15, 2013 US US8560781

A technique for using memory attributes to relay information to a program or other agent. More particularly, embodiments of the invention relate to using memory attribute bits to check various memory properties in an efficient manner.

See patent
Unified optimistic and pessimistic concurrency control for a software transactional memory (STM) system

Issued October 8, 2013 US US8555016

A method and apparatus for unified concurrency control in a Software Transactional Memory (STM) is herein described. A transaction record associated with a memory address referenced by a transactional memory access operation includes optimistic and pessimistic concurrency control fields. Access barriers and other transactional operations/functions are utilized to maintain both fields of the transaction record, appropriately. Consequently, concurrent execution of optimistic and pessimistic…

A method and apparatus for unified concurrency control in a Software Transactional Memory (STM) is herein described. A transaction record associated with a memory address referenced by a transactional memory access operation includes optimistic and pessimistic concurrency control fields. Access barriers and other transactional operations/functions are utilized to maintain both fields of the transaction record, appropriately. Consequently, concurrent execution of optimistic and pessimistic transactions is enabled.

See patent
Accelerating unbounded memory transactions using nested cache resident transactions

Issued September 17, 2013 US US8539465

Using cache resident transaction hardware to accelerate a software transactional memory system. The method includes identifying a plurality of atomic operations intended to be performed by a software transactional memory system as transactional operations as part of a software transaction. The method further includes selecting at least a portion of the plurality of atomic operations. The method further includes attempting to perform the portion of the plurality of atomic operations as hardware…

Using cache resident transaction hardware to accelerate a software transactional memory system. The method includes identifying a plurality of atomic operations intended to be performed by a software transactional memory system as transactional operations as part of a software transaction. The method further includes selecting at least a portion of the plurality of atomic operations. The method further includes attempting to perform the portion of the plurality of atomic operations as hardware transactions using cache resident transaction hardware.

See patent
Hardware acceleration for a software transactional memory system

Issued August 27, 2013 US US8521965

A method and apparatus for accelerating transactional execution. Barriers associated with shared memory lines referenced by memory accesses within a transaction are only invoked/executed the first time the shared memory lines are accessed within a transaction. Hardware support, such as a transaction field/transaction bits, are provided to determine if an access is the first access to a shared memory line during a pendancy of a transaction. Additionally, in an aggressive operational mode version…

A method and apparatus for accelerating transactional execution. Barriers associated with shared memory lines referenced by memory accesses within a transaction are only invoked/executed the first time the shared memory lines are accessed within a transaction. Hardware support, such as a transaction field/transaction bits, are provided to determine if an access is the first access to a shared memory line during a pendancy of a transaction. Additionally, in an aggressive operational mode version numbers representing versions of elements stored in shared memory lines are not stored and validated upon commitment to save on validation costs. Moreover, even in a cautious mode, that stores version numbers to enable validation, validation costs may not be incurred, if eviction of accessed shared memory lines do not occur during execution of the transaction.

See patent
Handling operating system (OS) transitions in an unbounded transactional memory (UTM) mode

Issued August 27, 2013 US US8521995

A method includes receiving control in a kernel mode via a ring transition from a user thread during execution of an unbounded transactional memory (UTM) transaction, updating a state of a transaction status register (TSR) associated with the user thread and storing the TSR with a context of the user thread, and later restoring the context during a transition from the kernel mode to the user thread. In this way, the UTM transaction may continue on resumption of the user thread.

See patent
Performing escape actions in transactions

Issued July 16, 2013 US US8489864

Performing non-transactional escape actions within a hardware based transactional memory system. A method includes at a hardware thread on a processor beginning a hardware based transaction for the thread. Without committing or aborting the transaction, the method further includes suspending the hardware based transaction and performing one or more operations for the thread, non-transactionally and not affected by: transaction monitoring and buffering for the transaction, an abort for the…

Performing non-transactional escape actions within a hardware based transactional memory system. A method includes at a hardware thread on a processor beginning a hardware based transaction for the thread. Without committing or aborting the transaction, the method further includes suspending the hardware based transaction and performing one or more operations for the thread, non-transactionally and not affected by: transaction monitoring and buffering for the transaction, an abort for the transaction, or a commit for the transaction. After performing one or more operations for the thread, non-transactionally, the method further includes resuming the transaction and performing additional operations transactionally. After performing the additional operations, the method further includes either committing or aborting the transaction.

See patent
Debugging mechanisms in a cache-based memory isolation system

Issued June 25, 2013 US 8473921
In a computing environment, a method of debugging a software application, wherein the software application is configured to use one or more processor caches coupled to a processor in an architecturally significant fashion, the method comprising: beginning execution of the software application at a physical processor; running a debugger while executing the software application at the physical processor; detecting that a portion of the software application causes at least one of reads or writes…

In a computing environment, a method of debugging a software application, wherein the software application is configured to use one or more processor caches coupled to a processor in an architecturally significant fashion, the method comprising: beginning execution of the software application at a physical processor; running a debugger while executing the software application at the physical processor; detecting that a portion of the software application causes at least one of reads or writes to be made to a processor cache in an architecturally significant fashion; and based on detecting that the portion of the software application causes at least one of reads or writes to be made to the processor cache in an architecturally significant fashion, preserving any reads or writes made to the cache in an architecturally significant fashion, while performing debugging operations with the debugger that would ordinarily disturb the reads or writes made to the processor cache in an architecturally significant fashion, including: taking a snapshot of physical processor state of the physical processor and pausing execution of the software application at the physical processor; executing the portion of the software application using a software simulator that simulates the physical processor using the snapshot of physical processor state and that simulates the processor cache, while also performing the debugging operations with the debugger; and subsequent to executing the portion of the software application using the software simulator: applying simulated processor state to the physical processor; and resuming execution of the software application at the physical processor.

Other inventors
See patent
Debugging mechanisms in a cache-based memory isolation system

Issued June 25, 2013 US US8473921

Debugging software in systems with architecturally significant processor caches. A method may be practiced in a computing environment. The method includes acts for debugging a software application, wherein the software application is configured to use one or more architecturally significant processor caches coupled to a processor. The method includes beginning execution of the software application. A debugger is run while executing the software application. The software application causes at…

Debugging software in systems with architecturally significant processor caches. A method may be practiced in a computing environment. The method includes acts for debugging a software application, wherein the software application is configured to use one or more architecturally significant processor caches coupled to a processor. The method includes beginning execution of the software application. A debugger is run while executing the software application. The software application causes at least one of reads or writes to be made to the cache in an architecturally significant fashion. The reads or writes made to the cache in an architecturally significant fashion are preserved while performing debugging operations that would ordinarily disturb the reads or writes made to the cache in an architecturally significant fashion.

See patent
Debugging mechanisms in a cache-based memory isolation system

Issued June 25, 2013 US 8,473,921
In a computing environment, a method of debugging a software application, wherein the software application is configured to use one or more processor caches coupled to a processor in an architecturally significant fashion, the method comprising: beginning execution of the software application at a physical processor; running a debugger while executing the software application at the physical processor; detecting that a portion of the software application causes at least one of reads or writes…

In a computing environment, a method of debugging a software application, wherein the software application is configured to use one or more processor caches coupled to a processor in an architecturally significant fashion, the method comprising: beginning execution of the software application at a physical processor; running a debugger while executing the software application at the physical processor; detecting that a portion of the software application causes at least one of reads or writes to be made to a processor cache in an architecturally significant fashion; and based on detecting that the portion of the software application causes at least one of reads or writes to be made to the processor cache in an architecturally significant fashion, preserving any reads or writes made to the cache in an architecturally significant fashion, while performing debugging operations with the debugger that would ordinarily disturb the reads or writes made to the processor cache in an architecturally significant fashion, including: taking a snapshot of physical processor state of the physical processor and pausing execution of the software application at the physical processor; executing the portion of the software application using a software simulator that simulates the physical processor using the snapshot of physical processor state and that simulates the processor cache, while also performing the debugging operations with the debugger; and subsequent to executing the portion of the software application using the software simulator: applying simulated processor state to the physical processor; and resuming execution of the software application at the physical processor.

Other inventors
See patent
Concurrent thread execution using user-level asynchronous signaling

Issued June 18, 2013 US US8468526B2

Various usage models are provided to utilize a Monitor and Call (“mcall”) instruction that incorporates user-level asynchronous signaling. The various usage models utilize the mcall instruction in a multithreading system in order to enhance concurrent thread execution. Other embodiments are also described and claimed.

See patent
Increasing functionality of a reader-writer lock

Issued March 26, 2013 US US8407386

In one embodiment, the present invention includes a method for accessing a shared memory associated with a reader-writer lock according to a first concurrency mode, dynamically changing from the first concurrency mode to a second concurrency mode, and accessing the shared memory according to the second concurrency mode. In this way, concurrency modes can be adaptively changed based on system conditions. Other embodiments are described and claimed.

See patent
Efficient garbage collection and exception handling in a hardware accelerated transactional memory system

Issued March 19, 2013 US US840221

Handling garbage collection and exceptions in hardware assisted transactions. Embodiments are practiced in a computing environment including a hardware assisted transaction system. Embodiments includes acts for writing to a card table outside of a transaction; handling garbage collection compaction occurring when a hardware transaction is active by using a common global variable and instructing one or more agents to write to the common global variable any time an operation is performed which…

Handling garbage collection and exceptions in hardware assisted transactions. Embodiments are practiced in a computing environment including a hardware assisted transaction system. Embodiments includes acts for writing to a card table outside of a transaction; handling garbage collection compaction occurring when a hardware transaction is active by using a common global variable and instructing one or more agents to write to the common global variable any time an operation is performed which may change an object's virtual address; acts for managing a thread-local allocation context; acts for handling exceptions while in a hardware assisted transaction. A method includes beginning a hardware assisted transaction, raising an exception while in the hardware assisted transaction, including creating an exception object, determining that the transaction should be rolled back, and as a result of determining that the transaction should be rolled back, marshaling the exception object out of the hardware assisted transaction.

See patent
Efficient garbage collection and exception handling in a hardware accelerated transactional memory system

Issued March 19, 2013 US US8402218B2

Handling garbage collection and exceptions in hardware assisted transactions. Embodiments are practiced in a computing environment including a hardware assisted transaction system. Embodiments includes acts for writing to a card table outside of a transaction; handling garbage collection compaction occurring when a hardware transaction is active by using a common global variable and instructing one or more agents to write to the common global variable any time an operation is performed which…

Handling garbage collection and exceptions in hardware assisted transactions. Embodiments are practiced in a computing environment including a hardware assisted transaction system. Embodiments includes acts for writing to a card table outside of a transaction; handling garbage collection compaction occurring when a hardware transaction is active by using a common global variable and instructing one or more agents to write to the common global variable any time an operation is performed which may change an object's virtual address; acts for managing a thread-local allocation context; acts for handling exceptions while in a hardware assisted transaction. A method includes beginning a hardware assisted transaction, raising an exception while in the hardware assisted transaction, including creating an exception object, determining that the transaction should be rolled back, and as a result of determining that the transaction should be rolled back, marshaling the exception object out of the hardware assisted transaction.

See patent
Metaphysically addressed cache metadata

Issued February 5, 2013 US US8370577

Storing metadata that is disjoint from corresponding data by storing the metadata to the same address as the corresponding data but in a different address space. A metadata store instruction includes a storage address for the metadata. The storage address is the same address as that for data corresponding to the metadata, but the storage address when used for the metadata is implemented in a metadata address space while the storage address, when used for the corresponding data is implemented in…

Storing metadata that is disjoint from corresponding data by storing the metadata to the same address as the corresponding data but in a different address space. A metadata store instruction includes a storage address for the metadata. The storage address is the same address as that for data corresponding to the metadata, but the storage address when used for the metadata is implemented in a metadata address space while the storage address, when used for the corresponding data is implemented in a different data address space. As a result of executing the metadata store instruction, the metadata is stored at the storage address. A metadata load instruction includes the storage address for the metadata. As a result of executing the metadata load instruction, the metadata stored at the address is received. Some embodiments may further implement a metadata clear instruction which clears any entries in the metadata address space.

See patent
Efficient non-transactional write barriers for strong atomicity

Issued January 29, 2013 US US8364911B2

A method and apparatus for providing optimized strong atomicity operations for non-transactional writes is herein described. Locks are acquired upon initial non-transactional writes to memory locations. The locks are maintained until an event is detected resulting in the release of the locks. As a result, in the intermediary period between acquiring and releasing the locks, any subsequent writes to memory locations that are locked are accelerated through non-execution of lock acquire…

A method and apparatus for providing optimized strong atomicity operations for non-transactional writes is herein described. Locks are acquired upon initial non-transactional writes to memory locations. The locks are maintained until an event is detected resulting in the release of the locks. As a result, in the intermediary period between acquiring and releasing the locks, any subsequent writes to memory locations that are locked are accelerated through non-execution of lock acquire operations.

See patent
Performing mode switching in an unbounded transactional memory (UTM) system

Issued January 29, 2013 US US8365016

In one embodiment, the present invention includes a method for selecting a first transaction execution mode to begin a first transaction in a unbounded transactional memory (UTM) system having a plurality of transaction execution modes. These transaction execution modes include hardware modes to execute within a cache memory of a processor, a hardware assisted mode to execute using transactional hardware of the processor and a software buffer, and a software transactional memory (STM) mode to…

In one embodiment, the present invention includes a method for selecting a first transaction execution mode to begin a first transaction in a unbounded transactional memory (UTM) system having a plurality of transaction execution modes. These transaction execution modes include hardware modes to execute within a cache memory of a processor, a hardware assisted mode to execute using transactional hardware of the processor and a software buffer, and a software transactional memory (STM) mode to execute without the transactional hardware. The first transaction execution mode can be selected to be a highest performant of the hardware modes if no pending transaction is executing in the STM mode, otherwise a lower performant mode can be selected. Other embodiments are described and claimed.

See patent
Minimizing code duplication in an unbounded transactional memory system by using mode agnostic transactional read and write barriers

Issued January 15, 2013 US US8356166

Minimizing code duplication in an unbounded transactional memory system. A computing apparatus including one or more processors in which it is possible to use a set of common mode-agnostic TM barrier sequences that runs on legacy ISA and extended ISA processors, and that employs hardware filter indicators (when available) to filter redundant applications of TM barriers, and that enables a compiled binary representation of the subject code to run correctly in any of the currently implemented set…

Minimizing code duplication in an unbounded transactional memory system. A computing apparatus including one or more processors in which it is possible to use a set of common mode-agnostic TM barrier sequences that runs on legacy ISA and extended ISA processors, and that employs hardware filter indicators (when available) to filter redundant applications of TM barriers, and that enables a compiled binary representation of the subject code to run correctly in any of the currently implemented set of transactional memory execution modes, including running the code outside of a transaction, and that enables the same compiled binary to continue to work with future TM implementations which may introduce as yet unknown future TM execution modes.

See patent
Method and apparatus to improve execution of a stored program

Issued January 1, 2013 US US8346760

In one embodiment, the invention provides a method comprising determining metadata encoded in instructions of a stored program; and executing the stored program based on the metadata.

See patent
Mechanisms to accelerate transactions using buffered stores

Issued November 20, 2012 US US8316194

In one embodiment, the present invention includes a method for executing a transactional memory (TM) transaction in a first thread, buffering a block of data in a first buffer of a cache memory of a processor, and acquiring a write monitor on the block to obtain ownership of the block at an encounter time in which data at a location of the block in the first buffer is updated. Other embodiments are described and claimed.

See patent
Thread synchronization methods and apparatus for managed run-time environments

Issued October 30, 2012 US US8302099

A example method disclosed herein comprises initiating a first optimistically balanced synchronization to acquire a lock of an object, the first optimistically balanced synchronization comprising a first optimistically balanced acquisition and a first optimistically balanced release to be performed on the lock by a same thread and at a same nesting level, releasing the lock after execution of program code covered by the lock if a stored state of the first optimistically balanced release…

A example method disclosed herein comprises initiating a first optimistically balanced synchronization to acquire a lock of an object, the first optimistically balanced synchronization comprising a first optimistically balanced acquisition and a first optimistically balanced release to be performed on the lock by a same thread and at a same nesting level, releasing the lock after execution of program code covered by the lock if a stored state of the first optimistically balanced release indicates that the first optimistically balanced release is still valid, the stored state of the first optimistically balanced release being initialized prior to execution of the program code to indicate that the first optimistically balanced release is valid, and throwing an exception after execution of the program code covered by the lock if the stored state of the first optimistically balanced release indicates that the first optimistically balanced release is no longer valid.

See patent
Operating system virtual memory management for hardware transactional memory

Issued August 21, 2012 US 8,250,331
Also with Koichi Yamada, Landy Wang, David Callahan, Vadim Bassim.

Other inventors
See patent
Device, system, and method of executing a call to a routine within a transaction

Issued August 14, 2012 US US8245244

Device, system, and method of executing a call to a routine within a transaction. In some embodiments an apparatus may include a memory having stored thereon compiled code corresponding to a transaction, wherein the transaction includes at least one call to a first routine of a pair of first and second mutually inverse routines, and wherein the compiled code includes a call to a first wrapped routine replacing the call to the first routine; and a runtime library including wrapper code, wherein…

Device, system, and method of executing a call to a routine within a transaction. In some embodiments an apparatus may include a memory having stored thereon compiled code corresponding to a transaction, wherein the transaction includes at least one call to a first routine of a pair of first and second mutually inverse routines, and wherein the compiled code includes a call to a first wrapped routine replacing the call to the first routine; and a runtime library including wrapper code, wherein the wrapper code, when executed in response to the call to the first wrapped routine, results in executing the call to the first routine within the transaction and undoing the call to the first routine responsive to abort of the transaction. Other embodiments are described and claimed.

See patent
Hardware accelerated transactional memory system with open nested transactions

Issued July 24, 2012 US 8,229,907
Also with Mike Magruder.

Other inventors
See patent
Hardware accelerated transactional memory system with open nested transactions

Issued July 24, 2012 US US8229907

Hardware assisted transactional memory system with open nested transactions. Embodiments include a system whereby hardware acceleration of transactions can be accomplished by implementing open nested transaction in hardware which respect software locks such that a top level transaction can be implemented in software, and thus not be limited by hardware constraints typical when using hardware transactional memory systems.

See patent
Hardware acceleration of a write-buffering software transactional memory

Issued June 12, 2012 US US8200909

A method and apparatus for accelerating a software transactional memory (STM) system is described herein. Annotation field are associated with lines of a transactional memory. An annotation field associated with a line of the transaction memory is initialized to a first value upon starting a transaction. In response to encountering a read operation in the transaction, then annotation field is checked. If the annotation field includes a first value, the read is serviced from the line of the…

A method and apparatus for accelerating a software transactional memory (STM) system is described herein. Annotation field are associated with lines of a transactional memory. An annotation field associated with a line of the transaction memory is initialized to a first value upon starting a transaction. In response to encountering a read operation in the transaction, then annotation field is checked. If the annotation field includes a first value, the read is serviced from the line of the transaction memory without having to search an additional write space. A second and third value in the annotation field potentially indicates whether a read operation missed the transactional memory or a tentative value is stored in a write space. Additionally, an additional bit in the annotation field, may be utilized to indicate whether previous read operations have been logged, allowing for subsequent redundant read logging to be reduced.

See patent
Hybrid transactions for low-overhead speculative parallelization

Issued June 5, 2012 US US8195898

A method and apparatus for a hybrid transactional memory system is herein described. A first transaction is executed utilizing a first style of a transactional memory system and a second transaction is executed in parallel utilizing a second style of a transactional memory system. For example, a main thread is executed utilizing an update-in place Software Transactional Memory (STM) system while a parallel thread, such as a helper thread, is executed utilizing a write buffering STM. As a…

A method and apparatus for a hybrid transactional memory system is herein described. A first transaction is executed utilizing a first style of a transactional memory system and a second transaction is executed in parallel utilizing a second style of a transactional memory system. For example, a main thread is executed utilizing an update-in place Software Transactional Memory (STM) system while a parallel thread, such as a helper thread, is executed utilizing a write buffering STM. As a result, a main thread may directly update memory locations, while a helper thread's transactional writes are buffered to ensure they do not invalidate transactional reads of the main thread. Therefore, parallel execution of threads is achieved, while ensuring at least one thread, such as a main thread, does not degrade below an amount of execution cycles it would take to execute the main thread serially.

See patent
System and method for allocating and deallocating memory within transactional code

Issued May 29, 2012 US US8190845

Methods and systems are provided for managing memory allocations and deallocations while in transactional code, including nested transactional code. The methods and systems manage transactional memory operations by using identifiers, such as sequence numbers, to handle memory management in transactions. The methods and systems also maintain lists of deferred actions to be performed at transaction abort and commit times. A number of memory management routines associated with one or more…

Methods and systems are provided for managing memory allocations and deallocations while in transactional code, including nested transactional code. The methods and systems manage transactional memory operations by using identifiers, such as sequence numbers, to handle memory management in transactions. The methods and systems also maintain lists of deferred actions to be performed at transaction abort and commit times. A number of memory management routines associated with one or more transactions examine the transaction sequence number of the current transaction, manipulate commit and/or undo logs, and set/use the transaction sequence number of an associated object, but are not so limited. The methods and systems provide for memory allocation and deallocations within transactional code while preserving transactional semantics. Other embodiments are described and claimed.

See patent
Wait loss synchronization

Issued April 17, 2012 US US8161247

Synchronizing threads on loss of memory access monitoring. Using a processor level instruction included as part of an instruction set architecture for a processor, a read, or write monitor to detect writes, or reads or writes respectively from other agents on a first set of one or more memory locations and a read, or write monitor on a second set of one or more different memory locations are set. A processor level instruction is executed, which causes the processor to suspend executing…

Synchronizing threads on loss of memory access monitoring. Using a processor level instruction included as part of an instruction set architecture for a processor, a read, or write monitor to detect writes, or reads or writes respectively from other agents on a first set of one or more memory locations and a read, or write monitor on a second set of one or more different memory locations are set. A processor level instruction is executed, which causes the processor to suspend executing instructions and optionally to enter a low power mode pending loss of a read or write monitor for the first or second set of one or more memory locations. A conflicting access is detected on the first or second set of one or more memory locations or a timeout is detected. As a result, the method includes resuming execution of instructions.

See patent
Using ephemeral stores for fine-grained conflict detection in a hardware accelerated STM

Issued March 20, 2012 US US8140773

A method and apparatus for fine-grained filtering in a hardware accelerated software transactional memory system is herein described. A data object, which may have any arbitrary size, is associated with a filter word. The filter word is in a first default state when no access, such as a read, from the data object has occurred during a pendancy of a transaction. Upon encountering a first access, such as a first read, from the data object, access barrier operations including an ephemeral/private…

A method and apparatus for fine-grained filtering in a hardware accelerated software transactional memory system is herein described. A data object, which may have any arbitrary size, is associated with a filter word. The filter word is in a first default state when no access, such as a read, from the data object has occurred during a pendancy of a transaction. Upon encountering a first access, such as a first read, from the data object, access barrier operations including an ephemeral/private store operation to set the filter word to a second state are performed. Upon a subsequent/redundant access, such as a second read, the access barrier operations are elided to accelerate the subsequent access, based on the filter word being set to the second state to indicate a previous access occurred.

See patent
Thread synchronization via selective modification of stored states of pending optimistically balanced lock releases having previous lock owner and validity flag

Issued March 13, 2012 US US8136112

Thread synchronization methods and apparatus for managed run-time environments are disclosed. An example method to maintain state information for optimistically balanced synchronization of a lock of an object in a managed runtime environment disclosed herein comprises storing state information comprising a state of each pending optimistically balanced release operation corresponding to each pending optimistically balanced synchronization to be performed on the lock of the object, each pending…

Thread synchronization methods and apparatus for managed run-time environments are disclosed. An example method to maintain state information for optimistically balanced synchronization of a lock of an object in a managed runtime environment disclosed herein comprises storing state information comprising a state of each pending optimistically balanced release operation corresponding to each pending optimistically balanced synchronization to be performed on the lock of the object, each pending optimistically balanced synchronization comprising respective paired acquisition and release operations between which an unknown number of unpaired locking operations are to occur, and modifying a first stored state of a first pending optimistically balanced release operation when a subsequent unpaired locking operation is performed on the lock, but not modifying any stored state of any pending optimistically balanced release, including the first stored state of a first pending optimistically balanced release operation, when a subsequent optimistically balanced synchronization is performed on the lock.

See patent
Mechanism for software transactional memory commit/abort in unmanaged runtime environment

Issued March 6, 2012 US US8132158

A method and apparatus for ensuring integrity of transaction exit functions is herein described. Dead local data in a transaction is prevented from overwriting local variables associated with a transaction exit function. In a write-buffering Software Transactional Memory (STM) system, a commit function is associated with a private stack to store local variables to ensure write-back of local dead data in a write-buffer does not corrupt the commit function. Similarly, in a roll-back STM, an abort…

A method and apparatus for ensuring integrity of transaction exit functions is herein described. Dead local data in a transaction is prevented from overwriting local variables associated with a transaction exit function. In a write-buffering Software Transactional Memory (STM) system, a commit function is associated with a private stack to store local variables to ensure write-back of local dead data in a write-buffer does not corrupt the commit function. Similarly, in a roll-back STM, an abort function is associated with a private stack to store local variables to ensure the roll-back of a program stack with local dead data from a write log does not corrupt the abort function. Alternatively, one stack may be used for the transaction including a first function and an exit function. Here, local dead variables are detected and prevented from overwriting local variables of the exit function.

See patent
Array comparison and swap operations

Issued January 31, 2012 US US8108627

A transactional memory system, method and apparatus are disclosed. An embodiment of the method includes attempting to acquire a write lock provided by an implementation of a software transactional memory (STM) system for each of a set of memory locations of the STM; if a write lock is acquired for each of the set of memory locations, comparing the value in each of the set of memory locations to a corresponding expected value; and if the comparing yields the same, predetermined result for each…

A transactional memory system, method and apparatus are disclosed. An embodiment of the method includes attempting to acquire a write lock provided by an implementation of a software transactional memory (STM) system for each of a set of memory locations of the STM; if a write lock is acquired for each of the set of memory locations, comparing the value in each of the set of memory locations to a corresponding expected value; and if the comparing yields the same, predetermined result for each of the set of memory locations, storing in each memory location a corresponding new value. Other embodiments are also described and claimed.

See patent
Performing mode switching in an unbounded transactional memory (UTM) system

Issued January 10, 2012 US 8,095,824
Also with Mike Magruder, Matt Tolton, Vadim Bassim.

Other inventors
See patent
Hardware acceleration of strongly atomic software transactional memory

Issued November 22, 2011 US US8065490

In accordance with some embodiments, software transactional memory may be used for both managed and unmanaged environments. If a cache line is resident in a cache and this is not the first time that the cache line has been read since the last write, then the data may be read directly from the cache line, improving performance. Otherwise, a normal read may be utilized to read the information. Similarly, write performance can be accelerated in some instances to improve performance.

See patent
Efficient and consistent software transactional memory

Issued November 15, 2011 US US8060482

A method and apparatus for efficient and consistent validation/conflict detection in a Software Transactional Memory (STM) system is herein described. A version check barrier is inserted after a load to compare versions of loaded values before and after the load. In addition, a global timestamp (GTS) is utilized to track a latest committed transaction. Each transaction is associated with a local timestamp (LTS) initialized to the GTS value at the start of a transaction. As a transaction commits…

A method and apparatus for efficient and consistent validation/conflict detection in a Software Transactional Memory (STM) system is herein described. A version check barrier is inserted after a load to compare versions of loaded values before and after the load. In addition, a global timestamp (GTS) is utilized to track a latest committed transaction. Each transaction is associated with a local timestamp (LTS) initialized to the GTS value at the start of a transaction. As a transaction commits it updates the GTS to a new value and sets versions of modified locations to the new value. Pending transactions compare versions determined in read barriers to their LTS. If the version is greater than their LTS indicating another transaction has committed after the pending transaction started and initialized the LTS, then the pending transaction validates its read set to maintain efficient and consistent transactional execution.

See patent
Method and apparatus for performing dynamic optimization for software transactional memory

Issued March 22, 2011 US US7913236

A method for managing a transaction includes determining that an optimistically immutable field in the transaction is written to. Invaliding a method in response to determining that the method in the transaction reads is the optimistically immutable field.
Other embodiments are disclosed and claimed.

See patent
Protecting shared variables in a software transactional memory system

Issued January 11, 2011 US US7870545

For a variable accessed at least once in a software-based transactional memory system (STM) defined (STM-defined) critical region of a program, modifying an access to the variable that occurs outside any STM-defined critical region system by starting a hardware based transactional memory based transaction, within the hardware based transactional memory based transaction, checking if the variable is currently owned by a STM transaction, checking if the variable is currently owned by a STM…

For a variable accessed at least once in a software-based transactional memory system (STM) defined (STM-defined) critical region of a program, modifying an access to the variable that occurs outside any STM-defined critical region system by starting a hardware based transactional memory based transaction, within the hardware based transactional memory based transaction, checking if the variable is currently owned by a STM transaction, checking if the variable is currently owned by a STM transaction; if the variable is not currently owned by a STM transaction, performing the access and then committing the hardware based transactional memory transaction; and if the variable is currently owned by a STM transaction, performing a responsive action.

See patent
Coordinating access to memory locations for hardware transactional memory transactions and software transactional memory transactions

Issued October 5, 2010 US US7809903

Provided is a method, system, and program for coordinating access to memory locations for hardware transactional memory transactions and software transactional memory transactions. A hardware transaction executing in hardware transactional memory initiates a request to access a memory location. A fault is returned to the hardware transaction request in response to an operation by one software transaction executing in a software transactional memory.

See patent
Methods and apparatus to tune intermediate representations in a managed runtime environment

Issued September 7, 2010 US US7793275

Methods and apparatus are disclosed to tune intermediate representations in a managed runtime environment. An example method disclosed herein receives a bytecode at a virtual machine during runtime, determines a method of the received bytecode, identifies an optimized intermediate representation associated with the method, and imports the optimized intermediate representation from the memory into the virtual machine. Other embodiments are described and claimed.

See patent
Software assisted nested hardware transactions

Issued June 1, 2010 US US7730286

A method and apparatus for efficiently executing nested transactions is herein described. Hardware support for execution of transactions is provided. Additionally, through the use of logging previous values immediately before a current nested transaction in a local memory and storage of a stack of handlers associated with a hierarchy of transactions, nested transactions are potentially efficiently executed. Upon a failure, abort, or invalidating event/access within a nested transaction, the…

A method and apparatus for efficiently executing nested transactions is herein described. Hardware support for execution of transactions is provided. Additionally, through the use of logging previous values immediately before a current nested transaction in a local memory and storage of a stack of handlers associated with a hierarchy of transactions, nested transactions are potentially efficiently executed. Upon a failure, abort, or invalidating event/access within a nested transaction, the state of variables or memory locations written to during execution of the nested transaction are rolled-back to immediately before the nested transaction, instead of all the way back to an original state of the variables or memory locations before an enclosing transaction. As a result, nested transactions may be re-executed within enclosing transactions, without flattening the enclosing and nested transactions to re-execute everything.

See patent
Software assisted nested hardware transactions

Issued June 1, 2010 US US7730286

A method and apparatus for efficiently executing nested transactions is herein described. Hardware support for execution of transactions is provided. Additionally, through the use of logging previous values immediately before a current nested transaction in a local memory and storage of a stack of handlers associated with a hierarchy of transactions, nested transactions are potentially efficiently executed. Upon a failure, abort, or invalidating event/access within a nested transaction, the…

A method and apparatus for efficiently executing nested transactions is herein described. Hardware support for execution of transactions is provided. Additionally, through the use of logging previous values immediately before a current nested transaction in a local memory and storage of a stack of handlers associated with a hierarchy of transactions, nested transactions are potentially efficiently executed. Upon a failure, abort, or invalidating event/access within a nested transaction, the state of variables or memory locations written to during execution of the nested transaction are rolled-back to immediately before the nested transaction, instead of all the way back to an original state of the variables or memory locations before an enclosing transaction. As a result, nested transactions may be re-executed within enclosing transactions, without flattening the enclosing and nested transactions to re-execute everything.

See patent
Methods and apparatus to dynamically insert prefetch instructions based on garbage collector analysis and layout of objects

Issued August 18, 2009 US US7577947

Methods and apparatus to dynamically insert prefetch instructions are disclosed. In an example method, one or more samples associated with cache misses are identified from a performance monitoring unit in a processor system. Based on sample information associated with the one or more samples, delinquent information is generated. To dynamically insert one or more prefetch instructions, a prefetch point is identified based on the delinquent information.

See patent
Thread synchronization with lock inflation methods and apparatus for managed run-time environments

Issued July 28, 2009 US US7567963

Thread synchronization with lock inflation methods and apparatus for managed run-time environments are disclosed. An example method disclosed herein comprises determining a locking operation to perform on a lock corresponding to the object, performing an optimistically balanced synchronization of the lock if the locking operation is not unbalanced, and modifying a lock shape of the lock if the locking operation is unbalanced.

See patent
Transactional memory with automatic object versioning

Issued June 2, 2009 US US7542977

Embodiments of a system and method for transactional memory (TM) with automatic object versioning are described. Embodiments described herein include a TM system and method that facilitates the execution of object-oriented application programs in a transactional environment, including automatically versioning objects to enhance efficiency. Embodiments of the TM automatically designate versions of objects using pointers, accurately identifying usable and unusable versions. Object versioning as…

Embodiments of a system and method for transactional memory (TM) with automatic object versioning are described. Embodiments described herein include a TM system and method that facilitates the execution of object-oriented application programs in a transactional environment, including automatically versioning objects to enhance efficiency. Embodiments of the TM automatically designate versions of objects using pointers, accurately identifying usable and unusable versions. Object versioning as described herein allows the garbage collector to easily and efficiently determine which objects may be moved, freeing memory space and reducing the number of objects traversed by a transaction before finding a useable version of an object. Other embodiments are described and claimed.

See patent
Processor and memory controller capable of use in computing system that employs compressed cache lines' worth of information

Issued March 31, 2009 US US7512750

A memory controller is described that comprises a compression map cache. The compression map cache is to store information that identifies a cache line's worth of information that has been compressed with another cache line's worth of information. A processor and a memory controller integrated on a same semiconductor die is also described. The memory controller comprises a compression map cache. The compression map cache is to store information that identifies a cache line's worth of…

A memory controller is described that comprises a compression map cache. The compression map cache is to store information that identifies a cache line's worth of information that has been compressed with another cache line's worth of information. A processor and a memory controller integrated on a same semiconductor die is also described. The memory controller comprises a compression map cache. The compression map cache is to store information that identifies a cache line's worth of information that has been compressed with another cache line's worth of information.

See patent
Program object read barrier

Issued March 31, 2009 US US7512930

A method and apparatus for a read barrier mechanism are described. According to an embodiment, a method comprises receiving an access request for a program object; performing a combined check for a null reference or for a read barrier for the program object; and if the combined check is affirmative, performing a recovery operation.

See patent
Object based conflict detection in a software transactional memory

Issued March 10, 2009 US US7502897

Object-based conflict detection is described in the context of software transactional memory. In one example, a block of instructions is received for execution as an object in a software transactional memory transaction. The base of the object is computed, a lock is found for the object using the base of the object.

See patent
Dynamic performance monitoring-based approach to memory management

Issued February 10, 2009 US US7490117

Techniques are described for optimizing memory management in a processor system. The techniques may be implemented on processors that include on-chip performance monitoring and on systems where an external performance monitor is coupled to a processor. Processors that include a Performance Monitoring Unit (PMU) are examples. The PMU may store data on read and write cache misses, as well as data on translation lookaside buffer (TLB) misses. The data from the PMU is used to determine if any…

Techniques are described for optimizing memory management in a processor system. The techniques may be implemented on processors that include on-chip performance monitoring and on systems where an external performance monitor is coupled to a processor. Processors that include a Performance Monitoring Unit (PMU) are examples. The PMU may store data on read and write cache misses, as well as data on translation lookaside buffer (TLB) misses. The data from the PMU is used to determine if any memory regions within a memory heap are delinquent memory regions, i.e., regions exhibiting high numbers of memory problems or stalls. If delinquent memory regions are found, the memory manager, such as a garbage collection routine, can efficiently optimize memory performance as well as the mutators performance by improving the layout of objects in the heap. In this way, memory management routines may be focused based on dynamic and real-time memory performance data.

See patent
Method for register allocation during instruction scheduling

Issued February 3, 2009 US US7487336

The present disclosure relates to the allocation of registers the scheduling of instructions, and, more specifically, to the classifying of operands and allocation of registers to local operands.

See patent
Memory reclamation with optimistic concurrency

Issued January 13, 2009 US US7478210

Memory reclamation with optimistic concurrency is described. In one example an allocated memory object is tentatively freed in a software transactional memory, the object having pointers into it from at least one transaction. A time when all transactions that are outstanding at the time an object is tentatively freed have ended is detected, and the object is actually freed based on the detection.

See patent
Methods and apparatus to dynamically insert prefetch instructions based on compiler and garbage collector analysis

Issued June 17, 2008 US US7389385

Methods and apparatus to insert prefetch instructions based on garbage collector analysis and compiler analysis are disclosed. In an example method, one or more batches of samples associated with cache misses from a performance monitoring unit in a processor system are received. One or more samples from the one or more batches of samples based on delinquent information are selected. A performance impact indicator associated with the one or more samples is generated. Based on the performance…

Methods and apparatus to insert prefetch instructions based on garbage collector analysis and compiler analysis are disclosed. In an example method, one or more batches of samples associated with cache misses from a performance monitoring unit in a processor system are received. One or more samples from the one or more batches of samples based on delinquent information are selected. A performance impact indicator associated with the one or more samples is generated. Based on the performance indicator, at least one of a garbage collector analysis and a compiler analysis is initiated to identify one or more delinquent paths. Based on the at least one of the garbage collector analysis and the compiler analysis, one or more prefetch points to insert prefetch instructions are identified.

See patent
Program phase detection for dynamic optimization

Issued June 17, 2008 US US7389502

A method, apparatus and system including selecting a phase threshold value, receiving a plurality of sequenced buffers, determining a distance between centers of at least two consecutive histogram bins, comparing the distance with the selected threshold value, and determining major execution phases of an executable process based on the comparison.

See patent
Methods and apparatus for optimizing the operating speed and size of a computer program

Issued April 29, 2008 US US7367022

Apparatus and methods for optimizing an operating speed and size of a computer program are disclosed. In an example, an apparatus includes an execution module to run a computer program, an exception detector to detect throws to an exception handler and to detect locations from which the throws occur, a memory to store data developed by the exception detector and a code adjuster to at least one of inline and fold the exception handler with respect to at least one of the detected locations.

See patent
Lock-free bounded FIFO queue mechanism

Issued April 29, 2008 US US7366831

A system includes a processor and a size bounded first-in first-out (FIFO) memory that is connected to the processor and a display is connected to the processor. A managing process to run on the processor to manage the FIFO memory structure. The FIFO memory includes a counter portion and a value portion for each of a tail portion and a head portion, and the managing process is non-blocking. The counter portion is used as a timestamp to maintain FIFO order.

See patent
Multi-processor computing system that employs compressed cache lines' worth of information and processor capable of use in said system

Issued August 14, 2007 US US7257693

Cache coherency rules for a multi-processor computing system that is capable of working with compressed cache lines' worth of information are described. A multi-processor computing system that is capable of working with compressed cache lines' worth of information is also described. The multi-processor computing system includes a plurality of hubs for communicating with various computing system components and for compressing/decompressing cache lines' worth of information. A processor that is…

Cache coherency rules for a multi-processor computing system that is capable of working with compressed cache lines' worth of information are described. A multi-processor computing system that is capable of working with compressed cache lines' worth of information is also described. The multi-processor computing system includes a plurality of hubs for communicating with various computing system components and for compressing/decompressing cache lines' worth of information. A processor that is capable of labeling cache lines' worth of information in accordance with the cache coherency rules is described. A processor that includes a hub as described above is also described.

See patent
Compressing data in a cache memory

Issued July 10, 2007 US US7243191

In one embodiment, the present invention includes a cache memory having a plurality of cache lines to store data, in which at least some of the cache lines are adapted to store data in a compressed state. The cache memory also may include a first tag corresponding to each of the cache lines to indicate whether data in the corresponding cache line is compressible.

See patent
Mechanism to store reordered data with compression

Issued January 9, 2007 US US7162583

According to one embodiment a computer system is disclosed. The computer system includes a central processing unit (CPU), a cache memory coupled to the CPU and a cache controller, coupled to the cache memory. The cache memory includes a plurality of compressible cache lines to store additional data. The cache controller reorders a cache line after each access to the cache line prior to the compression of the cache line into a compressed cache line.

See patent
Mechanism to include hints within compressed data

Issued January 9, 2007 US US7162584

According to one embodiment a computer system is disclosed. The computer system includes a central processing unit (CPU), a cache memory coupled to the CPU and a cache controller coupled to the cache memory. The cache memory includes a plurality of compressible cache lines to store additional data. The cache controller includes compression logic to compress one or more of the plurality of cache lines into compressed cache lines, and hint logic to store hint information in unused space within…

According to one embodiment a computer system is disclosed. The computer system includes a central processing unit (CPU), a cache memory coupled to the CPU and a cache controller coupled to the cache memory. The cache memory includes a plurality of compressible cache lines to store additional data. The cache controller includes compression logic to compress one or more of the plurality of cache lines into compressed cache lines, and hint logic to store hint information in unused space within the compressed cache lines.

See patent
Mechanism to compress data in a cache

Issued November 28, 2006 US US7143238

A computer system includes a central processing unit (CPU) and a cache memory coupled to the CPU. The cache memory includes a plurality of compressible cache lines to store additional data.

See patent
Method for implementing dynamic type checking

Issued July 18, 2006 US US7080354

Methods and apparatuses for dynamic type checking are described. For one embodiment runtime code generation is used to effect dynamic type checking by generating code specialized to different object types. For one embodiment a virtual dynamic type check (DTC) function is generated for each object at run time. The virtual DTC function contains a sequence of instructions to type check every element (type) within an object's type hierarchy. The virtual DTC function is tailored for a particular…

Methods and apparatuses for dynamic type checking are described. For one embodiment runtime code generation is used to effect dynamic type checking by generating code specialized to different object types. For one embodiment a virtual dynamic type check (DTC) function is generated for each object at run time. The virtual DTC function contains a sequence of instructions to type check every element (type) within an object's type hierarchy. The virtual DTC function is tailored for a particular type and thus conducts dynamic type checking more efficiently for objects of the particular type. For one embodiment the DTC function can complete type checking of interface type hierarchies. For one embodiment a compiler may determine whether a type is a class type or interface type and may generate a virtual DTC function only for interface types.

See patent
Method for fast exception handling

Issued August 9, 2005 US US6928582

A system and method for exception handling includes executing a first instruction. The first instruction then returns an exception. A program counter is used to determine the location of a second instruction. The second instruction includes a pointer to at least one exception handler.

See patent
Hierarchical software path profiling

Issued January 25, 2005 US US6848100

A hierarchical software profiling mechanism that gathers hierarchical path profile information has been described. Software to be profiled is instrumented with instructions that save an outer path sum when an inner region is entered, and restore the outer path sum when the inner region is exited. When the inner region is being executed, an inner path sum is generated and a profile indicator representing the inner path traversed is updated prior to the outer path sum being restored. The software…

A hierarchical software profiling mechanism that gathers hierarchical path profile information has been described. Software to be profiled is instrumented with instructions that save an outer path sum when an inner region is entered, and restore the outer path sum when the inner region is exited. When the inner region is being executed, an inner path sum is generated and a profile indicator representing the inner path traversed is updated prior to the outer path sum being restored. The software to be profiled is instrumented using information from augmented control flow graphs that represent the software.

See patent
Method of run-time tracking of object references in Java programs

Issued November 13, 2001 US US6317869

Many programming languages utilize reference pointers in computer code. Furthermore, some of these programming languages perform memory management in the form of garbage collection. Once such language is Java. During the execution of a garbage collection routine, the computer may need to locate all the variables containing reference values. The present invention introduces a method for run-time tracking of object references in computer code and determining which variables contain references to…

Many programming languages utilize reference pointers in computer code. Furthermore, some of these programming languages perform memory management in the form of garbage collection. Once such language is Java. During the execution of a garbage collection routine, the computer may need to locate all the variables containing reference values. The present invention introduces a method for run-time tracking of object references in computer code and determining which variables contain references to objects at garbage collection sites. The method of the present invention first creates a bit vector in memory. The bit vector is then initialized. Second, each variable declared in the computer program that may be used to store a reference value is assigned a unique bit within this bit vector. Each bit is maintained to indicate whether the variable it is assigned to is currently storing a reference value. Specifically, when a variable is assigned a reference value, the corresponding bit in the bit vector is set. When a variable is assigned a non-reference value, the corresponding bit in the bit vector is cleared.

See patent
Method for fast translation of java byte codes into efficient native processor code

Issued September 18, 2001 US US6292935

To efficient generate native processor code from operand stack based code, a mimic stack is introduced. The mimic stack is a compile time data structure that stores the location of operands pushed onto the operand stack. When an operation is detected that operates on operand stack values, the locations from the mimic stack are popped off and used to generate efficient code that directly accesses the operands.

See patent
Method for performing dynamic optimization of computer code

Issued January 2, 2001 US US6170083

Early Java Virtual Machine implementations executed Java programs very slowly since the Java byte codes were interpreted. Later, Java compilers were introduced to improve performance. To further improve performance, the present invention introduces a method of dynamically optimizing computer code. The method of the present invention first compiles Java byte code into an object code. While compiling, the method introduces instrumentation code into the object code that performs path profiling…

Early Java Virtual Machine implementations executed Java programs very slowly since the Java byte codes were interpreted. Later, Java compilers were introduced to improve performance. To further improve performance, the present invention introduces a method of dynamically optimizing computer code. The method of the present invention first compiles Java byte code into an object code. While compiling, the method introduces instrumentation code into the object code that performs path profiling. Specifically, the path profiling instrumentation code determines which execution paths are executed most often by counting the number of times each possible execution path is executed. When a particular execution path exceeds a threshold value, then that execution path is deemed a “hot” execution path. The hot execution path is then dynamically optimized. The optimized hot path is then executed instead of the original compiled object code for improved performance.

See patent
Method for eliminating common subexpressions from java byte codes

Issued December 5, 2000 US US6158048

Compilers are tools that generate efficient mappings from programs to machines. A Java "Just-In-Time" runs as part of an application, and as such, it must be fast and efficient in its use of memory. To achieve good performance and further optimize code generation, the present invention introduces a method for eliminating common subexpressions from Java bytecodes. The method of the present invention first loads a code stream containing sequences of computer code into computer memory. The…

Compilers are tools that generate efficient mappings from programs to machines. A Java "Just-In-Time" runs as part of an application, and as such, it must be fast and efficient in its use of memory. To achieve good performance and further optimize code generation, the present invention introduces a method for eliminating common subexpressions from Java bytecodes. The method of the present invention first loads a code stream containing sequences of computer code into computer memory. The expression value for a first expression of a first code sequence is computed and stored in a memory location. A tag is assigned to the memory location holding this expression value for tracking which expression sequences' values are held in memory locations. As code compilation continues, the code selector looks ahead in the code stream to see if any upcoming expression sequences already have expression values stored in a memory location. The code selector compares the expression of a second code sequence with the code sequences annotated by the tags of expression values currently stored in memory. If the second code sequence matches a sequence already associated with a memory location, then the value of the matched sequence is pushed from the memory location onto a stack, and the computations of the expression of the second code sequence is skipped. If the second expression does not match any of the expressions represented by the tags, the expression value of the second expression is calculated and stored in a memory location. This memory location is then annotated with its own expression tag for future comparisons with upcoming expressions in the code stream.

See patent
Method of run-time tracking of object references in Java programs

Issued July 25, 2000 US US6093216

Many programming languages utilize reference pointers in computer code. Furthermore, some of these programming languages perform memory management in the form of garbage collection. Once such language is Java. During the execution of a garbage collection routine, the computer may need to locate all the variables containing reference values. The present invention introduces a method for run-time tracking of object references in computer code and determining which variables contain references to…

Many programming languages utilize reference pointers in computer code. Furthermore, some of these programming languages perform memory management in the form of garbage collection. Once such language is Java. During the execution of a garbage collection routine, the computer may need to locate all the variables containing reference values. The present invention introduces a method for run-time tracking of object references in computer code and determining which variables contain references to objects at garbage collection sites. The method of the present invention first creates a bit vector in memory. The bit vector is then initialized. Second, each variable declared in the computer program that may be used to store a reference value is assigned a unique bit within this bit vector. Each bit is maintained to indicate whether the variable it is assigned to is currently storing a reference value. Specifically, when a variable is assigned a reference value, the corresponding bit in the bit vector is set. When a variable is assigned a non-reference value, the corresponding bit in the bit vector is cleared.

See patent
Device System and Method for Executing a Call to a Routine within a Transaction

US 20100058362
Other inventors

More activity by Ali-Reza

I was saddened to learn of Arvind's passing. I first met Arvind at ISCA 1992 when he came to see my first conference talk. I was getting mentally…

I was saddened to learn of Arvind's passing. I first met Arvind at ISCA 1992 when he came to see my first conference talk. I was getting mentally…

Liked by Ali-Reza Adl-Tabatabai
I am thrilled to join DigitalOcean (NYSE: DOCN) today as the Chief Product and Technology Officer. DigitalOcean has a strong developer following and…

I am thrilled to join DigitalOcean (NYSE: DOCN) today as the Chief Product and Technology Officer. DigitalOcean has a strong developer following and…

Liked by Ali-Reza Adl-Tabatabai
Starting the day with a wave of energy and inspiration… Hard and smart work with the A team. Abundance mindset and collaboration. Focused on…

Starting the day with a wave of energy and inspiration… Hard and smart work with the A team. Abundance mindset and collaboration. Focused on…

Liked by Ali-Reza Adl-Tabatabai
I'm at OpenSauce today witnessing builders and makers showcase their incredible creations! From open-source quantum engineering to solar-powered…

I'm at OpenSauce today witnessing builders and makers showcase their incredible creations! From open-source quantum engineering to solar-powered…

Shared by Ali-Reza Adl-Tabatabai
Over the past month we added basic support for Dart and Flutter into Gitar after talking to customers using it for cross-platform mobile development.…

Over the past month we added basic support for Dart and Flutter into Gitar after talking to customers using it for cross-platform mobile development.…

Posted by Ali-Reza Adl-Tabatabai
Final session, Mobile Developer Productivity Panel at Droidcon SF.

Final session, Mobile Developer Productivity Panel at Droidcon SF.

Shared by Ali-Reza Adl-Tabatabai
The Uber Android Platform team is at Droidcon SF. Stop by our booth to say hello.

The Uber Android Platform team is at Droidcon SF. Stop by our booth to say hello.

Liked by Ali-Reza Adl-Tabatabai
Excited to showcase Gitar at Droidcon SF! Drop by our booth on the first floor and pickup some Gitar schwag.

Excited to showcase Gitar at Droidcon SF! Drop by our booth on the first floor and pickup some Gitar schwag.

Shared by Ali-Reza Adl-Tabatabai
I'm excited to announce our presence is growing in Amsterdam. I'm thrilled to welcome Ali Afroozeh to Gitar! I amsterdam!

I'm excited to announce our presence is growing in Amsterdam. I'm thrilled to welcome Ali Afroozeh to Gitar! I amsterdam!

Shared by Ali-Reza Adl-Tabatabai
Sneak Peak - 60 second clip where 100s of Automated 💻 Test Cases across various products and services ➡️ hit our Pixel Farm 📲in Uber's Uber…

Sneak Peak - 60 second clip where 100s of Automated 💻 Test Cases across various products and services ➡️ hit our Pixel Farm 📲in Uber's Uber…

Liked by Ali-Reza Adl-Tabatabai
How we use Machine Learning to do end-to-end mobile testing at Uber. Large credit goes to Juan Lopez Marcano https://lnkd.in/e79PV2Qp

How we use Machine Learning to do end-to-end mobile testing at Uber. Large credit goes to Juan Lopez Marcano https://lnkd.in/e79PV2Qp

Liked by Ali-Reza Adl-Tabatabai
We just wrapped up an incredible 2 days at LaunchDarkly Galaxy where we unveiled our feature flag clean up product. It was exciting to see the…

We just wrapped up an incredible 2 days at LaunchDarkly Galaxy where we unveiled our feature flag clean up product. It was exciting to see the…

Shared by Ali-Reza Adl-Tabatabai
Ty Smith 👔 led Java to Kotlin migration effort at Uber which has enabled developers to use Compose, Coroutines and more. Thanks for presenting Uber…

Ty Smith 👔 led Java to Kotlin migration effort at Uber which has enabled developers to use Compose, Coroutines and more. Thanks for presenting Uber…

Liked by Ali-Reza Adl-Tabatabai

View Ali-Reza’s full profile

See who you know in common
Get introduced
Contact Ali-Reza directly

Join to view full profile

Explore collaborative articles

We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

Explore More

Add new skills with these courses

See all courses

Ali-Reza Adl-Tabatabai

Atherton, California, United States 2K followers 500+ connections

See your mutual connections View mutual connections with Ali-Reza Sign in Welcome back Email or phone Password Show Forgot password? Sign in or New to LinkedIn? Join now or New to LinkedIn? Join now

About

Activity

We’re excited to announce that we’ve raised a $14M Series A round. This funding marks a step forward in our mission to collapse the sprawl of the…

Liked by Ali-Reza Adl-Tabatabai

Proud to welcome Alexander Les to Gitar! I'm excited to see our engineering team grow...

Posted by Ali-Reza Adl-Tabatabai

📍 San Francisco, CA Guess each 🐶's name in the comments below #NationalTakeYourDogToWorkDay

Liked by Ali-Reza Adl-Tabatabai

Experience

-

-

-

-

-

-

-

-

-

-

-

-

-

-

Education

Publications

EuroSys '19: Proceedings of the Fourteenth EuroSys Conference 2019 Mar 2019

Efficient Mapping of Irregular C++ Applications to Integrated GPUs

CGO 2014: IEEE/ACM International Symposium on Code Generation and Optimization February 17, 2014

CoreRacer: A Practical Memory Race Recorder for Multicore x86 TSO Processors

44th International Symposium on Microarchitecture (MICRO-44) Dec 2011

SPAA '11: Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures Jun 2011

PSI EtA '10: Programming Support Innovations for Emerging Distributed Applications Oct 2010

2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) Dec 2009

Architecting a Chunk-based Memory Race Recorder in Modern CMPs

42nd International Symposium on Microarchitecture (MICRO-42) Dec 2009

SPAA '09: Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures Aug 2009

SPAA '09: Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures Aug 2009

ECOOP 2009 – Object-Oriented Programming Jul 2009

2009 IEEE International Symposium on Performance Analysis of Systems and Software Apr 2009

Proceedings of the First USENIX Workshop on Hot Topics in Parallelism Mar 2009

Design and implementation of transactional constructs for C/C++

OOPSLA 2008 Oct 2008

Proceedings of the 23rd ACM SIGPLAN Conference on Object-Oriented Programming Systems Languages and Applications Oct 2008

ECOOP 2008 – Object-Oriented Programming Jul 2008

SPAA '08: Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures Jun 2008

SPAA '08: Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures Jun 2008

SPAA '08: Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures Jun 2008

Workshop on transactional computing (Transact 2008) May 2008

CGO '08: Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization Apr 2008

PPoPP '08 Feb 2008

ICS '07: Proceedings of the 21st annual international conference on Supercomputing Jun 2007

PLDI '07: Proceedings of the 28th ACM SIGPLAN Conference on Programming Language Design and Implementation Jun 2007

International Symposium on Code Generation and Optimization (CGO'07) Mar 2007

EuroSys '07: Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007 Mar 2007

PPoPP '07: Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming Mar 2007

2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06) Dec 2006

ACM Queue Dec 2006

PLDI '06: Proceedings of the 27th ACM SIGPLAN Conference on Programming Language Design and Implementation Jun 2006

ISMM '06: Proceedings of the 5th international symposium on Memory management Jun 2006

PPoPP'06: Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming Mar 2006

POPL '06: Conference record of the 33rd ACM SIGPLAN-SIGACT symposium on Principles of programming languages Jan 2006

PLDI '04 Jun 2004

Improving 64-Bit Java IPF Performance by Compressing Heap References

CGO 2004: International Symposium on Code generation and Optimization Mar 2004

Intel Technology Journal, Vol. 7, Issue 1, 2003 Feb 2003

Proceedings.International Conference on Parallel Architectures and Compilation Techniques Sep 2002

ACM Transactions on Programming Languages and Systems May 2000

PLDI '98: Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation May 1998

OOPSLA '96: Proceedings of the 11th ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications Oct 1996

PLDI '96: Proceedings of the ACM SIGPLAN 1996 conference on Programming language design and implementation May 1996

PLDI '96: Proceedings of the ACM SIGPLAN 1996 conference on Programming language design and implementation May 1996

PLDI '93: SIGPLAN Symp. on Programming Language Design and Implementation Jun 1993

POPL '93: Proceedings of the 20th ACM SIGPLAN-SIGACT symposium on Principles of programming languages Mar 1993

Patents

Issued February 19, 2019 US US10210018

Issued September 19, 2017 US US9767027

Issued January 31, 2017 US US9558118

Atherton, California, United States

2K followers 500+ connections

View mutual connections with Ali-Reza

Welcome back

Email or phone

Password

Forgot password?

or

New to LinkedIn? Join now

or

New to LinkedIn? Join now