skip to main content
10.1145/3293883acmconferencesBook PagePublication PagesppoppConference Proceedingsconference-collections
PPoPP '19: Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming
ACM2019 Proceeding
Publisher:
  • Association for Computing Machinery
  • New York
  • NY
  • United States
Conference:
PPoPP '19: 24th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming Washington District of Columbia February 16 - 20, 2019
ISBN:
978-1-4503-6225-2
Published:
16 February 2019
Sponsors:

Reflects downloads up to 22 Sep 2024Bibliometrics
Skip Abstract Section
Abstract

This year's symposium continues and reinforces the PPoPP tradition of publishing leading work on all aspects of parallel programming, including foundational and theoretical aspects, techniques, languages, compilers, runtime systems, tools, and practical experiences. Given the pervasiveness of parallel architectures in the general consumer market, PPoPP, with its interest in new parallel workloads, techniques and productivity tools for parallel programming, is becoming more relevant than ever to the computer science community.

Beyond human-level accuracy: computational challenges in deep learning

Deep learning (DL) research yields accuracy and product improvements from both model architecture changes and scale: larger data sets and models, and more computation. For hardware design, it is difficult to predict DL model changes. However, recent ...

research-article
S-EnKF: co-designing for scalable ensemble Kalman filter

Ensemble Kalman filter (EnKF) is one of the most important methods for data assimilation, which is widely applied to the reconstruction of observed historical data for providing initial conditions of numerical atmospheric and oceanic models. With the ...

research-article
Throughput-oriented GPU memory allocation

Throughput-oriented architectures, such as GPUs, can sustain three orders of magnitude more concurrent threads than multicore architectures. This level of concurrency pushes typical synchronization primitives (e.g., mutexes) over their scalability ...

SEP-graph: finding shortest execution paths for graph processing under a hybrid framework on GPU

In general, the performance of parallel graph processing is determined by three pairs of critical parameters, namely synchronous or asynchronous execution mode (Sync or Async), Push or Pull communication mechanism (Push or Pull), and Data-driven or ...

Incremental flattening for nested data parallelism

Compilation techniques for nested-parallel applications that can adapt to hardware and dataset characteristics are vital for unlocking the power of modern hardware. This paper proposes such a technique, which builds on flattening and is applied in the ...

Adaptive sparse matrix-matrix multiplication on the GPU

In the ongoing efforts targeting the vectorization of linear algebra primitives, sparse matrix-matrix multiplication (SpGEMM) has received considerably less attention than sparse Matrix-Vector multiplication (SpMV). While both are equally important, ...

research-article
Public Access
Modular transactions: bounding mixed races in space and time

We define local transactional race freedom (LTRF), which provides a programmer model for software transactional memory. LTRF programs satisfy the SC-LTRF property, thus allowing the programmer to focus on sequential executions in which transactions ...

Leveraging hardware TM in Haskell

Transactional memory (TM) is heavily used for synchronization in the Haskell programming language, but its performance has historically been poor. We set out to improve this performance using hardware TM (HTM) on Intel processors. This task is ...

Stretching the capacity of hardware transactional memory in IBM POWER architectures

The hardware transactional memory (HTM) implementations in commercially available processors are significantly hindered by their tight capacity constraints. In practice, this renders current HTMs unsuitable to many real-world workloads of in-memory ...

research-article
Public Access
Processing transactions in a predefined order

In this paper we provide a high performance solution to the problem of committing transactions while enforcing a pre-defined order. We provide the design and implementation of three algorithms, which deploy a specialized cooperative transaction ...

research-article
Harmonia: a high throughput B+tree for GPUs

B+tree is one of the most important data structures and has been widely used in different fields. With the increase of concurrent queries and data-scale in storage, designing an efficient B+tree structure has become critical. Due to abundant computation ...

research-article
Open Access
Engineering a high-performance GPU B-Tree

We engineer a GPU implementation of a B-Tree that supports concurrent queries (point, range, and successor) and updates (insertions and deletions). Our B-tree outperforms the state of the art, a GPU log-structured merge tree (LSM) and a GPU sorted ...

QTLS: high-performance TLS asynchronous offload framework with Intel® QuickAssist technology

Hardware accelerators are a promising solution to optimize the Total Cost of Ownership (TCO) of cloud datacenters. This paper targets the costly Transport Layer Security (TLS) and investigates the TLS acceleration for the widely-deployed event-driven ...

research-article
Data-flow/dependence profiling for structured transformations

Profiling feedback is an important technique used by developers for performance debugging, where it is usually used to pinpoint performance bottlenecks and also to find optimization opportunities. Assessing the validity and potential benefit of a ...

Lightweight hardware transactional memory profiling

Programs that use hardware transactional memory (HTM) demand sophisticated performance analysis tools when they suffer from performance losses. We have developed TxSampler---a lightweight profiler for programs that use HTM. TxSampler measures ...

A pattern based algorithmic autotuner for graph processing on GPUs

This paper proposes Gswitch, a pattern-based algorithmic auto-tuning system that dynamically switches between optimization variants with negligible overhead. Its novelty lies in a small set of algorithmic patterns that allow for the configurable ...

Provably and practically efficient granularity control

Over the past decade, many programming languages and systems for parallel-computing have been developed, e.g., Fork/Join and Habanero Java, Parallel Haskell, Parallel ML, and X10. Although these systems raise the level of abstraction for writing ...

A coordinated tiling and batching framework for efficient GEMM on GPUs

General matrix multiplication (GEMM) plays a paramount role in a broad range of domains such as deep learning, scientific computing, and image processing. The primary optimization method is to partition the matrix into many tiles and exploit the ...

Semantics-aware scheduling policies for synchronization determinism

A common task for all deterministic multithreading (DMT) systems is to enforce synchronization determinism. However, synchronization determinism has not been the focus of existing DMT research. Instead, most DMT systems focused on how to order data ...

Proactive work stealing for futures

The use of futures provides a flexible way to express parallelism and can generate arbitrary dependences among parallel subcomputations. The additional flexibility that futures provide comes with a cost, however. When scheduled using classic work ...

A round-efficient distributed betweenness centrality algorithm

We present Min-Rounds BC (MRBC), a distributed-memory algorithm in the CONGEST model that computes the betweenness centrality (BC) of every vertex in a directed unweighted n-node graph in O(n) rounds. Min-Rounds BC also computes all-pairs-shortest-paths ...

Corrected trees for reliable group communication

Driven by ever increasing performance demands of compute-intensive applications, supercomputing systems comprise more and more nodes. This growth is a significant burden for fast group communication primitives and also makes those systems more ...

Adaptive sparse tiling for sparse matrix multiplication

Tiling is a key technique for data locality optimization and is widely used in high-performance implementations of dense matrix-matrix multiplication for multicore/manycore CPUs and GPUs. However, the irregular and matrix-dependent data access pattern ...

Encapsulated open nesting for STM: fine-grained higher-level conflict detection

Open nesting allows replacing the automatic detection of conflicting memory accesses used in transactional memory (TM) with programmer-specified higher-level conflict detection. Higher-level conflict detection allows removing conflicts of commuting ...

research-article
A specialized B-tree for concurrent datalog evaluation

Modern Datalog engines are employed in industrial applications such as graph-databases, networks, and static program analysis. To cope with vast amount of data, Datalog engines must employ parallel execution strategies, for which specialized concurrent ...

Efficient race detection with futures

This paper addresses the problem of provably efficient and practically good on-the-fly determinacy race detection in task parallel programs that use futures. Prior works on determinacy race detection have mostly focused on either task parallel programs ...

research-article
Verifying C11 programs operationally

This paper develops an operational semantics for a release-acquire fragment of the C11 memory model with relaxed accesses. We show that the semantics is both sound and complete with respect to the axiomatic model of Batty et al. The semantics relies on ...

Checking linearizability using hitting families

Linearizability is a key correctness property for concurrent data types. Linearizability requires that the behavior of concurrently invoked operations of the data type be equivalent to the behavior in an execution where each operation takes effect at an ...

Transitive joins: a sound and efficient online deadlock-avoidance policy

We introduce a new online deadlock-avoidance policy, Transitive Joins (TJ), that targets programs with dynamic task parallelism and arbitrary join operations. In this model, a computation task can asynchronously spawn new tasks and selectively join (...

poster
VEBO: a vertex- and edge-balanced ordering heuristic to load balance parallel graph processing

This work proposes Vertex- and Edge-Balanced Ordering (VEBO): balance the number of edges and the number of unique destinations of those edges. VEBO balances edges and vertices for graphs with a power-law degree distribution, and ensures an equal degree ...

Contributors
  • University of Maryland, College Park
  • Technion - Israel Institute of Technology
Index terms have been assigned to the content through auto-classification.

Recommendations

Acceptance Rates

PPoPP '19 Paper Acceptance Rate 29 of 152 submissions, 19%;
Overall Acceptance Rate 230 of 1,014 submissions, 23%
YearSubmittedAcceptedRate
PPoPP '211503121%
PPoPP '201212823%
PPoPP '191522919%
PPoPP '171322922%
PPoPP '141842815%
PPoPP '07652234%
PPoPP '03452044%
PPoPP '99791722%
PPOPP '97862630%
Overall1,01423023%