Recent & Highlighted News

More Posts

On June 6th, 2019, C3SR members were invited to present the center’s work on MLModelScope at the MLPerf community’s meeting in Santa Clara, CA. At the meeting, Dr. Jinjun Xiong on behalf of the team presented “Benchmarking and Understanding ML Inference,” a talk drawing on research results from MLModelScope. Among the MLModelScope’s team members, Ms. Cheng Li also attended the meeting in person, while Mr. Abdul Dakkak attended the meeting online.

CONTINUE READING

For the second year in a row, our C3SR teams are victorious in a competition challenging experts worldwide to design low-power embedded systems for Internet-of-Things (IoT) applications. At the 2019 Design Automation Conference (DAC) System Design Contest, our teams won Double Championships for both the GPU and the FPGA tracks of the contest. The contest’s objective is to create algorithms that can accurately detect and locate objects from images taken by aerial drones.

CONTINUE READING

C3SR faculty, Prof. Lav Varshney, was invited to give a talk at the opening ceremony of the 2019 National Science Olympiad competition. There were more than 2,000 top middle and high school students from around the country.

Here is the news story.

Prof. Varshney’s talk was well received by the thousands of enthusiastic students.

CONTINUE READING

On April 17, 2019, C3SR members Ms. Cheng Li, Mr. Abdul Dakkak, Prof. Jinjun Xiong and Prof. Wen-mei Hwu were invited to give a tutorial on the center’s work on MLModelScope at the 2019 ASPLOS conference in Providence, RI.

The tutorial was well received by the attendees. Details about this tutorial can be found here.

CONTINUE READING

One of the joint research work between the C3SR team and the IBM Research, titled “Evaluating Characteristics of CUDA Communication Primitives on High-Bandwidth Interconnects,” won the Best Research Paper Award at the 10th ACM/SPEC International Conference on Performance Engineering (ICPE).

The winning team members were Carl Pearson, Abdul Dakkak, Sarah Hashash, Cheng Li, I-Hsin Chung, Jinjun Xiong, and Wen-mei Hwu.

Congratulations!

CONTINUE READING

Publications

More Publications

The rapidly growing demands for powerful AI algorithms in many application domains have motivated massive investment in both high-quality deep neural network (DNN) models and high-efficiency implementations. In this position paper, we argue that a simultaneous DNN/implementation co-design methodology, named Neural Architecture and Implementation Search (NAIS), deserves more research attention to boost the development productivity and efficiency of both DNN models and implementation optimization. We propose a stylized design methodology that can drastically cut down the search cost while preserving the quality of the end solution. As an illustration, we discuss this DNN/implementation methodology in the context of both FPGAs and GPUs. We take autonomous driving as a key use case as it is one of the most demanding areas for high quality AI algorithms and accelerators. We discuss how such a co-design methodology can impact the autonomous driving industry significantly. We identify several research opportunities in this exciting domain.
International Conference on Computer-Aided Design, 2019

Fine-grained action detection is an important task with numerous applications in robotics and human-computer interaction. Existing methods typically utilize a two-stage approach including extraction of local spatio-temporal features followed by temporal modeling to capture long-term dependencies. While most recent papers have focused on the latter (long-temporal modeling), here, we focus on producing features capable of modeling fine-grained motion more efficiently. We propose a novel locally-consistent deformable convolution, which utilizes the change in receptive fields and enforces a local coherency constraint to capture motion information effectively. Our model jointly learns spatio-temporal features (instead of using independent spatial and temporal streams). The temporal component is learned from the feature space instead of pixel space, e.g. optical flow. The produced features can be flexibly used in conjunction with other long-temporal modeling networks, e.g. ST-CNN, DilatedTCN, and ED-TCN. Overall, our proposed approach robustly outperforms the original long-temporal models on two fine-grained action datasets: 50 Salads and GTEA, achieving F1 scores of 80.22% and 75.39% respectively.
International Conference on Computer Vision, 2019

Multi-scale context module and single-stage encoder-decoder structure are commonly employed for semantic segmentation. The multi-scale context module refers to the operations to aggregate feature responses from a large spatial extent, while the single-stage encoder-decoder structure encodes the high-level semantic information in the encoder path and recovers the boundary information in the decoder path. In contrast, multi-stage encoder-decoder networks have been widely used in human pose estimation and show superior performance than their single-stage counterpart. However, few efforts have been attempted to bring this effective design to semantic segmentation. In this work, we propose a Semantic Prediction Guidance (SPG) module which learns to re-weight the local features through the guidance from pixel-wise semantic prediction. We find that by carefully re-weighting features across stages, a two-stage encoder-decoder network coupled with our proposed SPG module can significantly outperform its one-stage counterpart with similar parameters and computations. Finally, we report experimental results on the semantic segmentation benchmark Cityscapes, in which our SPGNet attains 81.1% on the test set using only ‘fine’ annotations.
International Conference on Computer Vision, 2019

Recent advancements in deep learning techniques facilitate intelligentquery support in diverse applications, such as content-based image retrieval and audio texturing. Unlike conventional key-based queries, these intelligent queries lack efficient indexing and require complex compute operations for feature matching. To achieve highperformance intelligent querying against massive datasets, modern computing systems employ GPUs in-conjunction with solid-state drives (SSDs) for fast data access and parallel data processing. However, our characterization with various intelligent-query workloads developed with deep neural networks (DNNs), shows that the storage I/O bandwidth is still the major bottleneck that contributes 56%–90% of the query execution time. To this end, we present DeepStore, an in-storage accelerator architecture for intelligent queries. It consists of (1) energy-efficient in-storage accelerators designed specifically for supporting DNNbased intelligent queries, under the resource constraints in modern SSD controllers; (2) a similarity-based in-storage query cache to exploit the temporal locality of user queries for further performance improvement; and (3) a lightweight in-storage runtime system working as the query engine, which provides a simple software abstraction to support different types of intelligent queries. DeepStore exploits SSD parallelisms with design space exploration for achieving the maximal energy efficiency for in-storage accelerators. We validate DeepStore design with an SSD simulator, and evaluate it with a variety of vision, text, and audio based intelligent queries. Compared with the state-of-the-art GPU+SSD approach, DeepStore improves the query performance by up to 17.7×, and energy-efficiency by up to 78.6×.
Proceedings of the 52 Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’19), 2019

Deep neural networks (DNNs) have been widely adopted in many domains, including computer vision, natural language processing, and medical care. Recent research reveals that sparsity in DNN parameters can be exploited to reduce inference computational complexity and improve network quality. However, sparsity also introduces irregularity and extra complexity in data processing, which make the accelerator design challenging. This work presents the design and implementation of a highly flexible sparse DNN inference accelerator on FPGA. Our proposed inference engine can be easily configured to be used in both mobile computing and high-performance computing scenarios. Evaluation shows our proposed inference engine effectively accelerates sparse DNNs and outperforms CPU solution by up to 4.7x in terms of energy efficiency.
IEEE High Performance Extreme Computing Conference, 2019

In automatic speech recognition (ASR), wideband (WB) and narrowband (NB) speech signals with different sampling rates typically use separate acoustic models. Therefore mixed-bandwidth (MB) acoustic modeling has important practical values for ASR system deployment. In this paper, we extensively investigate large-scale MB deep neural network acoustic modeling for ASR using 1,150 hours of WB data and 2,300 hours of NB data. We study various MB strategies including downsampling, upsampling and bandwidth extension for MB acoustic modeling and evaluate their performance on 8 diverse WB and NB test sets from various application domains. To deal with the large amounts of training data, distributed training is carried out on multiple GPUs using synchronous data parallelism.
International Speech Communication Association, 2019

Unlike traditional PCIe-based FPGA accelerators, heterogeneous SoC-FPGA devices provide tighter integrations between software running on CPUs and hardware accelerators. Modern heterogeneous SoC-FPGA platforms support multiple I/O cache coherence options between CPUs and FPGAs, but these options can have inadvertent effects on the achieved bandwidths depending on applications and data access patterns. To provide the most efficient communications between CPUs and accelerators, understanding the data transaction behaviors and selecting the right I/O cache coherence method is essential. In this paper, we use Xilinx Zynq UltraScale+ as the SoC platform to show how certain I/O cache coherence method can perform better or worse in different situations, ultimately affecting the overall accelerator performances as well. Based on our analysis, we further explore possible software and hardware modifications to improve the I/O performances with different I/O cache coherence options. With our proposed modifications, the overall performance of SoC design can be averagely improved by 20%.
International Conference on Field Programmable Logic and Applications 2019, 2019

This work presents an update to the triangle-counting portion of the subgraph isomorphism static graph challenge. This work is motivated by a desire to understand the impact of CUDA unified memory on the triangle-counting problem. First, CUDA unified memory is used to overlap reading large graph data from disk with graph data structures in GPU memory. Second, we use CUDA unified memory hintsto solve multi-GPU performance scaling challenges present in our last submission. Finally, we improve the single-GPU kernel performance from our past submission by introducing a work-stealing dynamic algorithm GPU kernel with persistent threads, which makes performance adaptive for large graphs withoutrequiring a graph analysis phase.
2019 IEEE High Performance Extreme Computing Conference, 2019

In this paper, we present an update to our previous submission on k-truss decomposition from Graph Challenge 2018. For single GPU k-truss implementation, we propose multiple algorithmic optimizations that significantly improve performance by up to 35.2x (6.9x on average) compared to our previous GPU implementation. In addition, we present a scalable multi-GPU implementation in which each GPU handles a different ‘k’ value. Compared to our prior multi-GPU implementation,the proposed approach is faster by up to 151.3x (78.8x on average). In case when the edges with only maximal k-truss are sought, incrementing the `k’ value in each iteration is inefficient particularly for graphs with large maximum k-truss. Thus, we propose binary search for the ‘k’ value to find the maximal k-truss. The binary search approach on a single GPU is up to 101.5 (24.3x on average) faster than our 2018 $k$-truss submission. Lastly, we show that the proposed binary search finds the maximum k-truss for “Twitter” graph dataset having 2.8 billion bidirectional edges in just 16 minutes on a single V100 GPU.
2019 IEEE High Performance Extreme Computing Conference, 2019

The slow down in Moore’s Law has resulted in poor scaling of performance and energy. This slow down in scaling has been accompanied by the explosive growth of cognitive computing applications, creating a demand for high performance and energy efficient solutions. Amidst this climate, FPGA-based accelerators are emerging as a potential platform for deploying accelerators for cognitive computing workloads. However, the slow-down in scaling also limits the scaling of memory and I/O bandwidths. Additionally, a growing fraction of energy is spent on data transfer between off-chip memory and the compute units. Thus, now more than ever, there is a need to leverage near-memory and in-storage computing to maximize the bandwidth available to accelerators, and further improve energy efficiency. In this paper, we make the case for leveraging FPGAs in near-memory and in-storage settings, and present opportunities and challenges in such scenarios. We introduce a conceptual FPGA-based near-data processing architecture, and discuss innovations in architecture, systems, and compilers for accelerating cognitive computing workloads..
IEEE Computer Society Annual Symposium on VLSI, 2019

Invited Talks

Keynote Speeach: Cognitive Computing on Heterogeneous Hardware Systems for the AI Revolution
Sun, Oct 20, 2019
Invited Distinguished Speaker: Design, Compilation, and Acceleration for Deep Neural Networks in IoT Applications
Wed, Apr 17, 2019
Cloud Tools and Libraries for Exploriting Heterogeneous Cognitive Computing Systems
Wed, Sep 27, 2017
Architecture and Software for Emerging Low-Power Systems
Mon, Jul 24, 2017
Crowdsensing, Crowdsourcing, and Creativity
Thu, Jun 1, 2017
Cognitive Computing on Heterogeneous Hardware Systems for the AI Revolution
Sat, Jul 8, 2017
Innovative Applications and Technology Pivots – A Perfect Storm in Computing
Sat, Feb 11, 2017
IEEE HPEC Graph Challenge Honorable Mention
Thu, Sep 26, 2019
IEEE HPEC Graph Challenge Honorable Mention
Thu, Sep 26, 2019
IEEE HPEC Graph Challenge Student Innovation
Thu, Sep 26, 2019
Outstanding Poster (SPGNet: Semantic Prediction Guidance for Scene Parsing)
Mon, Sep 16, 2019
Best Poster Award
Sun, Jul 14, 2019
Third Place Winner, FPGA category
Mon, Jun 24, 2019
First Place Winner, both FPGA and GPU categories
Sun, Jun 2, 2019
GLSVLSI Service Recognition Award
Thu, May 9, 2019
Best Research Paper Award, "Evaluating Characteristics of CUDA Communication Primitives on High-Bandwidth Interconnects,"
Thu, Apr 11, 2019
IEEE Fellow
Tue, Nov 20, 2018