Recent & Highlighted News

More Posts

The C3SR faculty, Prof. Bo Li, has been named to MIT Technology Review’s list of 35 Innovators Under 35. for her pioneering research in adversarial attacks to detect flaws in AI systems and make them more robust.



A research work from the C3SR team titled “XSP: Across-Stack Profiling and Analysis of Machine Learning Models on GPU won the Best Paper Award at the 34th IEEE International Parallel & Distributed Processing Symposium (IPDPS).

The winning team members were Cheng Li, Abdul Dakkak, Jinjun Xiong, Wei Wei, Lingjie Xu, and Wen-mei Hwu.



The C3SR faculty members, Professors Deming Chen, Wen-mei Hwu and Jinjun Xiong, has recently been awarded by the Google TensorFlow team for a competitive Google Faculty Awards to Support Machine Learning Courses, Diversity, and Inclusion. This faculty award recognizes the team for their creation of the ECE498-ICC course, titled “Internet-of-Things and Cognitive Computing,” which were offerred in the Spring semesters of 2019 and 2020, respectively. Since its inception in 2019, the course has attracted not only computer engineering students, but also students from electrical engineering, computer science, agricultural & biological engineering, civil engineering, and biology.


Our beloved C3SR faculty, Professor Thomas Shi-Tao Huang, passed away peacefully at his daughter’s home in Indiana, during the evening of April 25, 2020. Three months ago, Tom’s wife, Margaret, left the world in peace in the company of her family; Tom was extremely sad. Two months later, Tom moved from UIUC, where he had been teaching and living for forty years, to his daughter’s home in Indiana. In the last few weeks, Tom enjoyed the company and care of his family, and he also cared about the development of his students in the distance.


Prof. Andrew Miller from C3SR won the recent NSF CAREER award for his work, titled “Composable Programming Abstractions for Secure Distributed Computing and Blockchain Applications.”

Congratulations, Andrew!



More Publications

2020 33nd International Conference on VLSI Design and 2020 19th International Conference on Embedded Systems (VLSID), 2020

The rapidly growing demands for powerful AI algorithms in many application domains have motivated massive investment in both high-quality deep neural network (DNN) models and high-efficiency implementations. In this position paper, we argue that a simultaneous DNN/implementation co-design methodology, named Neural Architecture and Implementation Search (NAIS), deserves more research attention to boost the development productivity and efficiency of both DNN models and implementation optimization. We propose a stylized design methodology that can drastically cut down the search cost while preserving the quality of the end solution. As an illustration, we discuss this DNN/implementation methodology in the context of both FPGAs and GPUs. We take autonomous driving as a key use case as it is one of the most demanding areas for high quality AI algorithms and accelerators. We discuss how such a co-design methodology can impact the autonomous driving industry significantly. We identify several research opportunities in this exciting domain.
International Conference on Computer-Aided Design, 2019

Fine-grained action detection is an important task with numerous applications in robotics and human-computer interaction. Existing methods typically utilize a two-stage approach including extraction of local spatio-temporal features followed by temporal modeling to capture long-term dependencies. While most recent papers have focused on the latter (long-temporal modeling), here, we focus on producing features capable of modeling fine-grained motion more efficiently. We propose a novel locally-consistent deformable convolution, which utilizes the change in receptive fields and enforces a local coherency constraint to capture motion information effectively. Our model jointly learns spatio-temporal features (instead of using independent spatial and temporal streams). The temporal component is learned from the feature space instead of pixel space, e.g. optical flow. The produced features can be flexibly used in conjunction with other long-temporal modeling networks, e.g. ST-CNN, DilatedTCN, and ED-TCN. Overall, our proposed approach robustly outperforms the original long-temporal models on two fine-grained action datasets: 50 Salads and GTEA, achieving F1 scores of 80.22% and 75.39% respectively.
International Conference on Computer Vision, 2019

Multi-scale context module and single-stage encoder-decoder structure are commonly employed for semantic segmentation. The multi-scale context module refers to the operations to aggregate feature responses from a large spatial extent, while the single-stage encoder-decoder structure encodes the high-level semantic information in the encoder path and recovers the boundary information in the decoder path. In contrast, multi-stage encoder-decoder networks have been widely used in human pose estimation and show superior performance than their single-stage counterpart. However, few efforts have been attempted to bring this effective design to semantic segmentation. In this work, we propose a Semantic Prediction Guidance (SPG) module which learns to re-weight the local features through the guidance from pixel-wise semantic prediction. We find that by carefully re-weighting features across stages, a two-stage encoder-decoder network coupled with our proposed SPG module can significantly outperform its one-stage counterpart with similar parameters and computations. Finally, we report experimental results on the semantic segmentation benchmark Cityscapes, in which our SPGNet attains 81.1% on the test set using only ‘fine’ annotations.
International Conference on Computer Vision, 2019

Recent advancements in deep learning techniques facilitate intelligentquery support in diverse applications, such as content-based image retrieval and audio texturing. Unlike conventional key-based queries, these intelligent queries lack efficient indexing and require complex compute operations for feature matching. To achieve highperformance intelligent querying against massive datasets, modern computing systems employ GPUs in-conjunction with solid-state drives (SSDs) for fast data access and parallel data processing. However, our characterization with various intelligent-query workloads developed with deep neural networks (DNNs), shows that the storage I/O bandwidth is still the major bottleneck that contributes 56%–90% of the query execution time. To this end, we present DeepStore, an in-storage accelerator architecture for intelligent queries. It consists of (1) energy-efficient in-storage accelerators designed specifically for supporting DNNbased intelligent queries, under the resource constraints in modern SSD controllers; (2) a similarity-based in-storage query cache to exploit the temporal locality of user queries for further performance improvement; and (3) a lightweight in-storage runtime system working as the query engine, which provides a simple software abstraction to support different types of intelligent queries. DeepStore exploits SSD parallelisms with design space exploration for achieving the maximal energy efficiency for in-storage accelerators. We validate DeepStore design with an SSD simulator, and evaluate it with a variety of vision, text, and audio based intelligent queries. Compared with the state-of-the-art GPU+SSD approach, DeepStore improves the query performance by up to 17.7×, and energy-efficiency by up to 78.6×.
Proceedings of the 52 Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’19), 2019

Deep neural networks (DNNs) have been widely adopted in many domains, including computer vision, natural language processing, and medical care. Recent research reveals that sparsity in DNN parameters can be exploited to reduce inference computational complexity and improve network quality. However, sparsity also introduces irregularity and extra complexity in data processing, which make the accelerator design challenging. This work presents the design and implementation of a highly flexible sparse DNN inference accelerator on FPGA. Our proposed inference engine can be easily configured to be used in both mobile computing and high-performance computing scenarios. Evaluation shows our proposed inference engine effectively accelerates sparse DNNs and outperforms CPU solution by up to 4.7x in terms of energy efficiency.
IEEE High Performance Extreme Computing Conference, 2019

In automatic speech recognition (ASR), wideband (WB) and narrowband (NB) speech signals with different sampling rates typically use separate acoustic models. Therefore mixed-bandwidth (MB) acoustic modeling has important practical values for ASR system deployment. In this paper, we extensively investigate large-scale MB deep neural network acoustic modeling for ASR using 1,150 hours of WB data and 2,300 hours of NB data. We study various MB strategies including downsampling, upsampling and bandwidth extension for MB acoustic modeling and evaluate their performance on 8 diverse WB and NB test sets from various application domains. To deal with the large amounts of training data, distributed training is carried out on multiple GPUs using synchronous data parallelism.
International Speech Communication Association, 2019

Unlike traditional PCIe-based FPGA accelerators, heterogeneous SoC-FPGA devices provide tighter integrations between software running on CPUs and hardware accelerators. Modern heterogeneous SoC-FPGA platforms support multiple I/O cache coherence options between CPUs and FPGAs, but these options can have inadvertent effects on the achieved bandwidths depending on applications and data access patterns. To provide the most efficient communications between CPUs and accelerators, understanding the data transaction behaviors and selecting the right I/O cache coherence method is essential. In this paper, we use Xilinx Zynq UltraScale+ as the SoC platform to show how certain I/O cache coherence method can perform better or worse in different situations, ultimately affecting the overall accelerator performances as well. Based on our analysis, we further explore possible software and hardware modifications to improve the I/O performances with different I/O cache coherence options. With our proposed modifications, the overall performance of SoC design can be averagely improved by 20%.
International Conference on Field Programmable Logic and Applications 2019, 2019

This work presents an update to the triangle-counting portion of the subgraph isomorphism static graph challenge. This work is motivated by a desire to understand the impact of CUDA unified memory on the triangle-counting problem. First, CUDA unified memory is used to overlap reading large graph data from disk with graph data structures in GPU memory. Second, we use CUDA unified memory hintsto solve multi-GPU performance scaling challenges present in our last submission. Finally, we improve the single-GPU kernel performance from our past submission by introducing a work-stealing dynamic algorithm GPU kernel with persistent threads, which makes performance adaptive for large graphs withoutrequiring a graph analysis phase.
2019 IEEE High Performance Extreme Computing Conference, 2019

In this paper, we present an update to our previous submission on k-truss decomposition from Graph Challenge 2018. For single GPU k-truss implementation, we propose multiple algorithmic optimizations that significantly improve performance by up to 35.2x (6.9x on average) compared to our previous GPU implementation. In addition, we present a scalable multi-GPU implementation in which each GPU handles a different ‘k’ value. Compared to our prior multi-GPU implementation,the proposed approach is faster by up to 151.3x (78.8x on average). In case when the edges with only maximal k-truss are sought, incrementing the `k’ value in each iteration is inefficient particularly for graphs with large maximum k-truss. Thus, we propose binary search for the ‘k’ value to find the maximal k-truss. The binary search approach on a single GPU is up to 101.5 (24.3x on average) faster than our 2018 $k$-truss submission. Lastly, we show that the proposed binary search finds the maximum k-truss for “Twitter” graph dataset having 2.8 billion bidirectional edges in just 16 minutes on a single V100 GPU.
2019 IEEE High Performance Extreme Computing Conference, 2019

Invited Talks

Keynote Speeach: Cognitive Computing on Heterogeneous Hardware Systems for the AI Revolution
Sun, Oct 20, 2019
Invited Distinguished Speaker: Design, Compilation, and Acceleration for Deep Neural Networks in IoT Applications
Wed, Apr 17, 2019
Cloud Tools and Libraries for Exploriting Heterogeneous Cognitive Computing Systems
Wed, Sep 27, 2017
Architecture and Software for Emerging Low-Power Systems
Mon, Jul 24, 2017
Crowdsensing, Crowdsourcing, and Creativity
Thu, Jun 1, 2017
Cognitive Computing on Heterogeneous Hardware Systems for the AI Revolution
Sat, Jul 8, 2017
Innovative Applications and Technology Pivots – A Perfect Storm in Computing
Sat, Feb 11, 2017
Best Paper Award at the 33rd International Conference on VLSI Design (VLSID)
Sat, Jan 4, 2020
IEEE HPEC Graph Challenge Honorable Mention
Thu, Sep 26, 2019
IEEE HPEC Graph Challenge Honorable Mention
Thu, Sep 26, 2019
IEEE HPEC Graph Challenge Student Innovation
Thu, Sep 26, 2019
Outstanding Poster (SPGNet: Semantic Prediction Guidance for Scene Parsing)
Mon, Sep 16, 2019
Best Poster Award
Sun, Jul 14, 2019
Third Place Winner, FPGA category
Mon, Jun 24, 2019
First Place Winner, both FPGA and GPU categories
Sun, Jun 2, 2019
GLSVLSI Service Recognition Award
Thu, May 9, 2019
Best Research Paper Award, "Evaluating Characteristics of CUDA Communication Primitives on High-Bandwidth Interconnects,"
Thu, Apr 11, 2019