D5.3: Optimized multi-threaded mapping to reconfigurable hardware

The ThreadPoolComposer (TPC) tool has been refined to support the automated insertion of custom-generated (non-Vivado HLS) IP cores into the thread-pool of hardware processing elements. This not only allows the use of highly optimized cores formulated in traditional
hardware design languages, such as Verilog or VHDL, but also of cores created by other highlevel synthesis systems (e.g., Bluespec). Optimized multi-threaded execution of applications is enabled by TPC automatically exploring the implementations space (trade-off of area vs. clock frequency vs. thread-level parallelism of hardware processing elements), aiming for maximum throughput of the generated hardware. Finally, TPC has been refined to be more extensible itself by providing a flexible plug-in interface to insert custom hardware-generation steps into the automated design flow.

The on-chip architecture of TPC-generated circuits uses AXI interconnects to link the different hardware elements (e.g., processing blocks and off-chip interfaces). For D5.3, the configuration of these interconnects has been optimized. This includes reducing clock periods by using inferred register slices when composing interconnect trees, as well as selectively using packet-mode first-in-first-out (FIFO) buffers in the top-layer of the tree. The optionally usable on-chip memory hierarchy, which was started in D5.2 with the design of configurable L1 caches (processing element-local) caches, has been extended in D5.3 to also include an L2 cache (shared between processing elements).

A special focus of D5.3 was the use of optimized hardware blocks for compute kernels, exceeding the performance of those created using high-level synthesis from C/C++ code. This has been achieved for the Stereovision and eRDM (railway monitoring) use-cases. The blocks created illustrate the use of different design methods within the common REPARA TPC framework. The Stereovision core (using a high-quality semi-global method) was implemented using a different high-level synthesis system (Bluespec, with a design language inspired by Haskell). The eRDM kernel was implemented both as C tuned for high-level synthesis (using, e.g., special directives to enable pipelining), as well as in a traditional HDL. In for both use-cases, the performance over the original solutions running on a conventional processor could be significantly improved.


Comments are closed.