Exploring AI / Machine Learning Implementations with Stratus HLS

A lot of AI design is done in software and, while much of it will remain there, increasing numbers of designs are finding their way into hardware. There are multiple reasons for this including the important goals of achieving lower power or higher performance for critical parts of the AI process. Imagine you need dramatically improved rate of object recognition in automated-driving applications.

Implementing an AI application in hardware presents some key challenges for the designer.

Need to explore multiple algorithms and architectures, typically using a framework such as TensorFlow or Caffe
Need to qualify power, performance, area, and accuracy trade-offs of various architectures
Need a rapid path from the models to production silicon

In this article, I'll describe a flow that starts in the TensorFlow environment, moves into abstract C++ targeted at the Stratus HLS flow, and then into a concrete hardware implementation flow.

We have a completed implementation of the commonly-used MNIST digits example that attempts to perform character recognition of images of hand-written digits. The approach we took for this implementation was to first model our recognition algorithm in the TensorFlow framework. This was easy and productive and allowed us to train the system and extract the "weights" we would use during inference mode. The architecture of the network we used is shown below.

The next step was to implement the key TensorFlow functions in abstract C++ and pull all the key pieces together into a macro-level SystemC architecture. As shown below, the C++ code was organized similar to the TensorFlow code, but with parametrized datatypes and latency constraints to allow architectural exploration.

Then we defined a range of exploration parameters. For the data types, the training model used full IEEE floating point data types, but we decided that exploring smaller data types could be an important metric. So, in our hardware-targeted model, we decided to use fixed-point data types and to vary the width of the internal data-types from 16-bit fixed point values down to 12-bit fixed-point values, stepping by 1 bit. For latency, we decided on 3 different latency setting indicating FAST, MEDIUM and SLOW designs. We would expect that lower performance would decrease both area and power at the cost of throughput (measured in "images per second"). Reducing the bit-width of the data types was expected to decrease area and power at the cost of accuracy (measured in terms of the % of correct results predicted).

The design flow for this project looked like:

We used the following tool flow to assess power, performance, and area (PPA) and accuracy.

Stratus HLS
- We synthesized this design through Stratus HLS using a 500MHz clock with a 7 nm technology library.
- We set up multiple configurations and ran Stratus HLS to generate Verilog RTL for each configuration.
Xcelium simulations
- Each Verilog RTL was simulated with Xcelium to measure throughput and to capture simulation vectors.
Joules RTL Power Solution
- Each Verilog RTL and its associated simulation vectors (from previous step) were run through Joules to capture power metrics.
Genus Synthesis Solution
- Each Verilog RTL was run through the logic synthesis step using Genus to produce accurate area numbers.

The detailed results from our experiments were:

In the table above, the green columns were aspects of the design that we varied and the blue columns contain data that we measured. As you can see, our predicted results were pretty much in line with our expectations. An error rate of 3.3% implies that the algorithm is correct 96.7% of the time. One of the interesting tidbits from this data is that there is very little increase in error rate when reducing the data bit-width from 16 to 14, and even 13 bits is pretty close. In the latter case, moving from 16 bits to 13 bits yields a 27% reduction in power and a 25% reduction in area for only 0.7% reduction in accuracy. However, moving down to 12 bits yields a distinct loss of accuracy.

These exploration experiments produced a very broad range of possible values for our implementation as shown in the table below:

As you can see, we have data points demonstrating a wide variety of measurements that impact the design. We could choose different implementation points depending on the requirements imposed by our application. If we were looking to implement a chip that would live on the edge of the network, extreme low-power might be a requirement. If our goal was to process images coming into a cloud server or an automobile, we might ignore power and instead go for the highest frame-rate possible.

This small example demonstrates why so much AI/machine learning hardware is being built with high-level synthesis. Stratus HLS allows the designer to concretely evaluate PPA and accuracy trade-offs of multiple architectures and implementations from a single high-level model, selecting the best trade-off for the specific end application.

For more information about implementing AI and machine learning hardware with the Stratus HLS and the full Cadence flow, click here.

For more information about the Stratus HLS solution, click here.

Exploring AI / Machine Learning Implementations with Stratus HLS

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112