# **Just-in-Time Compilation for Verilog**

A New Technique for Improving the FPGA Programming Experience

Eric Schkufza VMware Research Group eschkufza@vmware.com Michael Wei VMware Research Group mwei@vmware.com Chris Rossbach UT Austin rossbach@cs.utexas.edu

## **Abstract**

FPGAs offer compelling acceleration opportunities for modern applications. However compilation for FPGAs is painfully slow, potentially requiring hours or longer. We approach this problem with a solution from the software domain: the use of a JIT. Code is executed immediately in a software simulator, and compilation is performed in the background. When finished, the code is moved into hardware, and from the user's perspective it simply gets faster. We have embodied these ideas in Cascade: the first JIT compiler for Verilog. Cascade reduces the time between initiating compilation and running code to less than a second, and enables generic printf debugging from hardware. Cascade preserves program performance to within 3× in a debugging environment, and has no effect on a finalized design. Crucially, these properties hold even for programs that perform side effects on connected IO devices. A user study demonstrates the value to experts and non-experts alike: Cascade encourages more frequent compilation, and reduces the time to produce working hardware designs.

## 1. Introduction

Reprogrammable hardware (FPGAs) offer compelling acceleration opportunities for high-performance applications across a wide variety of domains [15, 39, 49, 36, 46, 77, 70, 22, 55, 38, 73, 14, 35]. FPGAs can exceed the performance of general-purpose CPUs by several orders of magnitude [59, 16] and offer dramatically lower cost and time to market than ASICs. In coming years, FPGA density and clock rates are projected to grow steadily while manufacturing costs decline. As a result, hardware vendors have announced plans for server-class processors with on-die FPGA fabric [8], and cloud providers have begun to roll out support for virtual machines with FPGA accelerators and application development frameworks [24].

While the benefits are substantial, so too are the costs. Programming an FPGA is a difficult task. Writing code in a hardware description language (HDL) requires a substantially different mental model than it does for a Von Neumann architecture. The rapid growth of Domain Specific Languages with HDL backends [19, 45, 11] has begun to address this issue. But regardless of frontend language, HDL must ultimately be compiled to an executable bitstream format which can be consumed by an FPGA. The key open problem we address in this paper is that this is an extremely slow process. Trivial programs can take several minutes to compile, and complex designs can take hours or longer.

We believe that compiler overhead is a serious obstacle to unlocking the potential of FPGAs as a commodity technology. First, long compile times greatly weaken the effectiveness of the compile-test-debug cycle. Second, large-scale deployments of FPGAs are likely to consist of FPGAs of varying sizes, architectures, and even vendors. Requiring developers to maintain, build, optimize, and test on every possible configuration makes it impossible to distribute FPGA logic at the same scale as software executables. For software developers used to the ability to rapidly prototype changes to their code, this is a serious barrier to entry. It diminishes interest in experimenting with FPGAs, and keeps the total number of active hardware developers much lower than it should be.

The obvious solution is to improve the compiler. However there are two reasons why this is not possible. First, the source is unavailable. FPGA hardware and toolchains are produced almost exclusively by two major manufacturers, Intel and Xilinx, and neither has a commercial incentive to open their platforms to developers. Open source initiatives have begun to change this [1, 3]. However most are in their infancy and support a single target at best. The second reason is that compilation for FPGAs is theoretically hard. Transforming HDL into a bitstream is a two-step process. The first involves translation to a register-transfer level (RTL) style intermediate representation (IR) and the second involves lowering (generating a mapping from) that IR onto FPGA fabric. Crucially, this amounts to constraint satisfaction, a known NP-hard problem for which no fast general-purpose solution method exists. While constraint solvers have improved dramatically in the past decade and continue to do so, it is unlikely that a polynomial-time HDL compiler is on its way.

Instead, the current practice is to rely on hardware simulation. Running HDL in a simulator does not require a lengthy compilation. However it does have serious drawbacks. First, most FPGA programs involve IO peripherals which must be replaced by software proxies. Building and guaranteeing the correctness of those proxies (assuming the simulation environment supports them — most don't) distracts from the goal of producing a working hardware design. Second, because compilation is NP-hard, functional correctness does not guarantee that a program can be successfully lowered onto an FPGA. Instead, programmers must experiment with many functionally correct designs, attempting a lengthy compilation for each, before arriving at one which works on their architecture. Fi-

nally, software debugging techniques such as printf statements cannot be used once a program has left simulation. When bugs inevitably appear in production code running in hardware, they can be very difficult to track down.

In this paper, we take a new approach. Rather than attempt to reduce the latency of the compiler, we propose a strategy for hiding it behind a simulator in a just-in-time (JIT) environment. The key idea is to use a sequence of transformations guided entirely by the syntax of Verilog to translate a program into many small pieces. Importantly, almost no user annotation is required. The pieces are organized into an IR which expresses a distributed system and supports communication between hardware and software. Pieces which interact directly with IO peripherals are automatically replaced by pre-compiled standard components and the remaining pieces begin execution in a software simulator while a potentially lengthy compilation is initiated for each in the background. As these compilations finish, the pieces transition from software to hardware. From the user's point of view, the code runs immediately and simply gets faster over time.

We have implemented these ideas in an open-source system called Cascade<sup>1</sup>, the first JIT compiler for Verilog. Cascade reduces the time between initiating compilation and running code to less than a second, preserves program performance to within  $3 \times$  in a debugging environment, and has no effect on a finalized design. In addition to tightening the compiletest-debug cycle, Cascade also improves portability and expressiveness. Automatically mapping IO peripherals onto pre-compiled standard components reduces the burden of porting a program from one architecture to another, and allows programmers to test their code in the same environment as they intend to release it. No IO proxies or simulators are necessary. Furthermore, the use of a runtime which supports communication between software and hardware allows Cascade to support printf-style debugging primitives even after a program has been migrated to hardware. The effect is substantial. We demonstrate through a user study that Cascade encourages more frequent compilation and reduces the time required for developers to produce working hardware designs.

To summarize, our key contribution is a compilation framework which supports a novel programming experience with strong implications for the way that hardware development is taught and carried out in practice. Our prototype implementation Cascade transforms HDL development into something which closely resembles writing JavaScript or Python. In short, we take the first steps towards bridging the gap between programming software and programming hardware.

## 2. Background

We begin with a high level overview of HDLs and the tool flows typical of the hardware programming experience. We frame our discussion in terms of a running example.

```
1: module Rol(
 2:
      input wire [7:0] x,
 3:
      output wire [7:0] y
 4:);
 5:
      assign y = (x == 8'h80) ? 1 : (x<<1);
 6: endmodule
 1: module Main(
                         clk,
 2:
      input wire
      input wire [3:0] pad, // dn/up = 1/0
 3:
      output wire [7:0] led // on/off = 1/0
 4:
 5:);
 6:
      reg cnt [7:0] = 1;
 7:
      Rol r(.x(cnt));
 8:
      always @(posedge clk)
 9:
        if (pad == 0)
10:
          cnt <= r.y;
11:
          $display(cnt); // unsynthesizable!
12:
13:
          $finish;
                          // unsynthesizable!
14:
      assign led = cnt;
15: endmodule
```

Figure 1: A Verilog implementation of the running example.

### 2.1. Running Example

Consider an FPGA with a connected set of IO peripherals: four buttons and eight LEDs. The task is to animate the LEDs, and respond when a user presses one of the buttons. The LEDs should illuminate one at a time, in sequence: first, second, third, etc., and then the first again, after the eighth. If the user presses any of the buttons, the animation should pause. This task is a deliberate simplification of a representative application. Even still, it requires synchronous and asynchronous event handling, and computation that combines user inputs with internal state. More importantly, the task is complicated by the diversity of platforms on which it might be deployed. Debugging code in a simulator with proxies to represent the IO peripherals is no substitute for running it and verifying that the buttons and LEDs indeed work as intended in the target environment.

### 2.2. Verilog

A Verilog [4] implementation of the running example is shown in Figure 1. Verilog is one of two standard HDLs which are used to program FPGAs. The alternative, VHDL [5], is essentially isomorphic. The code is organized hierarchically in units called *modules* (Rol, Main), whose interfaces are defined in terms of input/output ports (x, y, clk, pad, led). The inputs to the *root* (top-most) module (Main) correspond to IO peripherals. Modules can consist of nested modules, arbitrary width wires and registers, and logic gates.

A module with a single assignment (Rol) produces the desired animation: when x changes, y is assigned the next value in the sequence, a one bit rotation to the left. The state

<sup>&</sup>lt;sup>1</sup>URL withheld for double-blind submission.

```
1: procedure EVAL(e)
2:
       if e is an update then
           perform sequential update
3:
4:
       else
5:
           evaluate combinational logic
6:
       enqueue new events
7:
8: end procedure
   procedure ReferenceScheduler
       while \top do
2:
           if \exists activated events then
3:
               EVAL(any activated event)
4:
5:
           else if \exists update events then
               activate all update events
6:
           else
7:
               advance time t; schedule recurring events
8:
9:
       end while
10:
11: end procedure
```

Figure 2: The Verilog reference scheduler, shown simplified.

of the program is held in a register, cnt (Main: 6), which is connected to an instance of Rol (Main: 7) and used to drive the LEDs (Main: 14). The value of cnt is only updated to the output of r (Main: 10) when the clock transitions from 0 to 1 (Main: 8) and none of the buttons are pressed (Main: 9).

## 2.3. Synthesizable Core

The language constructs discussed so far are part of the *synthe-sizable core* of Verilog. They describe computation which can be lowered on to the physical circuitry of an FPGA. Outside of that core are *system tasks* such as print statements (Main:12) and shutdown directives (Main:13). In Figure 1, they have been used to print the state of the program and terminate execution whenever the user presses a button, perhaps as part of a debugging session. While invaluable to a developer, there is no general purpose way for a compiler to preserve system tasks in a release environment. It is not uncommon to deploy an FPGA in a setting where there is no terminal, or there is no kernel to signal.

### 2.4. Design Flow

The design flow for the code in Figure 1 would typically begin with the use of a software simulator [64, 61]. For programs whose only IO interaction is with a clock, this is an effective way to catch bugs early. Simulation begins in seconds and unsynthesizable Verilog is useful for diagnosing logic errors. However as with most programs, the running example involves IO peripherals. As a result, simulation would only be possible if the user was willing to implement software proxies, a task which is time consuming and error prone.

The next step would be the use of a *synthesis* tool [75, 33, 76] to transform the program into an RTL-like IR consisting of

```
1: module Rol ... endmodule
 // next eval'ed declaration here ...
 1: module Main();
      Clock clk();
                      // implicitly
 3:
      Pad#(4) pad(); // provided by
 4:
      Led#(8) led(); // environment
 5:
 6:
      reg cnt [7:0] = 1;
 7:
      Rol r(.x(cnt));
 8:
      always @ (posedge clk.val)
        if (pad.val == 0)
 9:
10:
          cnt <= r.y;
11:
14:
      // next eval'd statement here ...
15: endmodule
CASCADE >>> assign led.val = □
```

Figure 3: The Cascade REPL-based user interface, shown with a partial implementation of the running example.

wires, logic gates, registers, and state machines. Synthesis can take from minutes to hours depending on the aggressiveness of the optimizations (eg. state machine minimization) which are applied. Unsynthesizable code is deleted and further debugging still requires the use of proxies. While the resulting code would be closer to what would be lowered onto hardware, it would now need to be run in a waveform viewer [28].

The last step would be the use of a *place and route* tool to lower the RTL onto the FPGA fabric, establish connections between top-level input/outputs and peripheral IO devices, and guarantee that the critical path through the resulting circuit does not violate the timing requirements of the device's clock. As with synthesis, this process can take an hour or longer. Once finished, a *programming* tool would be used to reconfigure the FPGA. This process is straightforward and requires less than a millisecond to complete.

Finally, the program could be tested in hardware. As buggy behavior was detected and repaired, the design flow would be restarted from the beginning. To summarize, compilation is slow, code is not portable, and the developer is hampered by the fact that foundational (eg. printf-style) debugging facilities are confined to a different environment than the one in which code is deployed.

## 2.5. Simulation Reference Model

The reason the code in Figure 1 can be run in so many environments (simulator, waveform viewer, or FPGA) is because the semantics of Verilog are defined abstractly in terms of the reference scheduling algorithm shown (simplified) in Figure 2. The scheduler uses an unordered queue to determine the interleaving of two types of events: *evaluation* (combinational logic) and *update* (sequential logic). Evaluations correspond to changes to stateless components such as logic gates, wires,

```
1: module Main(
 2:
      input wire
                          clk,
      input wire [3:0] pad_val,
 3:
 4:
      output wire [7:0] led_val,
 5:
      output wire [7:0] r_x,
 6:
      input wire [7:0] r_y
 7:);
      reg cnt [7:0] = 1;
 8:
 9:
      assign r_x = cnt;
10:
      always @ (posedge clk_val)
11:
        if (pad_val == 0)
12:
          cnt <= r_y;
13:
        else
14:
          $display(cnt);
15:
          $finish;
16:
      assign led_val = cnt;
17: endmodule
```

Figure 4: An example of Cascade's distributed system IR. Modules are transformed into stand-alone Verilog subprograms.

or system tasks (a change in cnt triggers an evaluation of r.x, or the rising edge of clk may trigger a print statement) and updates correspond to changes to stateful components such as registers (assigning the value of r.y to cnt). Events are performed in any order but only once activated (placed on the queue). Evaluations are always active, whereas updates are activated when there are no other active events. When the queue is emptied the system is said to be in an observable state. The logical time is advanced, and some events such as the global clock tick are placed back on the queue.

Intuitively, it may be useful for a developer to think of a Verilog program in terms of its hardware realization. Evaluations appear to take place continuously, and updates appear to take place simultaneously whenever their trigger (eg. posedge clk) is satisfied. However, any system that produces the same sequence of observable states, whether it be a software simulator, an executable bitstream, or the system described in this paper which transitions freely between the two, is a well-formed model for Verilog.

### 3. Cascade

We now describe Cascade, the first JIT compiler for Verilog. Cascade is based on the following design goals which are derived from the shortcomings of the hardware development process. We defer discussion of anti-goals to Section 7.

Interactivity Code with IO side effects should run immediately, and a user should be able to modify a running program.Portability Code written on one platform should run on another with little or no modification.

**Expressiveness** Unsynthesizable Verilog should remain active after a program has moved to hardware.

**Performance** Users may trade native performance for expressiveness, but not be forced to sacrifice it for interactivity.



Figure 5: The Cascade runtime architecture.

### 3.1. User Interface

Cascade's user interface, a *Read-Eval-Print-Loop* (REPL) similar to that of a Python interpreter [71], is shown in Figure 3 with a nearly identical copy of the code from Figure 1. Verilog is lexed, parsed, and type-checked one line at a time, and errors are reported to the user. Code which passes these checks is integrated into the user's program: module declarations are placed in the outer-most scope, and statements are inserted at the end of a root module which is implicitly instantiated when Cascade begins execution. Code begins execution as soon as it is placed within an instantiated module and IO side effects are visible immediately. Cascade can also be run in batch mode with input provided through a file. The process is the same.

## 3.2. Standard Library

The only difference between the code Figures 1 and 3 is Cascade's treatment of IO peripherals, which are represented as pre-defined types: Clock, Pad, and Led. These modules are implicitly declared and instantiated when Cascade begins execution, along with whatever other types (eg. GPIO, Reset) are supported by the user's hardware environment. Several other types supported by all environments (Memory, FIFO, etc, not shown) may be instantiated at the user's discretion. The Verilog parameterization syntax (# (n)) is similar to C++ templates, and used to indicate object width (ie. four buttons, eight LEDs). This design supports **portability** by casting IO configuration as a target-specific implementation detail which can be managed by the compiler. Additionally, it allows Cascade to treat IO peripherals identically to user logic.

### 3.3. Intermediate Representation

Cascade uses the syntactic structure of Verilog to manage programs at the module granularity. This is done with an IR that expresses a distributed system composed of Verilog *subprograms* with a constrained protocol. Each subprogram represents a single module whose execution and communication are mediated by messages sent over a data/control plane.

When the user eval's code which instantiates a new module or places a statement at the end of the root module, Cascade

```
1: procedure EVALALL(E,t)
       while events e of type t in E's queue do
           EVAL(e)
3:
4:
       end while
5: end procedure
   procedure CASCADESCHEDULER
       while \top do
2:
           if \exists engine E with evaluation events then
3:
               EVALALL(E, evaluation)
4:
           else if \exists engine E with update events then
5.
               for all E with update events do
6:
                   EVALALL(E, update)
7:
               end for
8:
9.
           else
               service interrupts; end step for all engines
10:
               advance time t
11:
12:
           end if
       end while
13:
14:
       end for all engines
   end procedure
15:
```

Figure 6: The Cascade scheduler.

uses a static analysis to identify the set of variables accessed by modules other than the one in which they were defined (in Figure 3, clk.val, led.val, pad.val, r.x, and r.y). Verilog does not allow naming through pointers, so this process is tractable, sound, and complete. Cascade then modifies the subprogram source for the modules those variables appear in. Figure 4 shows the transformation for Main. First, the variables are promoted to input/outputs and renamed (r.x becomes  $r_x$ ). This provides the invariant that no module names a variable outside of its syntactic scope. Next, nested instantiations are replaced by assignments (Main:9). The result is that while Verilog's logical structure is hierarchical (main contains an instance of Rol), Cascade's IR is *flat* (main and that instance are peers).

The runtime state of a subprogram (recall that instantiated code begin execution immediately) is represented by a data structure known as an *engine*. Subprograms start as quickly compiled, low-performance, software simulated engines. Over time they are replaced by slowly compiled, high-performance FPGA resident hardware engines. If a subprogram is modified, its engine is transitioned back to software, and the process is repeated. Specifics, and considerations for standard library engines, are discussed in Section 4. Being agnostic to whether engines are located in hardware or software, and being able to transition freely between the two is the mechanism by which Cascade supports **interactivity**.

### 3.4. Runtime

The Cascade runtime is shown in Figure 5 during an execution of the running example. Boxes C through P represent engines

```
1: struct Engine {
      virtual State* get_state() = 0;
 3:
      virtual void set_state(State* s) = 0;
 4:
 5:
      virtual void read(Event* e) = 0;
 6:
      virtual void write(Event* e) = 0;
 7:
 8:
      virtual bool there_are_updates() = 0;
 9:
      virtual void update() = 0;
10:
      virtual bool there_are_evals() = 0;
11:
      virtual void evaluate() = 0;
12:
      virtual void end_step();
13:
      virtual void end();
14:
15:
      virtual void display(String* s) = 0;
16:
      virtual void finish() = 0;
17:
18:
      virtual void forward(Core* c);
19:
      virtual void open_loop(int steps);
20: };
```

Figure 7: The Cascade target-specific engine ABI.

for the five modules clk through pad. Some are in software, others in hardware, but to the user, it *appears* as though they are all in hardware. The user interacts with Cascade through a controller and observes program outputs through a view, which collectively form the REPL. The user's input, system task side effects, and runtime events are stored on an ordered interrupt queue, and a scheduler is used to orchestrate program execution by sending messages across the control/data plane.

The Cascade scheduler is shown in Figure 6. While formally equivalent to the reference, it has several structural differences. First, the scheduler batches events at the module granularity. If an engine has at least one active evaluation, the scheduler requests that it perform them all. If at least one engine has at least one active update event, it requests that all such engines perform them all. Second, the propagation of events generated by these computations takes place only when a batch has completed rather than as they become available. Finally, because eval'ing new code can affect program semantics, it is crucial that it happen when it cannot result in undefined behavior. This is guaranteed to be true in between time steps, when the event queue is empty, and the system is in an observable state. Cascade uses this window to update its IR by creating new engines in response to module instantiations, and rebuilding engines based on new read/write patterns between modules. The replacement of software engines with hardware engines as they become available, as well as interrupt handling (ie. passing display events to the view, or terminating in response to finish), and rescheduling recurring events like the global clock tick, take place during this window as well.

## 3.5. Target-Specific Engine ABI

Cascade is able to remain agnostic about where engines are located by imposing a constrained protocol on its IR. This



Figure 8: Cascade measures performance in terms of an aperiodic virtual clock defined over multiple physical clock domains.

protocol is captured by the *Application Binary Interface* (ABI) shown in Figure 7. Creating new implementations of this class is the mechanism by which developers can extend Cascade's support for new backend targets (we discuss two such implementations in Section 5). Importantly, *this is not* a user-exposed interface. The implementation details of target-specific engines are deliberately hidden from Verilog programmers inside Cascade's runtime.

Engines must support get and set methods so the runtime can manage their internal state (e.g. when Main's engine transitions from software to hardware, ont must be preserved rather than reset it to 1, as this would disturb the LED animation). Again, the absence of pointers implies the algorithm for identifying this state is tractable, sound, and complete. The there\_are\_updates, update, there\_are\_evals, and evaluate methods are invoked by the Cascade scheduler (lines 3-7), and the optional end\_step and end methods are invoked when the interrupt queue is empty (line 10), and on shutdown (line 14) respectively (e.g. this is how the standard clock re-queues its tick event as in Section 2.5). Engines must also support read and write methods, which are used to broadcast and discover changes to subprogram input/outputs which result from evaluations and updates. Finally, display and finish methods are used to notify the runtime of system task evaluation. Requiring these methods of all engines, enables expressiveness by providing support for unsynthesizable Verilog even from hardware.

## 4. Performance

Cascade's performance is a function of its runtime overhead and time spent performing engine computation and communication. A depiction is shown in Figure 8, which sets aside the running example and shows the runtime (top), a single software engine (middle), and a single hardware engine (bottom). Time proceeds left to right and computation moves between engines as the runtime invokes their ABIs through the data/control plane. In this section, we describe the optimizations Cascade uses to achieve **performance**, that is, to minimize communication and runtime overhead, and maximize the amount of computation in fast FPGA fabric.

#### 4.1. Goals

Hardware and software engines occupy different clock domains: software operates in GHz, and FPGAs in MHz. Further, the number of cycles a software engine takes to process an ABI request may be very different than for a hardware engine (e.g. thousands of CPU instructions versus a single FPGA clock tick). We define Cascade's performance in terms of its *virtual clock*, the *average* rate at which it can dispatch iterations of its scheduling loop (variable amounts of user interaction and ABI requests per iteration imply aperiodicity). Because the standard library's clock is just another engine, every two iterations of the scheduler correspond to a single virtual tick (up on the first, down on the second). Cascade's goal is to produce a virtual clock rate that matches the physical clock rate of the user's FPGA. Figure 9 shows the process for doing so.

### 4.2. User Logic

Returning to the running example, user logic (modules Main and r) begin execution in separate software engines (Figure 9.1). Because Cascade's IR is flat, all communication passes through the data/control plane, even though r's input/outputs are referenced exclusively by Main. The first optimization that Cascade performs is to inline user logic into a single subprogram. Verilog does not allow dynamic allocation of modules, so the process is tractable, sound, and complete. A new engine is allocated for the inlined subprogram (Figure 9.2), it inherits state and control from the old engines, and the number of read and write requests sent across the data/control plane, along with the number of evaluate and update requests required for the event queue to fixed point, are significantly reduced. At the same time, Cascade creates a new hardware engine which begins the process of compiling the inlined subprogram in the background. When compilation is complete, the hardware engine inherits state and control from the inlined software engine (Figure 9.3). From this point on, nearly all ABI requests are processed at hardware speed.

## 4.3. Standard Library Components

Standard library components with IO side effects must be placed in hardware as soon as they are instantiated, as em-



Figure 9: Cascade's optimization flow. Engines transition from software to hardware, and reduced interaction with the runtime.

ulating their behavior in software doesn't make sense (Figure 9.1). This means the time to compile them to hardware can't be hidden by simulation. To address this, Cascade maintains a small catalog of pre-compiled engines for the modules in its standard library. While this allows a program in any compilation state (Figure 9.1–3) to generate IO side effects immediately, interacting with those pre-compiled engines still requires data/control plane communication. Worse, once user logic has migrated to hardware (Figure 9.3), this overhead can represent a majority of Cascade's total runtime.

For these components, inlining is insufficient for eliminating the overhead. There is no source to inline; they are precompiled code which respond to requests as though they were user logic. The solution is to observe that if they were inlined into the user logic engine (Figure 9.3), it would become the single entry and exit point for all runtime/hardware communication. As a result, engines may support ABI forwarding (Figure 7). If so, the runtime can cease direct interaction with standard components and trust the user logic engine to respond to requests on behalf of itself and any standard components it contains (eg. by recursively invoking evaluate requests on those engines, or responding true to there\_are\_updates requests if it or those engines have updates). With this (Figure 9.4), the only obstacle to pure hardware performance becomes the interaction with the runtime's virtual clock.

### 4.4. Open-Loop Scheduling

Processing ABI requests in hardware can be done in a single FPGA clock tick (Section 5). Nonetheless, sending *even one* message between hardware and software per scheduler iteration can be prohibitive. The bandwidth to sustain a virtual clock rate in a typical FPGA range (10–100 MHz) would be an order of magnitude greater (0.1–1 GB/s), a value unlikely to be achieved outside of a very high-performance setting. The key to overcoming this limit is to relax the requirement of direct communication on every scheduler iteration.

Observe that any program in the state shown in Figure 9.4 will exhibit the same schedule. Every iteration, the clock reports there\_are\_updates, an update causes a tick, and the user logic alternates between invocations of evaluate and update until there\_are\_updates and there\_are\_evals return false. Thereafter, the process

repeats. To take advantage of this, hardware engines may support the open\_loop request (Figure 7) which tells an engine to simulate as many iterations as possible of the schedule described above. Control remains in the engine either until an upper limit of iterations is reached, or the evaluation of a system task requires runtime intervention (Figure 9.5).

Because placing control in an engine stalls the runtime, adaptive profiling is used to choose an iteration limit which allows the engine to relinquish control on a regular basis (typically a small number of seconds). Cascade does its best to transition to open loop quickly and stay there for as long as possible. However, whenever a user interaction causes an update to program logic, engines must be moved back into software and the process started anew.

### 4.5. Native Mode

Open-loop scheduling can achieve virtual clock rates within a small constant of native performance (Section 6). However, applications which do not use unsynthesizable Verilog and are no longer undergoing active modification are eligible for one final optimization. Placing Cascade in native mode causes it to compile the user's program exactly as written with an off-the-shelf toolchain. This sacrifices **interactivity**, but achieves full native performance, and is appropriate for applications which are no longer being actively debugged.

## 5. Target-Specific Implementation Details

We conclude our discussion of Cascade with implementation notes for simulator-based software engines and FPGA-resident hardware engines. This material in particularly low-level and provided for the sake of completeness. Readers who wish to return later may safely skip ahead to the evaluation (Section 6).

### 5.1. Software Engines

Software engines use a cycle-accurate event-driven simulation strategy similar to iVerilog [61]. The Verilog source for a subprogram is held in an in-memory AST data structure along with values for its stateful elements. Cascade computes data dependencies at compile-time and uses a lazy evaluation strategy for AST nodes to reduce the overhead of recomputing outputs in response to changes to subprogram inputs. Software

```
wire _updates = _umask ^ _numask;
 1: module Main(
                                                    28:
                                                    29:
 2:
                                                          wire _latch = <LATCH> |
      input wire
                          CLK.
      input wire
 3:
                          RW,
                                                    30:
                                                             (_updates & _oloop);
 4:
      input wire [31:0] ADDR,
                                                    31:
                                                          wire _tasks = _tmask ^
                                                                                  _ntmask;
 5:
      input wire [31:0] IN,
                                                    32:
                                                          wire _clear = <CLEAR>;
 6:
      output wire [31:0] OUT,
                                                    33:
                                                          wire _otick = _oloop & !_tasks;
      output wire
 7:
                                                    34:
                          WATT
 8:);
                                                    35:
                                                          always @ (posedge CLK)
                                                            _umask <= _latch ? _numask : _umask;
 9:
      reg [31:0] _vars[3:0];
                                                    36:
                                                    37:
10:
      reg [31:0] _nvars[3:0];
                                                            _tmask <= _clear ? _ntmask : _tmask;
11:
      reg _umask = 0, _numask = 0;
                                                    38:
                                                            _oloop <= <OLOOP> ? IN :
      reg [ 1:0] _tmask = 0, _ntmask = 0;
                                                              _otick ? (_oloop-1) :
12:
                                                    39:
13:
      reg [31:0] _oloop = 0, _itrs = 0;
                                                    40:
                                                              _tasks ? 0 : _oloop;
14:
                                                            _itrs <= <OLOOP> ? 0 :
                                                    41:
15:
                                                              _otick ? (_itrs+1) : _itrs;
      wire clk_val = _vars[0];
                                                    42:
                                                            _{vars[0]} \leftarrow _{otick} ? (_{vars[0]+1}) :
16:
      wire [3:0] pad_val = _vars[1];
                                                    43:
17:
      wire [7:0] led_val;
                                                    44:
                                                              <SET 0> ? IN : _vars[0];
18:
      wire [7:0] cnt = _vars[2];
                                                    45:
                                                            _vars[1] <= <SET 1> ? IN : _vars[1];
                                                            _vars[2] <= <SET 2> ? IN :
19:
                                                    46:
20:
                                                    47:
      always @(posedge clk_val)
                                                              _latch ? _nvars[2] : _vars[2];
21:
        if (pad_val == 0)
                                                    48:
22:
          _nvars[2] <= pad_val << 1;
                                                    49:
                                                          assign WAIT = _oloop;
23:
          _numask <= ~_umask;
                                                    50:
                                                          always @(*)
24:
                                                    51:
                                                             case (ADDR)
25:
          _nvars[3] <= cnt;
                                                    52:
                                                              0: OUT = clk_val;
26:
          _ntmask <= ~_tmask;
                                                    53:
                                                              // cases omitted ...
27:
      assign led_val = cnt;
                                                    54: endmodule
```

Figure 10: Verilog source code generated by the hardware engine associated with the inlined user logic from the running example.

engines inhabit the same process as the runtime; communication and interrupt scheduling take place through the heap.

## **5.2.** Hardware Engines

Hardware engines translate the Verilog source for a subprogram into code which can be compiled by a blackbox toolchain such as Quartus [33] or Vivado [76]. The code uses an AXI-style memory-mapped IO protocol to interact with a software stub which inhabits the same process as the runtime and mediates communication and interrupt scheduling. We describe these transformation by example. The effect on the code in Figure 4, after inlining r, is shown in Figure 10.

The port declaration on lines 1–8 is typical of AXI and replaces the original. CLK is the native FPGA clock, RW indicates a write or read request at address ADDR, IN and OUT are the buses for those requests, and WAIT is asserted when the FPGA requires more than one cycle to return a result. A kernel module and top-level connections (not shown) map the code's address space into software memory and guarantee that C-style dereferences in the stub produce the appropriate interactions with the resulting hardware. The shorthand <LATCH>, <OLOOP>, etc., represents checks for write requests to distinguished addresses, which serve as an RPC mechanism.

Auxiliary variables are introduced on lines 9–13. \_vars is a storage array with one element for each of Main's inputs (clk\_val and pad\_val), stateful elements (cnt), and

instance of a variable in a display statement (cnt again), \_umask and \_tmask are used for tracking updates and system tasks, \_oloop and \_itrs are used while running in open-loop mode, and the shadow variables \_nvars, \_numask, etc., are used to store values for their counterparts on the (n)ext update request. Mappings between these variables and the names in the original code appear on lines 15–18. The text of the original program appears on lines 20–27, only slightly modified. Update targets are replaced by their shadow variable counterparts (\_nvars[2]), and the corresponding bit in the update mask is toggled. System tasks are treated similarly. Values which appear in display statements are saved (\_nvars[3]), and the bit in the task mask that corresponds to each system task is toggled.

The remaining code supports the Engine ABI. read, write, get\_state, and set\_state are defined in terms of memory dereferences. Lines 49–53 provide access to any of the subprogram's outputs, stateful elements, or variables which appear in a display statement, and lines 43–47 provide write access to its inputs and stateful elements. there\_are\_updates is defined in terms of a read of the \_updates variable (line 28), which becomes true when one or more shadow variables are changed, and update is defined in terms of a write to the \_latch variable, which synchronizes those variables with their counterparts and clears the update mask (lines 36 and 46). The definition of evaluate



Figure 11: Proof of work performance benchmark.

involves reading the subprogram's output variables and checking the \_tasks variable (line 31) which becomes true when one or more tasks are triggered. If any of those tasks are display statements, the values of their arguments at the time of triggering are read out (lines 49-53), formatted in the software stub, and forwarded to the runtime. Thereafter, writing the \_clear variable (line 32) resets the task mask. Writing the \_oloop variable (line 38) places the code into the control loop described in Section 4.4. Control alternates between toggling the clock variable (line 38) and triggering updates (line 29) until either the target number of iterations is achieved or a task is triggered (lines 33 and 38).

Hardware engines may also establish ABI forwarding (not shown) for the standard components they contain. For combinational elements such as the pads and LEDs in the running example, this involves promoting the two subprogram variables <code>led\_val</code> and <code>pad\_val</code> to subprogram input/outputs, and connecting them to the corresponding IO peripherals.

## 6. Evaluation

We evaluated Cascade using a combination of real-world application and user study. All of the experiments described in this section were performed using the initial open-source release of Cascade (Version 1.0). The release consists of 25,000 lines of C++ code, along with several thousand lines of target-specific Verilog. As an experimental platform, we used an Intel Cyclone V SoC device [7] which consists of an 800 MHz dual core ARM processor, a reprogrammable fabric of 110K logic elements with a 50 MHz clock, and 1 GB of shared DDR3 memory. Cascade's runtime and software engines were configured to run on the ARM cores, and its hardware engines on the FPGA. Compilation for the code generated by Cascade's hardware engines was performed using Intel's Quartus Lite compiler (Version 17.0). In order to isolate Cascade's performance properties, the compiler was run on a separate networked server consisting of a four core 2.5 GHz Core i7 with 8 GB of DDR3 memory. In all cases, Cascade's software behavior was compute bound, exhibiting 100% CPU utilization with a memory footprint of less than 10 MB.



Figure 12: Streaming regular expression IO/s benchmark.

### 6.1. Proof of Work

We used Cascade to run a standard Verilog implementation of the SHA-256 proof of work consensus algorithm used in bitcoin mining [2]. The algorithm combines a block of data with a nonce, applies several rounds of SHA-256 hashing, and repeats until it finds a nonce which produces a hash less than a target value. The algorithm is typical of applications which can benefit from FPGA acceleration: it is embarrassingly parallel, deeply pipelineable, and its design may change suddenly, say, as the proof of work protocol evolves over time.

Figure 11 compares Cascade against Intel's Quartus compiler, and the open source iVerilog simulator [61]. Clock rate over time is shown on a log scale. iVerilog began execution in under one second, but its performance was limited to a virtual clock rate of 650 Hz. Quartus was able to lower the design onto the FPGA and achieve the full 50 MHz native performance, but only after ten minutes of compilation. Cascade was able to achieve the best of both worlds. Cascade began execution in under one second, and achieved a 2.4× faster virtual clock rate through simulation, while performing hardware compilation in the background. When compilation was finished and control was transitioned to open-loop hardware execution, Cascade was able to achieve a virtual clock rate within  $2.9 \times$  of the native clock while still providing support for unsynthesizable Verilog. The spatial overhead of the bitstream generated by Cascade's hardware engine was small but noticeable: 2.9× that of a direct compilation using Quartus, mainly due to support for get\_state and set\_state ABI requests. In native mode, Cascade's performance and spatial requirements were identical to Quartus's.

## 6.2. Regular Expression Streaming

We used Cascade to run a streaming regular expression matching benchmark generated by a tool similar to the Snort packet sniffer [68] or an SQL query accelerator [34]. In contrast to the previous example, this benchmark also involved an IO peripheral: a FIFO queue used to deliver bytes from the host device to the matching logic. While a real-world application would batch its computation to mask communication overhead,



Figure 13: Timing results from user study (data points for the Quartus IDE shown in blue, for Cascade shown in green).

we modified the benchmark to process one byte at a time. This allowed us to measure Cascade's ability to match the memory latency to an IO peripheral provided by the Quartus compiler.

Figure 12 compares Cascade against Intel's Quartus compiler. IO operations per second (tokens consumed) are plotted against time on a log scale. No comparison is given to iVerilog as it does not provide support for interactions with IO peripherals. The implementations are identical, with one exception: the Quartus implementation used the FIFO IP provided by the Quartus IDE, while the Cascade implementation used the FIFO data structure provided by Cascade's standard library. In both cases, host to FPGA transport took place over a memory mapped IO bus [6]. The details were distinct, but not meaningfully different with respect to performance.

Cascade began execution in under one second and achieved an IO latency of 32 KIO/s through simulation. In the same amount of time required for the Quartus implementation to finish compiling (9.5 minutes), Cascade was able to transition to open-loop hardware execution and achieve an IO latency of 492 KIO/s, nearly identical to the 560 KIO/s of the Quartus implementation. In this case, the spatial overhead of the bit-stream generated by Cascade's hardware engines was slightly larger (6.5×), though commensurate with that of similar research architectures [40]. As before, Cascade's performance and spatial requirements were identical to Quartus when run in native mode.

## 6.3. User Study

We used Cascade to perform a small user study (n=20) to test whether JIT compilation can improve the hardware development process. Subjects were drawn from a pool of Computer Science PhD students and full time staff at a large software company, and familiarity with hardware programming was mixed, ranging from none to strong. Subjects were given a 30 minute primer on Verilog and taken to a workstation consisting of an FPGA with IO peripherals, and a hardware development environment. The control group's environment was the Quartus IDE, and the experiment group's was Cascade. In both cases the peripherals were the same: four buttons and strip

of 72 individually addressable multi-colored LEDs. Also in both cases, the environments were pre-loaded with a small (50 line) program intended to produce a behavior described in the primer (e.g. pressing one button would cause the LEDs to turn red, pressing another would cause the LEDs to blink in sequence). The subjects were then told that the program contained one or more bugs which would prevent the FPGA from behaving as described. Their task was to fix the bugs and demonstrate a working program as quickly as possible.

Figure 13 summarizes the results of the study. For each participant, we recorded total number of compilations performed, time spent compiling, and time spent testing and debugging in between compilations. The left scatter plot compares number of compilations against time required to complete the task. On average, participants who used Cascade performed 43% more compilations, and completed the task 21% faster than those who used Quartus. Free responses indicated that Cascade users were less concerned with 'wasting time', and more likely to consider using an FPGA in the future. The right scatter plot compares time spent compiling against time spent in between compilations. Participants who used Cascade spent 67× less time compiling, but spent only slightly less time testing and debugging. This agreed with free responses which suggested that while Cascade encouraged faster compilation, it did not encourage sloppy thought.

### 7. Limitations

Before closing, we briefly consider Cascade's technical limitations and anti-goals that it does not seek to achieve.

## 7.1. Timing Critical Applications

Cascade presents the illusion that modifications to a running program produce immediately visible hardware side-effects. This is possible because Cascade abstracts away the details of how hardware-located standard library components interact with software-located user logic. For peripherals such as LEDs, the effects of the timing mismatch between the domains are negligible. For others such as FIFOs, back pressure (e.g. a full signal) is sufficient for guaranteeing that the data rate of the peripheral does not outstrip the compute throughput of Cascade's software engines. However for applications that use high-performance peripherals (eg. a giga-bit ethernet switch) it is unclear how to preserve higher-order program semantics such as QoS guarantees for compilation states in which user logic has not yet been migrated to hardware.

## 7.2. Non-Monotonic Language Features

The soundness of executing code immediately after it is eval'ed depends on the invariant that subsequent eval's do not affect the semantics of that code. This is the reason why Cascade's REPL gives users the ability to add code to a running program, but neither the ability to edit nor delete it. Supporting either feature would violate this property. The Verilog

specification describes support for several language features that would violate this property for insertions as well. Specifically, it is syntactically legal to re-parameterize modules after they have been instantiated (this is akin to changing a C++ template parameter after an object is created). While Cascade does not support these features, they are deprecated, and will not appear in subsequent revisions of the specification.

### 8. Related Work

FPGAs are a mature technology with a long history as target of research. We offer a necessarily brief survey of that work.

### 8.1. Programming and Compilation

FPGAs are programmed at many levels of abstraction. They are often the target of domain specific languages [19, 45, 63, 11, 53, 67]. In other cases, developers may use frameworks such as OpenCL [41] and Lime [10] or commercial highlevel synthesis tools such as Xilinx AutoESL [20], which transform C-style code into synthesizable RTL, or HDLs such as Verilog [4], VHDL [5], or BlueSpec [54]. For applications with strict runtime requirements, experts may target these lowest level languages directly. Compilation at this level is a serious bottleneck and the primary focus of our work.

Many systems in the software domain seek to reduce the overhead of existing compilers. ccache [72], distcc [58], and icecream [27] are gcc frontends that minimize redundant recompilation of sub-components and execute non-interfering tasks simultaneously. Microsoft Cloudbuild [25], Google Bazel [9], and Vesta [32] are distributed caching build systems. These systems do not translate to the hardware domain, where whole-program compilation is the norm. Cascade is an instance of a JIT system. It makes multiple invocations of the compiler in the context of a runtime environment. JIT techniques are used in the compilers for many popular software languages including Java, JavaScript, Python, Matlab, and R.

### 8.2. Simulation

Hardware simulators can be classified into two partially overlapping categories: event-driven and cycle-accurate. High-fidelity simulators such as those provided by Quartus [33] and Vivado [76] operate at speeds on the order of 10–100 Hz, but are accurate with respect to asynchronous logic and multiple clock domains. Interpreted simulators such as iVerilog [61] do not offer all of these features, but are somewhat faster, approaching 1 KHz. Compiled simulators such as Verilator [64] can operate in the 10 KHz range. Cascade uses JIT compilation techniques to interpolate between these performance domains and native rates of 10 to 100 MHz.

## 8.3. Operating Systems and Virtualization

Coordinating software simulation and native execution in a runtime environment requires design choices which resemble operating system and virtualization primitives. Many of these have been explored in the context of FPGAs: spatial multiplexing [26, 69, 74, 18], task preemption [48], relocation [37], context switching [47, 60] and interleaved hardwaresoftware execution [13, 69, 74, 29]. Several projects have extended these concepts to full-fledged operating systems for FPGA. These include ReconOS [50], Borph [66, 65], and MURAC [30]. Others have extended these concepts to FPGA hypervisors. These include CODEZERO [56], Zippy [57], TARTAN [52], and SCORE [23]. Chen et al. explore virtualization challenges that arise in a setting where FPGAs are a shared resource [17]. The work integrates Xilinx FPGAs in OpenStack [62] and Linux-KVM [44], and supports isolation across processes in different virtual machines. Amazon's EC2 F1 FPGA instances [24] are connected to each other through a dedicated isolated network such that sharing between instances, users, and accounts is not permitted. Microsoft Azure Olympus [51] servers are expected to follow a similar model.

### **8.4. Communication Models**

Many connection strategies exist for exposing FPGAs as hardware accelerators. In coprocessor-coupled platform such as ZYNQ [21] and Arria [31] an FPGA is connected to a dedicated CPU which is tasked with mediating interaction with the host system. In host-coupled platforms, there is no coprocessor. Instead, FPGA fabric must be set aside for the implementation of a mediation layer such as a PCIe bridge or an Avalon Memory Bus [6]. Cascade's hardware engines are an instance of the latter strategy.

## 8.5. Abstraction and Compatibility

Cascade's runtime environment is an instance of an overlay system; it maps a coarse-grain model onto a finer-grained target. Examples of overlays include ZUMA [12], VirtualRC [43], and RCMW [42] which provide bitstream independence across different hardware and toolchains, and VForce [53] which enables the same application to be run on different reconfigurable supercomputers.

## 9. Conclusion and Future Work

Compilation is a painfully slow part of the hardware design process and a major obstacle to the widespread adoption of FPGA technology. In this paper we presented Cascade, the first JIT compiler for Verilog. Cascade allows users to test and deploy code in the same environment, ignore the distinction between synthesizable and unsynthesizable Verilog, and to enjoy cross-platform portability, while requiring only minimal changes to their code. Cascade tightens the compile-test-debug cycle and allows users to modify programs as they are run. Side-effects on IO peripherals become visible immediately, debugging incurs no more than a  $3 \times$  performance penalty, and full native performance is supported for finalized designs.

Future work will explore the development of dynamic optimization techniques which can produce performance and layout improvements by specializing a program to the input values it encounters at runtime. Future work will also consider the use of Cascade as a platform for FPGA virtualization. Specifically, multi-runtime aware backends could be used to temporarily and spatially multiplex FPGA fabric, and Cascade's ability to move programs back and forth between hardware and software could be used to bootstrap virtual machine migration for systems that use hardware accelerators.

### References

- [1] Debian Details of Package fpgatools. https: //packages.debian.org/stretch/fpgatools. (Accessed July 2018).
- [2] FPGAMiner. https://github.com/fpgaminer/Open-Source-FPGA-Bitcoin-Miner. (Accessed July 2018).
- [3] SymbiFlow. https://symbiflow.github.io/. (Accessed July 2018).
- [4] Ieee standard for verilog hardware description language. *IEEE Std* 1364-2005 (Revision of IEEE Std 1364-2001), pages 1–560, 2006.
- [5] Ieee standard vhdl language reference manual. *IEEE Std 1076-2008* (*Revision of IEEE Std 1076-2002*), pages c1–626, Jan 2009.
- [6] Avalon interface specifications, 2017.
- [7] Device handbook altera cyclone v, 2017.
- [8] Intel unveils new Xeon chip with integrated FPGA, touts 20x performance boost ExtremeTech, 2017.
- [9] K Aehlig et al. Bazel: Correct, reproducible, fast builds for everyone, 2016.
- [10] Joshua Auerbach, David F. Bacon, Perry Cheng, and Rodric Rabbah. Lime: A java-compatible and synthesizable language for heterogeneous architectures. In Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications, OOPSLA '10, pages 89–108, New York, NY, USA, 2010. ACM.
- [11] Jonathan Bachrach, Huy Vo, Brian C. Richards, Yunsup Lee, Andrew Waterman, Rimas Avizienis, John Wawrzynek, and Krste Asanovic. Chisel: constructing hardware in a scala embedded language. In The 49th Annual Design Automation Conference 2012, DAC '12, San Francisco, CA, USA, June 3-7, 2012, pages 1216–1225, 2012.
- [12] Alexander Brant and Guy GF Lemieux. Zuma: An open fpga overlay architecture. In Field-Programmable Custom Computing Machines (FCCM), 2012 IEEE 20th Annual International Symposium on, pages 93–96. IEEE, 2012.
- [13] Gordon J. Brebner. A virtual hardware operating system for the xilinx xc6200. In Proceedings of the 6th International Workshop on Field-Programmable Logic, Smart Applications, New Paradigms and Compilers, FPL '96, pages 327–336, London, UK, UK, 1996. Springer-Verlag.
- [14] Stuart Byma, Naif Tarafdar, Talia Xu, Hadi Bannazadeh, Alberto Leon-Garcia, and Paul Chow. Expanding openflow capabilities with virtualized reconfigurable hardware. In *Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays*, FPGA '15, pages 94–97, New York, NY, USA, 2015. ACM.
- [15] Jared Casper and Kunle Olukotun. Hardware acceleration of database operations. In *Proceedings of the 2014 ACM/SIGDA International Symposium on Field-programmable Gate Arrays*, FPGA '14, pages 151–160, New York, NY, USA, 2014. ACM.
- [16] Adrian Caulfield, Eric Chung, Andrew Putnam, Hari Angepat, Jeremy Fowers, Michael Haselman, Stephen Heil, Matt Humphrey, Puneet Kaur, Joo-Young Kim, Daniel Lo, Todd Massengill, Kalin Ovtcharov, Michael Papamichael, Lisa Woods, Sitaram Lanka, Derek Chiou, and Doug Burger. A cloud-scale acceleration architecture. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, October 2016.
- [17] Fei Chen, Yi Shan, Yu Zhang, Yu Wang, Hubertus Franke, Xiaotao Chang, and Kun Wang. Enabling fpgas in the cloud. In *Proceedings of the 11th ACM Conference on Computing Frontiers*, CF '14, pages 3:1–3:10, New York, NY, USA, 2014. ACM.
- [18] Liang Chen, Thomas Marconi, and Tulika Mitra. Online scheduling for multi-core shared reconfigurable fabric. In *Proceedings of the Conference on Design, Automation and Test in Europe*, DATE '12, pages 582–585, San Jose, CA, USA, 2012. EDA Consortium.
- [19] Eric S. Chung, John D. Davis, and Jaewon Lee. Linqits: Big data on little clients. In 40th International Symposium on Computer Architecture. ACM, June 2013.

- [20] Philippe Coussy and Adam Morawiec. High-level synthesis: from algorithm to digital circuit. Springer Science & Business Media, 2008.
- [21] Louise H Crockett, Ross A Elliot, Martin A Enderwitz, and Robert W Stewart. The Zynq Book: Embedded Processing with the Arm Cortex-A9 on the Xilinx Zynq-7000 All Programmable Soc. Strathclyde Academic Media. 2014.
- [22] Guohao Dai, Yuze Chi, Yu Wang, and Huazhong Yang. Fpgp: Graph processing framework on fpga a case study of breadth-first search. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA '16, pages 105–110, New York, NY, USA, 2016. ACM.
- [23] André DeHon, Yury Markovsky, Eylon Caspi, Michael Chu, Randy Huang, Stylianos Perissakis, Laura Pozzi, Joseph Yeh, and John Wawrzynek. Stream computations organized for reconfigurable execution. *Microprocessors and Microsystems*, 30(6):334–354, 2006.
- [24] Amazon EC2. Amazon ec2 f1 instances, 2017.
- [25] Hamed Esfahani, Jonas Fietz, Qi Ke, Alexei Kolomiets, Erica Lan, Erik Mavrinac, Wolfram Schulte, Newton Sanches, and Srikanth Kandula. Cloudbuild: Microsoft's distributed and caching build service. In Proceedings of the 38th International Conference on Software Engineering, ICSE 2016, Austin, TX, USA, May 14-22, 2016 Companion Volume, pages 11–20, 2016.
- [26] W. Fu and K. Compton. Scheduling intervals for reconfigurable computing. In Field-Programmable Custom Computing Machines, 2008. FCCM '08. 16th International Symposium on, pages 87–96, April 2008.
- [27] K Funk et al. icecream, 2016.
- [28] GNU. Gtkwave. http://gtkwave.sourceforge.net. (Accessed July 2018).
- [29] Ivan Gonzalez, Sergio Lopez-Buedo, Gustavo Sutter, Diego Sanchez-Roman, Francisco J. Gomez-Arribas, and Javier Aracil. Virtualization of reconfigurable coprocessors in hprc systems with multicore architecture. J. Syst. Archit., 58(6-7):247–256, June 2012.
- [30] B. K. Hamilton, M. Inggs, and H. K. H. So. Scheduling mixedarchitecture processes in tightly coupled fpga-cpu reconfigurable computers. In Field-Programmable Custom Computing Machines (FCCM), 2014 IEEE 22nd Annual International Symposium on, pages 240–240, May 2014.
- [31] Arria V Device Handbook. Volume 1: Device overview and datasheet. 2012.
- [32] Allan Heydon, Timothy Mann, Roy Levin, and Yuan Yu. Software Configuration Management Using Vesta. Monographs in Computer Science. Springer, 2006.
- [33] Intel. Intel quartus prime software, 2018.
- [34] Zsolt István, David Sidler, and Gustavo Alonso. Caribou: Intelligent distributed storage. PVLDB, 10(11):1202–1213, 2017.
- [35] Zsolt István, David Sidler, Gustavo Alonso, and Marko Vukolic. Consensus in a box: Inexpensive coordination in hardware. In Proceedings of the 13th Usenix Conference on Networked Systems Design and Implementation, NSDI'16, pages 425–438, Berkeley, CA, USA, 2016. USENIX Association
- [36] Alexander Kaganov, Asif Lakhany, and Paul Chow. Fpga acceleration of multifactor cdo pricing. ACM Trans. Reconfigurable Technol. Syst., 4(2):20:1–20:17, May 2011.
- [37] H. Kalte and M. Porrmann. Context saving and restoring for multitasking in reconfigurable systems. In Field Programmable Logic and Applications, 2005. International Conference on, pages 223–228, Aug 2005.
- [38] Rüdiger Kapitza, Johannes Behl, Christian Cachin, Tobias Distler, Simon Kuhnle, Seyed Vahid Mohammadi, Wolfgang Schröder-Preikschat, and Klaus Stengel. Cheapbft: Resource-efficient byzantine fault tolerance. In Proceedings of the 7th ACM European Conference on Computer Systems, EuroSys '12, pages 295–308, New York, NY, USA, 2012. ACM.
- [39] Kaan Kara and Gustavo Alonso. Fast and robust hashing for database operators. In 26th International Conference on Field Programmable Logic and Applications, FPL 2016, Lausanne, Switzerland, August 29 -September 2, 2016, pages 1–4, 2016.
- [40] Ahmed Khawaja, Joshua Landgraf, Rohith Prakash, Michael Wei, Eric Schkufza, and Christopher Rossbach. Sharing, protection and compatibility for reconfigurable fabric with amorphos. *To Appear in OSDI*, 2018.
- [41] Khronos Group. The OpenCL Specification, Version 1.0, 2009.
- [42] Robert Kirchgessner, Alan D. George, and Greg Stitt. Low-overhead fpga middleware for application portability and productivity. ACM Trans. Reconfigurable Technol. Syst., 8(4):21:1–21:22, September 2015.

- [43] Robert Kirchgessner, Greg Stitt, Alan George, and Herman Lam. Virtualre: A virtual fpga platform for applications and tools portability. In Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays, FPGA '12, pages 205–208, New York, NY, USA, 2012. ACM.
- [44] Avi Kivity, Yaniv Kamay, Dor Laor, Uri Lublin, and Anthony Liguori. kvm: the linux virtual machine monitor. In *Proceedings of the Linux symposium*, volume 1, pages 225–230, 2007.
- [45] Ilia A. Lebedev, Christopher W. Fletcher, Shaoyi Cheng, James Martin, Austin Doupnik, Daniel Burke, Mingjie Lin, and John Wawrzynek. Exploring many-core design templates for fpgas and asics. *Int. J. Reconfig. Comp.*, 2012:439141:1–439141:15, 2012.
- [46] Christian Leber, Benjamin Geib, and Heiner Litz. High frequency trading acceleration using fpgas. In Proceedings of the 2011 21st International Conference on Field Programmable Logic and Applications, FPL '11, pages 317–322, Washington, DC, USA, 2011. IEEE Computer Society.
- [47] Trong-Yen Lee, Che-Cheng Hu, Li-Wen Lai, and Chia-Chun Tsai. Hardware context-switch methodology for dynamically partially reconfigurable systems. J. Inf. Sci. Eng., 26:1289–1305, 2010.
- [48] L. Levinson, R. Manner, M. Sessler, and H. Simmler. Preemptive multitasking on fpgas. In Field-Programmable Custom Computing Machines, 2000 IEEE Symposium on, pages 301–302, 2000.
- [49] Sheng Li, Hyeontaek Lim, Victor W. Lee, Jung Ho Ahn, Anuj Kalia, Michael Kaminsky, David G. Andersen, O. Seongil, Sukhan Lee, and Pradeep Dubey. Architecting to achieve a billion requests per second throughput on a single key-value store server platform. In Proceedings of the 42Nd Annual International Symposium on Computer Architecture, ISCA '15, pages 476–488, New York, NY, USA, 2015. ACM.
- [50] Enno Lübbers and Marco Platzner. Reconos: Multithreaded programming for reconfigurable computers. ACM Trans. Embed. Comput. Syst., 9(1):8:1–8:33, October 2009.
- [51] Microsoft. Microsoft azure goes back to rack servers with project olympus, 2017.
- [52] Mahim Mishra, Timothy J. Callahan, Tiberiu Chelcea, Girish Venkataramani, Seth C. Goldstein, and Mihai Budiu. Tartan: Evaluating spatial computation for whole program execution. SIGOPS Oper. Syst. Rev., 40(5):163–174, October 2006.
- [53] Nicholas Moore, Albert Conti, Miriam Leeser, Benjamin Cordes, and Laurie Smith King. An extensible framework for application portability between reconfigurable supercomputing architectures, 2007.
- [54] Rishiyur Nikhil. Bluespec system verilog: efficient, correct rtl from high level specifications. In Formal Methods and Models for Co-Design, 2004. MEMOCODE'04. Proceedings. Second ACM and IEEE International Conference on, pages 69–70. IEEE, 2004.
- [55] Tayo Oguntebi and Kunle Olukotun. Graphops: A dataflow library for graph analytics acceleration. In *Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays*, FPGA '16, pages 111–117, New York, NY, USA, 2016. ACM.
- [56] K. Dang Pham, A. K. Jain, J. Cui, S. A. Fahmy, and D. L. Maskell. Microkernel hypervisor for a hybrid arm-fpga platform. In Application-Specific Systems, Architectures and Processors (ASAP), 2013 IEEE 24th International Conference on, pages 219–226, June 2013.
- [57] Christian Plessl and Marco Platzner. Zippy-a coarse-grained reconfigurable array with support for hardware virtualization. In Application-Specific Systems, Architecture Processors, 2005. ASAP 2005. 16th IEEE International Conference on, pages 213–218. IEEE, 2005.
- [58] M Pool et al. distcc: A free distributed c/c++ compiler system, 2016.
- [59] Andrew Putnam, Adrian Caulfield, Eric Chung, Derek Chiou, Kypros Constantinides, John Demme, Hadi Esmaeilzadeh, Jeremy Fowers, Gopi Prashanth Gopal, Jan Gray, Michael Haselman, Scott Hauck, Stephen Heil, Amir Hormati, Joo-Young Kim, Sitaram Lanka, Jim Larus, Eric Peterson, Simon Pope, Aaron Smith, Jason Thong, Phillip Yi Xiao, and Doug Burger. A reconfigurable fabric for accelerating large-scale datacenter services. In 41st Annual International Symposium on Computer Architecture (ISCA), June 2014.
- [60] Kyle Rupnow, Wenyin Fu, and Katherine Compton. Block, drop or roll(back): Alternative preemption methods for RH multi-tasking. In FCCM 2009, 17th IEEE Symposium on Field Programmable Custom Computing Machines, Napa, California, USA, 5-7 April 2009, Proceedings, pages 63–70, 2009.
- [61] J. Russell and R. Cohn. Icarus Verilog. Book on Demand, 2012.
- [62] Omar Sefraoui, Mohammed Aissaoui, and Mohsine Eleuldj. Openstack: toward an open-source solution for cloud computing. *Interna*tional Journal of Computer Applications, 55(3), 2012.

- [63] Yi Shan, Bo Wang, Jing Yan, Yu Wang, Ning-Yi Xu, and Huazhong Yang. Fpmr: Mapreduce framework on fpga. In Peter Y. K. Cheung and John Wawrzynek, editors, FPGA, pages 93–102. ACM, 2010.
- [64] W Snyder, D Galbi, and P Wasson. Verilator, 2018.
- [65] Hayden Kwok-Hay So and Robert Brodersen. A unified hardware/software runtime environment for fpga-based reconfigurable computers using borph. ACM Trans. Embed. Comput. Syst., 7(2):14:1–14:28, January 2008.
- [66] Hayden Kwok-Hay So and Robert W. Brodersen. BORPH: An Operating System for FPGA-Based Reconfigurable Computers. PhD thesis, EECS Department, University of California, Berkeley, Jul 2007.
- [67] Hayden Kwok-Hay So and John Wawrzynek. Olaf'16: Second international workshop on overlay architectures for fpgas. In *Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays*, FPGA '16, pages 1–1, New York, NY, USA, 2016. ACM.
- [68] Haoyu Song, Todd S. Sproull, Michael Attig, and John W. Lockwood. Snort offloader: A reconfigurable hardware NIDS filter. In Proceedings of the 2005 International Conference on Field Programmable Logic and Applications (FPL), Tampere, Finland, August 24-26, 2005, pages 493–498, 2005.
- [69] C. Steiger, H. Walder, and M. Platzner. Operating systems for reconfigurable embedded platforms: online scheduling of real-time tasks. IEEE Transactions on Computers, 53(11):1393–1407, Nov 2004.
- [70] Naveen Suda, Vikas Chandra, Ganesh Dasika, Abinash Mohanty, Yufei Ma, Sarma Vrudhula, Jae-sun Seo, and Yu Cao. Throughput-optimized opencl-based fpga accelerator for large-scale convolutional neural networks. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA '16, pages 16–25, New York, NY, USA, 2016. ACM.
- [71] Lambert M. Surhone, Mariam T. Tennoe, and Susan F. Henssonow. Node. Js. Betascript Publishing, Mauritius, 2010.
- [72] A Tridgell, J Rosdahl, et al. ccache: A fast c/c++ compiler cache, 2016.
- [73] A. Tsutsui, T. Miyazaki, K. Yamada, and N. Ohta. Special purpose fpga for high-speed digital telecommunication systems. In *Proceedings* of the 1995 International Conference on Computer Design: VLSI in Computers and Processors, ICCD '95, pages 486–491, Washington, DC, USA, 1995. IEEE Computer Society.
- [74] G. Wassi, Mohamed El Amine Benkhelifa, G. Lawday, F. Verdier, and S. Garcia. Multi-shape tasks scheduling for online multitasking on fpgas. In Reconfigurable and Communication-Centric Systems-on-Chip (ReCoSoC), 2014 9th International Symposium on, pages 1–7, May 2014.
- [75] Clifford Wolf. Yosys open synthesis suite. http:// www.clifford.at/yosys/. (Accessed July 2018).
- [76] Xilinx. Vivado design suite, 2018.
- [77] Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. Optimizing fpga-based accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA '15, pages 161–170, New York, NY, USA, 2015. ACM.