Antonino Tumeo at OSDA 2023 -- SODA Synthesizer: An Open-Source, End-to-End Hardware Compiler

This talk presents the SODA (Software Defined Accelerators) framework, an open-source modular, multi-level, no-human-in-the-loop, hardware compiler that enables end-to-end generation of specialized accelerators from high-level data science frameworks. SODA is composed of SODA-Opt, a high-level frontend developed in MLIR that interfaces with domain-specific programming environments and allows performing system level design, and Bambu, a state-of-the-art high-level synthesis (HLS) engine that can target different device technologies. The framework implements design space exploration as compiler optimization passes. We show how the modular, yet tight, integration of the high-level optimizer and lower-level HLS tools enables the generation of accelerators optimized for the computational patterns of novel "converged" applications. We then discuss some of the research opportunities that such an open-source framework allows.

[For readability, moderator comments have been removed, as well as minor questions for better understanding.]

Many people, including Professor Fabrizio Ferrandi, for the High-Level Synthesis tool is in the room. We just gave the tutorial this morning, so hopefully you attended also that. And you know what I'm going to talk about, which is like motivating the work first. So obviously, data science algorithms approach frameworks; They keep evolving. That's an artificial intelligence world now, right? And we know that domain-specific accelerators are kind of the only way to keep increasing performance and energy constraints. And like all the accelerators that exist today, really typical are their process, right? Find a pattern that you can accelerate, iterate, new application, find another pattern. There is a productivity gap here. And instead, like in the world that we are in now, the actual algorithm designer wants to have an opportunity to take and design its own custom accelerator. So the reality is that we need the tools to quickly transition from the model, machine learning model, or data analytics program, down to the chiplet implementation.

Our solution is the SODA Synthesizer. It's a modular, multi-level, interoperable, extensible, open-source hardware compiler from high-level programming framework to silicon. So compiler-based, because it has a front-end that is based on MLIR, it's called SODA-OPT. And it's compiler-based because the back-end is based on a state-of-the-art high-level synthesis tool, PandA-Bambu. We also support coarse-grain, reconfigurable array generator, still open-source, OpenCGRA, and I'm going a little more into the detail of Bambu in this presentation. In all cases, we generate synthesizable Verilog, and targets can be both FPGAs or ASICs. The beautiful thing is, since this is all compiler-based, obviously, design space exploration is changing compiler optimization passes and parameters, right? And you can build your own design space exploration language in this way. And there are a few references, if you want more details, obviously. In 10 minutes, it's difficult to go through all of it.

The front-end is SODA-OPT. It's based on MLIR. SODA-OPT stands for Search, Outline, Dispatch, Accelerate, and Optimize. It's based on MLIR, as I was saying, because it's also used now in a lot of high-level machine learning tools, like TensorFlow/Runtime, ONNX-MLIR, Torch-MLIR. We support those lower-to-linear algebra in MLIR, and then start with our optimization. SODA-OPT does the partitioning and the optimization, so the snippets that you partition and you want to accelerate. And also generates, that's the other beautiful thing of MLIR, is that you generate MLIR or runtime calls, and you can generate also all the glue logic to control the accelerators from the host. It's open source. That's the link. You can find also the tutorial from this morning, I guess.

The back-end, in particular, the HLS back-end is PandA-Bambu. It's potentially now, at least in terms of development, the only open-source high-level synthesis tool remaining that is complete, in a sense. Key features that we added during the years, are parallel accelerator support, modular high-level synthesis, and support to target ASICs with open-source tools. And obviously, we want to verify whatever the high-level synthesis tool does, so there is a significant part that is devoted to generate automated testing and verification. It's modular. We also support commercial tools, but today we are talking about open-source, right? Then, with MLIR, obviously, you can feed that to Vivado HLS, and we have numbers to prove it.

But why an HLS back-end? And why going through progressive lowering? Well, maybe with HLS, you are not going to get the fastest solution possible, especially if you want to do true lowering. But if you have a good HLS engine, you can still deal with a general solution and generate an accelerator. And you can provide opportunities also for finding specialized patterns and create custom accelerators. We use that, for example, to do the support for multi-tenant accelerators. And with respect to the other solutions that use HLS, we keep going through progressive lowering, which is like how a compiler should do, right? It's more elegant, and you don't need to raise and lose information from the fact that you are writing back something in higher-level. And again, as I said, new optimizations are compiler passes. You can devise a design space optimization problem, right, as a compiler design space optimization.

A couple of words on the ASIC target, in particular. We also tested with commercial tools, but we regularly, and that's also the focus of our tutorial, use the OpenROAD suite, both with OpenPDK 45 nanometers and the ASAP 7 nanometer cells. So that you can really evaluate your algorithms from the high-level implementation down to the results provided by OpenROAD. And Bambu has a feature, through a tool, to characterize the resource library, depending on the target technology. And it's used for both FPGA target and like OpenPDK and ASAP are provided.

This is just a list of the optimization that SODA-OPT supports. I'm not going too much into the details through them. The key information is that we do optimization for both memory and computational intensity before once they are, the code snippet is separated. The code snippet that you want to accelerate is separated from the rest of the application. And like the memory optimization are obviously very relevant because you can localize things and then work together with other synthesis tools to add buffers, for example, multiple memory ports.

To demonstrate the flow, there are a few numbers with PolyBench with ASAP 7 nanometers, but probably the nicest thing is this picture [Slide 10], right? We partitioned the LeNet and then this is generated with a NanGateFreePDK 45 nanometers. You can see the version that are not accelerated and the version that are accelerated with optimization from SODA-OPT. Obviously it's visually nice, the optimized solutions are bigger, but they are also faster. And I think I have the number, whoops, yeah, but like in general, right, they are faster.

I am quickly going through, in the last couple of minutes, a couple of research opportunities. Obviously, this is an open source design automation workshop, the open source ecosystem, right? I hope that you quickly saw how SODA-OPT right now demonstrated that open source tools can seamlessly integrate, right? Obviously, I worked on Bambu, but we developed the SODA-OPT on top of it after. And we use OpenROAD regularly. So there is a great opportunity to do this, right? And you can also integrate with commercial tools. Actually, with Professor Andrew Kahng, we had a special session in ICCAD talking about that. And that's another opportunity that with open source tool we have that before was not available. There is significant opportunities to support intellectual property and IP blocks. There are opportunities in supporting prototyping platform and FPGA generators. I think also Professor Gaillardon had a talk today on OpenFPGA, right? So that's another opportunity. You can configure even the FPGA that you're embedded FPGA that you're going to generate.

Yeah, one example of a platform, for example, is the embedded scalable platforms from Columbia University when we are working with Bambu as the open source tool. And like other things, this is a compiler. So profile-driven synthesis, you can, especially on the memory part that I was talking about, you can take, optimize, and instrument on a host and then regenerate the architecture that is optimized for it.

And I have one minute, but I need to flash this. If you didn't understand already, it's all open source. It's all available. There is the whole tutorial. You just take the picture [Slide 15], go visit, try the tool. But yeah, that's the SODA Synthesizer. We implement an end-to-end silicon compiler, compiler base for the generation of domain-specific accelerators. So hopefully it's a first step, right, in creating this ecosystem of open source tool that can go from high-level specification down directly to the other. And I'm happy to answer any questions. Thank you.

Q&A

I did not really understand the standard cells picture. Can you elaborate a bit more?

Oh, yeah. [Slide 10] So this is just an example on how we do the partitioning, right? We were tasked, in this case, we were tasked to kind of generate chiplets out of this network, right? So what we did was, using MLIR, we can do partitioning of the specification at the different granularities. We decided, I mean, it's just a simple thing, right? We decided to partition operator by operator. And then we went through the optimizations of our MLIR tool to optimize the different accelerators for each of the operators. But this is more, again, it's visually nice. The complete study also looks at how you kind of actually do the operator fusion, right? Because sometimes this is not convenient. But this was kind of a nice example to show end-to-end synthesis. Suppose that then you want to attach this with chiplets, right, with a chiplet interface. That's a simple pipeline that implements the model.
Thank you for your interesting work. It's not fully clear to me yet to what extent you get most out of your generated custom hardware accelerators, let's say, versus fully programmable flexible accelerators, which would usually require some form of compiler generation as well, to which you'd be committed at the end?

So yes, I don't have the right picture here. But the main focus of this is fully custom accelerators, right? Use the MLIR tool, this one [Slide 5]. This is a little better [Slide 4]. Use the MLIR tool to partition the specification at different granularities, right? If you look at our tutorial, we show we can do operator by operator, depending on the dialect of MLIR that you choose, or insert a specific part of our SODA dialect to the [?] to do the partitioning. And this can be obviously automated. Then, though, MLIR has a wonderful thing, that one of the lowering targets, by default, is a runtime, right? You can define your own runtime. So that generates the glue logic for, instead, the microcontroller.
The accelerator is a fixed function, but you can use the compiler to affect the...

It's a fixed function. Obviously, with MLIR synthesis, right, you can even write your kind of changing adaptable accelerator in C, and then get it converted. It's efficient. Not always, but...
Hi. Thanks a lot for the talk. I would be interested in how you represent parameters in the finalized design, or network parameters.

So parameters can be either constant, or can be loaded from the memory. One of the things that you are studying with, actually, accelerator, where you can change the modality, right, is if they need to be input stationary, or you want something that is also stationary, and you need to stream in the weights. They are stored in memory in our model, right, and then brought in into a local VRAM before computation.
Thank you for your patience. My question is regarding the result that you presented, that you have 4x area increase on 15x speedup. Is it because your focus of optimization is on speedup?

Yes.
So is it possible to do some multi-construct optimization?

Yes. So I don't have this idea. But one of the things that you can do is, obviously, explore, set which parameter you want to meet, right, and then perform the SODA-OPT optimization passes, trying to meet those constraints. It's not completely finalized yet, but we are adding a design space exploration engine in Python, where you should be able to implement your own heuristic, right, to do the exploration, changing the parameters.

Share this page!

SODA Synthesizer: An Open-Source, End-to-End Hardware Compiler

Q&A