Share this page!

Workshop Program

Being the premier European academic forum for open-source design automation, OSDA’s mission is to connect the leading proponents of the space. We invited authors of major tools and flows to talk about their recent activities to promote open-source hardware, and open-source design automation.

We have a very dense program. Every speaker will give a 10 minutes talk, followed by a 5 minutes Q&A session.


Christian Krieg

Post-Doctoral Researcher and Teacher at TU Wien
Welcome Session

The workshop chair opens the workshop, welcomes speakers and the audience, provides some information on the workshop, and pitches the posters to be presented in the poster session.

Video will be released July 27, 2023

Larry Doolittle

Senior Scientist/Engineer at Lawrence Berkeley National Labs

vhd2vl is a simple and open-source stand-alone program that converts synthesizable VHDL to Verilog. While it has plenty of limitations, it has proved useful to many developers since its start in 2004. This talk will cover its strengths, weaknesses, and alternatives.


[For readability, moderator comments have been removed, as well as minor questions for better understanding.]

Thanks so much for inviting me. It's a pleasure to rub shoulders with so many smart people. And while this community knows me for vhd2vl, at least some of you might, that my primary focus is really using FPGAs and all these open source toolchains to do DSP for particle accelerators. So this is a hobby project of mine, but I'm glad some people find it useful, and yeah.

So I want to tell you what vhd2vl is, for those of you who don't already know where it came from, how it fits into the general toolkit, and what does it tell us about code. I think it's kind of interesting, actually. In summary, it converts VHDL to Verilog, at least some dialects of VHDL. It's up on GitHub. It's licensed under GPLv2. And I want to point out that this question, "Can I convert VHDL code to Verilog?", keeps coming up. Maybe not enough to call it frequent, but if you go back in the records and scan forums, and Google is actually great at this, you can find questions going back a long time. It keeps coming up. And every time it pops up, the immediate question is, "Why do you want to do that?" Well, people have their reasons. And then the second immediate answer is "No." And then eventually somebody says, "Try some proprietary program, maybe it'll work." And then, varying on that, "Try synthesizing with Vivado or Quartus or something, and then you can usually, you can get that to re-emit the consequences of its synthesis into some portable form like Verilog." But always this comes with a caveat that any results you get will be totally, well, maybe not totally, but probably useless for maintenance.

You can find, at least, I had trouble last night finding any tables that show the corresponding syntax, right? They're both hardware description language, at least when you go to the synthesis subset. So this is how you say "and" in Verilog, here's how you say "and" in VHDL, here's how you create a loop in Verilog, here's how you do it, right? It's the same. You're doing the same stuff just with different syntactic sugar. So you'd say, well, if I have a short program and maybe I'm not an expert in VHDL or Verilog, you should be an expert in one or the other. But if I have it in one form, I should be able to follow the rules and convert it to the other thing. But that's what computers are for.

So in 2003, Vincenzo posted his answer to a programming forum and says, "I wrote a translator that supports a limited but useful subset of synthesizable VHDL. Blah, blah, blah. Since we have no commercial interest in such software, I decided to release it under GPL". So he did it. So that's vhd2vl, at least the first version. So he's actually based in Australia and he really didn't do much maintenance on that for a while, after the first couple of years. So I found it useful. I was using it and I saw problems with it. So it is open source, so I started contributing back. And that has made me the maintainer since 2005. I got some help from Alejandro, sorry, from Rodrigo in Argentina, starting like six years ago. That's been very helpful.

So now that we have vhd2vl, now the real answer is, wrong answer, okay, real answer is forum posters still don't know about it. So they still post their obsolete answers. If you find VHDL written by pragmatic hardware engineers, that usually works just fine with this code. If you get some software designer who's enamored with the high-level features of VHDL, maybe that won't work so well. And this is really coming out of the Unix paradigm. It's based on, right, it's just a Unix program. So I assume there are some Windows users out there. If you can run vhd2vl under WSL, what's called Windows Subsystem for Linux, send me an email and I can say now it's a Windows program too. There are other recipes and programs out there. I found one written in Java, but it's apparently free to download if you give your email and some other information. It's definitely not open source. People use GHDL for this purpose. Icarus Verilog made an attempt to do this so they could parse VHDL, but that's now abandoned and I'm pretty sure marked as deprecated.

And there was that comment before that any result would be totally unreadable. I hope the font size lets you look at this. And I hope, so the left, this is one particular example in our test set for our code base. The left side is bog standard description of a counter or a frequency divider in VHDL that I've forgotten the provenance of, but it works. And if you use vhd2vl on it, you get the Verilog on the right. Took some really, really small edits to get it to squash onto the page here. I took out a couple of new lines. But I hope you find that the right-hand side is perfectly legible and in fact even most of the comments are preserved. In some contexts, that's actually a real killer feature. So it's not that it's, I mean, the machine generation side is good and it's pretty, it's accurate. If it fails, it fails at the parsing step, not at the, not by generating wrong code. But the result is actually legible. So you could actually import that into a pure Verilog code base. Hello.

And I think this calls into focus some comments about what a program really is. And the ability, so once you write down a program in your text editor, it has a life of its own. This is why we license code. So it can go on and be reused. But this forces you to manipulate your program with other programs. So you write your Verilog and you translate it. You compile it. You synthesize it. You lint it. You write your VHDL. You compile it. You translate it. You lint it. You store it in your Git repository. Having links between all of these parts of the chain is crucially important. This is, the tools to do this, to manipulate programs, range from outlandishly complex like GCC to really simple, right? If I have said that I can start making bulk changes to the program to meet my changing needs. And vhd2vl is just one more program in that suite of tools to manipulate other programs. And I want to point out that tools like this are useless unless people can picture what they do and apply them productively. And it's easy, at least for some audiences, like the people in this room, to say, if you have VHDL and you want Verilog, grab vhd2vl and it should do a pretty good job for you. Try it, right? Don't assume it will work until you verify it empirically.

But it's a tool. Kind of out of the Unix paradigm from the 1970s. And I want to, I hear a lot about AI at this workshop, at this conference. It's not obvious to me that AI means anything in this regime. We value tools that are reproducible, regression testable. I don't see any sense that AI is going to be in that category. Maybe it can be a mechanism to help find bugs. We've done fuzzing for a long time to look for bugs in software. Maybe AI will help us find our bugs. But then we need to turn it into an actual test. I missed, wait, wait, wait, wait, hmm, I jumped across a couple [slides] here.

So I want to, I want to compare vhd2vl with a couple of other programs and this is why I say that I'm lucky to rub shoulders with such brilliant people in this room. It's a tiny fraction, tiny fraction of the size of some of these other programs. And we have, I counted six contributors, two current. This is a far cry from what these other tools are. Internally, vhd2vl uses Flex and Bison, but a third of it is actually just standard C99. So it's not a big program, but it's a useful program. Yeah, I wanted to get this.

What is a program? I said that here. But kind of the introduction to that, data is code. Buffer exploit writers know that. Code is speech. There's a US Supreme Court decision that validates that. And code is data. So being able to use code as data, this is the innovation from Grace Hopper and kind of building on John von Neumann, that computers, programming computers is a real thing. And that's where, right, that's what takes software and makes it an investment instead of a one-time use punch card drawing equations on a piece of paper.

So all I want to really say is that vhd2vl can play a role in putting code to work in new ways. Of course, I'm eager to get successes and bug reports. I want to end with one little comment here. I have a few seconds. I think it's really intriguing that I get to give this talk in Antwerp, where when I walk down the street, businesses don't care what language they give you their information in. It could be French or German or Flemish or English. They don't care. And people can cope. Historically, computers haven't really coped well with multi-language input. So vhd2vl is maybe a very small piece, but I think it's a constructive piece in this world where you can't assume that everything is uniform. So, dank u wel.


  1. What's the boundary between what will work and what won't work, and how does the user know when it works or not?

    There is, well, I mean, you can read the README, and there are about ten points to watch for. And then in the end, you just try it and see if it crashes or gives you a useful result.

  2. Is there some quantitative issue?

    Straight register transfer level is good. Generate loops, or whatever VHDL calls them. I'm a Verilog guy, not a VHDL guy. So, yeah. Complicated things. What do you call them? I forget the...

  3. The second part of my question, how does the user know when it works or not?

    Oh, if it errors out, then it didn't work. If it produces output, it's supposed to be correct.

  4. So, this is going to be a naive one, but is there more of a need for a VHDL to Verilog versus a Verilog to VHDL, and any comments along those lines? It seems very one-dimensional.

    Yeah, this is one-directional. I commented that I think Icarus Verilog can do Verilog to VHDL. It won't preserve comments, but I think it's pretty readable output.

  5. I should ask a slightly annoying question. What happens to the assert statement at the end of your code example? The big code example you have in the VHDL, the assert, and the second to last line, I don't see it in there.

    Yep. No, it doesn't understand asserts. I mean, an assert is not synthesizable. So, we're doing the synthesizable subset. So, for non-synthesizable things, it would produce... The best scenario is it ignores it, right?

  6. Is there a parsing case for the code? Is there any case for the boundaries of the codes? And asserts and lines, you can process?

    It can process assert, but it ignores them. Again, for details like this, you should really look at the... I mean, there's a line by line, right? These are the things to watch out for in the README. It's not that long, right? It's not a textbook. It's like two pages long in the README, including a checklist of things that it can and cannot do. Little things like VHDL is case insensitive and Verilog is case sensitive. So, this program is case retentive. And that could trip people up.

  7. This seems like a big job to achieve this translation. Can you give me some use cases? Like, why were you...

    It's 3,000 lines of code. It's not a big deal.

  8. Why did you have to drive to want to translate VHDL into Verilog?

    Oh. I was working on a collaborative project where I wanted to write Verilog and one of my co-workers wanted to write VHDL. So, we found a way to just make it work.

  9. You're a Verilog guy. You try to turn everything into Verilog...

    Well, and at least at the time of that writing, GHDL was very immature and Icarus Verilog worked really well.


Antonino Tumeo

Chief Scientist at Pacific Northwest National Laboratory (PNNL)
SODA Synthesizer: An Open-Source, End-to-End Hardware Compiler

This talk presents the SODA (Software Defined Accelerators) framework, an open-source modular, multi-level, no-human-in-the-loop, hardware compiler that enables end-to-end generation of specialized accelerators from high-level data science frameworks. SODA is composed of SODA-Opt, a high-level frontend developed in MLIR that interfaces with domain-specific programming environments and allows performing system level design, and Bambu, a state-of-the-art high-level synthesis (HLS) engine that can target different device technologies. The framework implements design space exploration as compiler optimization passes. We show how the modular, yet tight, integration of the high-level optimizer and lower-level HLS tools enables the generation of accelerators optimized for the computational patterns of novel "converged" applications. We then discuss some of the research opportunities that such an open-source framework allows.


[For readability, moderator comments have been removed, as well as minor questions for better understanding.]

Many people, including Professor Fabrizio Ferrandi, for the High-Level Synthesis tool is in the room. We just gave the tutorial this morning, so hopefully you attended also that. And you know what I'm going to talk about, which is like motivating the work first. So obviously, data science algorithms approach frameworks; They keep evolving. That's an artificial intelligence world now, right? And we know that domain-specific accelerators are kind of the only way to keep increasing performance and energy constraints. And like all the accelerators that exist today, really typical are their process, right? Find a pattern that you can accelerate, iterate, new application, find another pattern. There is a productivity gap here. And instead, like in the world that we are in now, the actual algorithm designer wants to have an opportunity to take and design its own custom accelerator. So the reality is that we need the tools to quickly transition from the model, machine learning model, or data analytics program, down to the chiplet implementation.

Our solution is the SODA Synthesizer. It's a modular, multi-level, interoperable, extensible, open-source hardware compiler from high-level programming framework to silicon. So compiler-based, because it has a front-end that is based on MLIR, it's called SODA-OPT. And it's compiler-based because the back-end is based on a state-of-the-art high-level synthesis tool, PandA-Bambu. We also support coarse-grain, reconfigurable array generator, still open-source, OpenCGRA, and I'm going a little more into the detail of Bambu in this presentation. In all cases, we generate synthesizable Verilog, and targets can be both FPGAs or ASICs. The beautiful thing is, since this is all compiler-based, obviously, design space exploration is changing compiler optimization passes and parameters, right? And you can build your own design space exploration language in this way. And there are a few references, if you want more details, obviously. In 10 minutes, it's difficult to go through all of it.

The front-end is SODA-OPT. It's based on MLIR. SODA-OPT stands for Search, Outline, Dispatch, Accelerate, and Optimize. It's based on MLIR, as I was saying, because it's also used now in a lot of high-level machine learning tools, like TensorFlow/Runtime, ONNX-MLIR, Torch-MLIR. We support those lower-to-linear algebra in MLIR, and then start with our optimization. SODA-OPT does the partitioning and the optimization, so the snippets that you partition and you want to accelerate. And also generates, that's the other beautiful thing of MLIR, is that you generate MLIR or runtime calls, and you can generate also all the glue logic to control the accelerators from the host. It's open source. That's the link. You can find also the tutorial from this morning, I guess.

The back-end, in particular, the HLS back-end is PandA-Bambu. It's potentially now, at least in terms of development, the only open-source high-level synthesis tool remaining that is complete, in a sense. Key features that we added during the years, are parallel accelerator support, modular high-level synthesis, and support to target ASICs with open-source tools. And obviously, we want to verify whatever the high-level synthesis tool does, so there is a significant part that is devoted to generate automated testing and verification. It's modular. We also support commercial tools, but today we are talking about open-source, right? Then, with MLIR, obviously, you can feed that to Vivado HLS, and we have numbers to prove it.

But why an HLS back-end? And why going through progressive lowering? Well, maybe with HLS, you are not going to get the fastest solution possible, especially if you want to do true lowering. But if you have a good HLS engine, you can still deal with a general solution and generate an accelerator. And you can provide opportunities also for finding specialized patterns and create custom accelerators. We use that, for example, to do the support for multi-tenant accelerators. And with respect to the other solutions that use HLS, we keep going through progressive lowering, which is like how a compiler should do, right? It's more elegant, and you don't need to raise and lose information from the fact that you are writing back something in higher-level. And again, as I said, new optimizations are compiler passes. You can devise a design space optimization problem, right, as a compiler design space optimization.

A couple of words on the ASIC target, in particular. We also tested with commercial tools, but we regularly, and that's also the focus of our tutorial, use the OpenROAD suite, both with OpenPDK 45 nanometers and the ASAP 7 nanometer cells. So that you can really evaluate your algorithms from the high-level implementation down to the results provided by OpenROAD. And Bambu has a feature, through a tool, to characterize the resource library, depending on the target technology. And it's used for both FPGA target and like OpenPDK and ASAP are provided.

This is just a list of the optimization that SODA-OPT supports. I'm not going too much into the details through them. The key information is that we do optimization for both memory and computational intensity before once they are, the code snippet is separated. The code snippet that you want to accelerate is separated from the rest of the application. And like the memory optimization are obviously very relevant because you can localize things and then work together with other synthesis tools to add buffers, for example, multiple memory ports.

To demonstrate the flow, there are a few numbers with PolyBench with ASAP 7 nanometers, but probably the nicest thing is this picture [Slide 10], right? We partitioned the LeNet and then this is generated with a NanGateFreePDK 45 nanometers. You can see the version that are not accelerated and the version that are accelerated with optimization from SODA-OPT. Obviously it's visually nice, the optimized solutions are bigger, but they are also faster. And I think I have the number, whoops, yeah, but like in general, right, they are faster.

I am quickly going through, in the last couple of minutes, a couple of research opportunities. Obviously, this is an open source design automation workshop, the open source ecosystem, right? I hope that you quickly saw how SODA-OPT right now demonstrated that open source tools can seamlessly integrate, right? Obviously, I worked on Bambu, but we developed the SODA-OPT on top of it after. And we use OpenROAD regularly. So there is a great opportunity to do this, right? And you can also integrate with commercial tools. Actually, with Professor Andrew Kahng, we had a special session in ICCAD talking about that. And that's another opportunity that with open source tool we have that before was not available. There is significant opportunities to support intellectual property and IP blocks. There are opportunities in supporting prototyping platform and FPGA generators. I think also Professor Gaillardon had a talk today on OpenFPGA, right? So that's another opportunity. You can configure even the FPGA that you're embedded FPGA that you're going to generate.

Yeah, one example of a platform, for example, is the embedded scalable platforms from Columbia University when we are working with Bambu as the open source tool. And like other things, this is a compiler. So profile-driven synthesis, you can, especially on the memory part that I was talking about, you can take, optimize, and instrument on a host and then regenerate the architecture that is optimized for it.

And I have one minute, but I need to flash this. If you didn't understand already, it's all open source. It's all available. There is the whole tutorial. You just take the picture [Slide 15], go visit, try the tool. But yeah, that's the SODA Synthesizer. We implement an end-to-end silicon compiler, compiler base for the generation of domain-specific accelerators. So hopefully it's a first step, right, in creating this ecosystem of open source tool that can go from high-level specification down directly to the other. And I'm happy to answer any questions. Thank you.


  1. I did not really understand the standard cells picture. Can you elaborate a bit more?

    Oh, yeah. [Slide 10] So this is just an example on how we do the partitioning, right? We were tasked, in this case, we were tasked to kind of generate chiplets out of this network, right? So what we did was, using MLIR, we can do partitioning of the specification at the different granularities. We decided, I mean, it's just a simple thing, right? We decided to partition operator by operator. And then we went through the optimizations of our MLIR tool to optimize the different accelerators for each of the operators. But this is more, again, it's visually nice. The complete study also looks at how you kind of actually do the operator fusion, right? Because sometimes this is not convenient. But this was kind of a nice example to show end-to-end synthesis. Suppose that then you want to attach this with chiplets, right, with a chiplet interface. That's a simple pipeline that implements the model.

  2. Thank you for your interesting work. It's not fully clear to me yet to what extent you get most out of your generated custom hardware accelerators, let's say, versus fully programmable flexible accelerators, which would usually require some form of compiler generation as well, to which you'd be committed at the end?

    So yes, I don't have the right picture here. But the main focus of this is fully custom accelerators, right? Use the MLIR tool, this one [Slide 5]. This is a little better [Slide 4]. Use the MLIR tool to partition the specification at different granularities, right? If you look at our tutorial, we show we can do operator by operator, depending on the dialect of MLIR that you choose, or insert a specific part of our SODA dialect to the [?] to do the partitioning. And this can be obviously automated. Then, though, MLIR has a wonderful thing, that one of the lowering targets, by default, is a runtime, right? You can define your own runtime. So that generates the glue logic for, instead, the microcontroller.

  3. The accelerator is a fixed function, but you can use the compiler to affect the...

    It's a fixed function. Obviously, with MLIR synthesis, right, you can even write your kind of changing adaptable accelerator in C, and then get it converted. It's efficient. Not always, but...

  4. Hi. Thanks a lot for the talk. I would be interested in how you represent parameters in the finalized design, or network parameters.

    So parameters can be either constant, or can be loaded from the memory. One of the things that you are studying with, actually, accelerator, where you can change the modality, right, is if they need to be input stationary, or you want something that is also stationary, and you need to stream in the weights. They are stored in memory in our model, right, and then brought in into a local VRAM before computation.

  5. Thank you for your patience. My question is regarding the result that you presented, that you have 4x area increase on 15x speedup. Is it because your focus of optimization is on speedup?


  6. So is it possible to do some multi-construct optimization?

    Yes. So I don't have this idea. But one of the things that you can do is, obviously, explore, set which parameter you want to meet, right, and then perform the SODA-OPT optimization passes, trying to meet those constraints. It's not completely finalized yet, but we are adding a design space exploration engine in Python, where you should be able to implement your own heuristic, right, to do the exploration, changing the parameters.


Matthew Guthaus

Professor at University of California Santa Cruz
SRAM Design with OpenRAM in SkyWater 130nm

In this talk, Prof. Guthaus presents the current status of the OpenRAM project including Skywater 130 tape-out results. In addition, Prof. Guthaus will discuss the future roadmap of the OpenRAM project features and support for newer technologies.


[For readability, moderator comments have been removed, as well as minor questions for better understanding.]

Thank you everybody, so thank you for the nice introduction, hopefully I won't be the first one to disappoint by going over ten minutes, but we'll see, I have a lot of slides, so...

Now I'm not going to talk a lot about kind of what OpenRAM is, how it works and so on, I've given a number of other talks online that we can look at, this one's actually going to focus a lot more on some actual first silicon test results that we've gotten and measurements and so on, but you know the TLDR, OpenRAM's a memory compiler in Python, it has kind of reference flows and so on, I think the newest things in the last six to nine months are you can now 'pip install' OpenRAM, it doesn't 'pip install' all the SkyWater stuff because that's quite a big set of cells and libraries, but we're working on something for that, and then we've also moved as well from a Docker type set up to more of a Conda type tool set up, so we're kind of making improvements to it from a software perspective over time, which is interesting.

Now the interesting thing is we've actually gotten back some of our first results, we've made two test chips, the first one was OR1 that I did with efabless and Google way back before the Google SkyWater open MPWs, so this was kind of a dedicated test chip to get that going, this actually was all open source OpenRAM except the DRC-LVS was still the proprietary PDK from SkyWater.

And then we did another test chip and we've actually done two more since then this, but the second test chip was actually using the Caravel project with efabless, and we did 10 SRAMs on this, including five dual port memories, five single port memories with a bunch of different configurations to kind of be able to hopefully test and characterize the memories in real silicon. Now some of the challenges, you may think memory compiler is easy, it's just an embedded for loop, you make the array, you're done, it's not quite that simple, there's a lot of control logic and annoying things you have to deal with.

One of the annoying things is how to deal with the bit cells when they're foundry specific bit cells, specifically the bit cells from SkyWater were our first experience with this because we were originally an open source tool with a free PDK and scalable CMOS technology, which we didn't have anything about the lithography information in those processes. So SkyWater started to expose us to some of that stuff, and we had some reference arrays of known bit cells from SkyWater and we basically reverse engineered an old memory to basically make the OpenRAM for SkyWater, and as you can see here, we have an example of the dual port bit cell, which we extracted from an array, and this included a strap cell as well as a well tap cell all in one.

And then the second iteration we started adding, that was on the first tape out, then the second tape out we added the single port memories, which this had a much more dense and complicated bit cell layout, including custom corner cells and a lot more integrated optical proximity correction for the lithography. And so we had to do a more customized placement of the array, which required writing some new code to do that for OpenRAM, using our kind of plug-in interface to do custom modules. And you can see here the single port bit cell along with a separate strap cell and the corner cells.

So you can see the size difference of the bit cells. This is a little bit unfair because this dual port cell on the left includes the tap cell and the one on the right does not, but the difference in size is quite dramatic. And you can also see kind of the customization of the layout of the cell of the single port is, you know, uses some non-rectilinear geometries and so on. So it's a much denser cell.

Now one of the challenges we had was how to verify these. So our first tape out we used the commercial tools, which were able to handle a lot of these proprietary rules. The open source tools were less flexible to the non-user design rules. So because the SRAM cells have all of these OPC layers that help with the lithography, they violate a lot of the user design rules. And so we basically went with an approach to replace the bit cell and any sort of offending cell with an abstract view cell that passes DRC, but doesn't necessarily have all of the features in the cell. You can see here an example of the 6T single port SRAM cell, which we used to actually do the connectivity analysis to make sure that the bit lines and at least some of the high level stuff passes DRC while ignoring the other contents of the bit cell. Now, so that's kind of how we address the main arrays.

We also had some other kind of custom stuff needed for our control logic in our memory. We do a replica-based control scheme where we use a kind of a fake column that is all the bit cells are pre-programmed to a logic zero. And we use that to generate the timing for our array. And in order to do this, we had to generate a replica bit cell in red that's programmed to zero, as well as a dummy bit cell that has the bit lines disconnected.

And so we made those cells by very slightly perturbing the layout. And we're hopefully going to be getting some x-ray analysis of these bit cells to see how good our guesses were and how the lithography would play out. The benefits of open-source community is someone's going to do that for us.

We did the same with the dummy bit cell as well. Now, decoders. I'm going to skip ahead to some of the actual results. So we taped out. Oop, jumped ahead too quickly. There's our actual dye photomicrograph of the first one. We don't have the second one yet, but we have it on our actual silicon.

And who thought you'd see a shmoo plot in an open source talk? But we've actually got silicon measurements of that first SRAM. And it's functional. I think the main challenge was a lot of the routing at the top level. We didn't buffer a lot of the signals to do timing optimization to connect to the SRAM. So we're actually limited in performance by the interconnect connecting to the SRAM, rather than the SRAM itself. And you can see, you know, we tested over a different set of corner temperatures, voltages, and so on. And it was working up to around 40 megahertz, which is not bad for a first go.

Then we also did voltage measurements as well. And then finally, we did a voltage retention analysis. And we see that it retains voltage down to about 440 millivolts. Then we raise the voltage back up and be able to read the contents back. So we actually have some characterization results, which are encouraging. Then the second test chip, we have it on my desk. It's configuring the IOs and we don't have a lot of life out of it yet. But hopefully I can talk to Tim more and we can come up with plans to get it a little more analyzed. It's one of the reasons I'm here.

And so future work, we've also just released OpenROM. So this is a NAND ROM generator. It's not integrated with OpenLane yet. We're porting to Global Foundry's 180. And we also got some ReRAM test structures on the last MPW. And we're working on ReRAM arrays in OpenRAM as well. So a lot of different information. I don't think I went too far over my 10 minutes. And I do want to leave time for some questions. So...


  1. You said you need a limited [?] of software, and this [?] a lot of times.

    Yeah, so the Python itself is, it implements a lot of stuff. And it's a lot of stuff that we don't have. It uses a lot of open source tools in the backend for simulation, DRC, LBS. We try to use kind of a wrapper idea where we disguise the interface. So we can use, for example, simulation with, you know, HSPICE, NGSPICE, Xyce. Any simulator that's kind of standard, we have an interface to it. Inside OpenRAM itself, we actually have a lot of data structures for layout, for hierarchy of logic, you know, transistors, devices. We have a data structure and an API to basically interface with all of that. And so it's meant to be a flexible interface that you can basically generate any sort of custom layout, you know, whether that's regular.

  2. Yeah, so it's my experience, any structural code that starts with a sizeable, which is not, you know, like, too large, it gets unmanageable.

    Yeah, we also have some, it does become unmanageable in a certain extent, but we also start to automate certain things. Like we have a channel router. We have, we actually have a maze router that's not very good, but it's a maze router for connecting some things. So there are some things that are a little more automated, but make it a little more manageable. And we always sacrifice area for portability is our key. We want it to be portable. The layout's not very dense in a lot of cases.

  3. Yes, exactly. It's a question of how much more area efficient your generator will be when at one kilobyte, as some of your normal design people say, yeah, just use the RTL synthesis and if you have enough chip area, this will at least be in one flow working and not much area. If you compare your generator with using a standard full flow for the same sizes of SRAM.

    So if you were to use a 1k, like flip-flop or latch-based RAM, we're probably like 4x smaller. It's considerable. Once you're above a couple hundred bits, we're a savings. Compared to a commercial compiler of memories, I would say we're 30% worse, that ballpark. There's a lot of improvement needed there. But again, our goal has always been portability and productivity. And then we have that on the horizon to go back for density and layout, but that's kind of a secondary goal still. I'm always looking for help though. If people want to help with that, that'd be good.

  4. Yeah, you go ahead. Thank you. It's a follow-up question on Python. So this is a real C++ for this type of work, but have you ever ran into a moment of aha, so maybe we should not choose Python because of the performance problem or not?

    So I would say the only reason I've said aha, we shouldn't have chosen Python is because it's horrible at object-oriented design. And we started the project in Python 2. Whatever it was, way back actually, quite a long time ago. And Python's evolved over time as well. And we didn't necessarily pick up on a lot of the design practices early on, which they happened after we started the project. So like naming schemes to help with object-oriented and stuff like that. You know, the PD, whatever they call it, the Python suggestions, the design suggestions. So that's the only reason I've said I reconsidered, would reconsider Python as that. How it can abstract and so on. But that's not a fundamental limitation. And we've been revising it over time, so I think it keeps getting better and better.

  5. Is ECC on your roadmap?

    So we, that's a good question. ECC, we support extra rows and columns already. And we have, I had a student do a master's project where we do a soft Verilog wrapper to do the self-test and repair. So you have extra glue logic that gets synthesized, but we, and we have redundant rows and columns. Yes.

  6. Are there any roadblockers or the opposite, which is like avenues that you see where you can wrap up a big set?

    Yeah, my thoughts are changing how we think about memories in design is a big thing. Like right now, the common thing is for designers just to instantiate a memory and be like, I need this much memory. That's a bad approach. Like it should be more of a synthesis type approach to that. I think interfacing with the high-level tools like OpenROAD, Yosys, that's where there's a lot of potential. And it's not really possible with a lot of the commercial or proprietary compilers because you don't have as much flexibility, so.

  7. So how about in the, you know, the hardware, how about non-CC cell, like AP, AP, that, you know, especially for processing the memory?

    Yeah, so we intentionally wrote it that the type of bit cell doesn't really matter. It, it does rely a little bit on differential signaling. So if you went to a single-ended, you'd have to change your sense-amp scheme. There are probably some stuff you'd have to fix, but our intent was that you would be able to change your cell. And we've written it that it's very flexible in that. It's also very flexible in, for example, like your decoder. You can override our default decoder and make your own. And it's intended to very, be very, you know, modifiable in that way.


Tsung-Wei Huang

Assistant Professor at University of Utah
Taskflow: A General-purpose Task-parallel Programming System

Today's EDA algorithms demand large parallel and heterogeneous computing resources for performance. However, writing parallel EDA algorithms is extremely challenging due to highly complex and irregular patterns. This talk will present a novel programming system to help tackle the parallelization challenges of building high-performance EDA algorithms.


[For readability, moderator comments have been removed, as well as minor questions for better understanding.]

Good afternoon, everybody. My name is Tsung-Wei Huang, and you can just call me TW. Today I'm going to talk about Taskflow. It's a general purpose task parallel programming system to help you more easily parallelize the EDA application.

And you probably, many of you already know, parallel computing, parallel heterogeneous computing is very, very critical for your application performance. For example, if you look at the machine learning today without GPU or even with heterogeneous parallel using one GPU, we are able to achieve over 100x speedup over that with CPU alone. So that's the power of parallel heterogeneous computing.

But writing a parallel program itself, it's not a very easy job because you have to deal with a lot of technical and parallelization detail. For example, you have to worry about the parallelism abstraction, either over software or hardware. You have to worry about concurrency control, tasks, data race, dependency constraints, scheduling efficiency, load balancing, and even performance portability, and so on and so forth. And there's always this straddle between what you really want and the cost of that design. For example, everybody wants simple, maintainable, extensible, and portable implementation, but each of that design may steal your application performance a little bit. And I do believe nobody really wants to manage all these technical details themselves.

So it turns out we want a programming solution that can sit on top to help us handle all these technical details and challenges. But why does existing parallel programming systems do not work, especially for EDA problems? Well, if you look at many of these EDA applications, they are very, very complicated and very complex. Existing systems, they are typically very good at loop parallelism, regular parallelism, but they are not very strong when things become very irregular, like synthesis, optimization, simulation, and so on and so forth. The other issue is most of the existing systems, they count on directed acyclic graph model. It does not allow you to model control flow inside a task graph, like iterative cycle and so on and so forth. So from the evolution of parallel programming standards, it turns out we can envision that the task parallelism is going to be the best model for us to describe heterogeneous workload or parallel computing workload, because it captures your intention in decomposing a parallel heterogeneous algorithm into a top-down task graph that eventually can be scaled to different accelerator.

So our solution over here is called Taskflow, which was supported by our NSF project to overcome many of the challenges of parallel programming challenges that cannot be efficiently handled by existing systems. And the very first challenge we want to overcome is transparency. So let's take a look at a "Hello world!" example in our system. Suppose you want to do four things, A, B, C, D. A has to run before B and C. D has to run after B and C. Each task represents a function or a task. In terms of Taskflow, this is all you need. Only 15 lines of C++ code to get a parallel execution for this task graph. So first you will need to create a Taskflow object, and then you will create an executor to perform the scheduling. From the Taskflow, you can use employees to insert several tasks in terms of C++ lambda function objects, like A, B, C, D over here, and use the precede() method to relate a dependency between A, B, C, and D. And finally, you can submit the Taskflow to the executor, and it will perform all the scheduling for you. So at this moment, I believe most of you can fully understand what this code is doing. That's the power of transparency and expressiveness.

Another major innovation of our Taskflow, when you compare with existing systems, is our Controlled Taskflow Graph Programming Model, or CTFG. CTFG goes beyond the limitation of a traditional DAG-based model that does not allow you to express control flow in a single graph entity. For a complicated workload like this example over here, where you have a cycle to describe iterative control flow, conditional tasking, or even a loop, it becomes almost impossible for existing DAG-based systems to express a very efficient overlap between control flow and your task.

So this is our heterogeneous tasking interface. We have been using CUDA and SYCL, so you can write a single-source C++ program, and then it will be compiled to run on multiple accelerators like FPGA, CPU, GPU, and so on and so forth. And the programming model is pretty much similar to what you are familiar with the CPU-based model. In this example over here, we have four data transfer tasks, and then also one kernel task to perform the power iteration. And you can describe everything in a very similar fashion to our CPU-based model.

Using Taskflow is very, very easy. Pretty much all you need is just include our Taskflow project into your project, and you tell your compiler where to find the header file. Because Taskflow is header file only, it's completely written in standard C++. There is no non-standard C++ feature. All you need is to download our header file, include it into your project, tell compiler where to find it, that's it.

Everything by default can be visualized. If you want to run your Taskflow program with the visualization result, you just need to enable an environment variable telling the operating system how to dump your execution timeline into the JSON file, then you can copy and paste the JSON into a browser-based interface to visualize the execution result of your program. And everything is built in by default.

We have successfully applied our system to many EDA applications. One of the most important applications we applied is our OpenTimer project. If you look at timing analysis, of course, is a very important step in the overall design flow because it helps you verify the expected timing behavior. So if you look at what existing work do, they typically would levelize your graph and perform level-by-level propagation and use pipeline to parallelize the propagation. But with Taskflow, we are able to model the entire timing propagation in a big Taskflow graph, so we can flow the computation very naturally and more asynchronously across your circuit network, including many in-graph control flow tasks, so we can prune unnecessary propagation on the fly.

So this is the sample result that shows you with Taskflow, we are able to achieve over 600 speedup. And of course, that is also part of the reason by using GPU over the baseline with one CPU and with 40 CPU, our solution can also be 44x faster.

Everything is composable and is unified in our system. Task dependency control flow, they are all associated with each other, so you can represent a very generic control Taskflow graph to achieve end-to-end parallelism. And for example, there is a post on Reddit sharing how Taskflow helped their company migrate existing multimedia engine to the parallel target in just a few hours, and the performance got about 8%.

So right now, it is open source, and we do have quite a lot of people using that. Some of them are companies, for example, I recently gave a talk to Xilinx at AMD, their Vivado Synthesis place and route engine is already using Taskflow.

Okay, I believe I'm going to stop here. If you want to understand more details about our system, feel free to check out our website.


  1. So in your examples, you specified the precedence manually, so that "A after C, B" and "B after C", or something like that. But are you also able to figure out the data dependencies?

    So the question is about data dependency. In our system, we do not handle data dependency. It is completely up to application developer. This is probably one of the biggest lessons we have learned when we tried to start, initiate this project, because the way you want to optimize the memory-managed data is typically application-dependent. So we do not provide yet another abstraction over data, but we just focus on how to describe your workload in terms of task graph, and then we schedule that for you. And of course, you can always describe your data dependency in terms of task dependency.

  2. Are there opportunities in any algorithms you looked at for overlapping data, streaming data between tasks?

    That's a very good question. Yes, we do have a very specialized scheduler trying to maximize the overlap between your tasks and data movement as much as possible. But that will be a totally different topic related to scheduler.

  3. My question would be, how does this language compare or approach to other domain-specific language for data or control data flow such as Hadoop and Actor language, because it looks very much like Actors. You connect in your main program and then you do this behavior. Do you have to write any analysis quizzes for dynamic or static data flow? Because they're domain-specific, what's the analysis for such restricted data flow, what is the computation that has to be done in memory? Any work has been done in this area?

    Actually, my short answer to this is no. Like I said, we primarily focus on how we can describe a task flow graph and we can do the scheduling for you. And the reason why we do not want to come up with another domain-specific knowledge is the ecosystem. Because a lot of time when you come up with a domain-specific language, the biggest challenge is you have to convince the community, convince the people to either rewrite the application using your language or they will start to question about the sustainability of that particular language. So that is the biggest challenge there.

  4. So your task flow could be any C++ object?

    So TaskFlow is completely written in standard C++. Like I said, there is no non-standard C++ feature. So it's fairly easy to use and integrate into your project.


Tim Edwards

VP Analog at Efabless, Inc.
Principles of Paranoid Design

This talk explores how hardware projects designed using an open source PDK rely too much on precise data which may not be available, and how problems can be avoided by certain design methodologies such as two-phase clocking, negative-edge clocking, margining, and monte carlo simulation. While open PDK data can be made more reliable by cross validation with multiple tools and, ultimately, measurement, good design practices can achieve working silicon without absolute certainty.


[For readability, moderator comments have been removed, as well as minor questions for better understanding.]

Okay, thank you. If you don't know who I am, then you may not be in the VLSI domain. You can Google who I am.

So I'd say half of VLSI talks include a graph of Moore's Law. This is my slide of Moore's Law. The reason I wanted to show it is because Moore's Law represents the fact that design in VLSI has been performance-driven, just completely, totally performance-driven.

And of course, being performance-driven, that comes at a cost, a literal cost. And the cost of making a mask set for any process, somewhere around 65 nanometers that exceeded the median cost of a house in the US, I like to use that as a reference. And of course, people who are doing prototyping designs don't have to pay for the mask cost, but the mask cost has to be paid by somebody. And so even if you distribute that cost over a number of projects in an MPW, it's still expensive and it's still out of the price range for a hobbyist, for instance.

If you want to do something useful, like over here, this is our Caravel chip from efabless. The core processor of that is a very small CPU, that's about the minimum size of something useful you can do. And it's still two millimeters squared in a 130 nanometer process, and so that's still in the cost of thousands of dollars. So the cost, the point is that the cost is not just the cost of the silicon, it's also the cost of failure if your thing doesn't work.

And the thing is that standard design processes essentially assume that you have a PDK with essentially perfect data. And that's because foundry data for established nodes has come from PDKs that are, I mean, they're generating PDKs that are generally reliable. That's because they have done this over numerous iterations of silicon, and that's what they give to you when you buy a proprietary PDK.

So the traditional design methodologies all assume this idea that your data is perfect. And so all you do is you test your design over process corners, over PVT, and if that works, you're good. And if it doesn't work, you might try doing Monte Carlo simulation, which is a little easier to meet, and if that works, you're good. And that sets your probability bounds that you're going to get first-time working silicon.

Now the problem is you get an open-source PDK. You throw an open-source PDK into the mix, and now the trust in that PDK goes way down. The reason is the open PDK has been pulled from many, many sources, and some of the data sources -- I've been involved in the open-sourcing of the SkyWater process -- and the problem with that was we get all these data formats. Some of them are proprietary, so we can't use them. And so we pull data from wherever we can find it in the files that the foundry sent to us, and in some cases, we find that those files don't even have the right data because they weren't part of the commercial flow. They were just some other file format, and nobody paid attention to what was in that file, and so you find it's completely broken. So until we have many more measurements on open-source silicon, if you're starting with any new project, then you will need to rely on assorted tools and methods that are in the public domain. Now there are two approaches to this. You can try to get your data better, or you can try to design against the data, which is why I call this principles of paranoid design.

So for the first part of that, how do you make your data better? So one of the things that commercial tools have that has not been in the open-source world until recently is the idea that you can use a field equation solver to figure out what all the parasitics are for your process, because that is one of the main areas where the data is often not very reliable. So I have found this, this is from It's an open-source field equation solver called FasterCap. It has 3D and 2D solvers, and I wrote a little project that I put up on GitHub. I'll show the URL for that in the next slide, but essentially it's a bunch of, a bunch of, an API and a bunch of routines written in Python that will take a file that describes a process like SkyWater 130, for instance, or GF180, and then build out a number of different metal structures in a FasterCap input format, then make like a hundred of those, thousands of those, and then run them through FasterCap.

And from that, I've been able to get all these curve traces for how the parasitic capacitances work in this process, and use that then to go improve all the models that I have in my layout program, Magic, which does extraction, and I'm getting much better results than I have before using this. And it's also extensible, so if any time we need to onboard another project, another foundry process, then it's just a matter of writing an input file, and then we can redo these and get all the coefficients that goes, plugs back into Magic, and you get your parasitic extraction.

So right now I have what I consider to be a pretty good full RC extraction in Magic, and this is an example of me running full RC extraction and then simulating on a OpenRAM circuit that Matt sent me, and you can see the difference between in this section here where it's being driven back to mid-range voltage, that in an ideal simulation, it goes directly down to the mid-range voltage, and a full RC extracted, it's got a slope to it, and somewhere else in the simulation, you can see where a bit fails because of that.

At the same time, we've got tools like the simulators themselves, Ngspice and Xyce. Ngspice has recently introduced OSDI and OpenVAF compilers, which you can run as plug-ins. That makes it a lot easier for us to use some of the newer models. We have, for instance, a few Verilog-A models in the SkyWater process that we weren't able to introduce with the original release of the open PDK that we can do now, and that's also true of the ReRAM models in Sky130, and then Xyce has been doing a lot of development also in the last few years and is getting much more compatible with other versions of SPICE.

So now the other approach is, rather than try to make your data more reliable, try to design for making something robust, rather than designing for performance, performance, performance. The principle here is that if you are designing to an open PDK, and the open PDK has only recently been introduced, then you should not be trying to design something for performance, you should be designing for something interesting, for novel architecture, and you should be trying to make the thing work. So design for robustness, and be paranoid about your design.

Most of the design methodologies to make things robust are not new, some of these things are pretty old, and some of them have been entirely forgotten because of this push for performance. So one thing you can do, for instance, if you want a robust digital circuit, is just to do two-phase clocking, in which you take a flop, you divide it in half, instead of clocking one on the rising clock edge and the other half on the falling clock edge, you clock them on two phases of a non-overlapping clock. And so if your circuit has setup problems, you can just slow down the clock, if your circuit has hold problems, you can increase the spacing between the clocks, and one way or the other, you will get it to work. It won't be the best performance, but it will work. And I think most synthesis/place-and-route tools should be able to work with this style, it's just a matter of routing two different clock networks. You're not going to do it optimally, and that's going to have a performance impact, but they should be able to do it.

For anybody who's been through the first couple of Open MPWs, we had problems, that problem was due specifically to me not paying attention to what I just said, which is to not trust the data. So if you have something like a serial chain up top, the standard way to solve for making sure that your clock doesn't arrive before your data is to delay the clock by inserting delays into it. The tools will do that automatically. In our case, this was in a hierarchy, I was doing it manually. I trusted the data, I put in some extra delays, it wasn't enough, and we got hold violations in the scan chain. Now there are several things I could have done that would have made it more robust. One of them is just to run your clock backwards through the scan chain. I had not wanted to do that because I was adding several wires up the side, which I was trying to avoid taking out area that I would otherwise be giving to the users for the user project area. This is in the Caravel harness chip. But eventually we realized, oh, you can just do on the bottom, you just clock things on the negative clock edge, and it will be always correct. And we have some users who have done that. Our paranoid users are our most successful users, going back to what I said on the first slide.

You can also do this for, for instance, an entire subsystem like the wishbone interface. This was suggested by Tobias Strauch, in which you can design the user area so that it is picking up the wishbone clock and clocking on the negative, clocking data on the negative edge of that clock. And as long as you've designed your wishbone interface on the user side for that, then it will work. And we had another paranoid user who decided that he didn't know the relationship between the clock and data between the microcontroller and the user project. So he figured that he would just put in a delay chain and then select from the delay chain, you can select whatever delay of clock you want. And he found one that worked, and he was the first person to get up a full user project that was a complete microcontroller in the user project area.

So that's all I had for my story, and thank you for listening, and we now have a little time, I guess, for questions and answers.


  1. So just a feedback for the community. A lot of the links that go to these PDKs are broken, so if you go there, there is nothing, basically. For example, it takes you to the website, but it read that, but it's not there, or a different version, and it's a bit complex and confusing to find things. I don't know if you can get back this to the people in the community.

    Well, yeah, like all open-source stuff, it depends on feedback to fix, and I'm not sure which links you're specifically referring to, but coming from where.

  2. [...] From efabless to [?], I don't remember exactly which of the links are broken...

    Yeah, I do know that there are issues with some of the links there, and again, feedback is what we need. We do have a Slack channel where we're fairly responsive to those things.

  3. How do you see the problems with data quality changing as we go down, for example, to smaller nodes, if we're lucky enough to have open PDKs on those nodes? Do you think it's going to be mostly more of the same problems, or do you expect bigger problems with things like more complex transistor models and stuff?

    It depends. It will get worse to a point. I understand that once you get to FinFETs, you end up having... Andrew is suddenly shaking his head, so I'm probably about to say something that is... No, you're good. Okay. Yeah, once you get to FinFETs, you have a lot of constraints, and there are so many constraints that it actually makes design problems easier in some cases. It's possible that once you get to FinFETs, the problem just becomes a little easier, but yeah, certainly down to 65, 45 nanometers, all the way down to 28. 28 is probably the worst, so... And we're getting there. And I don't expect to get open-source PDKs down in that range anytime soon.

    [Comment from the audience:] The caveat is, or the comment is that it was a big struggle with the problems that was there before the ADM. But if a commercial droplet load is open-source, it will almost from the get-go be much more complete and robust and polarized. So that means that easier solution does happen.

    Yeah, there are certain levels of trust that are dependent on the level of trust in the foundry to begin with, and the process that's being done. SkyWater particularly is sort of researchy in the way they do things, and that makes it a little less trustable than some of the other foundries. But then they're also easier to work with and have been the only one so far followed by GF then to become open-source.


Rishiyur Nikhil

Co-founder and CTO at Bluespec Inc.
BSV and BH, High-Level Hardware Design Languages (HLHDLs)

BSV and BH, the Bluespec HLHDLs (High-Level Languages for Hardware Design), emerged from ideas in formal specification (Term Rewriting Systems), functional programming (Haskell), and automatic synthesis of RTL from specifications. BSV has been used in some major commercial ASIC designs and is used widely in FPGA projects. The BSV/BH compiler (written in Haskell) was open-sourced in 2020 ( and today's projects are centered around RISC-V design and verification, and on accelerators.


[For readability, moderator comments have been removed, as well as minor questions for better understanding.]

Yes, I'm going to tell you a little bit about the Bluespec high-level design languages. I would guess that most people might have heard the name Bluespec, but perhaps don't know very much about it at all, other than having heard the name. There's also a small terminological confusion in that there's a Bluespec company and there's the Bluespec languages. They used to be the same at one point, but the Bluespec language has been open-sourced completely for about three years now, and the Bluespec company is not directly involved in Bluespec language anything, other than allowing people like me to work on it and giving any resources for that. So I'd like to talk about, just give you a sense of what the language is about. It's yet another high-level design language, many of you will be saying, so what's different? And there are many dimensions or differences between different other HDLs, but I think the one fairly unique one, in the sense that I've not seen any other HDL that takes this approach, is the semantic model of condition-action rules, which is inspired by languages in formal verification.

So I'm going to mostly spend my time on that. There's also a different dimension, which is how it borrows lots of ideas from Haskell for static elaboration, et cetera, but we don't have enough time to go into both of these things, so I'm just going to focus on this thing, which is fairly unique. So I'll do it by means of a small example. So what Bluespec, the language is, is being inspired by, right from the beginning, by languages for formal specification of highly concurrent systems.

So I'll start by giving a small example. Just imagine a very small toy coherent cache system. So in this example, L2 is your L2 cache. We have two L1 caches, A and B, and we have FIFOs that do read-write requests from, let's say, from two CPUs into the L1 caches, and they may be coming concurrently into those CPUs. Alright? So let's imagine this is the initial state of the system, where L1_A has the data X in exclusive mode, maybe we simply did a write onto the data. It's invalid in L1_B, and L2 says, just has a notation that it's available in exclusive mode in A, and perhaps has an old value Y in it.

And now imagine a read request comes in to B, so we can imagine transitioning to a state like this, where the states have all become shared, the current value X is available everywhere, and L2 has a notation that it's available at both A and B. Once we have it in the state, seeing a read on B, we can immediately satisfy the read and put back an X.

So what I've shown you in a picture can be expressed in text as a rule, a condition action rule. This rule says if the request on B at the head is a read, its state is invalid, and L2 is exclusive, then here are some updates to the state that will transition the L1, L2s and respond to B with the value X in it.

So this, essentially the entire behavior of the cache can be specified. So this, I'm not claiming this is an implementation, but you can specify the cache behavior using rules like this. So for example, I'd have a similar rule if the request was a write, and similarly a pair of rules for reads and writes on A, right? And the nice thing about rules is that they are concurrent, in the sense that if rules don't interfere in any way, they can run concurrently, there's no ordering between rules. You can non-deterministically choose what order you execute rules on if you want to pick an order. The only thing you need to, that is important, is that rules should be treated as atomic transactions. And why atomic transactions? Because we want to think in terms of invariance for correctness on the overall cache system. An invariant may be a correctness condition like if one of the L1s is in exclusive mode, then the other one must be invalid, for example. Right? So we want you to check. So with rules, you can reason about that if the invariant was true before the rule, after this rule, is it still true? And atomicity gives you the tool exactly to do that kind of reasoning.

So here's the second version of the same problem, except I'm going to write it with much finer granularity rules, right? So start with the same state. And now my rules are going to be local. In other words, every rule is only going to look at one of the elements of that cache system, and it can send and receive messages. So when we see this initial state, we send a shared request to L2, and we go into a pending mode on B. When L2 sees the shared request, it sends a downgrade request up to L1_A, saying please get back into shared mode, and it also goes into pending mode. When that sees the shared thing, it goes into shared state, returns a shared response, and says here's the new value X. When the L2 gets that, it goes into shared mode and sends an acknowledgment back to B, saying okay, you're shared. And finally, B itself can go into shared mode, and now we're just like in the second state that you saw in the previous slide, and we can respond with an X.

So this is much more closer to the implementation because it's local. I'm not looking at state all over the system as if I can instantaneously look at state. I'm accommodating the idea that there's a cost to be paid by communication. Now this whole thing could also be written in exactly the same syntax, the same notation language, except that every rule now, if you look at it in detail, I've just given about three or four of these rules, is set up to be a local examination of your current local state and incoming messages, and the transition is the local change in state and outgoing messages.

So this is not a new idea in general, the idea of condition-action rules. It goes back to perhaps the earliest days of LISP programming in the 50s and 60s. I've mentioned some formal specification systems, TLA+, UNITY, term-rewriting systems, Event-B, et cetera, that all use modular different syntaxes and details, roughly the same idea. You have condition-action rules. Rules are not ordered. They can go in any order, so they give you concurrency. They're non-deterministic. You can choose them in any order you like, and the only thing is they all assume rules are atomic. So, and further, the other nice thing about having the single languages, you can refine, right? In my first slide and my second slide, I showed you the solution to the same cache-cores problem with rules at different granularity, where in some sense you're lowering it down closer to the implementation.

So given that background, let me now tell you a little bit about BSP and BH, these languages. So essentially, the syntax and language I showed you in the first two slides is actual BSP code. I can push a button, run it through the BSP compiler, and get synthesizable RTL out of it. So, and that RTL, then you take it through standard RTL flows, FPGA, or ASIC, and essentially all behavior is written in this way, as in these formal specification languages. There is no other behavioral description other than this rule notation right here. So we have two such languages for historical reasons.

The original compiler 20 years ago, when we first started it, was written by Haskell enthusiasts, had a very Haskell-like syntax, didn't go very well in the hardware engineering community, so we redesigned the syntax to be a more SystemVerilog-ish syntax, so that's called BSV. But essentially, it's the same language, your choice of front-end, and you can mix and match, mix and match them.

So it's, so this is not a new language, as I said, it's newly open-sourced. It's been about 20 years in the making, so this is not a research language, this is not a toy language we're experimenting with. We're using this, and it has been used a lot in production in many, many situations. And being of that provenance, of that much time, it's a so-called batteries-included language. In other words, there's lots and lots of libraries available, so that you don't design every little piece of IP from scratch, whether it's caches, or MMUs, or RISC-V CPUs, or debug controllers, interrupt controllers, AXI buses, all of these are available as open-source libraries, so that you can sort of focus more on the architecture of something that you might be designing, rather than having to build everything up from scratch.

So it's been used, like I said, it was a proprietary tool from the company for 15, 16 years, during which time it was used for ASIC design in ST and in TI, for some rather large subsystems. It was used for modeling in IBM and other places, because the nice fact about the fact that you have the same notation, whether you're, no matter what level of abstraction you're describing it at, means you can actually compile and synthesize even the higher description. So even the thing I showed you in the first slide, which was unrealistic from an implementation point of view, if you're looking for performance, is still compilable to RTL, and you can still produce an RTL, you can run it on FPGAs as is, even the specification can be run on FPGAs. So IBM, recognizing that, was using this as a modeling tool to explore microarchitectures for some of their power CPUs, etc., back in the 2000s.

So nowadays, I would say it's mostly used by people for FPGA programming, a lot of it in the RISC-V space. So it's been open-sourced, like I said, about three years ago, and so I hope that gives you a quick sense of what's in Bluespec. So this is a unique feature. You'll see other languages, for example, like Chisel has exploited Scala for high-level expression of RTL, if you like, because it generates RTL from that. This is the only language I'm aware of that uses this idea of condition-action rules as the behavioral spec, and like I said, the whole inspiration has been for formal verification. We and the company haven't used it.

First of all, I'd like to say that even if you're not using formal verification as part of your automation, the idea that it has a formal semantics and has a formal way of doing refinement is useful, even if you're doing it manually. Most of what we have done has, in fact, been done manually. The automation work, there's a bunch of work at MIT where they have done pretty advanced RISC-V superscalar, out-of-order speculative processors and proved them correct using Bluespec, and we're in the process of doing some DARPA-related projects also along these lines.

So I'll stop at that point with this main point. Just as a teaser, it's in the slide, so you can see it later on. It's too much to look at at this point, but the other side of Bluespec is that it uses essentially all the power of Haskell, including polymorphic types, polymorphic parameters, higher-order functions, et cetera, to make it into a very powerful language for circuit description. So if you think of generating your circuit as two phases, the static elaboration part in some sense is the circuit description part by which you describe what's the structure of your circuit, what are the pieces, what connects to what, and how do they connect correctly from a types point of view to each other. And then there's the behavioral part. So the condition-action rules is about the behavioral side of Bluespec, and there's also a whole chapter on the Haskell-based static elaboration, which you can talk to me about separately outside or later on.


  1. So, obviously, it's a very good decision. I'm always interested in hearing the reasoning behind these things. I'm just curious why your company decided to make a Bluespec compiler open-source. I'm always interested in knowing why people do this.

    Yeah. So we essentially gave up on trying to commercialize this. We tried it for more than a decade. And I think what I find is that if you look at similar experience in software languages, open-source languages succeed. Open-source languages themselves might succeed if you have enough oomph in them. Let's say Python, for example. Or even if it's not an open-source language, it might succeed if you have a big name and big resources behind it. So I point at Java, for example. Why did people adopt Java? Well, Sun Microsystems was behind it. And why do people adopt C-sharp and F-sharp? Microsoft is behind it. So we can't compete in that space. We're not a big company like it. So unless we got acquired by one of the biggies and one of the biggies promoted it, then it would take off. But it didn't. So we gave up eventually. We found over time also that a lot of our activities in the company drifted more into services and IP and things like that, for which we use Bluespec for everything we do, the language. But as an EDA tool, we just gave up on that.

  2. So this is a very different execution model from what we're used to, right? I can very much see how it's useful for describing caches and CPUs and stuff. But are there things that you tried to design with it where the execution model was a limitation, I guess?

    I don't think so. So I think both for... So this raises an interesting question, a related question. Other people ask it. It was not your question. It was how does this compare with HLS, for example, right? So one interesting thing is that one separation I make is that whether you take Verilog, VHDL, SystemVerilog, Chisel, Bluespec, you, the designer, are completely ultimately choosing the microarchitecture of what you do. There is no guesswork. The tool is not making any innovation, except local combinatorial optimizations and all. Generally, the overall structure of what you're doing is designer-specified. Whereas in HLS, the tool has an architecture and it has knobs by which you can adjust details of that architecture. So that's one major difference. So we sit very much in the Verilog, SystemVerilog, Chisel, that camp, as opposed to HLS.

    Then coming back to your question, then, is if you think all of these languages, Verilog, SystemVerilog, VHDL, and Chisel, all use the classical clocked synchronous circuit model as their, effectively, their behavioral model on top of it, right? And here we are lifting it up beyond that to the rule-based model. I didn't say anything about clocks in any of my description out here. So there is a question of how you synthesize from there into, because ultimately we produce standard clocked synthesizeable Verilog.

    And ultimately, it comes down to scalability in your reasoning about correctness, right? I think for small circuits, it doesn't make much of a difference whether you use the conventional Verilog, SystemVerilog, or Bluespec. But the thing about this atomicity is it's a global property. In other words, the way in a Bluespec program, you can have a module that has rules, that rule might invoke a method that crosses the boundary of the module into another module, which you can think of the method as a piece of a rule that's extending, and that may transitively go beyond that to multiple modules.

    And it's in that kind of a situation where atomicity becomes a nightmare if you have to think about every level of detail of arbitration of every possible interference, especially when there's a lot of conditional interference. That is, on some conditions it interferes, and on other conditions it doesn't, etc. So that is where the value of having this atomic transaction model really helps in quickly composing scalable systems that have a lot of this kind of thing.

    It's definitely a big advantage on circuits that have a lot of control flow in them, because that is where the conditional interference, etc., becomes most prominent. If you have very much of a data path, you know, the kind of thing that HLS does well on loops and arrays and all that, I don't think Bluespec will give you much of an advantage on that.

  3. At one point, they just go away, but there were fashionable discussions of self-timed logic and systolic operation. I guess they are very good at that.

    So, good question. So the question was about self-timing logic and asynchronous logic and things like that. And again, Caltech was, of course, very prominent in that. That's a very interesting question because rules, as I gave you, didn't say anything about clocks. And it seems like a natural fit for asynchronous logic, self-timing logic, etc. It's a topic to be, somebody should explore. It's the kind of thing where we've always had this idea that it could be targeted towards asynchronous logic. There was a company called Acronix that was doing asynchronous logic-based FPGAs that we had some conversations with, etc. We just didn't have the bandwidth to explore that, per se. But absolutely, it could be, the same rule-based source language could, instead of our synthesis method that goes through into clocked Verilog, could have gone into asynchronous logic as well.


Frans Skarman

PhD Student at Linköping University
Spade: An Expression-Based HDL With Pipelines

Frans will present Spade, a new open source standalone hardware description language. He will show how Spade's abstractions and tooling, which is inspired by software languages, improves the productivity of an HDL without sacrificing low level control.


[For readability, moderator comments have been removed, as well as minor questions for better understanding.]

Yes, thank you very much for the introduction. Yeah, so I'm Frans, I'm going to talk about Spade, and of course we need to motivate things. So my motivation is very much in line with the answer to my question. It's abstraction, and to me this is sort of saying what things we have to think about and what things we are in control over. So as an example you might have, you have Verilog and VHDL. They are going to be low-level in almost any way you look at them, but you can do low-level Verilog and sort of instantiate individual AND gates in a netlist, or do high-level Verilog doing more behavioral description and reasoning about sort of individual operations on individual bundles of bits. On the other end of the spectrum we have high-level synthesis. Here we've given up a whole lot of control, but we also have way less things to think about. Now Spade is absolutely not a high-level synthesis tool. Just like Bluespec it falls on the lower end side of the spectrum, and the goal with everything I'm trying to do here is to retain the control that you have with Verilog and VHDL, but push the amount of high-level reasoning you can do while still retaining that control. So first of all I want to give a... Alright, I also want to steal a bunch of stuff from software languages, because I'm a software developer and I miss a lot of things when I come to Verilog and VHDL and other HDLs.

So I want to start off with an example of what the language looks like. This is sort of the "Hello world!", the counter that comes from some value, or from zero up to some maximum value. The inputs are separated from the outputs, so here we take a clock, reset, and a max value, and this thing produces a current counter value. And the whole language is more linear than Verilog and VHDL, for better or for worse, but I think it's easier to reason about things in a linear way and then explicitly do non-linear flow of data, I suppose. To define a counter we need a register. We call that register 'val' and we clock it by this 'clk' signal. The register statement is the only sequential statement in the language, everything else is combinational. We specify a reset, so if the 'rst' signal is true, then we set it to zero. And then to describe the behavior of the circuit, we give a new value of the register as a function of the old values. So if the current values are the 'max' value, the new value will be set to zero, otherwise it will be the old value incremented by one. Some key takeaways here. First of all, it's an expression-based language, so instead of saying "if 'val' is 'max', set 'val' to zero", we say "'val' is the result of this 'if' statement", and then we have to specify a value in each branch, and that prevents a few bugs, and once you get used to it, I think it's a more natural mapping to hardware than the imperative style that most HDLs have. It's a statically typed language, so we specify all the types. It also tries to make sure that you don't accidentally throw away information that you might need. So if you do "'val' plus one", that could overflow, so the language will not allow you to throw away that overflow implicitly. You have to call this 'trunc' function to say that "No, I actually do want to throw away a bit here", since we have a feedback in the circuit. It has type inference, so you don't have to specify any types inside the body, the compiler figures them out for you, as long as things are fine. If the compiler finds some inconsistencies, then it will, of course, alert you. So you get the benefits of static types without the annoying typing. Unlike Bluespec, this is a cycle-to-cycle description of your hardware. You give the new value of all the reducers as a function of the old value of all the reducers, but you have a lot more structured tools for doing this than you do in Verilog and VHDL.

One of those tools that you have access to is a pipelining feature. So on the left here you have the code, on the right we have the resulting hardware. The thing I want you to pay attention to is these three lines. First of all, the head of the pipeline specifies its latency. Just by reading the head, you can now see that the delay between input and output in this module is going to be two. Then you have the two 'reg' statements, which specify that all the variables above a 'reg' statement should now be registered, and when you refer to those variables below the rig statement, refer to the register version. This decouples the description of where you put pipeline reducers from the computation that you're performing inside the circuit. This is useful when you're describing the circuit originally, but it's way more useful when you have to refactor things. So let's say we realize that this 'g' block at the top is too slow. It's breaking our f_max, so we need to optimize it, and we do that by pipelining it. Normally you would have to do a bunch of thinking now, like "What do I have to change because of this?", in Spade, because the compiler knows about pipelines. It will first tell you that, hey, you first need to instantiate this 'g' block as a pipeline, and you specify the depth there, so if there was more complex behavior here, the user would need to go back and check that, okay, there is no, either change the circuit to match a new description, or, yeah... In this case, the compiler will actually figure out the problem for us. We have this line going backwards through the pipeline. The behavior of our circuit changed, so the compiler will tell us that this 'x' value is not available where you're trying to use it. You need to use it in a later stage. Of course, the solution here is to delay the computation of this whole pipeline one cycle, so insert these two registers, these two new registers into the pipeline, and because we decoupled the description of pipelining from the description of computation, the only change we have to make is to put 'reg' there, and update the depth of the outer pipeline, because now that changed the latency of the pipeline.

This pipelining feature has a few more things that I don't have time to go into. You can do feedback and bypasses, so you can say, "I want to refer to the value of this register two cycles ago", or, "two cycles in the value as it appears in two stages below me", for example. This is very useful if you're doing something like a processor, where you have a register file that sort of feedbacks on itself. You can also do, as of a few weeks ago, there's some built-in support for dynamic behavior, so you can say that all of the pipeline registers in this stage should stall if a condition holds, and then it will stall all the pipeline stages above this, and this allows you to do correct flushing, and there's almost enough support in the language now for doing backpressure negotiation between the pipelines, so you can solve that as well in a structured manner. That's the pipelining construct.

One of the things I wanted to steal from software languages was the type system, so one of the major things I miss in a lot of languages that don't have this is the 'enum', and the more powerful 'enum' in the Rust sense rather than the 'enum' of C, where it's just one of a set of values. The type system supports generics, so we define a type here called 'Option', and it's going to be generic over any type 'T'. We could put integers in there, we could put 'structs' in there, other 'enums', whatever we feel like. This option type will take on one of two values, either it's 'Some', in which case a value is present, or it could be 'None', in which case no value is present. The best way to view this to me is as a valid bit that is bundled with the data it validates, so the representation of this type will be a tag along with a value. If the tag is zero, the option type is 'None', and the value bits are undefined, and we're not going to be allowed to access these bits unless we first check that the tag is '1', in which case the compiler will give us access to the bits, so this prevents reading data that wasn't really valid.

This is very useful for a lot of other things as well. You could model commands on a bus. For example, if you have a memory, you could have no operation on the memory, you could have a read operation or you could have a write operation, and you can bundle the data that it needs.

You can model an instruction set, so you have a 'Set' instruction where you have a destination register and an immediate value, you have an 'Add' instruction where you have a destination register and two input registers, or you could have a 'Jump' instruction which has a target, and this will be encoded in a similar way, and it's very nice to match on these and then you only get access to the fields of your instruction that you actually have a use for.

Finally, tooling. I think tooling is a very important part of any language. I showed briefly the compiler error message. That's something I'm very passionate about. The compiler should give you useful error messages that describe what you need to change to make things work, so any unclear compiler error message acts as a bug.

We have test benches in cocotb, and cocotb is a Python testing framework for Verilog. The thing that Spade appends to it is the ability to write Spade code inside cocotb, so you can write a string with a Spade expression, it goes out to the compiler, compiles it, and that way you don't have to care how the compiler decided to encode your 'enum' or your 'struct'.

And there's a build tool. It can manage dependencies. Here I say I want a RISC-V implementation from a path and a RGB-led driver from some Git repo. It can call build tools for you, so Yosys and nextpnr, and it's scriptable via plugins, so this specifies that I want to bring in a plugin that generates a program memory from an assembly file.

The last slide, Spade is an open-source project. Of course, this is OSDA. It's implemented in Rust, but it's not using any of the Rust compiler, it's not embedded inside Rust, it is a standalone language, it's just that I take a lot of inspiration from it and I implemented the compiler in Rust. It's targeting Verilog, which is not great, it's easy to do, and all the tools support it, so that's why I decided to do it, but I would like to explore something else, like CIRCT, Calyx, or RTLIL, because that would give me a whole lot more nice features.

That's all I have to say. If you want to learn more, there's a website,, or you can follow me on Mastodon, where I ramble about this language.


  1. I'm trying to grasp how this fits into your first diagram of the varying levels of abstraction. How would you compare its capabilities with something like Chisel, for one, I guess?

    For Chisel, I actually have a slide, because I get asked this question a lot. So, of course, Spade is a new language. It's Chisel, and all the other languages are going to be way more mature. I would say all of the languages kind of push the abstraction level, but Chisel, in particular. And, well, I guess Chisel and all the other hardware construction languages, as I think they call them, they do so by doing meta-programming. But when you can't use meta-programming, when you're describing the individual operations that you want to perform, you're still doing bundles of bits and individual operations on those bundles of bits. So there's no pipelining feature, because it's just bundles of bits. You don't have any nice types at the hardware level. You don't have this pattern matching that you can do on the 'enums'. I didn't show that off due to time, and they're also imperative. So I think you have this "if this happens, set this," which I think is sort of the wrong way to view hardware. And there are also embedded languages. So to me, they feel kind of clunky. You have to do "when/elsewhen/otherwise" instead of "if/else", which is fine, but then the autoformatter kind of messes with it when you do that. And, yeah. There's some other stuff as well. You can read the points on the slide. But I hope that answers your question. You asked about LiteX as well?

  2. Which is in the same space.

    Yeah. [...] Yeah.

  3. Could you give us an example of what you are designing with this?

    Sure, yeah. So, it's kind of early stage still. I have a working RISC-V processor, a five-stage pipeline thing that it only supports the base instruction set for now. I've built the controller for a research project I'm working on which is sort of doing dynamic programming. So it's feeding a bunch of stuff into a long pipeline and writing the results back to memories. I'm playing around with talking to SDRAM now. So I, for that I realized that I don't really want to write a SDRAM controller right now. So then it's more like "Can I integrate LiteDRAM with Spade in a nice way?" And some random games. A few friends of mine built a game during a game jam.

  4. I've used, I guess, Migen, which is the older version of Amaranth. A lot. And I guess your work takes 10 hours. I'm trying to figure out how much time I'm going to use. I also love Rust. It's a nice thing. The only thing is how much could you push this sort of fatal thought of basically leveraging typing and Rust-like tightness which comes with a lot of relief that you take it over time. But at the same time, how much frustration has it been in terms of trying to put things in that way? Has it been a positive journey? And how do you feel about it?

    I've had lots of fun with this project. Started two years ago as a hobby project and then it turned into a work project and I still find it super fun to work on. It's a fun challenge. It's nice to have things to borrow from. With your typing thing, anything that isn't a thing in software but is a thing in hardware is like modeling ports. So I didn't get into that here. But I have a system for similar to the lifetimes in Rust but for modeling so that you can only use a memory port, for example, once. If you try to give a memory port to two different independent circuits, the compiler would say "No! It's being used already."

  5. That seems to be going a bit heavily on the Rust side. I do enjoy the fact that when I use Migen, it's a Python-based thing and you have this introspection of the objects and actually there's a lot of power to be managed there. Unfortunately, this kind of strict typing, which I think is coming with Python, but at the same time there's a trade-off between [?] and ease of expression.

    Yeah, I'm very much on the typing side of things.

  6. I could follow up on that also. As I mentioned in my talk, we use essentially all Haskell types. It's not a subset of Haskell types. We are just not a believer of that. There are all kinds of interesting, apart from just generally matching types and making sure that I'm right. There are other aspects that you get out of that that are really beneficial. For example, in Haskell you have a concept in the type system called type classes, and it's a kind of very disciplined structure for overloading. When you do that, what you can do is, for example, a little earlier, 'Any' in the 'struct'. There's two separate questions. One is logically, "What is the 'Any' in the 'struct'?", and then the other is physically, "How do you represent that in bits?". One way to do it is a particular tagging scheme this way or another tag scheme that way. Type classes completely solve that problem. You can separate out the concept of logically what is an 'Any' in the 'struct' versus what the representation is. That's one example of type classes. Another kind of place where you use type classes is if you think of a large system that has component status registers floating all over the place in different modules. Ultimately, that's a global space of component status registers. Plumbing it through module interface is often extremely messy. Again, type classes completely solve that problem for you because you can hide the plumbing using the monadic types. We're just believers in very strong polymorphic typing in hardware design languages.

    [Comment from the audience:] I agree.

  7. I have just one question. How did you choose pipelining? That the user types pipeline, data types, latency; ...and why not let the tool do that for you?

    The answer is I want control of the retiming, I guess. The reason I started this project was I was doing a bunch of pipelining stuff, and then if you're doing it in Verilog, you do _s1, _s2, and then you have to make sure you refer to things. I still wanted that level of description. I just didn't have to do it. I didn't want to do the work manually because it seemed so easy to automate. If you want more, like you specify a latency and it does the retiming for you, you should look at PipelineC, which is another HDL, and it does this iterative retiming automatically.


Poster Session (Coffee Break)

  • Davide Cieri, Nicolò Vladi Biesuz, Rimsky Alejandro Rojas Caballero, Francesco Gonnella, Nico Giangiacomi, Guillermo Loustau De Linares and Andrew Peck: Hog 2023.1: a collaborative management tool to handle Git-based HDL repository
  • Lucas Klemmer and Daniel Grosse: Programming Language Assisted Waveform Analysis: A Case Study on the Instruction Performance of SERV
  • Vamsi Vytla and Larry Doolittle: Newad: A register map automation tool for Verilog
  • Stefan Riesenberger and Christian Krieg: Towards Power Characterization of FPGA Architectures To Enable Open-Source Power Estimation Using Micro-Benchmarks

Andrew Kahng

Professor at University of California San Diego
OpenROAD: Foundations and Realization of Open, Accessible Design

OpenROAD ( is an open-source RTL-to-GDS tool that generates manufacturable layout from a given hardware description – in 24 hours, at advanced foundry nodes. OpenROAD lowers the cost, expertise and schedule barriers to hardware design, thus providing a platform for research, education and system innovation. This talk will present current status of the OpenROAD project and the roadmap for OpenROAD as it seeks to enable VLSI/EDA education, early design space exploration for system designers, research on machine learning in EDA, and more.


[For readability, moderator comments have been removed, as well as minor questions for better understanding.]

Thank you for the introduction and the invitation to OSDA 2023. As you can see from the logos, it's definitely not me. It's a whole bunch of very dedicated and talented people who have worked for five years on OpenROAD. So I'm excited to see so many colleagues who are interested and who share the vision of open-source EDA. And this is a talk about OSDA perspectives from the OpenROAD project.

So OpenROAD aims for no human in the loop, tape out clean GDS, so RTL to GDS in FinFET nodes in 24 hours. The project has over 600 tape outs in foundry, 130 nanometer to 12 nanometer. There's a growing community of contributors and supporters and OpenROAD supports education and outreach on many levels, whether through IEEE societies or the Google Skywater shuttles or contests and STEM workshops, et cetera. It's also the basis for research in both EDA and hardware design, and it addresses needs of small R&D teams that otherwise face too many barriers to getting ideas into silicon. So several flows have also been built upon OpenROAD. And the project owes a lot to its unique partnership between academic researchers and the team of EDA veterans at Precision Innovations.

So where is OpenROAD going? Well, we work on pretty much every green box you see here, which stretches our team's resources. But I want to mention two overarching directions. The first is, what can we enable with unlimited copies running at the same time? And we've been working on cloud-optimized physical design, what we call COPILOT, and this takes us strongly into the realm of machine learning to predict doomed subtasks and so on. There's also low-hanging fruit, like black-box hyperparameter optimization or autotuning, which finds superior flow settings than our internal design experts can, and they never imagined such things even were possible. And in this vein, we've spun up as many as 30,000 cloud instances at once in a project with Intel.

The second overarching direction is to enable faster and more accurate exploration of architecture and floor plan options. Shortening the time to useful PPA feedback is very valuable these days. And so there's a new macroplacer called Hierarchical RTLMP, which understands RTL hierarchy and data flow. It delivers very human expert-like results. This is an AI accelerator in Global Foundries 12LP with 760 macros, and we think users will start to autotune this new macro placer, use it for early design space exploration, build hybrid flows with commercial tools, or other possibilities. Early design exploration also depends on partitioning that understands timing and modern constraints. So TritonPart is a very strong partitioner that is also new in OpenROAD.

But I did not come from San Diego to give a 10-minute talk about OpenROAD. It's much more important to think about, I feel, what can be the lasting outcomes from this workshop, including directions for open-source design automation as a community. So what is this cartoon? It shows the hockey stick of area and power on the y-axis versus clock period, which is increasing on the x-axis. So open-source in red is worse than closed-source in blue, which is worse than some unknown black optimal hockey stick. And the red hockey stick is basically all of us. The huge challenge, then, for all of us is shifting the red hockey stick from today to tomorrow, and the question is "HOW?". So I feel one answer to "HOW?" is what we just saw, develop better engines for key optimizations, jump in where traditional EDA vendors are only wanting to tiptoe, like cloud or data for machine learning in an open way, hooks for machine learning, and so on. But that's not it. The real challenge and the most important direction, in my opinion, is efficiency as a community. So open-source EDA has a huge number of needs, too many, and not enough people. So I want to give a few thoughts about this.

So thought number one is that bars matter a lot. Is it good enough? How relevant is it in terms of functionality, the data produced, the quality of results, and so on? Whatever the answer, is it measurably and continuously improving? Is it actively supported? And these are some aspects of good enough. And can I rely on it? Because there's student code versus professional code. There's training, documentation, user community, availability with what terms and conditions. And all of these bars questions, relevance and quality in particular, are always coming at academic and open-source EDA. So one question to think about is do we need more wood behind fewer arrows, as they say.

Thought number two is if it's infrastructure, it's a commodity. Infrastructure is not differentiating. It just needs to exist. So should highway signs be green or blue? Should stop lights be vertical or horizontal? Pick one and move on. So I'm convinced that data model, database, STA, readers and writers, PDK support, loggers, extension language support, GUI, should be like plumbing and utilities. If they work, you don't think about them, which is also a bar. So in my personal universe, METRICS2.1 or OpenDB or OpenROAD or whatever, we need elements of some road bed upon which we build a road ahead for OSDA and the open hardware ecosystem. For instance, for research on machine learning in EDA, which has to include open data and explainable models and optimization benchmarking, we don't need five road beds, but we need at least one. And actually standards, interoperability, efficiency, frameworks, they're everywhere in the history of EDA and IC design. Without them, there's just fragmentation of resources and towers of Babel, which are just bad. So it really is a matter of putting more wood behind fewer arrows.

The third thought is we shouldn't forget that good proxies are essential to the development and adoption of open-source EDA. What do I mean by this? Open-source EDA is often rejected because there is poor validation or confirmation of relevance and value. And the root cause of this, if you really think, is that something somewhere was not shareable. Not shareable implies the need for high quality, open proxies. Proxy PDKs and enablements. Proxy EDA tools that we can benchmark and record data from. Proxy test cases that are relevant, that drive design automation, and its progress into the future. So if proxies are not good enough today, it actually blocks the entire community and this community needs to invest efforts accordingly. It may be a journey, but there is a saying for that ["A journey of a thousand miles begins with a single step"].

So in conclusion, this really was an OpenROAD talk. Efficiency has always been the biggest challenge. How can we move faster and achieve more as a community? So the three thoughts were first, bars of critical mass, critical quality matter. Second, infrastructure is a commodity. It's not differentiating. So pick something and move on. And it's actually very harmful to try to build five different road beds when at the end of the day, your end goal is to have a road. Third, not shareable always ends up being a blocker. So we need to continually improve proxy PDKs, tools, and enablements. What will be the lasting outcomes of this workshop? Hopefully new friendships and synergies formed today will take us more quickly to some better tomorrow. I look forward to the discussions, you know, after we're done and during the breaks. Here is kind of the usual links slide just for the video. The third item in particular is a recent talk at an NSF workshop and thank you very much for being here today. I look forward to questions.


  1. I have my thoughts, I'm a little bit skeptic. Because I think, firstly, there is, in open-source, some kind of inefficiency needs. You obviously have some people who want to do it differently than what is the common practice. And second thing is that these discussions are not constant when we need to standardize things. But the last thing, what I mentioned you want to do is to be involved in standardization. So those are two thoughts I have in my mind.

    Sure. Excellent points. We are all, I think, self-selected to be very pioneering, free-spirited, and lone wolves, if you will. And there are a lot of single-passionate-developer open-source projects out there. It just so happens that to impact in a sustainable, stable way, a larger community of education, workforce development, training, bringing in a next generation of EDA researchers and developers, I personally believe that critical mass and critical quality matter. And that's been the feedback we've had through five years of OpenROAD. Through all the Birds-of-a-Feather workshops at DAC, what did the community want? They wanted a full flow. Then they wanted high quality software engineering. Then they wanted support. Before they would even begin to kick the tires. And so we kind of lived through those first few years of the project and learned a lot about what people actually want to see. And we're so far from being where we need to be. That's why I perhaps think a lot about critical mass, critical quality, and how can we actually, as a community, support seven place and route initiatives or something like that, when we don't have enough resources even for one, it seems. Hope that makes sense.

  2. ... to be clear, I don't disagree with that problem statement. It's that I see a lot of problems to get a solution to that problem statement.

    There's a lot of fantastic work, and I feel like if we share with each other the know-how that, "Oh, there is an open-source FinFET capable detailed router out there", or, "There's another one at Chinese University of Hong Kong", or, "This is the one static timer that understands generated clocks and timing exceptions", you know, in a sort of industrial sense. And we share this and build upon each other's works, then we will move faster. It's a matter of attitude, culture, personalities, of course.

  3. How about excitement?

    Excitement. I think people get excited. I mean, I feel like the OpenLane folks must be very excited. The SiliconCompiler folks must be excited. You know, there's high school students, there's classes. UC Santa Cruz Extension teaches a course based on OpenROAD in a couple of classes, and they, I think, were having a workshop two days ago on OpenROAD for ASIC design at UCSC Extension in the Valley. There is some element of virality and excitement, but if you're a glass-half-full person, it's very motivating. If you're a glass-half-empty person, it is also motivating in the sense of, "Oh, there's so much left to do and so little time", because, you know, the world moves fast. Moore's law has always been 1% every week. That's been tough to keep up with. And that's why I bemoan redundancy and waste or inefficiency in some of the ways I mentioned.

  4. Thank you for the presentation. May I ask, I mean, can we... Actually, I'm most afraid of hearing this. I mean, it's not just, you know, me. Can we have a chance to be inspired by the Linux developments in the early stage? Like a carrot or a wheeling list? Can we be a structural maintainer rather than a harsh leadership?

    I think we've always had that in mind. Even the second program manager of the DARPA program that spawned so many projects in hardware and software, you know, had a vision of Blue Hat, not Red Hat. And we've always talked about the Linux of EDA, that sort of thing. Indeed, the funding agencies do support sustainability initiatives. How to take this into a freemium model or software-as-a-service model. Introductions to venture capitalists. All of those possibilities probably are viable today. On the other hand, people who work in open-source EDA for these past years, as I said, are very self-selected. I mean, they escaped from venture funded EDA startups and big EDA companies. So, you know, how to maintain the attractiveness of open-source development, the rewards that are very personal, you know, seeing young people attracted to the field, without sort of saddling them with this yet another startup, you know, life that they don't want to return to. That's been kind of a challenge. I think most people in the room who are in small companies are very self-selected to have the lifestyle of, you know, doing good near often the end of their careers. After decades doing the grind. And I think personally I see this challenge in finding sustainable futures. We talk a lot to philanthropists and I understand others do as well. But, you know, it's a few million dollars for each independent effort every year to even have six highly-skilled developers plus some grad students and, you know, contests or whatever. There's not a lot of multiples of millions of dollars per year lying around for us to harness. And I wonder how we will manage as a community to do the best we can with what we can harness. That's perhaps the challenge.


Jean-Paul Chaput

Engineer at Sorbonne Université
Coriolis -- A FOSS RTL to GDSII Toolchain

The talk will present the RTL-to-GDS toolchain Coriolis (, it's current features and future plan. A special emphasis will be put on the challenges of making such a toolchain.


[For readability, moderator comments have been removed, as well as minor questions for better understanding.]

Well, good afternoon everyone, so I'm Jean-Paul Chaput, and I will present you today Coriolis, which is an RTL-to-GDSII tool chain that was developed at Sorbonne University.

So here is a very simplified view of the design flow, and on the left you have the hardware description language, whatever they are, then we go through logical synthesis, and then to the physical synthesis, and then we got the layout. To be completely accurate, Coriolis is the blue part, this is the part that we specifically developed at Sorbonne University, but we have also developed a framework to manage all the tools of the chain. But the red ones are third-party, so developed outside of our lab, but we use them. When we developed Coriolis, we wanted to do something different from what was done at that time, which is between 15 and 20 years ago. We wanted to make integrated tools. That was the key point of making Coriolis. We did not want to only share the database, the underlying database, and having tools run one after the other, for example run first the placement, then the global routing, and finally the detailed routing. What we wanted is that all the tools reside in memory at the same time, and part of the tool can be run, the tools are run sequentially, but with part of them not in order. For example, typically we use that to tightly integrate the global routing and the detailed routing, and some part of the detailed routing are run before the global routing. The fact that everything resides in memory allows us full communication between the tools. We don't need files, we don't need extra things. We can really bind the tools together very tightly, and this is a key point of Coriolis. So you have the database on the left, the end result of that is, finally, we wrap everything in Python. The whole tool, the whole set of tools, is completely scriptable in Python. I mean by that, that in the end, we don't have a binary. There is no one binary with Coriolis, we have just a set of libraries, which is then bound by Python. We write the computationally intensive part of the tools in C++, put them in the library, and then we assemble the library the way we want. And we can, this is for experimental needs, so we can rearrange the tools, we can do a lot of experiments with Python, which is much faster than doing it directly with C++ code. This is doing for, we do that for fast prototyping. So the end result is a mix of C++, highly efficient tools, and Python glue, and even maybe Python algorithms sometimes. So part of it is written in Python, part of it is written in C++, in an almost seamless way. We can really efficiently communicate between Python and C++. So this did allow us to completely integrate analog design. There is one feature which is not represented here, but we will see an example at the end of the presentation. We can completely mix analog and digital design. There is no longer, if for the designer in the room, there is no longer analog-on-top or digital-on-top approach. It's seamless. This is still a demonstrator, but we have all the capabilities to do it, and we will achieve it sometime soon. So what the current capabilities of Coriolis are. As making ASICs is very difficult, I think every people who try to make a GDSII, who passes all the verification, is aware of that situation. It is very difficult. So we started to target the mature nodes, that is, above 130 nanometers, typically Skywater. And we target mature nodes, and we will slowly go down to more advanced nodes as we add feature after feature, mainly the timing closure. So our future strategy, how do we manage to do that? As I said at the start of the presentation, we have completely integrated tools, but they still run basically sequentially. Even if some part of them can be run out of order, we are basically sequential. What we want to do to reach advanced nodes is to manage timing closure, and for that we want to go to another level of integration between the tools. That is, instead of having the tool run sequentially, we want them to run step by step, through a progressive refinement process. The idea is basically to make one step of placement, then perform an analysis of the timing, extract some constraints, some information, that will guide the next step of the placement. And that involves global routing and placement, and we will go down progressively until all the objectives are met. And this is our next big step. So this is the challenge.

So here is the first example of design that we do. This is the first stage of an OpAmp, (I don't know the exact name). So the point is that on the lower part, you see the analog design. It's not very compact, because that was not the point here. This is a test example, and it is mixed with a decoder on top. The part on top, we see clearly it's a different layout. It's the digital part, it's made of standard cells, and it performs a decoding task, which controls the little devices, which are exactly below. You see the set of horizontal lines, and it's completely integrated. The router, for example, the global routing and the detailed routing are done with the same structure for both parts, analog and digital. And the analog part can manage, the detailed and global router, can manage specific constraints of the analog part. So next, what we did very recently is this small test chip with the PragmatIC technology. And it's only a very small one, 760 standard cells. It's a small thermometer, and it was made by PragmatIC, with their flexible technology. It's a four-metal layer technology, of which two are available for routing. So it's kind of a bit tough for the router, because it has only two metal layers for routing, and it was still able to complete in over-the-cell routing. And the chip was made, and it did work the first time, with a yield of around 70%. So it was very interesting.

The next one is the biggest still that we did with Coriolis. So it's 1.3 million transistors. It's an implementation of LibreSoC chip. It is an OpenPOWER architecture. So it's quite different from RISC-V, still. And we were partially able to test it, due to some difficulties. We were only able to check the PLL, but that says a lot for us, because it means that the I/O pads work. It means that the standard cell works, especially the D-flip-flop, because the PLL contains D-flip-flops. And the PLL did work, and generates clock at the expected speed. But due to some problems out of our control, we weren't able to fully test the chip.

And finally, we also made a little RISC-V through ChipFlow, which was sent to the MPW4 program from Skywater. For that, we don't know if it works or not, because we are still waiting for the chip. So we won't be able to test it. The point of Coriolis, just a little bit back, is that it's so much integrated with Python that you can describe your whole design with just one Python script, Even the Makefile-like dependencies, and the fact that you want to run Yosys or that tool, inside one Python script, and not a very long one. In fact, in those kind of scripts, the longest part is the description of where the I/O pads are. If you have 200 I/O pads, then you need to have 200 lines, each one per I/O pads. And the rest is just calling the tools. It's fully customizable, you can do whatever you like. One other point is that with the Coriolis project, not only do we want to provide tools, but we also want to provide blocks, and especially portable blocks. It has always been a big problem that when you change a technological node, most of the time you have to do a lot of work re-doing or re-validating your standard cells, and it is even worse for analog blocks. So what we are also developing is portable analog blocks, and this will be another outcome of the project. So I think I have done it, maybe a little too short, so now I'm waiting for questions.


  1. There are other tools like OpenROAD, why did you then go for Coriolis?

    First, in fact, in terms of chronology, we were the first. The Coriolis project started around the year 2000. But as we are a little team, we have to progress slowly and methodically through the tools. So this is why... And we have... The problem is about the database. One of the problems is about the database. We developed a very specific database, exactly tailored to suit our needs. It is difficult to, after a long time of developing over it, to switch to another one. And I would say it's not very beneficial. If you change your database, basically you do exactly what you have done before, but on another database. So unless there is a very big incentive to do it, I mean, you gain something at the end. If you just switch the database, at least for now it has no interest. But in fact, as I said, we started more than 15 or 20 years ago, depending on the starting point. But we are thinking about changing our database. It may occur in the future, but it will be a slow move, and one which is very well planned.

    [Comment from the audience:] One comment is that we all want to use open-access. But open-access is not open. So we use the Athena Design Systems database. It was donated to the project. By a team from IEH, we had three companies, IT donated to the project, and that was where we started. So it's important to have open access, equivalent to saying, "Come to your house, do not worry about the key". And I think the world wants something like that.

  2. Can you share your timeline for the whole mixed-signal flow?

    I don't have a timeline yet. I know exactly what we have to do. We are in the process of hiring people, and it will depend on if we succeed or not. I cannot give you a definite timetable, because I don't know yet. It's too difficult to see now.

  3. ...I'd like to see it by now.

    Yes, I would say it's almost working. So it depends on the incentive. We can re-focus our priority if there is a demand. Up until now, it was not our top priority, because we have other requests. But that can change.

  4. I have a question, a more technical question. You said that when you have certain routines, and you are calling people at a time, is that a possibility?

    Sorry, I don't hear you well.

  5. There is a set of lines, and then you are calling the things that you need, etc. So, and you said that mostly you store it in memory. So, how do you, let's say, comment this kind of deviation of the queries? Because, you know, computers so far, they have limited memory. And so, at certain moments, when the complexity of the chip limit goes down, how do you predict that?

    I think for now, we rely on the memory. I mean, we can only manage chips that fit into the memory of the system. But until now, we have not reached that limit. I mean, we did not make a very huge chip. The biggest one we made is the TSMC one, and I think it fitted in less than 10 gigabytes. So, we are quite compact in memory.


Myrtle Shah

PhD student at Heidelberg University
nextpnr & FABulous: customisable custom hardware

Myrtle will introduce some of the recent developments in nextpnr; including easier ways of prototyping new architectures as well as some core algorithm improvements. They will also introduce FABulous, a highly flexible open source eFPGA fabric generator, and its close integration with nextpnr.


[For readability, moderator comments have been removed, as well as minor questions for better understanding.]

So hey, I'm Myrtle. As said, I'm at Heidelberg University. I've been with a couple of different places in the past, but throughout that I've been the lead developer and maintainer of nextpnr, an open-source place-and-route tool that's targeted at real-world FPGA fabrics, so both commercial FPGAs but more recently taped-out academic FPGAs as well.

So nextpnr, it's been in development since May 2018, so going on five years. In that time, a bit like a cat has nine lives, nextpnr has had about four or five different employers that's been paying me to work on it in slightly different capacities, but I've stuck with it and I have probably more loyalty to nextpnr than anything else at this point, so it's kind of my pet project as much as anything else. It's open-source, it's targeted at multiple architectures, so we're not making sure none of the core code is specific to any particular FPGA. This isn't some throwaway research FPGA placer that was intended to work on one model of UltraScale with a restricted set of primitive. We're really looking to be able to support any functionality of any FPGA and provide a tool that real users can use for real designs. More recently, as well as these various commercial FPGAs that we support, we have also been working on support for academic FPGAs, in particular FABulous eFPGAs, which will be the second half of my talk today.

So one of the things I've worked on more recently in nextpnr, if you've ever looked inside the nextpnr details, the way the code is implemented, it has a fairly complicated API that you have to implement in order to add new FPGA families to it. And if you're implementing a big Xilinx FPGA, that's great because it gives you a lot of control over things like data structures that you really need when you're scaling for a big FPGA. It enables you to implement the really complicated constraints you might get in an FPGA architecture, deal with all the things like IOs, SERDEs, PLL, custom clock routing, complicated validity rules inside slices, all those kind of things. But when you're dealing with something like an eFPGA and you just want to throw something together quickly, implementing that big API generally involved just copying and pasting a bunch of code, and that really wasn't ideal. So Viaduct gives you a way of basically building up the representation of the FPGA in memory. It's only going to scale up to about 25,000 LUTs, but if you're just working with prototyping an academic eFPGA, want to bring up place and route quickly, it's perfectly good for that. And yeah, the core nextpnr de-duplicated custom API approach can go up well beyond a million LUTs in terms of database scalability, but yeah, we don't need that here. And the nice thing about Viaduct in terms of prototyping FPGAs is you can still bring in a lot of the custom constraints and things that nextpnr give you that are really important in terms of easily targeting real-world FPGAs, but you don't have to use them from the get-go. You can really start from less than a thousand lines of code to add a new FPGA into nextpnr. So I'm not actually going to talk that much about the core of nextpnr today, mainly because it hasn't actually changed that much since some of the previous talks that I've done, possibly looking slightly different, but yeah. But even back in those days, the core place and route algorithms haven't changed that much. There are some future plans, but yeah, they're still down the line.

So yeah, on to FABulous. FABulous is an eFPGA fabric generator from Manchester and Heidelberg Universities. So it's for building custom FPGAs for your application, and it's a very customizable generator in a similar way to the spirit of nextpnr being a very customizable place and route tool. FABulous is incredibly customizable in terms of the type of fabrics you can build with it. So it's not just throwing together your standard CLBs, it also has a lot of flexibility for including things like custom blocks, so things like DSPs, block RAM, register files, adding interfaces to hard CPUs, and you can also use its routing graph framework to build things like CGRAs as well, and things with coarse-grain reconfigurability, hard wiring between IP cores, almost anything that you could put in FPGA, the FABulous framework is flexible enough to model as well. And we've been trying out FABulous with some test tape-outs on some of the open-source Google shuttle runs, which seem to have been quite a core theme throughout today's workshop, has been that the new possibilities these shuttle runs have opened up. And actually probably a big change since we last sort of had these workshops kind of in the before-COVID times is the possibilities that these open-source shuttle runs have come back. And so instead of just theoretically talking about ASICs, you know, we're coming back with real chips and we're able to test real chips. And in our case, we have real FPGA silicon from MPW-2 that's come back and it's working, and we can build bitstreams and run it.

So yeah, I think this is actually mostly stuff that I've talked about previously, but yeah, FABulous, it's flexible, it even supports both Verilog or VHDL, so you don't have to have a holy war, you can just pick whichever one you prefer. As well as that, it can generate the data that an nextpnr needs in order to place and route for that fabric. And of course we have Yosys support for doing synthesis. It uses a latch-based configuration architecture. This is kind of always a trade-off. So the simplest approach of FPGA configuration is just having a shift register, but that's about twice as big because it needs D flip-flops instead of latches. And it's also less robust because as you're shifting things in through the fabric, you go through a whole different series of configurations every time you shift it. And there's actually a risk you can do things like build ring oscillators and things with those configurations you don't want. In an ideal world you might use something like an SRAM cell for your FPGA configuration, but the problem is once you start doing that, that's an incredibly process-specific thing. You can't then just easily change your primitive and rebuild your fabric. You've got to design a whole new custom primitive rather than just inserting a foundry cell. So that's why we settled on latch-based configuration. And through our configuration interface that lets us reprogram individual lines of latches, that also gives us partial reconfiguration support without actually having to do any extra work. That just comes for free with our fabric design.

So these are actually the two fabrics we taped out on MPW-2. The one on your right is the one that I've mostly been working with. That's a pure FPGA. It's a slightly bigger FPGA, but it's got DSPs, block RAMs, register files, LUT4s. The one on your left is the one that has two RISC-V cores added to it, two IBEX hard RISC-V cores. That one is still a bit of a work in progress to bring up because of its higher complexity. But yeah, those are our two cores.

And so to give you an idea of the kind of designs that we can build on this FPGA, this should actually be animated, but it's a little demo showing a whole bunch of different primitives, actually. So we have the block RAM, which is actually OpenRAM-based. That's containing the texture data. We have DSPs, which are being used in multipliers for the perspective transform. And then we have a bunch of general logic, which is just doing things like some multi-cycle dividers for the transform. And so yeah, this is actually just like a little animated scrolling road output on VGA. I think this is about 500 LUTs or so of logic, plus the DSPs and all the block RAMs.

So kind of in the scheme of the Open MPW run so far, we're not quite dealing with perfection yet. This isn't actually quite the same as a lot of the hold-time problems, because this wasn't exactly a hold-time problem because we mis-characterized stuff. It's a hold-time problem because we simplified the clock architecture from a clock tree to a clock ladder. And that essentially means that some patterns of data routing will mean the data delay doesn't match the clock delay, and you get hold-time problems. And the idea always was that we should be able to fix this in nextpnr. But yeah, there's a bit more work to actually get the fix-it-in-software done. Not least, we actually need to extract a timing analysis out of the fabric and have a timing model for nextpnr. And yeah, once we have that, I expect we'll have a pretty robust fabric working and something potentially where we can make some cute business cards or something and show our custom-made FPGA fabric on an open process, which will be very nice.

So talking a bit more about what our future plans are, there's one of the weak spots of nextpnr has always been timing analysis. So as well as things like the hold time fix-up that we'll need for those FABulous FPGAs, we also need to be able to do things like cross-clock domain constraints. So you can constrain individual clock frequencies in nextpnr, but you can't do things like constrain the minimum/maximum delay between clocks, multi-cycles, that kind of thing. So that's probably the biggest priority in terms of increasing the usability of nextpnr. nextpnr has a GUI. It's a bit of a basic GUI, but it's there. And one of the things that would be nice is actually to support FABulous fabrics in the GUI. That's obviously a bit more complicated than what we've done in the GUI before, where we've had a fixed FPGA like an iCE40, because people can make all manner of FABulous fabrics, and we have to work out things like the layout of wires and blocks in the GUI automatically. And then there's the usual stuff, which I think has cropped up in my plans on every nextpnr presentation I've done, of how we're going to improve the place and route in the future. My current project, something more of a personal project, is an electrostatic placer for nextpnr. So that uses some... It's a very common algorithm in ASIC placement. It's becoming more accepted in FPGA placement as well. And it uses essentially the principles of Electrostatics to optimize the placement. So you imagine that your cells are essentially charged particles, and you have some forces pulling them together because you want to minimize wire length, but you also have some forces pushing them apart because you don't want cells overlapping, because that's not a legal placement. And then you can do a whole bunch of maths, and luckily maths is fairly well researched. Actually, a big part of it interestingly boils down to some Fast Fourier Transforms, which, once again, very well researched thing. So this can also be a pretty easy-to-accelerate placer as well, because there's lots of existing work on doing, for example, GPU acceleration of these kind of algorithms. And a couple of other people are working at the moment on some Rust bindings for nextpnr's API. And this again might make researching things like parallel algorithms, where Rust has potentially nicer paradigms for that than C++ available. So for example, one of the projects that was actually inspiring that was doing a partition-based router that would actually, for most of the nets that don't need to cross a large amount of the design, firing them off to different threads and just routing them entirely in parallel, if it's known that they don't overlap. So that's kind of an idea of where the nextpnr roadmap would like to lead.

And yeah, that's my email address if you have any questions, and also a credit for the cat girl picture that's adorning the side of the slides.


  1. Yeah, you said the whole demo that you gave for the road was about 500 LUTs, so what's the utilization of that? How many LUTs total?

    I think there's about 850 LUTs or so total. So yeah, just a bit over half utilization. I've tested some high-utilization things just as quick tests. So for example, making a chain of inverters that goes through every LUT, just to get a rough idea that every LUT is working. But yeah...

  2. And this half were just being able to utilization bargains being on?

    We can definitely push a bit higher than that. So it does depend a lot on, what depends a lot on the utilization is essentially how dense the routing graph is. So iCE40 FPGAs have a very, very dense routing graph. We can push an iCE40 FPGA well beyond 95% utilization. ECP5 is a bit less so, so probably 85% is the highest you want to go. FABulous, we haven't really looked into it in great detail, but I'd guess again about 85% or so is the highest utilization that you probably really want to be using before you're going to start hitting at a minimum some timing problems.

  3. Thank you for the work on this. So you mentioned in the second part of the slides that you aren't requiring any [?], but that seems like way higher up than just an nextpnr. Is that also like a full flow for eFPGA?

    So this is in terms of not the, of course it, yeah, it supports Verilog or VHDL for the designs because it uses Yosys. This is in terms of the Verilog that FABulous generates for the FPGA fabric. A netlist basically of, basically essentially it's a netlist of latches and MUXs essentially, plus whatever other primitives you have.

  4. Your photos of the tunnel road look very regular. Are you playing tricks like you heard earlier, where you place them around one tile?

    Yep. So for each tile type, so we have, I think in that fabric, basically, aside from like the tiles around the edges, we have like basically three basic tile types, the LUTs, the register files and DSPs plus, yeah, we have some edge tiles and the block RAMs are separate blocks. But yeah, each one of those is basically placed and routed as a fixed macro and then just stamped out like a hundred times across the chip or whatever.

  5. And that's the same question I asked before. Do you see applications of this where the FPGA is used as glue around more efficient, customized, I don't know, DSP or...

    Yeah, definitely. So that's probably going to be one of the topics of my PhD thesis, is things like designing the ideal FPGA for an application-specific thing. So maybe there are specific kinds of DSP blocks, for example, that would suit a certain application best, or the split between FPGA and hard macro. Or maybe you can, for example, for an SoC, for example, you would have today's crypto algorithms as hard blocks, but then have some FPGA's fabric in order to implement what crypto algorithms might come out in the future that you might want to accelerate as well, if it's a long lifetime IoT project or something.

  6. Hi, [audience member] from New York University. What differentiates algorithms from OpenFPGA?

    So, yeah, to be honest with you, I think one of the big things is that we have this kind of high level of customizability, but also simplicity. So we have incredibly non... You're not kind of forced into any particular way. We're kind of very focused on exploration and having a code base that's easy to hack around with. And yeah, incredibly simple format for specifying things like how your routing graph looks, so you can really easily play around with different routing graphs, that kind of thing. And we also have, I think, probably a bit ahead in terms of things like playing with the shuttle runs and stuff. So yeah, that's kind of, I think, where we are.


Tristan Gingold

HDL Developer at CERN
GHDL in the FOSS EDA ecosystem

GHDL is an open-source VHDL simulator and synthesis tool. This talk will present the latest added features and some ideas for future development (in particular mixed simulation)

Video will be released July 17, 2023

Jim Lewis

OSVVM Architect at SynthWorks
Quick Overview of OSVVM, VHDL’s #1 Verification Methodology

Open Source VHDL Verification Methodology (OSVVM) provides VHDL with buzz word verification capabilities including Transaction Level Modeling, Constrained Random, Functional Coverage, Scoreboards, FIFOs, Memory Models, Error and Message handling, and Test Reporting that are simple to use and feel like built-in language features. OSVVM has grown rapidly during the COVID years, giving us better capability, better test reporting (HTML and Junit), and scripting that is simple to use (and works with most VHDL simulators). This presentation shows how these advances fit into the overall OSVVM Methodology.


Welcome to the OSVVM 15-Minute Waltz. Let's go ahead and get started.

What is OSVVM? Well, first up, it's a verification framework. It's a verification utility library that implements the things that make VHDL a full verification language. It's a verification component library with a growing set of verification components. It's a script library that allows us to write simulator-independent scripts. It's a co-simulation library that lets us run software on top of our hardware in the simulator. It generates test reports, so HTML for humans and JUnit XML for continuous integration tools. OSVVM is free, open-source, available on GitHub, available on, and it's developed by the same VHDL experts who have helped with VHDL standards.

As a framework, it looks very similar to the SystemVerilog framework. We have verification components that implement interface signaling, and we have a test sequencer that has sequences of transactions that implement our test case. Then each test case is a separate architecture of test control. Our framework is simply structural code, and it's simple, just like RTL code. So we have an instance of the DUT, we have instances of our verification components, and we have an instance of our test sequencer.

Elements of our framework: We have transaction interfaces, such as 'ManagerRec'. We have the transaction API, such as 'Write()' and 'Send()'. We have verification components, and we have the test sequencer.

We have our model-independent transactions. This comes from the observation that many interfaces do similar things. So OSVVM has codified the transaction interface and transaction API for stream interfaces, such as AxiStream and UART, and in there you'll find our transaction interface implemented as a record, and our transaction API implemented as procedure calls. These procedures here are a subset of what's in the OSVM library. And then we have another set for address bus interfaces, or what's also called memory-mapped interfaces, such as Axi4 or Avalon, and again we have our record and our transactions, and again this is a very small subset of what's in the library itself. The benefit of this: We simplify verification component development, we simplify the reuse of common test cases and sequences.

Our verification components have a DUT interface, such as the record shown here for the 'AxiBus' or it could also have individual signals. And then it has a transaction record on it that uses one of the types from the previous slide, so here we're using the 'AddressBusRecType'. Inside a verification component -- if it's a very simple one -- we're calling 'WaitForTransaction', which waits until a transaction has been called, and then the record has something in it, so we decode the record and do the operations. Now we don't often have sub-programs in here, more often I write the code in here. Benefits: For verification developers, they just focus on the model functionality and don't have to do the other stuff that's provided by the OSVVM model-independent transactions.

Our test sequencer has transactions on the interface and maybe also Reset. Inside, this is where our test case is and our test case is in a single file. We have a Control Process that initiates and finalizes transactions, and then we have one process per interface, so we're concurrent just like the design is concurrent. Our tests are simply calls to the transactions, and it's easy to add and mix-in directed tasks with constrained random tasks with scoreboards and functional coverage. We also have synchronization utilities that help us synchronize these independent processes such as here at the beginning or here as the test is done. Because we're running the test, we're checking for errors and such and recording them against a data structure, and then at the end of the test, the control process calls this 'EndOfTestReports()' procedure that reports all of the errors for the test and creates YAML files that the scripts convert into the HTML reports.

Writing a directed test is easy. We simply call the transactions such as 'Send()' on the TX side or 'Get()' here on the RX side. We can do some checking with 'AffirmIfEqual()' from the AlertLog package. We can do checking by instead calling the 'CheckTransaction()' from the transaction interface. The test output of the 'AffirmaIfEqual()' then, if it passes, produces a 'Log PASSED'. If it fails, it produces a 'Log ERROR', an 'Alert ERROR'. The benefit here, we've greatly simplified writing self-checking tests and we've improved readability.

Our constrained random tests are simply a call to something from our randomization library such as our 'DistInt()' here, and this one, 70% of the time, it's going to generate a zero for us, and in that case, we're going to generate no errors and we're going to pick a data value between 0 and 255. Other possibilities, one for 'Parity Error' or two for 'Stop Error', note, we set up the operation to be stop error, but we also randomize a different set of values. This is the nature of constrained random, and then we do our transactions. Here, we're setting up our transaction and we're calling it at the end, but we could also be calling it within those 'case' branches and could be doing more than one transaction per branch. So our constrained random approach in OSVVM then is randomization plus code plus transaction calls.

Now we could do checking the same way we were previously and repeat the sequence on the 'Receive' side, but we really don't want to do that because it's tedious and it's error-prone. Instead, we can use a scoreboard. A scoreboard is a data structure used for checking data when it's minimally transformed such as sending something across a UART receiving it somewhere else like we're doing here. Our scoreboard has a FIFO and a checker inside, it uses package generics so that we can support different types, it handles small data transformations, it handles out-of-order execution, and it handles dropped values.

Using a scoreboard is pretty easy, we set up an object of the 'ID' type and then we construct the data structure. So we're building this in the package and it's actually a singleton data structure that we have sitting out there. Then we call 'Push()' with the handle for the scoreboard, 'SB' here, and then we do a 'Send()' transaction and then in the 'Receive' side we do a 'Get()' transaction to receive the value and then we just pass the values that we receive up to the scoreboard for checking. So we have a big benefit here in that the 'Checking' side is relatively generic and it stays the same even if the 'Stimulus Generation' side changes so we switch from directed to randomized test, still the same thing on the 'Checker' side.

The next thing we need to have is functional coverage. Functional coverage basically is code that tracks items in your test plan such as requirements, features, and boundary conditions. Why do we do this? Well, with randomization, how do you know what the test did? So we're looking for 100% functional coverage, 100% code coverage and that what indicates the test is done. Now why not just use code coverage? So code coverage tracks code execution but it misses anything that's not directly in the code such as binning values from a register or things that are independent that we need to correlate.

Okay so here we're building out our coverage model again using an 'ID' type because again the coverage models are in a singleton data structure. We then construct the data structure and then we define the coverage model by defining the bins of values that we want to see and then we call 'ICover()' to collect the coverage and it's simple. In fact, functional coverage with OSVVM is as simple and concise as language syntax.

Now we also go further. We can do introspection of the coverage model and create what is considered to be runtime coverage driven randomization. We call this intelligent coverage randomization which is what we're doing internally is randomizing across the coverage holes. So we start out the same way. We create our coverage object. We do our caller constructor and then we build out our bins. But now we add to our bins coverage goals. These become randomization weights and then we call up 'GetRandPoint()' with the coverage model to randomize a value within that coverage model and then we decode that value much similar fashion to what we did with the constrained random approach and then we dispatch our transaction and we record the fact that we did that transaction with 'ICover()'.

When we finish our test, we're going to generate our reports. The first reports you're going to see are these PASS/FAIL tests for a given test but 'EndOfTestReports()' is also essential for our other reports generated by OSVVM.

The next thing we have is our scripting and our scripting is started out as a list of files and then it evolved to having these TCL procedures that we call to set up our tests. So we have our library and we activate our library and then we analyze to compile things and then we simulate. And note, when we activated the library, we use that same library for the rest of the commands that follow it. So the library is set and remembered. And we work on basically all of the popular VHDL simulators with the exception of Xilinx's XSim. We're waiting on them to get good 2008 support.

We call our scripts using 'build' and 'include' rather than TCL's 'source' and EDA vendors' 'do'. Because this is what allows us to, when we specify a path, to make the path relative to the script directory rather than making it relative to the directory the simulators running in. That's important because you want to be able to relocate things on a project by project basis. So we use 'build' to start things off, to call things from the command line or to call the top-level scripts from the continuous integration run. And 'build' plus 'EndOfTestReports()' is what generates our reports. And then 'include' is for calling a profile from another profile.

Our report: We're going to just show you one of the reports. Our build summary report starts out with status for the entire build. Did it pass? Did it fail? We give you links to the log file and an HTML version of the log file. If you ran it with coverage, we have a link to the merged code coverage. We have a test suite summary, so we break our test cases out into test suites that focus on testing one thing, like the 'AlertLog' package or the 'AxiStream' verification component. And then, so that's a summary for the suites. And then we have the test case summaries that give us details of how each test case within a given suite ran.

So all you need for your VHDL verification is OSVVM. We have a powerful, concise capability that rivals other verification languages. We have unmatched reuse through the entire verification process. We have unmatched report capability with HTML for humans, JUnit XML for continuous integration tools. We have tests that are readable and reviewable by all, meaning verification engineers, but also the hardware designers, but also the software and system engineers. If you can read the transactions, you can read the tests. OSVVM is set up to be adopted incrementally, and you can find us on GitHub. You can find us on Thank you for attending my presentation.


Claire Xenia Wolf

CTO at YosysHQ

In her talk, Claire will discuss recent developments in open-source verification tools. Claire will briefly present equivalence checking with Yosys (EQY) and mutation cover with Yosys (MCY), and will highlight potential future directions.


Workshop closing