[For readability, moderator comments have been removed, as well as minor questions for better understanding.]
So hey, I'm Myrtle. As said, I'm at Heidelberg University. I've been with a couple of different places in the past, but throughout that I've been the lead developer and maintainer of nextpnr, an open-source place-and-route tool that's targeted at real-world FPGA fabrics, so both commercial FPGAs but more recently taped-out academic FPGAs as well.
So nextpnr, it's been in development since May 2018, so going on five years. In that time, a bit like a cat has nine lives, nextpnr has had about four or five different employers that's been paying me to work on it in slightly different capacities, but I've stuck with it and I have probably more loyalty to nextpnr than anything else at this point, so it's kind of my pet project as much as anything else. It's open-source, it's targeted at multiple architectures, so we're not making sure none of the core code is specific to any particular FPGA. This isn't some throwaway research FPGA placer that was intended to work on one model of UltraScale with a restricted set of primitive. We're really looking to be able to support any functionality of any FPGA and provide a tool that real users can use for real designs. More recently, as well as these various commercial FPGAs that we support, we have also been working on support for academic FPGAs, in particular FABulous eFPGAs, which will be the second half of my talk today.
So one of the things I've worked on more recently in nextpnr, if you've ever looked inside the nextpnr details, the way the code is implemented, it has a fairly complicated API that you have to implement in order to add new FPGA families to it. And if you're implementing a big Xilinx FPGA, that's great because it gives you a lot of control over things like data structures that you really need when you're scaling for a big FPGA. It enables you to implement the really complicated constraints you might get in an FPGA architecture, deal with all the things like IOs, SERDEs, PLL, custom clock routing, complicated validity rules inside slices, all those kind of things. But when you're dealing with something like an eFPGA and you just want to throw something together quickly, implementing that big API generally involved just copying and pasting a bunch of code, and that really wasn't ideal. So Viaduct gives you a way of basically building up the representation of the FPGA in memory. It's only going to scale up to about 25,000 LUTs, but if you're just working with prototyping an academic eFPGA, want to bring up place and route quickly, it's perfectly good for that. And yeah, the core nextpnr de-duplicated custom API approach can go up well beyond a million LUTs in terms of database scalability, but yeah, we don't need that here. And the nice thing about Viaduct in terms of prototyping FPGAs is you can still bring in a lot of the custom constraints and things that nextpnr give you that are really important in terms of easily targeting real-world FPGAs, but you don't have to use them from the get-go. You can really start from less than a thousand lines of code to add a new FPGA into nextpnr. So I'm not actually going to talk that much about the core of nextpnr today, mainly because it hasn't actually changed that much since some of the previous talks that I've done, possibly looking slightly different, but yeah. But even back in those days, the core place and route algorithms haven't changed that much. There are some future plans, but yeah, they're still down the line.
So yeah, on to FABulous. FABulous is an eFPGA fabric generator from Manchester and Heidelberg Universities. So it's for building custom FPGAs for your application, and it's a very customizable generator in a similar way to the spirit of nextpnr being a very customizable place and route tool. FABulous is incredibly customizable in terms of the type of fabrics you can build with it. So it's not just throwing together your standard CLBs, it also has a lot of flexibility for including things like custom blocks, so things like DSPs, block RAM, register files, adding interfaces to hard CPUs, and you can also use its routing graph framework to build things like CGRAs as well, and things with coarse-grain reconfigurability, hard wiring between IP cores, almost anything that you could put in FPGA, the FABulous framework is flexible enough to model as well. And we've been trying out FABulous with some test tape-outs on some of the open-source Google shuttle runs, which seem to have been quite a core theme throughout today's workshop, has been that the new possibilities these shuttle runs have opened up. And actually probably a big change since we last sort of had these workshops kind of in the before-COVID times is the possibilities that these open-source shuttle runs have come back. And so instead of just theoretically talking about ASICs, you know, we're coming back with real chips and we're able to test real chips. And in our case, we have real FPGA silicon from MPW-2 that's come back and it's working, and we can build bitstreams and run it.
So yeah, I think this is actually mostly stuff that I've talked about previously, but yeah, FABulous, it's flexible, it even supports both Verilog or VHDL, so you don't have to have a holy war, you can just pick whichever one you prefer. As well as that, it can generate the data that an nextpnr needs in order to place and route for that fabric. And of course we have Yosys support for doing synthesis. It uses a latch-based configuration architecture. This is kind of always a trade-off. So the simplest approach of FPGA configuration is just having a shift register, but that's about twice as big because it needs D flip-flops instead of latches. And it's also less robust because as you're shifting things in through the fabric, you go through a whole different series of configurations every time you shift it. And there's actually a risk you can do things like build ring oscillators and things with those configurations you don't want. In an ideal world you might use something like an SRAM cell for your FPGA configuration, but the problem is once you start doing that, that's an incredibly process-specific thing. You can't then just easily change your primitive and rebuild your fabric. You've got to design a whole new custom primitive rather than just inserting a foundry cell. So that's why we settled on latch-based configuration. And through our configuration interface that lets us reprogram individual lines of latches, that also gives us partial reconfiguration support without actually having to do any extra work. That just comes for free with our fabric design.
So these are actually the two fabrics we taped out on MPW-2. The one on your right is the one that I've mostly been working with. That's a pure FPGA. It's a slightly bigger FPGA, but it's got DSPs, block RAMs, register files, LUT4s. The one on your left is the one that has two RISC-V cores added to it, two IBEX hard RISC-V cores. That one is still a bit of a work in progress to bring up because of its higher complexity. But yeah, those are our two cores.
And so to give you an idea of the kind of designs that we can build on this FPGA, this should actually be animated, but it's a little demo showing a whole bunch of different primitives, actually. So we have the block RAM, which is actually OpenRAM-based. That's containing the texture data. We have DSPs, which are being used in multipliers for the perspective transform. And then we have a bunch of general logic, which is just doing things like some multi-cycle dividers for the transform. And so yeah, this is actually just like a little animated scrolling road output on VGA. I think this is about 500 LUTs or so of logic, plus the DSPs and all the block RAMs.
So kind of in the scheme of the Open MPW run so far, we're not quite dealing with perfection yet. This isn't actually quite the same as a lot of the hold-time problems, because this wasn't exactly a hold-time problem because we mis-characterized stuff. It's a hold-time problem because we simplified the clock architecture from a clock tree to a clock ladder. And that essentially means that some patterns of data routing will mean the data delay doesn't match the clock delay, and you get hold-time problems. And the idea always was that we should be able to fix this in nextpnr. But yeah, there's a bit more work to actually get the fix-it-in-software done. Not least, we actually need to extract a timing analysis out of the fabric and have a timing model for nextpnr. And yeah, once we have that, I expect we'll have a pretty robust fabric working and something potentially where we can make some cute business cards or something and show our custom-made FPGA fabric on an open process, which will be very nice.
So talking a bit more about what our future plans are, there's one of the weak spots of nextpnr has always been timing analysis. So as well as things like the hold time fix-up that we'll need for those FABulous FPGAs, we also need to be able to do things like cross-clock domain constraints. So you can constrain individual clock frequencies in nextpnr, but you can't do things like constrain the minimum/maximum delay between clocks, multi-cycles, that kind of thing. So that's probably the biggest priority in terms of increasing the usability of nextpnr. nextpnr has a GUI. It's a bit of a basic GUI, but it's there. And one of the things that would be nice is actually to support FABulous fabrics in the GUI. That's obviously a bit more complicated than what we've done in the GUI before, where we've had a fixed FPGA like an iCE40, because people can make all manner of FABulous fabrics, and we have to work out things like the layout of wires and blocks in the GUI automatically. And then there's the usual stuff, which I think has cropped up in my plans on every nextpnr presentation I've done, of how we're going to improve the place and route in the future. My current project, something more of a personal project, is an electrostatic placer for nextpnr. So that uses some... It's a very common algorithm in ASIC placement. It's becoming more accepted in FPGA placement as well. And it uses essentially the principles of Electrostatics to optimize the placement. So you imagine that your cells are essentially charged particles, and you have some forces pulling them together because you want to minimize wire length, but you also have some forces pushing them apart because you don't want cells overlapping, because that's not a legal placement. And then you can do a whole bunch of maths, and luckily maths is fairly well researched. Actually, a big part of it interestingly boils down to some Fast Fourier Transforms, which, once again, very well researched thing. So this can also be a pretty easy-to-accelerate placer as well, because there's lots of existing work on doing, for example, GPU acceleration of these kind of algorithms. And a couple of other people are working at the moment on some Rust bindings for nextpnr's API. And this again might make researching things like parallel algorithms, where Rust has potentially nicer paradigms for that than C++ available. So for example, one of the projects that was actually inspiring that was doing a partition-based router that would actually, for most of the nets that don't need to cross a large amount of the design, firing them off to different threads and just routing them entirely in parallel, if it's known that they don't overlap. So that's kind of an idea of where the nextpnr roadmap would like to lead.
And yeah, that's my email address if you have any questions, and also a credit for the cat girl picture that's adorning the side of the slides.
Yeah, you said the whole demo that you gave for the road was about 500 LUTs, so what's the utilization of that? How many LUTs total?
I think there's about 850 LUTs or so total. So yeah, just a bit over half utilization. I've tested some high-utilization things just as quick tests. So for example, making a chain of inverters that goes through every LUT, just to get a rough idea that every LUT is working. But yeah...
And this half were just being able to utilization bargains being on?
We can definitely push a bit higher than that. So it does depend a lot on, what depends a lot on the utilization is essentially how dense the routing graph is. So iCE40 FPGAs have a very, very dense routing graph. We can push an iCE40 FPGA well beyond 95% utilization. ECP5 is a bit less so, so probably 85% is the highest you want to go. FABulous, we haven't really looked into it in great detail, but I'd guess again about 85% or so is the highest utilization that you probably really want to be using before you're going to start hitting at a minimum some timing problems.
Thank you for the work on this. So you mentioned in the second part of the slides that you aren't requiring any [?], but that seems like way higher up than just an nextpnr. Is that also like a full flow for eFPGA?
So this is in terms of not the, of course it, yeah, it supports Verilog or VHDL for the designs because it uses Yosys. This is in terms of the Verilog that FABulous generates for the FPGA fabric. A netlist basically of, basically essentially it's a netlist of latches and MUXs essentially, plus whatever other primitives you have.
Your photos of the tunnel road look very regular. Are you playing tricks like you heard earlier, where you place them around one tile?
Yep. So for each tile type, so we have, I think in that fabric, basically, aside from like the tiles around the edges, we have like basically three basic tile types, the LUTs, the register files and DSPs plus, yeah, we have some edge tiles and the block RAMs are separate blocks. But yeah, each one of those is basically placed and routed as a fixed macro and then just stamped out like a hundred times across the chip or whatever.
And that's the same question I asked before. Do you see applications of this where the FPGA is used as glue around more efficient, customized, I don't know, DSP or...
Yeah, definitely. So that's probably going to be one of the topics of my PhD thesis, is things like designing the ideal FPGA for an application-specific thing. So maybe there are specific kinds of DSP blocks, for example, that would suit a certain application best, or the split between FPGA and hard macro. Or maybe you can, for example, for an SoC, for example, you would have today's crypto algorithms as hard blocks, but then have some FPGA's fabric in order to implement what crypto algorithms might come out in the future that you might want to accelerate as well, if it's a long lifetime IoT project or something.
Hi, [audience member] from New York University. What differentiates algorithms from OpenFPGA?
So, yeah, to be honest with you, I think one of the big things is that we have this kind of high level of customizability, but also simplicity. So we have incredibly non... You're not kind of forced into any particular way. We're kind of very focused on exploration and having a code base that's easy to hack around with. And yeah, incredibly simple format for specifying things like how your routing graph looks, so you can really easily play around with different routing graphs, that kind of thing. And we also have, I think, probably a bit ahead in terms of things like playing with the shuttle runs and stuff. So yeah, that's kind of, I think, where we are.