Tim Edwards at OSDA 2023 -- Principles of Paranoid Design | OSDA -- Open-Source Design Automation

This talk explores how hardware projects designed using an open source PDK rely too much on precise data which may not be available, and how problems can be avoided by certain design methodologies such as two-phase clocking, negative-edge clocking, margining, and monte carlo simulation. While open PDK data can be made more reliable by cross validation with multiple tools and, ultimately, measurement, good design practices can achieve working silicon without absolute certainty.

[For readability, moderator comments have been removed, as well as minor questions for better understanding.]

Okay, thank you. If you don't know who I am, then you may not be in the VLSI domain. You can Google who I am.

So I'd say half of VLSI talks include a graph of Moore's Law. This is my slide of Moore's Law. The reason I wanted to show it is because Moore's Law represents the fact that design in VLSI has been performance-driven, just completely, totally performance-driven.

And of course, being performance-driven, that comes at a cost, a literal cost. And the cost of making a mask set for any process, somewhere around 65 nanometers that exceeded the median cost of a house in the US, I like to use that as a reference. And of course, people who are doing prototyping designs don't have to pay for the mask cost, but the mask cost has to be paid by somebody. And so even if you distribute that cost over a number of projects in an MPW, it's still expensive and it's still out of the price range for a hobbyist, for instance.

If you want to do something useful, like over here, this is our Caravel chip from efabless. The core processor of that is a very small CPU, that's about the minimum size of something useful you can do. And it's still two millimeters squared in a 130 nanometer process, and so that's still in the cost of thousands of dollars. So the cost, the point is that the cost is not just the cost of the silicon, it's also the cost of failure if your thing doesn't work.

And the thing is that standard design processes essentially assume that you have a PDK with essentially perfect data. And that's because foundry data for established nodes has come from PDKs that are, I mean, they're generating PDKs that are generally reliable. That's because they have done this over numerous iterations of silicon, and that's what they give to you when you buy a proprietary PDK.

So the traditional design methodologies all assume this idea that your data is perfect. And so all you do is you test your design over process corners, over PVT, and if that works, you're good. And if it doesn't work, you might try doing Monte Carlo simulation, which is a little easier to meet, and if that works, you're good. And that sets your probability bounds that you're going to get first-time working silicon.

Now the problem is you get an open-source PDK. You throw an open-source PDK into the mix, and now the trust in that PDK goes way down. The reason is the open PDK has been pulled from many, many sources, and some of the data sources -- I've been involved in the open-sourcing of the SkyWater process -- and the problem with that was we get all these data formats. Some of them are proprietary, so we can't use them. And so we pull data from wherever we can find it in the files that the foundry sent to us, and in some cases, we find that those files don't even have the right data because they weren't part of the commercial flow. They were just some other file format, and nobody paid attention to what was in that file, and so you find it's completely broken. So until we have many more measurements on open-source silicon, if you're starting with any new project, then you will need to rely on assorted tools and methods that are in the public domain. Now there are two approaches to this. You can try to get your data better, or you can try to design against the data, which is why I call this principles of paranoid design.

So for the first part of that, how do you make your data better? So one of the things that commercial tools have that has not been in the open-source world until recently is the idea that you can use a field equation solver to figure out what all the parasitics are for your process, because that is one of the main areas where the data is often not very reliable. So I have found this, this is from https://fastfieldsolvers.com. It's an open-source field equation solver called FasterCap. It has 3D and 2D solvers, and I wrote a little project that I put up on GitHub. I'll show the URL for that in the next slide, but essentially it's a bunch of, a bunch of, an API and a bunch of routines written in Python that will take a file that describes a process like SkyWater 130, for instance, or GF180, and then build out a number of different metal structures in a FasterCap input format, then make like a hundred of those, thousands of those, and then run them through FasterCap.

And from that, I've been able to get all these curve traces for how the parasitic capacitances work in this process, and use that then to go improve all the models that I have in my layout program, Magic, which does extraction, and I'm getting much better results than I have before using this. And it's also extensible, so if any time we need to onboard another project, another foundry process, then it's just a matter of writing an input file, and then we can redo these and get all the coefficients that goes, plugs back into Magic, and you get your parasitic extraction.

So right now I have what I consider to be a pretty good full RC extraction in Magic, and this is an example of me running full RC extraction and then simulating on a OpenRAM circuit that Matt sent me, and you can see the difference between in this section here where it's being driven back to mid-range voltage, that in an ideal simulation, it goes directly down to the mid-range voltage, and a full RC extracted, it's got a slope to it, and somewhere else in the simulation, you can see where a bit fails because of that.

At the same time, we've got tools like the simulators themselves, Ngspice and Xyce. Ngspice has recently introduced OSDI and OpenVAF compilers, which you can run as plug-ins. That makes it a lot easier for us to use some of the newer models. We have, for instance, a few Verilog-A models in the SkyWater process that we weren't able to introduce with the original release of the open PDK that we can do now, and that's also true of the ReRAM models in Sky130, and then Xyce has been doing a lot of development also in the last few years and is getting much more compatible with other versions of SPICE.

So now the other approach is, rather than try to make your data more reliable, try to design for making something robust, rather than designing for performance, performance, performance. The principle here is that if you are designing to an open PDK, and the open PDK has only recently been introduced, then you should not be trying to design something for performance, you should be designing for something interesting, for novel architecture, and you should be trying to make the thing work. So design for robustness, and be paranoid about your design.

Most of the design methodologies to make things robust are not new, some of these things are pretty old, and some of them have been entirely forgotten because of this push for performance. So one thing you can do, for instance, if you want a robust digital circuit, is just to do two-phase clocking, in which you take a flop, you divide it in half, instead of clocking one on the rising clock edge and the other half on the falling clock edge, you clock them on two phases of a non-overlapping clock. And so if your circuit has setup problems, you can just slow down the clock, if your circuit has hold problems, you can increase the spacing between the clocks, and one way or the other, you will get it to work. It won't be the best performance, but it will work. And I think most synthesis/place-and-route tools should be able to work with this style, it's just a matter of routing two different clock networks. You're not going to do it optimally, and that's going to have a performance impact, but they should be able to do it.

For anybody who's been through the first couple of Open MPWs, we had problems, that problem was due specifically to me not paying attention to what I just said, which is to not trust the data. So if you have something like a serial chain up top, the standard way to solve for making sure that your clock doesn't arrive before your data is to delay the clock by inserting delays into it. The tools will do that automatically. In our case, this was in a hierarchy, I was doing it manually. I trusted the data, I put in some extra delays, it wasn't enough, and we got hold violations in the scan chain. Now there are several things I could have done that would have made it more robust. One of them is just to run your clock backwards through the scan chain. I had not wanted to do that because I was adding several wires up the side, which I was trying to avoid taking out area that I would otherwise be giving to the users for the user project area. This is in the Caravel harness chip. But eventually we realized, oh, you can just do on the bottom, you just clock things on the negative clock edge, and it will be always correct. And we have some users who have done that. Our paranoid users are our most successful users, going back to what I said on the first slide.

You can also do this for, for instance, an entire subsystem like the wishbone interface. This was suggested by Tobias Strauch, in which you can design the user area so that it is picking up the wishbone clock and clocking on the negative, clocking data on the negative edge of that clock. And as long as you've designed your wishbone interface on the user side for that, then it will work. And we had another paranoid user who decided that he didn't know the relationship between the clock and data between the microcontroller and the user project. So he figured that he would just put in a delay chain and then select from the delay chain, you can select whatever delay of clock you want. And he found one that worked, and he was the first person to get up a full user project that was a complete microcontroller in the user project area.

So that's all I had for my story, and thank you for listening, and we now have a little time, I guess, for questions and answers.

Q&A

So just a feedback for the community. A lot of the links that go to these PDKs are broken, so if you go there, there is nothing, basically. For example, it takes you to the website, but it read that, but it's not there, or a different version, and it's a bit complex and confusing to find things. I don't know if you can get back this to the people in the community.

Well, yeah, like all open-source stuff, it depends on feedback to fix, and I'm not sure which links you're specifically referring to, but coming from where.
[...] From efabless to [?], I don't remember exactly which of the links are broken...

Yeah, I do know that there are issues with some of the links there, and again, feedback is what we need. We do have a Slack channel where we're fairly responsive to those things.
How do you see the problems with data quality changing as we go down, for example, to smaller nodes, if we're lucky enough to have open PDKs on those nodes? Do you think it's going to be mostly more of the same problems, or do you expect bigger problems with things like more complex transistor models and stuff?

It depends. It will get worse to a point. I understand that once you get to FinFETs, you end up having... Andrew is suddenly shaking his head, so I'm probably about to say something that is... No, you're good. Okay. Yeah, once you get to FinFETs, you have a lot of constraints, and there are so many constraints that it actually makes design problems easier in some cases. It's possible that once you get to FinFETs, the problem just becomes a little easier, but yeah, certainly down to 65, 45 nanometers, all the way down to 28. 28 is probably the worst, so... And we're getting there. And I don't expect to get open-source PDKs down in that range anytime soon.

[Comment from the audience:] The caveat is, or the comment is that it was a big struggle with the problems that was there before the ADM. But if a commercial droplet load is open-source, it will almost from the get-go be much more complete and robust and polarized. So that means that easier solution does happen.

Yeah, there are certain levels of trust that are dependent on the level of trust in the foundry to begin with, and the process that's being done. SkyWater particularly is sort of researchy in the way they do things, and that makes it a little less trustable than some of the other foundries. But then they're also easier to work with and have been the only one so far followed by GF then to become open-source.

Share this page!

Principles of Paranoid Design

Q&A