[For readability, moderator comments have been removed, as well as minor questions for better understanding.]
Good afternoon, everybody. My name is Tsung-Wei Huang, and you can just call me TW. Today I'm going to talk about Taskflow. It's a general purpose task parallel programming system to help you more easily parallelize the EDA application.
And you probably, many of you already know, parallel computing, parallel heterogeneous computing is very, very critical for your application performance. For example, if you look at the machine learning today without GPU or even with heterogeneous parallel using one GPU, we are able to achieve over 100x speedup over that with CPU alone. So that's the power of parallel heterogeneous computing.
But writing a parallel program itself, it's not a very easy job because you have to deal with a lot of technical and parallelization detail. For example, you have to worry about the parallelism abstraction, either over software or hardware. You have to worry about concurrency control, tasks, data race, dependency constraints, scheduling efficiency, load balancing, and even performance portability, and so on and so forth. And there's always this straddle between what you really want and the cost of that design. For example, everybody wants simple, maintainable, extensible, and portable implementation, but each of that design may steal your application performance a little bit. And I do believe nobody really wants to manage all these technical details themselves.
So it turns out we want a programming solution that can sit on top to help us handle all these technical details and challenges. But why does existing parallel programming systems do not work, especially for EDA problems? Well, if you look at many of these EDA applications, they are very, very complicated and very complex. Existing systems, they are typically very good at loop parallelism, regular parallelism, but they are not very strong when things become very irregular, like synthesis, optimization, simulation, and so on and so forth. The other issue is most of the existing systems, they count on directed acyclic graph model. It does not allow you to model control flow inside a task graph, like iterative cycle and so on and so forth. So from the evolution of parallel programming standards, it turns out we can envision that the task parallelism is going to be the best model for us to describe heterogeneous workload or parallel computing workload, because it captures your intention in decomposing a parallel heterogeneous algorithm into a top-down task graph that eventually can be scaled to different accelerator.
So our solution over here is called Taskflow, which was supported by our NSF project to overcome many of the challenges of parallel programming challenges that cannot be efficiently handled by existing systems. And the very first challenge we want to overcome is transparency. So let's take a look at a "Hello world!" example in our system. Suppose you want to do four things, A, B, C, D. A has to run before B and C. D has to run after B and C. Each task represents a function or a task. In terms of Taskflow, this is all you need. Only 15 lines of C++ code to get a parallel execution for this task graph. So first you will need to create a Taskflow object, and then you will create an executor to perform the scheduling. From the Taskflow, you can use employees to insert several tasks in terms of C++ lambda function objects, like A, B, C, D over here, and use the precede() method to relate a dependency between A, B, C, and D. And finally, you can submit the Taskflow to the executor, and it will perform all the scheduling for you. So at this moment, I believe most of you can fully understand what this code is doing. That's the power of transparency and expressiveness.
Another major innovation of our Taskflow, when you compare with existing systems, is our Controlled Taskflow Graph Programming Model, or CTFG. CTFG goes beyond the limitation of a traditional DAG-based model that does not allow you to express control flow in a single graph entity. For a complicated workload like this example over here, where you have a cycle to describe iterative control flow, conditional tasking, or even a loop, it becomes almost impossible for existing DAG-based systems to express a very efficient overlap between control flow and your task.
So this is our heterogeneous tasking interface. We have been using CUDA and SYCL, so you can write a single-source C++ program, and then it will be compiled to run on multiple accelerators like FPGA, CPU, GPU, and so on and so forth. And the programming model is pretty much similar to what you are familiar with the CPU-based model. In this example over here, we have four data transfer tasks, and then also one kernel task to perform the power iteration. And you can describe everything in a very similar fashion to our CPU-based model.
Using Taskflow is very, very easy. Pretty much all you need is just include our Taskflow project into your project, and you tell your compiler where to find the header file. Because Taskflow is header file only, it's completely written in standard C++. There is no non-standard C++ feature. All you need is to download our header file, include it into your project, tell compiler where to find it, that's it.
Everything by default can be visualized. If you want to run your Taskflow program with the visualization result, you just need to enable an environment variable telling the operating system how to dump your execution timeline into the JSON file, then you can copy and paste the JSON into a browser-based interface to visualize the execution result of your program. And everything is built in by default.
We have successfully applied our system to many EDA applications. One of the most important applications we applied is our OpenTimer project. If you look at timing analysis, of course, is a very important step in the overall design flow because it helps you verify the expected timing behavior. So if you look at what existing work do, they typically would levelize your graph and perform level-by-level propagation and use pipeline to parallelize the propagation. But with Taskflow, we are able to model the entire timing propagation in a big Taskflow graph, so we can flow the computation very naturally and more asynchronously across your circuit network, including many in-graph control flow tasks, so we can prune unnecessary propagation on the fly.
So this is the sample result that shows you with Taskflow, we are able to achieve over 600 speedup. And of course, that is also part of the reason by using GPU over the baseline with one CPU and with 40 CPU, our solution can also be 44x faster.
Everything is composable and is unified in our system. Task dependency control flow, they are all associated with each other, so you can represent a very generic control Taskflow graph to achieve end-to-end parallelism. And for example, there is a post on Reddit sharing how Taskflow helped their company migrate existing multimedia engine to the parallel target in just a few hours, and the performance got about 8%.
So right now, it is open source, and we do have quite a lot of people using that. Some of them are companies, for example, I recently gave a talk to Xilinx at AMD, their Vivado Synthesis place and route engine is already using Taskflow.
Okay, I believe I'm going to stop here. If you want to understand more details about our system, feel free to check out our website.
So in your examples, you specified the precedence manually, so that "A after C, B" and "B after C", or something like that. But are you also able to figure out the data dependencies?
So the question is about data dependency. In our system, we do not handle data dependency. It is completely up to application developer. This is probably one of the biggest lessons we have learned when we tried to start, initiate this project, because the way you want to optimize the memory-managed data is typically application-dependent. So we do not provide yet another abstraction over data, but we just focus on how to describe your workload in terms of task graph, and then we schedule that for you. And of course, you can always describe your data dependency in terms of task dependency.
Are there opportunities in any algorithms you looked at for overlapping data, streaming data between tasks?
That's a very good question. Yes, we do have a very specialized scheduler trying to maximize the overlap between your tasks and data movement as much as possible. But that will be a totally different topic related to scheduler.
My question would be, how does this language compare or approach to other domain-specific language for data or control data flow such as Hadoop and Actor language, because it looks very much like Actors. You connect in your main program and then you do this behavior. Do you have to write any analysis quizzes for dynamic or static data flow? Because they're domain-specific, what's the analysis for such restricted data flow, what is the computation that has to be done in memory? Any work has been done in this area?
Actually, my short answer to this is no. Like I said, we primarily focus on how we can describe a task flow graph and we can do the scheduling for you. And the reason why we do not want to come up with another domain-specific knowledge is the ecosystem. Because a lot of time when you come up with a domain-specific language, the biggest challenge is you have to convince the community, convince the people to either rewrite the application using your language or they will start to question about the sustainability of that particular language. So that is the biggest challenge there.
So your task flow could be any C++ object?
So TaskFlow is completely written in standard C++. Like I said, there is no non-standard C++ feature. So it's fairly easy to use and integrate into your project.