Fast 2D Rendering on GPU

raphlinus · on June 14, 2020

Fast 2D rendering on GPU has represented the last few months of concentrated work for me, and I'm happy to present the results now. It's required going pretty deep into various aspects of GPU compute, so feel free to ask me about that, 2D vector graphics, or anything related.

danielvaughn · on June 14, 2020

Hi! This looks interesting, although I confess I'm very new to this area. I'm writing a programming language specifically for UI designers, and currently I'm building a naive implementation in html5 canvas. I know very little about low-level rendering - just really know how to use drawing API's like canvas and Quartz 2D.

With that being said, I'm looking to delve deeper into this subject. Do you have any recommendations on where to start? Whenever I look beyond simple drawing API's, the focus seems to be entirely on 3D rendering (which is interesting but not my main focus right now).

raphlinus · on June 14, 2020

It's a good question, as honestly the knowledge for 2D graphics is pretty arcane, as opposed to 3D being so widely taught. I actually started a github repo for a book on 2d graphics but have no idea whether I'll actually finish it.

In the meantime, antigrain.com is one good (if old) source. The original PostScript "red book" was extremely influential in its time (it's where I learned a lot of this stuff) but is quite dated now. Best of luck, and I'm also happy to field requests for more specific areas. For example, for color theory (an important aspect of 2D graphics!), handprint.com is quite a remarkable resource.

blondin · on June 14, 2020

> knowledge for 2D graphics is pretty arcane

such an unfortunate state of affairs!

i am currently learning how to render graphics using the GPU on my mac using apple metal. what i am getting is that the GPU has been optimized for 3D rendering?! GPUs make no provision or easy way for rendering 2D graphics?

it makes no sense to me... that's where you start...

danielvaughn · on June 14, 2020

I remember asking a similar question on HN a while back. The response was that 2D graphics, UIs in particular, are mostly computed on the CPU. I have no idea why this is the case, though.

raphlinus · on June 14, 2020

See this blog, it explains it pretty well imho: https://blog.mecheye.net/2019/05/why-is-2d-graphics-is-harde...

And: historically they've been computed mostly on CPU, but I think it's time for that to change.

nitrogen · on June 14, 2020

During the late 1990s and early 2000s it was a lot more common for GPUs to provide 2D acceleration, and GUIs were drawn using those primitives. I remember the switch to CPU rendering happening, and the subsequent removal of 2D acceleration from GPUs, but I don't remember why.

At any rate, the 2D graphics we expect now are a lot more complex than the unantialiased lines, blits, and fills of old.

jcelerier · on June 14, 2020

> And: historically they've been computed mostly on CPU, but I think it's time for that to change.

It would be great to wait a bit for OS & GPU power management to evolve before biting the bullet on that. My laptop goes from 6 to 2.something hours of battery as soon as I have a GL context opening somewhere, likely because it seems to power on its discrete GPU automatically in that case.

raphlinus · on June 14, 2020

This is changing. I've been doing power measurements as well (just didn't make the cut of this blog post), and the 1060 is surprisingly power-efficient in its low frequency modes. It's also generally the case that the GPU is always active in its role running the compositor.

jcelerier · on June 15, 2020

> and the 1060 is surprisingly power-efficient in its low frequency modes.

maybe ? the computer on which this happens is a 1070. But please be aware that series 10 are a very small percentage of people. The average laptop of non-tech people around me is easily 8 years old, often on their 2nd or 3rd battery... and these people won't be able to complain easily to anyone when their new battery's life suddenly is halved because of $SOFTWARE.

zozbot234 · on June 14, 2020

With most dual-GPU machines you do get a choice whether to power on the discrete GPU or not. It's even supported on GNOME/Wayland as of late.

ben-schaaf · on June 14, 2020

Both macOS and Windows have ways for applications to specify whether they want to prefer a discrete gpu or integrated.

Kuinox · on June 14, 2020

Don't fool yourself, they won't do anything until most of the browser/apps do it. Then they will fix it to sell "longer" battery life.

gct · on June 14, 2020

I've done a fair bit of 2d graphics work (written a rasterizer, etc). Honestly it's because it's

  1) tricky to shoehorn 2D graphics onto the APIs that GPUs provide and 
  2) really not needed.  I can easily render eg: a world map with hundreds of thousands of lines at hundreds of frames/second with one core.

chrismorgan · on June 14, 2020

Please don’t use preformatted text to write lists. It’s a pain. Just leave a blank line between the items so that each is a paragraph.

_y4o5 · on June 14, 2020

If you need portable results pixel to pixel on various platforms then CPU based rendering is more straightforward than using the GPU.

The various libraries such as freetype for font rastrization only works on CPU.

Plenty of work should be done to research and implementation is left to be done in order to use the GPU more widely.

raphlinus · on June 14, 2020

This is an interesting and subtle point about doing "software on GPU compute." You are in complete control over what gets computed, and are not at the mercy of the hardware's fixed function pipeline for stuff like rasterization rules and sampling patterns for antialiasing. So I think portable results pixel to pixel are in fact viable.

Of course CPU rendering is always more straightforward than GPU, the higher performance comes at a significant cost in complexity.

dungdang · on June 14, 2020

now this may be a dimb question, but why would you start there?

as far as i understand, 2d and 3d have literally zero to do with each other in how they are rendered. one is a bunch of triangles. the other is lines, curves, thickness, gradients, and fonts (which are essentially little programs)

chrisseaton · on June 14, 2020

> the other is lines, curves, thickness, gradients, and fonts

You can reduce all these to drawing triangles.

raphlinus · on June 14, 2020

> You can reduce all these to drawing triangles.

You can, people have tried this, and it sucks. The main problem is that the conversion of Bézier paths to triangles is a hard problem with lots of conditional branching. Even when you do it, there is the other problem of rendering triangles with really good antialiasing, MSAA forces a compromise between performance and quality. By contrast, piet-gpu does an exact-area calculation for antialiasing.

So it's not a question of whether you can do it, but whether it works well, and approaches like piet-gpu absolutely stomp triangles.

Const-me · on June 15, 2020

> there is the other problem of rendering triangles with really good antialiasing

Easier than you think. Here's couple lines of pixel shader that does that, with really good antialiasing and without MSAA:

https://github.com/Const-me/Vrmac/blob/master/Vrmac/Draw/Sha...

Jasper_ · on June 14, 2020

Doing that reduction is surprisingly difficult, and usually a serial algorithm that runs on the CPU, and a naive approach is a resulting triangle set is not efficient to a GPU, but that's what toolkits like cairo, Direct2D, nanovg do.

Raph is describing an architecture where path evaluation happens on the GPU, without being baked to triangles.

dungdang · on June 16, 2020

yes, you can draw 2d in 3d space. this thread however is about 3d being built on top of 2d. not 2d being built on top of 3d.

blondin · on June 14, 2020

that's not a dumb question! it's my own fault. i only have knowledge of 2D graphics.

smallstepforman · on June 14, 2020

Haiku OS AppServer (the screen rendering component, similar to Unix X11) is a full GUI system implemented with AntiGrain Geometry as the renderer.

gct · on June 14, 2020

I think that blend2d (https://blend2d.com/) is a worthy successor to AGG, and it's under active development.

longstation · on June 14, 2020

Thank you for the link. It was a great reading. But I just want to point out that Blender2D is a software renderer.

gct · on June 15, 2020

So is/was AGG

danielvaughn · on June 14, 2020

Thanks!

syspec · on June 14, 2020

I recommend the blog of the OP, so many gems

structural · on June 14, 2020

What's been your overall experience with using Vulkan shaders for compute? Are there basic primitives that are missing from shading languages and/or have you found any impedence mismatches between writing shaders vs. how you might describe the same algorithms in other languages?

raphlinus · on June 14, 2020

That's a big topic. I've been able to work around the missing primitives (for example, I autogenerate code for Rust-style structs and enums), but have had much bigger struggles around two issues: tools, which are still quite primitive, and understanding performance, which is extremely difficult. These two problems intersect because I can imagine a lot better tools for digging into performance issues. One that I would have paid good money for is an instruction-level simulator that would highlight the source code to tell me where the stalls, bank conflicts, divergence problems, etc. are in the source code. Such a thing is possible (there are academic papers like [1]), but not as far as I know usable in daily development.

The "impedance mismatch" is that you (generally) have to write in a style to extract lots of parallelism. This tends to be very different than the way you'd write scalar CPU code, but not completely alien to me as it has a lot of similarity with the way you'd write SIMD. I've pretty much gotten the hang of it now. I'm thinking of a blog post of redoing path_coarse.comp from its current basically scalar style to a more parallel version, as that would I think illuminate the issues.

[1] http://comparch.gatech.edu/hparch/papers/gera_ispass18.pdf

NotCamelCase · on June 14, 2020

I pondered on the same subject recently as I was implementing the same algorithm (i.e. Mandelbrot set) on the CPU (scalar vs SIMD) and GPU compute using fixed-point and floating-point for comparisons (if interested: https://tayfunkayhan.wordpress.com/2020/06/03/mandelbrot-in-...).

It bothers me how little progress has been done on "shading" languages front compared to overall many-core computation models and capabilities over the years. And, that is despite the fact that shaders are very often where the most time is spent in modern workloads.

Compute with Vulkan is another story. It offers some nice abstractions, but it shows that it's mostly intended for async-compute/work-offloading for rendering, IMO. Too much fruction.

RivieraKid · on June 14, 2020

- Shouldn't 2D rendering be a solved problem given that it's basically a subset of 3D rendering?

- Don't libraries like Skia, Qt, Cairo use GPU rendering? I've always assumed so. I mean, this is 2020, GPUs have been around for decades.

pcwalton · on June 14, 2020

> - Shouldn't 2D rendering be a solved problem given that it's basically a subset of 3D rendering?

The problem is that primitives artists use are different. 3D rendering tends to all consist of polygon meshes, which are relatively easy to render. 2D rendering (basically) consists of Bezier paths, which are harder. The equivalent in 3D, which is adaptive subdivision, is not really a solved problem in real-time either.

Additionally, 2D rendering quality tends to be more important than 3D rendering quality. Whereas you can get away with 4xMSAA or hacks like FXAA in 3D, true 16xAA (without hacks) is the absolute minimum for 2D rendering quality nowadays, and even it isn't considered great for some tasks like font rendering (Pathfinder and piet-gpu both use analytic AA which is effectively 256xAA).

> - Don't libraries like Skia, Qt, Cairo use GPU rendering? I've always assumed so. I mean, this is 2020, GPUs have been around for decades.

There's a difference between renderers with GPU support and renderers that are oriented around using the GPU efficiently. In many cases this results in an order-of-magnitude speedup. On the GPU, state changes are expensive, and many such renderers that have GPU support don't really go out of their way to avoid them. There are also occlusion culling optimizations that most renderers don't do, but piet-gpu and Pathfinder do.

raphlinus · on June 14, 2020

Others have spoken to this, but as a general introduction I highly recommend Jasper's post "Why are 2D vector graphics so much harder than 3D?" In short, no, it's not a solved problem.

Also, I see a lot of variations of this question, but I should state this more clearly. There's been accelerated graphics in one form or another for a long time, but what I'm doing is a completely different type of thing. In my world, on the CPU you just encode the scene into a binary representation that's optimized for GPU but in many ways is like flatbuffers, and then the GPU runs a highly parallel program to render the whole thing. In previous approaches, the CPU is deeply involved in taking the scene apart and putting it back together in a form that's well suited to relatively dumb pixel pipes. Now that GPUs are really fast, that approach runs into limitations.

It also depends what you're trying to do. I'm focusing here on dynamic paths (and thus font rendering), while most of the libraries optimized for UI put text into texture atlases and then use the GPU to composite quads to the final surface, something they can do well.

https://blog.mecheye.net/2019/05/why-is-2d-graphics-is-harde...

raks435 · on June 14, 2020

Can you expound the principle of tiling mentioned in your algorithm a bit more ? The conventional mechanism is to use de-Casteljau to divide a bezier curve into triangles and then rasterize these triangles using GPU. If the curve is required to be scaled, the triangularization/tesselation is done again How is the algorithm presented in the link different ? Somehow the concept of tiling seems to imply that rasterization of the curve is done in the CPU itself. What am I missing ?

raphlinus · on June 14, 2020

I recommend reading the blog post series, I'm not sure I can usefully summarize the concepts in a comment reply. But very briefly, there's a flattening step (evaluated on GPU, based on de Casteljau) that converts the Bezier into a polyline (not triangles), then a tiling step that records for each tile a "command list" that contains the complete description of how to render the pixels in a tile, finally "fine rasterization" so that each workgroup reads that command list and renders 256 pixels in parallel from it. From your question, it sounds like your mental model is pretty different from how this pipeline works.

raks435 · on June 14, 2020

Yes, I am trying to align my mental model with yours. I read your blog post "A sort-middle architecture for 2D graphics" & "2D Graphics on Modern GPU", but still unable to grasp the fundamental guiding principles. It's not clear what are the commands that constitute each tile and whatever they are, what is the fundamental reasion which leads to the performance being better than joining the polylines of a curve to get a triangle list and having those triangle rasterized by the GPU. Any blog/article on the fundamental principles, that you would recomend

pritovido · on June 14, 2020

No, it is not a subset, unless you are talking about drawing textures.

Those libraries use GPU rendering using textures, like distance fields(Qt) or simply calculate fonts in 2D and draw it on textures using the GPU, like Apple or cairo usually do.

Distance fields are blurry when you have small fonts on display.

Things like calculating the exact area order the 2D curve is something you could easily do in CPU, but it is extremely difficult to do on the GPU. You need decoupled data in order to parallelize it.

jcelerier · on June 14, 2020

> Don't libraries like Skia, Qt, Cairo use GPU rendering? I've always assumed so. I mean, this is 2020, GPUs have been around for decades.

As a Qt user / developer - you can use the GPU for your app, e.g. with Qt Quick or QGraphicsView, but there are sometimes good reasons to stick to CPU & software rendering, e.g. it is somewhat common to want to have intertwined "native" OS widgets (which are all CPU-rendered raster things) and custom GPU-drawn scene - this is a case where things fall a bit apart.

Another thing is that pretty, freetype-like font rendering is super expensive when you have a lot of text to show, and can't really be done (at least I have definitely not seen infinality-level beauty from the state of the art) on the GPU yet... next to that filling some rects (read: 95% of UI) with SSE/AVX/AVX2 as Qt does is stupidly fast.

chrismorgan · on June 14, 2020

> pretty, freetype-like font rendering is super expensive when you have a lot of text to show, and can't really be done (at least I have definitely not seen infinality-level beauty from the state of the art) on the GPU yet

This is exactly what Pathfinder does. pcwalton is on this thread and is the main author of that.

https://github.com/servo/pathfinder#features says:

> Advanced font rendering. Pathfinder can render fonts with slight hinting and can perform subpixel antialiasing on LCD screens. It can do stem darkening/font dilation like macOS and FreeType in order to make text easier to read at small sizes. The library also has support for gamma correction.

jcelerier · on June 14, 2020

> This is exactly what Pathfinder does.

from the screenshots I saw so far, pretty much not.

> It can do stem darkening/font dilation like macOS and FreeType in order to make text easier to read at small sizes.

there are tons of different ways to do that. Even freetype has a few different algorithms to do it, some not even merged if I'm not mistaken, which give wildly different results

chrismorgan · on June 14, 2020

To give you an idea that pcwalton knows what he’s been doing and has indeed been seeking to match platform rendering exactly, here are a couple of tweets about the macOS font dilation: https://twitter.com/pcwalton/status/918593367914803201, https://twitter.com/pcwalton/status/918991457532354560.

I rather like the demonstration of rendering including subpixel rendering at https://twitter.com/pcwalton/status/971475785616797698, as well.

jcelerier · on June 14, 2020

> To give you an idea that pcwalton knows what he’s been doing and has indeed been seeking to match platform rendering exactly,

I don't intend at all to cast doubt on pcwalton's abilities - the work is brilliant without any hesitation.

But I wonder how that is possible given that "platform rendering" pretty much has changed every other macOS version and every Windows version ("ClearType" from WinXP is definitely not "ClearType" from Win10) ; and let's not talk about the customization abilities of freetype which makes rendering on any two linux boxes also entirely distinct.

stanfordkid · on June 14, 2020

Things like multi-resolution font-rendering are actually more complicated than one might imagine.

The easy way is to just tesselate the font into polygons -- but this tesselation often depends on the zoom level. The same thing can be said about implicit curves etc.

Most libraries do use GPU's for basic draw operations (e.g rendering a gradient), but to build something like Photoshop -- you need much more complexity.

awfulaxolotl · on June 14, 2020

How well does this approach work for something like 2D data visualization where most of the visual elements are the same -- i.e. can be instanced in OpenGL/etc?

Thanks for publishing this, it's awesome work! I'm looking forward to progression to wgpu hinted in the Github README.

raphlinus · on June 14, 2020

I absolutely have data visualization in mind for this, as I think it can benefit greatly from the scale. But the pipeline I've built is very agile, it will easily handle a diverse mix of items. It's not like OpenGL etc where there's a certain amount of overhead for a draw call so there are significant gains to be had from instancing and batching.

It is likely that CPU-side encoding can be made more efficient, though, by just filling in quantities to a template, rather than encoding from scratch.

leeoniya · on June 14, 2020

any relation to the work currently being done here by pcwalton?

https://github.com/servo/pathfinder/pull/350

hopefully you guys aren't doing identical work in parallel (no pun intended) :)

pcwalton · on June 14, 2020

Yes, there's been a lot of influence in both directions. The approaches have a lot in common, but also have significant differences. I plan to write up a description of the Pathfinder compute work soonish.

paufernandez · on June 14, 2020

I would love that!

raphlinus · on June 14, 2020

As Patrick indicated, there's a lot of cross-fertilization of ideas, and one of the best outcomes of this work would be for high performance compute rendering to ship in Pathfinder.

moonchild · on June 14, 2020

It looks like the main fruit of this is piet-gpu[1]. How close would you say it is to stable? And how complete a realisation of the concepts put forth in the OP? I'm currently working on a ui library, and evaluating alternatives for hw acceleration. The whole library wants to be in c, so I would want to rewrite the library by release-time to use it, but don't want to waste time doing that now if it's still really volatile.

1. https://github.com/linebender/piet-gpu

raphlinus · on June 14, 2020

Not very, and not very. This is research, it depends on compute capabilities that have really only gone mainstream in GPU lately, and the current codebase is about trying out the ideas. Best of luck in your project!

MrBuddyCasino · on June 14, 2020

Wow, those numbers look very impressive on an absolute basis! Do you have an idea on how piet-gpu compares to Skia? I'm not a graphics guy, my impression was that it is currently the dominant vector graphics renderer.

chrismorgan · on June 14, 2020

I would like to get an idea of the capabilities of these things.

Suppose I was to make an ebook reader designed to be modelled more closely after paper, on a device like the Surface Book’s 13″ 3000×2000 display, showing two pages side-by-side. Each page might contain something like 2,000–2,500 letters. I want to be able to flip through pages like I might with a paper book, so that I might be roughly completely rendering several pages at once, and perhaps parts of several more pages; ideally it might render the page like a real 3D page, but if that’s too troublesome I’d settle for an affine transformation while flipping. Assume that the layout of the pages, with all the shaping, is all done ahead of time and is in memory.

I’ve never seen anyone attempt anything like this before. In the old way of doing things, I think anyone attempting this would render each page to a bitmap and use that as a GPU texture, and I think that could provide acceptable performance (unless you were flipping rapidly through hundreds of pages, because that’d take a lot of GPU memory to keep), but I imagine that the quality of the rendering mid-turn would be fairly atrocious—it could be the sort of thing where the appearance of the text subtly changes half a second after you finish turning the page, as it switches from one renderer to a slightly different one.

Would the performance of this new approach be sufficient to render what I describe at 60fps, while rendering each frame perfectly?

I was talking about the Surface Book; its Intel Core i7-6600U has Intel HD Graphics 520 as its GPU; probably not too far off your 630’s results. (And I’m interested in what integrated graphics can do, more than a dedicated GPU.)

My guess based upon your paper-1 results is that this is probably just barely possible with integrated graphics, so long as you employ a few tricks to reduce the amount of work required.

raphlinus · on June 14, 2020

Short answer, yes, I believe it could work, though not with much margin to spare. It might require some dedicated optimization work to fit the frame budget, especially when you know things about the scene (it's mostly small glyphs and with basically no overdraw) as opposed to being completely general purpose.

Doing a warp transformation in the element processing kernel would probably work just fine, and give you realistic movement and razor-sharp rendering.

I'd love to see such a thing and would very much like to encourage people to build it based on my results :)

FraKtus · on June 14, 2020

Another need for that kind of effect is for real-time events.

I work on software that is used to create visual effects in real-time. It's used in live events such as concerts, clubs, corporate presentations. There you want to be able to render high-quality text in high resolution, even higher than 4K, and still allow a performer to add effects on text such as zooming, scrolling, doing 3D transition effects...

Today the best way to do this is via a font atlas but it does not look nice when zooming. Also if you need non-European characters it may heavy preparing the font atlas.

So I am looking for a library that would allow this kind of rendering manipulation for text rendering. Do you have any suggestions?

raphlinus · on June 14, 2020

In the free software world, Pathfinder is probably your best bet, and I'm hopeful that it will take on speedups from compute (directly inspired by my research). The library that's probably the closest fit for what you describe is Slug.

FraKtus · on June 14, 2020

You say free software and of course, it's nice to have source code and flexibility.

I have nothing against licensing and it looks like things are moving there. Do you have a recommendation for Metal / DirectX support for that kind of rendering facilities? What matters more for me is performance.

raphlinus · on June 14, 2020

Slug has DirectX support (it runs on all major APIs). Pathfinder does not yet, but likely will in time. As always, evaluate the offerings and choose what meets your needs best. Performance is probably comparable between the two, but that depends hugely on workloads and how it's integrated into the rest of your system.

FraKtus · on June 15, 2020

Thank you for sharing your experience I will investigate.

londons_explore · on June 14, 2020

For the OP's effect, remember you're probably going to want motion blur in each frame, which means you can render at a much lower resolution in the direction of motion (as long as you can do multipass rendering)

ghusbands · on June 14, 2020

I don't think you do want motion blur, as you don't know the direction in which the user's eyes are tracking. When games and films do motion blur, they know where they expect the viewer's attention to be, so blurring things they won't be tracking improves the look (especially at the low frame-rate of films). But if you happen to track the moving page, and you are likely to, and it has blur added, then it will look worse.

Look at phone interfaces; even if you scroll fast, they don't add motion blur, and on high framerate and/or low-persistence displays, you can read things while it scrolls.

Pulcinella · on June 15, 2020

I have done some testing around this (motion blur while scrolling) and can confirm that motion blur is not practical for scrolling. It “looks nice” and is a nice effect, but makes the text completely unreadable while scrolling.

If you have access to a Mac, the easiest way to check for your self is to add a CI motion blue filter to a tableview’s layer’s filter property (you can change the strength of the blur based on the speed of scrolling if you would like).

chrismorgan · on June 14, 2020

Eek, hadn’t thought about how motion blur might be desirable. That’s going to get messy.

jayd16 · on June 14, 2020

I think you could easily get this done in a traditional 3D pipeline if you use a texture based font, aniso/mip filtering and gpu instancing. You wouldn't be rendering 2500 draw calls. You can batch much of it together.

The reason you don't see this is because no one runs a 3d engine in their text readers. Fonts are not usually shared as sprite sheets. You also lose subpixel font rendering.

MauranKilom · on June 14, 2020

I would like to point out that caching one bitmap per page will be the preferred solution for a different but important reason: You don't want to have your laptop GPU running at 100% while viewing static text. Fidelity is nice, but I don't think you'll convince many people with an ebook reader that drains your mobile device's battery like that.

chrismorgan · on June 14, 2020

I don’t believe this is a problem. It will only need to render it when things change—so maybe the GPU will be sitting at 100% while you’re flipping through pages, but other than that typically very brief time, it should be at 0%.

aasasd · on June 14, 2020

Noob question. Isn't some GPU acceleration for 2D used since forever? I keep being perplexed as to why browsers still have problems with it (at least on my older Macbook with a shitty Intel's embedded GPU). I thought some stuff, like scrolling, was offloaded at least in early/mid-2000s, making for a distinctly meh experience when proper drivers weren't installed.

thechao · on June 14, 2020

GPUs derive most of their parallelism from emitting rows of related “quads” (the smallest unit of the screen that supports a difference operator). 2D graphics are “chatty” in window sizes that are more like 5–10 units in diameter. It’s hell on HW perf. To make things worse, 2D applications usually want submillisecond latency. A GPU driver/hw stack will struggle to get latency below 1–2ms. When there’s lots of multipass rendering (which is a thing 2D also wants a lot of), latency can to 10+ ms.

raphlinus · on June 14, 2020

Well said, thanks. It's also a goal of this work to do compositing inside the "fine rasterizer" compute kernel, to avoid multipass as much as possible. Of course, in the general case for stuff like blur that's hard, but that's one reason why I've been working on special cases like blurred rounded rectangles, which I can render pretty much at lightspeed. (It's not integrated into the current code but would be quite easy)

rasz · on June 14, 2020

What does latency have to do with 2D? Are you suggesting UI toolkits are writing GPU commands synchronously? You should fill display list, fire it and forget about GPU until the next update.

thechao · on June 14, 2020

Pen drawing for UIs. The ideal case is to update a small portion of the screen at about 240hz, to provide a good simulation of pen/pencil feedback. Really, your latency envelope should be on the order of the propagation of sound through the barrel of the marking device, but screens don’t update that fast.

rasz · on June 15, 2020

You are probably thinking https://www.youtube.com/watch?v=vOvQCPLkPt4

Bottleneck is in the input layer. While using a desktop computer look at the mouse cursor, now move the mouse - hardware accelerated 2D graphics, imperceptible latency.

Low latency thru orthogonal multiplexing https://www.youtube.com/watch?v=t1VcC9_yhc0

marcusjt · on June 14, 2020

Surely >95% of screens out there right now are running at 60Hz, so 240Hz "pen drawing" is pretty niche and not a priority?

chrismorgan · on June 15, 2020

And >95% of screens don’t support pen drawing.

I expect a screen that supports or is designed for pen drawing to be somewhat more likely to be above 60Hz. All of these things are niche things that not many care about, but in any case, it’d be nice to be able to do better. And like with Formula 1 race cars, benefits from high-end techniques tend to trickle down to other more mainstream targets in time.

koonsolo · on June 14, 2020

It's even more offloaded in browsers as I expected.

I recently had to do a proof of concept displaying a huge gantt chart (I really mean huge!).

Normally, 2D drawing has some disadvantages with textures and scaling/rotating.

This proof of concept included testing out various methods on the <canvas>, including both using WebGL and standard 2D calls.

To my surprise, I was able to get the standard 2D calls way faster, even with texturing and rotations (which I did not expect). All browsers (Chrome, Firefox, Edge, IE) do GPU optimizations with standard 2D canvas drawings.

Only when I was putting a shitload of different things on there (gantt chart looked more like a barcode at this point), the WebGL implementation started to outperform the 2D calls.

Remark that I didn't go into programming shaders. In the end, it wasn't necesarry since the normal canvas calls already supported plenty of performance.

Also remark that I'm not just some random dude with no experience. I'm creating a game dev tool https://rpgplayground.com, which runs fully in the browser using Haxe and the Kha graphics library, and made plenty of games during my career.

c-smile · on June 14, 2020

Well, some rendering acceleration was with us from the very beginning in form of DMA (https://en.wikipedia.org/wiki/Direct_memory_access) - transfer of blocks from RAM to VRAM.

But that pretty much what we have now even on modern systems.

For desktop UI purposes (but not for games) GPU acceleration started to become critical only relatively recently - on high-DPI screens. Number of pixels jumped 9 times between 96 ppi and 320 ppi (Retina) screens. CPUs haven't changed that much in the time frame so you may experience slow rendering on otherwise perfect screen picture.

rasz · on June 14, 2020

At the very beginning x86 PC computers didnt have DMA capable of transferring ram to ram, not to mention DMA was slower than CPU http://www.os2museum.com/wp/the-danger-of-datasheets/ First PC 2D acceleration was very much on graphics chip, in form of either IBM 8514 or TIGA compatibility ($1K cards when introduced). Amiga also doesnt fit the profile, seeing most models shipped with unified shared memory (chip ram).

boomlinde · on June 14, 2020

> Amiga also doesnt fit the profile, seeing most models shipped with unified shared memory (chip ram).

What profile? They did RAM to RAM DMA transfer exactly for video acceleration with the Agnus chip, which would access memory through DMA channels to perform 2D blitting.

rasz · on June 15, 2020

c-smile specifically said 'RAM to VRAM'. Agnus/Alice didnt have DMA access to _non video_ ram so Amiga is automatically out.

What I think he was thinking about was Blitter in general, and not particular implementations using DMA controller. First Blitter accelerated 2D graphics I read about were done on Xerox Alto using microcode.

boomlinde · on June 15, 2020

I'm arguing from the point of view that 'VRAM to VRAM' (or maybe "shared I/O RAM" in Amiga's case) is a subset of 'RAM to VRAM'. You seem to be arguing from the point of view that just "RAM" in this context means non-video RAM. That's a fair assumption, and I understood it from your last post, but I don't think the difference is significant to the general point that we've had various forms of DMA-based 2D video acceleration for a long time.

jayd16 · on June 14, 2020

As others have said, the 2D pipeline right now is mostly focused on drawing textures (dumb blocks of pixel data). The work in the OP is about drawing SVG graphics (complex sets of shapes, lines, curves).

CodeArtisan · on June 14, 2020

Yes, both X Server and Windows have 2D Hardware acceleration for years now. X Server through a multitude of interfaces[1] and Windows through GDI and now Direct2D[2][3]

[1] https://en.wikipedia.org/wiki/X.Org_Server#2D_graphics_drive...

[2] https://en.wikipedia.org/wiki/Direct2D

[3] https://docs.microsoft.com/en-us/windows/win32/direct2d/comp...

boomlinde · on June 15, 2020

Yes, it's common to have this kind of acceleration in a video controller. There are VESA extensions for blitting to screen memory for example. Even something as basic as a CGA text screen—where rendering of bitmap fonts is offloaded to the video card and the CPU only needs to operate on indices into a table of predefined characters—could be considered a basic form of 2D acceleration.

The article and most discussion here however concerns rasterization, specifically path rendering, which is useful when rendering things like fonts and image formats like SVG. You use primitives such as lines and curves to form outlines of objects and then fill them. It's a problem that does not very easily translate into rendering textured triangles (which is what the GPU is best at) or copying/moving screen regions, so there's some work in doing it in a performant way still, and a lot of it typically happens on the CPU.

Shorel · on June 14, 2020

There was some form of 2D GPU acceleration used in Windows XP that is no longer used in modern GPUs. I remember a PCI Trident video card that provided such acceleration for my Windows 98-era computer.

There's also a 3D fixed pipeline that was all that the first 3D GPUs could do and it is also removed from modern GPUs.

A game I still play to this day is called rFactor, and it runs much faster in DX9 mode than in DX8 or DX7 mode in my hardware. At the very least, it behaves that way in this week's test. This means my hardware doesn't have that old fixed pipeline.

Some people with old hardware can only run it with decent frame rate in DX7.

So the answer to your question is: yes, 2D has been accelerated since forever, but now we have general purpose 3D programmable video cards, and they are so good that now we only have general purpose 3D programmable video cards and the other forms of acceleration are obsolete.

Except for things like video codecs and DRM.

makapuf · on June 14, 2020

Try windows3.1 instead of windows xp: 2d acceleration cards started early 90 , see https://en.m.wikipedia.org/wiki/S3_Graphics or trident pages by example.

Shorel · on June 15, 2020

I take your word for it.

It's just that my Win3.1 computer was too slow for me to remember the acceleration.

pjmlp · on June 16, 2020

Even DirectX had predecessor, back in Windows 3.1 with the Win32s extensions, it was called WinG and offered hardware accelerated blitting.

https://en.wikipedia.org/wiki/WinG

Shorel · on June 18, 2020

I remember playing something that ran with the Win32s extensions and was similar to Wolfenstein 3D, but it was 3D, and I don't think it had anything to do with 2D acceleration.

Also remember the glorious Fury3 game.

https://www.youtube.com/watch?v=6_8tSpXLuK8

eco · on June 14, 2020

How does this (and Pathfinder for that matter) compare to NanoVG? I've recently been experimenting with swapping out Cairo for NanoVG and it seems much faster. The lack of dashed lines may kill my experiment though unless I can think of a decent workaround.

pcwalton · on June 14, 2020

I have to say I'm consistently impressed by NanoVG. Not because it does anything fancy, but because it doesn't. It's not the fastest or the most featureful vector renderer out there, but it resides in a very nice sweet spot that balances performance, features, and simplicity.

In general I've found NanoVG to have comparable performance to the master branch of Pathfinder. NanoVG currently performs better at text (PF's implementation is designed to get international text right first and optimization hasn't really been done yet), while PF generally does better at complex vector workloads like the tiger.

pbsurf · on June 14, 2020

I've just released a fork of nanovg that does GPU rendering a bit like Pathfinder, so it can support arbitrary paths - nanovg's antialiasing has some issues with thin filled paths. It also adds support for dashed lines.

https://github.com/styluslabs/nanovgXC

If you give it a try, let me know how it works for you.

eco · on June 14, 2020

That is very exciting to hear. I'll give it a shot. Thanks.

raphlinus · on June 14, 2020

I expect it to be faster, but doing performance evaluation is really hard work. Performance depends on so many variables, including the workloads you're trying to render and a ton of factors about the GPU and system. I know people are really curious about competitive performance against other renderers, and it's something I should probably do, but honestly I'd love it if somebody else took up that mantle.

eco · on June 14, 2020

I kind of meant more in terms of how they work and what they can do but I suppose I could do the work in answering that question myself by figuring out how nanovg works then comparing it to what you've written.

I've been finding Patrick's pathfinder work fascinating so it's nice to have yet another project to follow.

raphlinus · on June 14, 2020

Got it. So far my research prototype really just does filled paths and limited strokes, but the architecture should extend to a richer set of graphics primitives. Pathfinder also started out that way but recent work is building out a good chunk of the SVG imaging model. I definitely recommend checking it out, there's a good chance it'll do what you need.

stephc_int13 · on June 14, 2020

Something I am not sure to completely understand, after reading the article, is: in which way is it better/ more desirable than the classical approach of GPU rendering? (with vertex, triangles, and shaders)

pcwalton · on June 14, 2020

The basic problem is that, without compute, you either have to encode the vector scene into GPU primitives (triangles) as quickly as possible, and all the ways I know of to do that either involve (a) an expensive CPU process or (b) a cheap CPU process but way too much overdraw on GPU side. Compute gives you the best of both worlds, allowing you to upload outlines directly to GPU and do the processing necessary to lower them to primitives on the GPU itself.

Pathfinder in there is actually using regular old GPU rasterization with triangles and so forth, and I'm fairly confident it's about as fast as you can go at D3D10 level (i.e. no compute shaders) without sacrificing quality. Note that the numbers can vary wildly depending on hardware. On my MacBook Pro, with a powerful CPU and limiting myself to the Intel integrated GPU only, Pathfinder is actually about equal to the GPU compute approach on a lot of scenes like the tiger, though it uses a lot of CPU.

Const-me · on June 15, 2020

> all the ways I know of to do that either involve (a) an expensive CPU process or (b) a cheap CPU process but way too much overdraw on GPU side

That's fixable, take a look: https://github.com/Const-me/Vrmac#vector-graphics-engine

My CPU process is moderately expensive, and I use quite a few tricks to reduce overdraw, e.g. both draw calls (the complete vector image, however complex, takes 2 draw calls to render) use hardware depth buffer and early Z rejection.

stephc_int13 · on June 14, 2020

Is it true for all 2D rendering?

From my intuition, this seems pretty specialized for vector-like rendering, with a lot of small bezier shapes.

pcwalton · on June 14, 2020

Yeah, when I say 2D I mean vector art. There are a lot of things under the heading of 2D rendering, such as blitting raster sprites, that are much closer to being solved problems. (Though you might be surprised--power concerns, coupled with greatly increased pixel density, have brought renewed attention to performance of blitting lately...)

c-smile · on June 14, 2020

Question to the author: is this only implementation of GPU rasterizer? What about anything like WARP (https://en.wikipedia.org/wiki/Windows_Advanced_Rasterization...) - fallback rendering when GPU is not available ?

raphlinus · on June 14, 2020

It's something I'd like to explore at some point, also Swiftshader, which might be easier to explore, as it's already Vulkan. I expect performance to be pretty good, but there are already really advanced CPU renderers such as Blend2D. Doing serious performance evaluation is hard work, so I actually hope it's something others take up, as I have pretty limited time for it myself.

c-smile · on June 14, 2020

I understand.

Problem is that any practical 2D rendering solution shall support as GPU as CPU rendering unfortunately.

It would be interesting to see any GPU equivalent of something like AGG by Max Shemanarev, RIP.

throwaway9087 · on June 14, 2020

Practical 2D renderers implement an abstraction layer that lets them easily redirect their output to a number of low level libraries, which can be either CPU or GPU-based (or a mix). I worked on a few such 2D stacks, including OpenGL, DirectX, WebGL, and AGG based, and it took no more than a few days to add new 2D backends to an existing pipeline. Most 2D rendering is based on concepts from PostScript so it's usually easy to do such ports -- except for AGG, that one was a bit like a library from outer space. Maxim himself worked on a hardware accelerated 2D renderer for Scaleform, and it looked nothing like AGG, mostly because AGG is practically impossible to move to a GPU implementation.

c-smile · on June 14, 2020

I know, AGG is just a set of primitives but not an abstraction like class Graphics {...}. But that one can be assembled from them. Did it once for early versions of Sciter.

Ideally GPU and CPU rendering backends should have pixel perfect match that makes "adding new 2D backend" tricky at best.

Abishek_Muthian · on June 14, 2020

iTerm's usage of metal on macOS is good example of benefits in implementing 2D rendering on GPU[1].

[1]https://gitlab.com/gnachman/iterm2/-/wikis/Metal-Renderer

bori5 · on June 14, 2020

That links to iTerm2 which is what I’m guessing you meant to say ? Kitty is also a GPU accelerated terminal emulator and one that I enjoy using https://sw.kovidgoyal.net/kitty/ Not sure if it uses Metal on Mac though.

marcosscriven · on June 14, 2020

Naive question - but does this change to make use of DX11/shaders in newer GPUs make it more likely to have a good cross-platform UI for app development?

I’ve looked before and although OpenGL is good for windowing, widgets etc, it’s not great for sub pixel rendered/anti aliased text.

raphlinus · on June 14, 2020

Yes, the motivation and long term goal for this work is to provide a performant foundation for cross-platform UI. There's a lot to be done though!

Const-me · on June 14, 2020

I wonder how it compares performance-wise against my solution of the same problem: https://github.com/Const-me/Vrmac#vector-graphics-engine

I don’t use compute shaders, tessellating input splines into polylines, and building triangular meshes from these.

chadcmulligan · on June 14, 2020

This is very impressive. Have you benchmarked it against core graphics on a Mac? I believe they've done a similar thing - performing the render on Metal

c-smile · on June 14, 2020

CoreGraphics is pure CPU rasterizer, similar to GDI+.

Sciter (https://sciter.com) on MacOS uses Skia/OpenGL by default with fallback to CoreGraphics.

It is possible to configure Sciter to use CoreGraphics on MacOS to compare these two on the same UI by using

    SciterSetOption(NULL, SCITER_SET_GFX_LAYER, GFX_LAYER_CG);

I think it is safe to say that Skia/OpenGL is 5-10 times more performant than CG on typical UI tasks.

chadcmulligan · on June 14, 2020

Interesting, thanks - I remember hearing at one of the WWDC that Coregraphics got a 10x speed improvement using metal. I just read the fine print - it seems draw calls only have a 10x speed improvement, I assume because they render through a layer of some sort. The CA* libraries may use metal - animation and layers, which I guess is where the 10x draw call improvement comes in maybe.

Some discussion here https://arstechnica.com/civis/viewtopic.php?t=1285571 , though a lot of guessing.

raphlinus · on June 14, 2020

Metal is definitely used extensively in CoreAnimation, and Apple UI tends to rely on that - relatively slow (and memory hungry) rendering of layer content, which is then composited very smoothly and nicely in CA.

They might use it for other stuff like glyph compositing (I think this is one reason they got rid of RGB subpixeling, to make it more amenable to GPU), but last I profiled it, it was still doing a lot of the pixels on CPU, as others have stated.

c-smile · on June 14, 2020

I believe that the most HW optimizations are made in MacOS on DWM level - window composition and animation. Window surface is just a bitmap in RAM that needs to be populated by CoreGraphics/CPU.

In any case Acrylic/Vibrancy effects or "blur-behind background" (like on screenshots here: https://sciter.com/sciter-4-2-support-of-acrylic-theming/) are achievable only on GPU and on DWM level.

pcwalton · on June 14, 2020

And before that, I'm pretty sure Core Graphics used OpenGL in some situations. I remember seeing CG::OGL in stack traces for WebKit.

pcwalton · on June 14, 2020

Core Graphics is generally slower than Cairo/Pixman.

amelius · on June 14, 2020

Is GPU here synonymous with Nvidia?

raphlinus · on June 14, 2020

No, it's designed to be portable to all GPU hardware that can support compute, which these days is a pretty good chunk of the fleet. I tested it on Linux on Intel HD 4000, and the master branch seems to run just fine, though previous versions.

A lot of the academic literature (Massively Parallel Vector Graphics, the Li et al scanline work) is dependent on CUDA, but that's just because tools for doing compute on general purpose graphics APIs are so primitive. I have a talk and a bunch of blog posts on exactly this topic, as I had to explore deeply to figure it out. See https://news.ycombinator.com/item?id=22880502 and https://raphlinus.github.io/gpu/2020/04/30/prefix-sum.html for more breadcrumbs on that.

adamnemecek · on June 14, 2020

It uses metal and apple uses amd do no.