Cooperative Threading

gpderetta · on Nov 13, 2019

There is a section about pipeline stalls caused by switching stacks, but that is misleading. Switching stacks per se is not a huge penalty, the cpu [1] might need to add synchronization opcodes to synchronize the stack engine view of the stack pointer with the actual stack pointer, but the performance cost is minimal and often free.

A big penalty is due to (failed) branch prediction: a coroutine switch is essentially an indirect branch so the predictor might need help (for example using a different branch instruction address for each coroutine type); also call instruction used to call into the coroutine switch function is not paired with a ret instruction, which messes with the specialized call predictor. Changing the final jmp instruction in the switch function to push add; ret actually makes thing worse; the best solution is to inline the switch function in the caller via inline assembler (this also helps with the previous issue and, if the compiler provides the functionality, it allows only saving the registers that are actually in use)

edit: last time I did a synthetic benchmark of my one of my own coroutine implementations, the switch performance was only constrained by the number of taken branches the CPU could issue (one every other cycle): https://github.com/gpderetta/delimited/blob/master/benchmark...

[1] I'm talking about the typical x86 cpu.

near · on Nov 13, 2019

You could well be right. In my own tests, the ESP swap is what tanked performance. A more limited stackless test did not have nearly the same overhead. And indeed, the largest speedup was from moving to set ESP as soon as possible, get the return address into EAX, and finally jump to it. CALL+RET with late ESP set behaved poorly, leading me to believe it was due to the CPU stalling on the ESP change.

I did also try inlining this, but it was much less portable and actually performed slower in my emulator. I won't claim that will be the case for everyone, of course. YMMV.

avdicius · on Nov 17, 2019

Wow. I used my own stack switch routine for several years now and didn't realize it could be improved until I read this thread. Thank you very much.

This is my old naive version:

https://github.com/ademakov/MainMemory/blob/master/src/base/...

This is what I have now:

https://github.com/ademakov/MainMemory/blob/cstack-switch-re...

gpderetta · on Nov 17, 2019

You are missing x/y/zmm regs in the clobber list.

avdicius · on Nov 18, 2019

Yes, that's true. However in my app this has never caused any problems. Apparently even if the compiler auto-vectorizes anything it does this only with scratch regs or in leaf procedures. I don't have any fp or simd code of myself.

fwsgonzo · on Nov 13, 2019

Interesting that you saw that. I guess it could be real, but sounds weird too.

ShroudedNight · on Nov 13, 2019

> Basically it requires dropping down to the assembler-level for each supported architecture to implement the context switching, as modifying the CPU context registers and stack frame directly is not permitted by most sane programming languages.

I thought that was (essentially) the definition of longjmp? Thinking further, it seems like the initial setup of additional stacks would require at least taking the covers of setjmp and interacting with its implementation details directly.

https://en.wikipedia.org/wiki/Setjmp.h

__david__ · on Nov 13, 2019

About 20 years ago I and two others designed an embedded kernel that switched tasks with setjmp and longjmp. It was exactly as you suspected—the only implementation specific part was creating a jmp_buf on task creation (which doesn’t require even assembly intrinsics). We were only targeting one architecture/abi at the time (arm) but it was nice to know that porting to a new architecture wouldn’t require a bunch of assembly rewriting (the only assembly was the interrupt trampoline—even our crt0 was C (well, there was at least one intrinsic asm instruction to set the stack pointer in crt0)).

jjjordan · on Nov 13, 2019

There's an implementation using setjmp/longjmp in his library here [1]. It uses sigaltstack to assign a newly allocated stack to the coroutine. Marc Lehman's libcoro [2] library does the same.

[1] https://github.com/byuu/higan/blob/master/libco/sjlj.c

[2] http://software.schmorp.de/pkg/libcoro.html

gpderetta · on Nov 13, 2019

as an historical note, the sigaltstack trick was popularized, if not invented, by the Pth library.

loeg · on Nov 13, 2019

POSIX has (had?) getcontext() and setcontext():

https://pubs.opengroup.org/onlinepubs/009695399/functions/ge...

Marked obsolescent in that version (2004 edition) (purportedly replaced by POSIX threads, except those aren't coroutines...) — but likely still present in whatever libc you use. (There's also pthread_set_concurrency(), but it is (a) option and (b) marked obsolescent as well ... https://pubs.opengroup.org/onlinepubs/9699919799/functions/p... .)

gpderetta · on Nov 13, 2019

unfortunately set/swapcontext and friends require a system call to change the signal mask. So in practice they are a few orders of magnitude slower than setjump or similar.

jlokier · on Nov 13, 2019

On some systems setjmp/longjmp also change the signal mask, and are just as slow. For those there is _setjmp/_longjmp.

The Linux man page explains excellently:

"POSIX does not specify whether setjmp() will save the signal mask (to be later restored during longjmp()). In System V it will not. In 4.3BSD it will, and there is a function _setjmp() that will not. The behavior under Linux depends on the glibc version and the setting of feature test macros. On Linux with glibc versions before 2.19, setjmp() follows the System V behavior by default, but the BSD behavior is provided if the _BSD_SOURCE feature test macro is explicitly defined and none of _POSIX_SOURCE, _POSIX_C_SOURCE, _XOPEN_SOURCE, _GNU_SOURCE, or _SVID_SOURCE is defined."

loeg · on Nov 13, 2019

You could easily imagine that most or all signal masks in spome green threaded program are the same, and a cheap userspace implementation of set/swapcontext that only calls into the kernel if the signal mask changes. I.e., it doesn't have to be a kernel call most of the time.

near · on Nov 13, 2019

To change the stack pointer, you either need sigaltstack (not always available) or you have to hack jmpbuf (not remotely portable), but indeed we have done both in my libco library. Assembler is quite a bit faster when you can manage it.

tijsvd · on Nov 13, 2019

Yes, I've done this successfully with setjmp/longjmp in the past. To setup the stack and make the first jump, you need to either fiddle with the jump buffer, or use something provided by the os, like windows fibers or linux ucontext. Those can be used to fully implement the scheduler, but longjmp is faster and limits the OS-specific part of the scheduler.

SlowRobotAhead · on Nov 13, 2019

> as modifying the CPU context registers and stack frame directly is not permitted by most sane programming languages

Embedded C programmer here. I find this statement highly offensive :D

I wonder if this guy knows how planes and car and traffic lights and medical equipment all works? Lots of things you rely on every day were made in languages that allow “insane” direct CPU register access.

BubRoss · on Nov 13, 2019

There is a big difference between modifying the stack control registers directly and modifying general purpose registers directly.

SlowRobotAhead · on Nov 13, 2019

Not really.

What do you think you do when you write in assembly and want to enter or exit a function? You push and pop on the stack control registers.... gasp?

It’s not common but I’ve had to manually adjust PC register before. The author and most software devs are so far removed from metal, the downvotes on my comments are hilarious to me.

imtringued · on Nov 14, 2019

>What do you think you do when you write in assembly

You said you were an "Embedded C programmer". Why are you now switching the topic and talking about assembly? First of all there is no such objective definition of assembly. It is inherently specific to the architecture and therefore someone knowing x86 assembly doesn't mean that person also knows how to use RISC-V assembly. Since you have moved the goalpost out of the playing field by switching languages one could now conceive of an architecture that simply has no registers at all because everything is stored in RAM. Such an architecture would allow a C compiler to still produce valid code but assembly code would not be able to access any registers whatsoever. Therefore it makes equally little sense for the C programming language to have the ability to access registers.

> The author and most software devs are so far removed from metal, the downvotes on my comments are hilarious to me.

The reason why you receive downvotess is that even people who are not "far removed from metal" disagree with your comments because you are making fun of them.

justin66 · on Nov 13, 2019

> The author and most software devs are so far removed from metal

The author writes emulators of multiple game systems in C++. They probably wake up screaming at night thinking about the metal.

SlowRobotAhead · on Nov 13, 2019

Ha, good point. Doesn't stop them from using the term I've noticed.

near · on Nov 14, 2019

> as modifying the CPU context registers and stack frame directly is not permitted by most sane programming languages

My point was, if C++ had a global variable named "stack_pointer" you could assign to, that would indeed be rather insane.

BubRoss · on Nov 13, 2019

I said directly. You seem to be implying that it is common for important embedded work to involve manually messing with stack registers directly by the programmer with assembly.

SlowRobotAhead · on Nov 13, 2019

Um... I don't know what to tell you - but I do that every day. Guess what - people who make libraries for other embedded devices, sometimes still write in assembly. Insane, I know.

BubRoss · on Nov 13, 2019

You seem to be purposely ignoring what I am saying so that you contradict something I didn't say. I doubt people are writing the software for medical devices by directly writing to stack registers (which wouldn't mean the typical automatic instructions that push and pop the stack while automatically incrementing or decrementing the instruction pointer at the same time)

imtringued · on Nov 14, 2019

Your first comment didn't mention assembly. You only talked about C. It's insane that you switch the topic of the conversation and then pretend to be the only smart one.

lebuffon · on Nov 13, 2019

The efficiency of cooperative tasking was seen in the 1970s when machines were notoriously slow. In the example below roughly 8x better.

https://www.forth.com/resources/forth-programming-language/

"a PDP-11 or Nova could be expected to support up to eight users, although the performance in a system with eight active users was poor.

On this hardware, Moore’s Forth systems offered an integrated development toolkit including interactive access to an assembler, editor and the high-level Forth language, combined with a multitasking, multiuser operating environment supporting 64 users without visible degradation..."

artemonster · on Nov 13, 2019

I wonder whether writing whole system in verilog, synthesizing cycle accurate simulator with verilator would yield better results...

ginko · on Nov 13, 2019

I work for a silicon IP company and the verilog simulation of our product is about a factor of ten slower than our "cycle accurate" software model.

The FPGA version is about 40 times or so faster than the model and the actual silicon is another 10 times or so faster than that.

bhouston · on Nov 13, 2019

How cheap are fgpas that could simulate an SNES or similar?

Any chance we standardize and commoditize fpgas so that it can be just another device on a laptop or phone that anyone could use? Like bluetooth or a GPU or gps?

This has to happen at some point but how far off from that are we now?

I assume there are different grades of fpgas and that likely complicates things but it is like desktop GPUs are way way better than phone GPUs... But both run opengl/vulkan and essentially the same code.

gpderetta · on Nov 13, 2019

I know of FPGA based implementation of whole Amigas [1] and I do not htink they are particularly expensive. It is very likely that SNES reimplementations also exist already.

Edit: this [2] is a full system for 220 euros and in addition to Amiga it emulates a bunch of 8 bit consoles.

[1]: https://en.wikipedia.org/wiki/Minimig

[2]: https://amigastore.eu/en/358-mist-midi-fpga-computer-with-mi...

krupan · on Nov 13, 2019

Those are very good questions. I have a feeling that Intel is looking into this now that they have purchased Altera.

krupan · on Nov 13, 2019

Like ginko says, it probably wouldn't. Verilator still requires you to model at a very low level (sythesizable RTL), much lower than this article is describing. That detailed of a model is going to be slower than what you can do without that restriction.

Now, that's not to say that it wouldn't be good enough, i.e., fast enough that your games would still be playable. It's an interesting idea.

near · on Nov 13, 2019

Verilog lets you model things natively in parallel, but few people have user-programmable FPGAs lying around sadly. Very cool tech, though!

krupan · on Nov 13, 2019

Just to be clear, artemonster isn't talking about using an FPGA. Verilator is software that complies Verilog to C++, so you'd still end up with a software emulator.

The upside to this is that Verilog makes it very easy to model parallel hardware. The downside is that you have to model the hardware at the clock-cycle, "flip-flops and Boolean logic" level of detail.

helltone · on Nov 13, 2019

According to the article, being preempted by the kernel is too expensive because of the context switch userland<>kernel. I wonder if its possible to implement the same preemptive mechanism purely in userland, for example with a timer alarm and a signal handler using swapcontext or similar to achieve better performance?

toast0 · on Nov 13, 2019

Like cfallin said; in order to get an alarm signal, on a modern system, that means getting a kernel mediated interrupt. It could be possible to make that happen with an x86 ring 3 interrupt, the architecture can do it, but I doubt that's supported on any OS with releases in this century, if it was ever used. (Maybe some RTOS or in a dos extender or something).

If you want to fake preemption in userland, you have to add a 'preemption' check every so often. Erlang uses the number of function calls made (reductions), which works because it doesn't have a loop concept, just recursion, which is a function call. As long as you have a check happening frequently enough, you can count the number of calls, or do something cool with rdtsc to track time or ? (Being careful not to use a syscall that may do a context switch you were trying to avoid).

Cooperative multitasking is essentially having to put the preemption checks in explicitly. For an emulator, it seems that it's not too hard to piggyback that on cycle tracking. Although it probably depends on the target system's design; some machines have lots of chips in tight lockstep, and some are more asynchronous with smaller overlap.

lokedhs · on Nov 13, 2019

What about allocating one kernel thread to act as a watchdog to detect if a user-space thread has not performed the preemption check in time, and only send an interrupt if that's the case.

The preemption check would simply write some data to memory (the actual data wouldn't matter), and the watchdog would use the MMU modification flag to check if the memory page was updated. That way you'd avoid synchronisation problems.

tomp · on Nov 13, 2019

That is essentially what is happening anyhow. Your runtime (e.g. GO) usually has multiple worker threads executing fibres / coroutines (one OS thread for each CPU core, usually). These worker threads have a userspace scheduler which performs userspace (cooperative) context switches. In addition, every so often the OS would also interrupt the worker threads and does an OS-level context switch (so that other processes can run).

The problem is 2-fold; (1) is that in cooperative (userspace) multithreading, context switches can only happen at a safepoint, where the runtime knows what's going on, and what the current state is; that's necessary for any kind of userspace service provided by the runtime, such as GC, synchronisation, JIT, etc. I think JVM people (or maybe Go? edit: found the Go issue [0]) are trying to generalize this so that every point of the program is a safepoint; not sure how it's going. (2) is that userspace has very limited capacity to even observe OS-level context switches, let alone modify them... E.g. ideally you'd be able to take a look where (at which instruction) the context switch happened, what the current state is (e.g. values of registers), and maybe modify it (e.g. switch to another coroutine). AFAIK there's no portable way to do this, and some OSs/platforms don't support any way to do this. Basically pretty much the only thing that can happen after a thread is interrupted by the OS, is that the thread resumes exactly when it was.

[0] https://github.com/golang/go/issues/24543

jlokier · on Nov 13, 2019

You can implement something similar without needing any special kernel features:

- Userspaces cooperative scheduling for all the tasks, in multiple threads if multi-core.

- A periodic timer signal sent to all the userspace threads. (Or one thread, but you might have to give it high scheduling priority; not all kernels offer this)

- The key to keeping kernel-userspace transition overhead down is the periodic timer doesn't run too often.

- The timer signal handler checks the per-thread "context-switched since last check" flag. If set, clear it. If not set, either pre-empt the active co-operative task in that thread (if that's possible; see another comment about safepoints), set a "pre-empt soon" flag for the co-operative task to detect (e.g. if it's looping but checks the flag in the loop), or move the inactive tasks in that thread to another thread (work-stealing), or a new thread (using pre-allocated idle threads because you can't make a new thread in a signal handler).

- If the timing of the timer signals is inconvenient, for example if you want to give the active task about 15ms more execution time, the signal handler may start a second timer instead of immediate action.

- Checking a per-thread flag from a signal handler may prove entertaining to do in a standard async-signal-safe way. (This matters more on older platforms where threads are themselves implemented in all sorts of ways.)

- Waking an idle "monitor" thread with a timer would be a cleaner way to do some of this, but there's no guarantee about when the monitor thread will run, and it may be a while if the CPUs are already busy. RT priority helps, if available.

gpderetta · on Nov 13, 2019

would this monitor thread run on the same core? If so, a timer interrupt or other kernel preemption event need to fire to get it scheduled. At that point might as well handle the preemption directly.

On the other hand, dedicating a core as a preemption supervisor seems wasteful. I guess you could run the supervisor on an hyper thread to reduce wastage.

lokedhs · on Nov 13, 2019

I was thinking it would be running on its own core. Cores tend to be quite cheap these days.

cfallin · on Nov 13, 2019

That would still involve entering and leaving kernel mode, because timer alarms ultimately are triggered by a hardware interrupt (from e.g. the APIC timer or, in really old x86 machines, the 8254 programmable interval timer, aka IRQ 0). So IRQ -> switch to kernel timer IRQ handler (prologue saves all usermode registers) -> examine internal data structures, find pending alarm signal -> modify process state, pushing signal frame and setting RIP to signal handler -> return to userspace. That's probably slower than just Timer IRQ -> switch to kernel -> invoke scheduler -> decides to preempt, changes to new thread -> restore context, return to user space (but I'd be curious if someone actually measured).

Fundamentally, preemption has to originate from a hardware IRQ (or IPI from another core), so really the only way would be to kernel-bypass by setting an IRQ handler in userspace (ring 3). That's technically possible on x86 (IDT entry can have a ring-3 code segment) but I don't think the kernel has a mechanism for that...

nitrogen · on Nov 13, 2019

How do user-mode network drivers work? Do they rely on polling, or do they still have the kernel handle interrupts?

gpderetta · on Nov 13, 2019

I wonder whether virtualization hardware can be used for this.

jeffreygoesto · on Nov 13, 2019

Yes, that is true. You would need to think about how or if you handle the other properties, process separation brings like proper isolation, but for sheer context switch performance that works. Erlang does that for instance:

https://stackoverflow.com/questions/2708033/technically-why-...

yellowapple · on Nov 13, 2019

If you're implementing a VM, it's straightforward to do this by counting the number of virtual instructions executed for a given "process" and switching to the next one after so many executions. This is, in fact, how Erlang's VM works, and how it's able to achieve preemptive multitasking entirely in userspace (and therefore avoiding the userspace<>kernelspace context switches).

kccqzy · on Nov 13, 2019

Haskell has something similar: by default it uses SIGVTALRM for preemption but you can configure the runtime to instead hook the allocator, and stop the thread once it has allocated enough. (Threads have to be stopped during allocation for GC anyways so the functionality is already there.) The only difficulty is threads that do not allocate at all, but that doesn't happen in real-world Haskell code outside special micro benchmarks.

saagarjha · on Nov 13, 2019

What’s with them mprotect here: https://github.com/byuu/bsnes/blob/master/libco/x86.c ?

lilyball · on Nov 13, 2019

Good question. https://github.com/byuu/bsnes/blob/bd8e94a7c7cbfdf7ac0b7f24c... suggests there are reasons why the section(text) approach might not work, and https://github.com/byuu/bsnes/blob/6b7e6e01bb025bf5cbeb92e65... explicitly says it doesn't work with clang, though I don't see why there's not some way of solving this without mprotect that works everywhere.

gpderetta · on Nov 13, 2019

but why is it not simply using a separate asm file instead of embedding the preassembled binary as a constant?

near · on Nov 13, 2019

Because for many years libco has supported all C89 compilers. I didn't want a dependency on GNU as on Windows, nor did I want to write an MSVC variant. There is no technical reason it cannot be inline now, especially now that we have Clang with a compatible asm syntax.

aidenn0 · on Nov 13, 2019

I do wonder why byuu wrote his own library rather than using one of the half-dozen existing ones. Presumably there is something about this use-case that makes a dedicated library useful.

near · on Nov 14, 2019

There's probably hundreds of these libraries. Almost everyone chooses to just write their own.

I wanted one because in 2007, no project existed that was laser-focused on maximizing performance (I switch threads tens of millions of times a second in my emulators), was lightweight enough (I implement my own schedulers for my emulators), and was portable enough (I am not even aware of a CPU architecture libco won't run on currently.)

The library is extremely small, and being in control of it allows me to adapt and support new targets directly as needed.

grawlinson · on Nov 13, 2019

The cool thing about libco, is that it's also used in LXD. Quite a novel use for this library!

axilmar · on Nov 13, 2019

Do we really need all this complexity? how about if having, let's say, N chips to emulate, do a simple loop and call 'emulate' for each of the chips?

    for(EmulatedChip &chip : chips) {
        chip.emulate(time);
    }

Why would the above not be suitable for emulation?

boomlinde · on Nov 13, 2019

That model is unsuitable for a system where chips are interconnected. Your model breaks down once chip x needs to produce something on a bus for chip y on cycle z. Of course, then you could emulate with time = 1 cycle, but at that point you are stuck implementing a state machine either way.

If you just have a bunch of chips operating separately without the need for synchronous communication your model works, but that's unrealistic.

joppy · on Nov 13, 2019

I think this approach is the "state machine" approach given in the article. The added complexity in moving to a coroutines/co-operative threading model is balanced by the drastic code reduction, pointed out in the article by "Yes, it's really that dramatic of a difference".

fwsgonzo · on Nov 13, 2019

Yes, you inevitably end up implementing a state machine like that.

near · on Nov 14, 2019

I consider it something of a trap in fact.

You start writing a new emulator for the first time, dispatching entire instructions in one go, and so you don't even really need a state machine at all.

Then you start trying to get the timing better with opcode-cycle granularity, and in comes a separate state machine for every instruction.

Then you realize you need clock-cycle granularity to fix certain edge-case games (eg emulating the effects that occur during bus accesses), and suddenly the prospect of a state machine for every cycle of every instruction becomes overwhelming.

You would then be realistically stuck with the choice to either stop improving your accuracy, or rewriting things cooperatively.

Since I'm focusing byuu.net articles toward aspiring emulator developers, I thought it would be an important topic to cover.

amelius · on Nov 13, 2019

> Unlike coroutines, each cooperative thread has its own stack frame.

This seems to be incorrect. See:

https://stackoverflow.com/questions/28977302/how-do-stackles...

> In contrast to a stackless coroutine a stackful coroutine can be suspended from within a nested stackframe.

It seems the author is trying to reinvent stackful coroutines, and calls them "cooperative threads".

near · on Nov 13, 2019

No one really agrees on the exact terminology. There are stackful and stackless coroutines, continuations, green threads, fibers, cooperative threads, etc. Generally when people talk about coroutines without a qualification they are meaning the stackless variety. I personally consider a coroutine with a stack frame to be a cooperative thread, but understandably not everyone will agree on naming. The nomenclature hopefully should not affect the broader message of the article though.

In any case, I am not so much trying as I already have. C++ does not have stackful coroutines. I implemented them via my libco library and have been using them for over a decade now.

gpderetta · on Nov 13, 2019

You could argue that coroutines as originally specified implied stackless-ness: it was just a generalization of a subroutine that could return (to its immediate caller) multiple times, so at most one stack frame need to be preserved.

A better name for stackful coroutines is one-shot continuations.