According to the article, being preempted by the kernel is too expensive because...

toast0 · on Nov 13, 2019

Like cfallin said; in order to get an alarm signal, on a modern system, that means getting a kernel mediated interrupt. It could be possible to make that happen with an x86 ring 3 interrupt, the architecture can do it, but I doubt that's supported on any OS with releases in this century, if it was ever used. (Maybe some RTOS or in a dos extender or something).

If you want to fake preemption in userland, you have to add a 'preemption' check every so often. Erlang uses the number of function calls made (reductions), which works because it doesn't have a loop concept, just recursion, which is a function call. As long as you have a check happening frequently enough, you can count the number of calls, or do something cool with rdtsc to track time or ? (Being careful not to use a syscall that may do a context switch you were trying to avoid).

Cooperative multitasking is essentially having to put the preemption checks in explicitly. For an emulator, it seems that it's not too hard to piggyback that on cycle tracking. Although it probably depends on the target system's design; some machines have lots of chips in tight lockstep, and some are more asynchronous with smaller overlap.

lokedhs · on Nov 13, 2019

What about allocating one kernel thread to act as a watchdog to detect if a user-space thread has not performed the preemption check in time, and only send an interrupt if that's the case.

The preemption check would simply write some data to memory (the actual data wouldn't matter), and the watchdog would use the MMU modification flag to check if the memory page was updated. That way you'd avoid synchronisation problems.

tomp · on Nov 13, 2019

That is essentially what is happening anyhow. Your runtime (e.g. GO) usually has multiple worker threads executing fibres / coroutines (one OS thread for each CPU core, usually). These worker threads have a userspace scheduler which performs userspace (cooperative) context switches. In addition, every so often the OS would also interrupt the worker threads and does an OS-level context switch (so that other processes can run).

The problem is 2-fold; (1) is that in cooperative (userspace) multithreading, context switches can only happen at a safepoint, where the runtime knows what's going on, and what the current state is; that's necessary for any kind of userspace service provided by the runtime, such as GC, synchronisation, JIT, etc. I think JVM people (or maybe Go? edit: found the Go issue [0]) are trying to generalize this so that every point of the program is a safepoint; not sure how it's going. (2) is that userspace has very limited capacity to even observe OS-level context switches, let alone modify them... E.g. ideally you'd be able to take a look where (at which instruction) the context switch happened, what the current state is (e.g. values of registers), and maybe modify it (e.g. switch to another coroutine). AFAIK there's no portable way to do this, and some OSs/platforms don't support any way to do this. Basically pretty much the only thing that can happen after a thread is interrupted by the OS, is that the thread resumes exactly when it was.

[0] https://github.com/golang/go/issues/24543

jlokier · on Nov 13, 2019

You can implement something similar without needing any special kernel features:

- Userspaces cooperative scheduling for all the tasks, in multiple threads if multi-core.

- A periodic timer signal sent to all the userspace threads. (Or one thread, but you might have to give it high scheduling priority; not all kernels offer this)

- The key to keeping kernel-userspace transition overhead down is the periodic timer doesn't run too often.

- The timer signal handler checks the per-thread "context-switched since last check" flag. If set, clear it. If not set, either pre-empt the active co-operative task in that thread (if that's possible; see another comment about safepoints), set a "pre-empt soon" flag for the co-operative task to detect (e.g. if it's looping but checks the flag in the loop), or move the inactive tasks in that thread to another thread (work-stealing), or a new thread (using pre-allocated idle threads because you can't make a new thread in a signal handler).

- If the timing of the timer signals is inconvenient, for example if you want to give the active task about 15ms more execution time, the signal handler may start a second timer instead of immediate action.

- Checking a per-thread flag from a signal handler may prove entertaining to do in a standard async-signal-safe way. (This matters more on older platforms where threads are themselves implemented in all sorts of ways.)

- Waking an idle "monitor" thread with a timer would be a cleaner way to do some of this, but there's no guarantee about when the monitor thread will run, and it may be a while if the CPUs are already busy. RT priority helps, if available.

gpderetta · on Nov 13, 2019

would this monitor thread run on the same core? If so, a timer interrupt or other kernel preemption event need to fire to get it scheduled. At that point might as well handle the preemption directly.

On the other hand, dedicating a core as a preemption supervisor seems wasteful. I guess you could run the supervisor on an hyper thread to reduce wastage.

lokedhs · on Nov 13, 2019

I was thinking it would be running on its own core. Cores tend to be quite cheap these days.

cfallin · on Nov 13, 2019

That would still involve entering and leaving kernel mode, because timer alarms ultimately are triggered by a hardware interrupt (from e.g. the APIC timer or, in really old x86 machines, the 8254 programmable interval timer, aka IRQ 0). So IRQ -> switch to kernel timer IRQ handler (prologue saves all usermode registers) -> examine internal data structures, find pending alarm signal -> modify process state, pushing signal frame and setting RIP to signal handler -> return to userspace. That's probably slower than just Timer IRQ -> switch to kernel -> invoke scheduler -> decides to preempt, changes to new thread -> restore context, return to user space (but I'd be curious if someone actually measured).

Fundamentally, preemption has to originate from a hardware IRQ (or IPI from another core), so really the only way would be to kernel-bypass by setting an IRQ handler in userspace (ring 3). That's technically possible on x86 (IDT entry can have a ring-3 code segment) but I don't think the kernel has a mechanism for that...

nitrogen · on Nov 13, 2019

How do user-mode network drivers work? Do they rely on polling, or do they still have the kernel handle interrupts?

gpderetta · on Nov 13, 2019

I wonder whether virtualization hardware can be used for this.

jeffreygoesto · on Nov 13, 2019

Yes, that is true. You would need to think about how or if you handle the other properties, process separation brings like proper isolation, but for sheer context switch performance that works. Erlang does that for instance:

https://stackoverflow.com/questions/2708033/technically-why-...

yellowapple · on Nov 13, 2019

If you're implementing a VM, it's straightforward to do this by counting the number of virtual instructions executed for a given "process" and switching to the next one after so many executions. This is, in fact, how Erlang's VM works, and how it's able to achieve preemptive multitasking entirely in userspace (and therefore avoiding the userspace<>kernelspace context switches).

kccqzy · on Nov 13, 2019

Haskell has something similar: by default it uses SIGVTALRM for preemption but you can configure the runtime to instead hook the allocator, and stop the thread once it has allocated enough. (Threads have to be stopped during allocation for GC anyways so the functionality is already there.) The only difficulty is threads that do not allocate at all, but that doesn't happen in real-world Haskell code outside special micro benchmarks.