Linux 3.9 introduced a new way of writing socket servers

rdtsc · on Aug 24, 2013

> Now the question is why to bother with multiprocess socket servers at all - aren't threads and events better? There's at least one good niche for them - dynamic languages like Python or Ruby, which need multiple OS processes to achieve real concurrency [my emphasis]

That is not true. It is an often repeated misconception. It makes it sound like Python creators were just incompetent and just stuck threads in there even though they are completely useless. In fact Python's threads work well for IO concurrency. I used them and saw great speedup when accepting and handling simultaneous socket connections. Yes you won't get CPU concurrency, but if your server is not CPU bound you might not notice much of a difference.

IO concurrency is real concurrency. In 8 years using Python for fun and professionally I probably wrote more IO concurrent code than CPU concurrent code. Even then for CPU concurrent code I would have had to drop into C using an extension (and there you can release the GIL anyway).

Now, the obvious follow up is that in case of IO concurrency you are often better of using gevent or eventlet. You get lighter weight threads (memory wise) and less chances of synchronizations bugs (since greenlet based green threads will switch only on IO concurrency points, socket reads, sleep and explicit waits on green semaphores and locks).

bad_user · on Aug 25, 2013

> you won't get CPU concurrency, but if your server is not CPU bound you might not notice much of a difference

That's not true. Most applications, with the possible exceptions of proxies, are also CPU bound.

Take for example a web service that receives JSON documents. The act of parsing JSON documents is CPU bound. The act of creating a response is CPU bound. In between you can also have IO bound operations, like fetching data from a MySQL database or a Memcached instance, however in the process of creating the final response you also need to transform the data received and that's also CPU bound.

As a real world example, I worked on a web-service written in Scala and running on the JVM. Initially it was running on only 8 Heroku dynos and these instances were receiving over 30,000 requests per second of real traffic. These Heroku instances are of course under-powered, because on my modest laptop the same web server is able to handle more than 10,000 requests per second.

And yes, asynchronous I/O lets you easily have 100,000 connections per server. But if you need throughput, then the CPU starts being a bottleneck.

Of course, my problem with Python and why I migrated away from it is that in truth Python sucks for asynchronous I/O too. But that's another story.

illumen · on Aug 25, 2013

Wrong. Those examples are IO bound - memory. You'll find the CPU waiting on memory in these cases.

JSON is parsed in C with CPython, or in assembly with PyPy.

As a real world example, I've done realtime image processing of a gigabyte per second worth of data on a single machine with asynchronous python. It was IO bound, we had more CPU to spare. Hell, we even had some GPUs sitting there not doing anything because they weren't needed.

If you're doing real performance computing, then taking advantage of GPUs/DSP or other hardware is where it is at anyway. Python is quite good at a glue language for interfacing to these things.

johnsoft · on Aug 25, 2013

Memory is I/O from the hardware standpoint, but I don't think that makes sense from a Python coder's standpoint. Python doesn't release the GIL while waiting for a memory fetch, so multiple threads can't concurrently wait on memory the CPU asks for. You're still waiting until the GIL is released before any thread which needs it can execute. If you're calling out to GPUs or DSPs, it's likely the interop code releases the GIL, improving the concurrency situation. But I think by "real concurrency", the author was referring to pure Python, CPU-bound (or memory-bound) code.

ithkuil · on Aug 25, 2013

Unfortunately if the CPU is waiting for memory, then thread is stuck, even if the hardware CPU had resources to execute some other instructions it cannot execute them from the same stream of instructions. Instead it could execute another stream of instructions, i.e. another thread. This is how hyperthreading works and still requires you to write a multithreaded application with the same problems and requirements.

sluukkonen · on Aug 24, 2013

What he means is true parallelism (although alternate ruby and python implementations have it).

rdtsc · on Aug 24, 2013

Downloading 2 web pages at the same time without one blocking another from completing is true parallelism. The request is sent for one, while it is in progress (maybe server is slow), another one can go out and come back with data. This can happen for hundreds or thousands of them. These are executed in parallel. So we got concurrent units of work executed at the same time, I fail to see how that is not parallel.

Now this is IO concurrency but it is real concurrency. Adding CPU concurrency would be very nice. It might speed things up a bit, or it might not. It really depends.

As an example consider haproxy. The little proxy that could. It handles large amounts of concurrent connection in parallel and it is single threaded in its default configuration. I've heard of 100k connections. It deals with IO concurrency. Chances are, making it multi-threaded might not dramatically improve its performance (it might even slow it down).

meowface · on Aug 24, 2013

Concurrent generally means "occurring or able to occurr at the exact same time." So you do have parallel and asynchronous IO, but not concurrent. There's lots of debate over the terms, but I think that's the agreed "exact definition."

Of course, does this really make a difference for network IO? Almost always the answer is no. The difference will be on the order of microseconds, maybe milliseconds.

rdtsc · on Aug 24, 2013

My definition is this:

* Concurrency is a property of relationships between tasks in the problem (or the algorithm). Is fetching one page for example independent of another one. If could be so it is concurrent, but it also might not be true, if it is a child page. You have to fetch one page, look at links and then fetch those pages. To the tasks have hard coded sequence so those are not concurrent.

* Parallelism is how that algorithm or problem is solved or executed. It could be that you can execute all concurrent units at the same time so you achieve parallelism, which is great. Or it could be that due to a particular architecture or other reasons you execute it serially. Maybe you just have a while loop and fetch one page, wait fetch another one. The problem is concurrent but it is not run in parallel.

Notice my definition doesn't include CPU or IO in there. In real world there is both. CPU concurrency interleaved with IO concurrency. That you can then end up running none, one or both in parallel when you execute.

skyraider · on Aug 24, 2013

To simplify:

Concurrent tasks complete in the same, overlapping time period.

Parallel tasks run at literally the same time.

scott_s · on Aug 24, 2013

Yes. And this implies that concurrent is not necessarily parallel, but parallel is always concurrent. A way to keep the notions straight is that if you're implementing a kernel that will only ever run on one core, you still have to worry about concurrency: all processes, including the kernel, are time-sharing the single core.

barrkel · on Aug 25, 2013

parallel is always concurrent

I don't really agree. The heart of the problem of concurrency is non-determinism, but it's perfectly possible to have deterministic parallel algorithms. Normally the key is not letting the parallel operations interact with one another.

So to me, a useful (for discussion) definition of concurrency involves multiple logical tasks, overlapping in time, and interacting with one another, in a non-deterministic way.

Whereas parallelism is concerned with taking advantage of physical hardware that can do more than one thing simultaneously.

And concurrency does not imply parallelism, nor does parallelism imply concurrency, under my understanding. In particular, data parallelism like SIMD or CUDA is not concurrent.

rdtsc · on Aug 25, 2013

> So to me, a useful (for discussion) definition of concurrency involves multiple logical tasks, overlapping in time, and interacting with one another, in a non-deterministic way.

Hmm, I would think it would be the opposite, they're concurrent precisely because they don't have to interact. They can run independently. 2 requests from a server are concurrent because they don't have to know about each other and don't have to interact with each. This is a property of the problem domain (idealized web requests) this doesn't tell us anything about how they'll run (in parallel or not).

> Whereas parallelism is concerned with taking advantage of physical hardware that can do more than one thing simultaneously.

I agree with that.

> And concurrency does not imply parallelism, nor does parallelism imply concurrency, under my understanding. In particular, data parallelism like SIMD or CUDA is not concurrent.

Don't quite agree with that and don't see why SIMD algorithms have to be a special case. Maybe you compute a dot product between 2 vectors. If you write the algorithm down you have a bunch of multiplications and a sum. You notice that it has a lot of concurrency (the algorithm). If you don't have SIMD you could spawn a thread to multiply out each pair and then to sum. That would be silly. But you'd run in parallel. You could just do it sequentially with a for loop. But if you have SIMD, it know how to run those concurrent algorithmic steps in parallel.

RivieraKid · on Aug 24, 2013

According to wikipedia:

In computer science, concurrency is a property of systems in which several computations are executing simultaneously, and potentially interacting with each other.

But it doesn't really matter, what matters is whether people understand each other, not whether they're using the "correct" words.

rbanffy · on Aug 25, 2013

It worries me people when people think they understand each other while using the wrong words.

RivieraKid · on Aug 25, 2013

If they really do understand what each other is trying to say, it doesn't matter much what words they're using.

What annoys me is when half of the comments is about form, not about substance. It's understandable, people (including me) love to correct mistakes of other people, but it still annoys me.

rbanffy · on Aug 25, 2013

When people start misusing a word, it loses its original meaning. There is an avoidable period of confusion between the moment people start using word A to mean a subset of A and when everyone agrees A refers to a subset of what previously was A.

I like to avoid confusion whenever possible.

plainOldText · on Aug 24, 2013

I think a lot of people confuse parallelism with concurrency. The easiest analogy I can think of is this:

1. Concurrent means having two cups of water, one in each hand, and drinking(think CPU computation) a little bit from one, then switch to the other. While you drink from a cup someone is filling up the other (think socket IO)

2. Parallel means having two cups, one in each hand,and lifting them up and drinking from them at the same exact time.

gruseom · on Aug 25, 2013

The confusion is worsened by the fact that "concurrent" literally means "at the same time", so calling interleaved timeslices "concurrent" is an oxymoron.

The fact that it's so hard to remember which is supposed to be "concurrency" and which "parallelism" is an indicator of how weakly these words are bound to those meanings.

plainOldText · on Aug 25, 2013

I think you've very much nailed it.

I like to imagine that concurrent processes, concur (agree) on how they should share the time slices of the CPU, and parallel processes don't ever give a damn about each other, cause each has its own core, and like parallel lines, they never meet.

Now, for clarity purposes I would add that in fact concurrent processes don't "choose" per se when to run, that's the schedulers job.

I think the whole concurrent/parallel confusion is worsened even more by that fact that on a multi-core system concurrent processes can in fact be executed in parallel.

ithkuil · on Aug 25, 2013

well, "concurrent" literally means "that run together", without specifing how the race takes place, perhaps the runners have to interleave their steps :-)

gruseom · on Aug 25, 2013

Oh! Of course you're right. This is brilliant, and clearly the right way to look at it. I hope more people notice what you've said here.

What ithkuil pointed out is that the root "cur" in "current" comes from the Latin for "to run". Thus "concurrent" does not literally mean "at the same time"—what I said was wrong. It literally means "running together". There's not any piece of that word that technically refers to time, so it's not an oxymoron to use it to describe interleaved timeslicing.

Perhaps it would be clearer if we spoke of "concurrent vs. simultaneous" processes rather than "concurrent vs. parallel". I'm not sure; I still don't think everyone is talking about the same things.

ithkuil · on Aug 26, 2013

seriously, the whole issue of whether threaded code is really running in parallel or not (i.e. whether adding more cpus will make the code run faster) is misleading.

Context switches without the compiler being explicitly aware of when this happens can yield similar issues whether the context switch is done in software or if memory accesses are interleaved because the code is genuinely running on multiple execution units.

The problem stems from the fact that both the compiler and the processor might perform memory access in a different order than what you'd expect. I'd suggest an interesting read about it at http://ridiculousfish.com/blog/posts/barrier.html

Asynchronous programming allows to process effectively one event at a time, where things happen exactly as defined by a simple programming model, and the compiler can know what it can safely be done to produce the requested side effects.

If the grain of the events is fine enough you can reach the same effect as being concurrent, from the point of view of task being performed, while actually there is nothing really concurrent from the point of view of the actual code that is running.

Thus, it's not about the definition of concurrency per se, but about what is being concurrent in the system.

rdtsc · on Aug 25, 2013

Yes. Very unfortunate. If they were replaced by random new labels "frob" or "zbring" it would probably be easier to convey the ideas behind then. Existing colloquial meanings and interpretations just cause confusion.

Maybe that is why they stick with Latin when practicing law. Each term then is in a separate language and less likely to cause confusion or collisions with the English language.

nknighthb · on Aug 24, 2013

http://www.merriam-webster.com/dictionary/concurrent

http://en.wiktionary.org/wiki/concurrent

http://en.wikipedia.org/wiki/Concurrency_(computer_science)

To state the obvious, you're attempting to make distinctions that either don't exist, or do not have a consensus. You need to find new words. Your definition of concurrent is just... wrong.

anaphor · on Aug 24, 2013

You're the one using the wrong words: actual experts on concurrency and parallelism disagree with you:

http://existentialtype.wordpress.com/2011/03/17/parallelism-... http://blog.golang.org/concurrency-is-not-parallelism http://ghcmutterings.wordpress.com/2009/10/06/parallelism-co...

nknighthb · on Aug 24, 2013

Oh? Then why don't they say anything in their posts that disagrees with my view on the subject? (Hint: You're replying to the first comment I made in this thread, so the idea that I used the words "concurrent" or "parallel" in any particular way, much less wrongly, is objectively incorrect.)

plainOldText · on Aug 25, 2013

Maybe my post wasn't clear enough, but I wasn't trying to illustrate the word "concurrent" in its literal sense, but rather from a programming perspective. (also read gruseom's comment above)

Concurrent and parallel, in their literal senses can be synonyms – and sure enough, if you read the first source you've just linked, at 2a you'll see concurrent defined as "in parallel". In the case of programming, a concurrent program is not parallel, though it could be if you have multiple cores.

nknighthb · on Aug 25, 2013

Your post described only a possible manifestation of concurrency, and attempted to define that as what concurrency is in some sort of opposition to "parallel".

I suspect we're actually in agreement that ideally "concurrent" would be a description of capability and "parallel" a manifestation of that capability, but the fact is you're never going to get everyone to agree on (or remember) that[1]. So, again, new words are needed.

[1] Edit: I just found the later post of yours where you said this:

"I like to imagine that concurrent processes, concur (agree) on how they should share the time slices of the CPU, and parallel processes don't ever give a damn about each other, cause each has its own core, and like parallel lines, they never meet."

So, yet another novel definition of "concurrent". And yet you think other people are wrong. Heh.

rdtsc · on Aug 25, 2013

I like this quick comic from Joe Armstrong (Erlang's "father").

http://joearms.github.io/2013/04/05/concurrent-and-parallel-...

gruseom · on Aug 25, 2013

It's great, but all these definitions (this one, the ones at https://news.ycombinator.com/item?id=6270128 and so on) have slightly different meanings. It must be maddening to anyone trying to get it for the first time.

For example, are parallel processes always also concurrent? It's hard to imagine a more elementary question, yet the different definitions don't all answer it the same way. That alone casts some doubt on how well-defined these terms are to begin with.

rdtsc · on Aug 25, 2013

What helped me understand was this point

concurrency is a property of the algorithm, parallelism is a property of the execution environment

To expand on the above. This means that a particular problem can be talked about in terms of smaller sub problems. Example, you are serving a site. Sub problems are handling each client request. Another example of problem "crack password via brute-force method", sub problem is "try one particular password". Here is where discussion comes about whether there are concurrent sub-problems or not. We are not sure about how they'll run yet.

>For example, are parallel processes always concurrent?

Not sure what you mean by that. Are these processes solving one particular problem. Concurrency and parallelism make sense for a particular problem or algorithm. How are these processes related? Do they just happen to run on the same machine but otherwise are solving separate problems. Then maybe it doesn't even make sense to talk about either concurrency or parallelism.

Now you can turn this on its head an look at it from the point of view of a kernel designer. His very simplified algorithm is "fairly schedule processes and IO" for all the users. So his problem now deals with any two processes but these are now all part of a problem.

I guess I am trying to say that some questions just don't make sense to ask.

scott_s · on Aug 24, 2013

What that analogy misses is that drinking two cups at the same time is both parallel and concurrent.

vidarh · on Aug 24, 2013

It is very common to describe pre-emptive multitasking on a single core as concurrent processing. Your definition is a general dictionary definition - it does not necessarily fit well with usage in technology.

meowface · on Aug 24, 2013

You're right, both "parallel" and "concurrent" are ambiguous if using the dictionary definitions. Computer science has added somewhat new definitions to both of those terms but they're not really universally known or understood.

It'd be nice if new terms were used entirely, really.

revelation · on Aug 24, 2013

That definition makes very little sense for computing. The basis of multithreading is an illusion of concurrency.

barrkel · on Aug 25, 2013

No, multithreading is an OS abstraction providing the illusion of parallelism - which might actually be parallel, if more than one hardware executor is available - and is an implementation technique for concurrency, but not the only one.

eropple · on Aug 24, 2013

Unless you have multiple cores.

iooi · on Aug 24, 2013

> So we got concurrent units of work executed at the same time, I fail to see how that is not parallel.

I thought that by definition this is impossible under the GIL. Not completely sure, but would love to know. I have written thousands of lines using gevent and eventlet but have only achieved peaks of 10 Mb/s (on servers that have at least 100), and I'm sure that truly concurrent languages could fully take advantage of that throughput -- currently in the process of migrating from Python.

rdtsc · on Aug 24, 2013

It is hard to tell but if you mostly fetch data without processing it, copying it, compute it, then gevent can certainly handle concurrent socket operations.

Python's GIL won't let you execute Python code in parallel like say you start multiplying numbers in one thread and another. You won't multiply twice as many numbers because of the GIL. But for IO concurrency you should achieve parallelism (unless you have a string CPU consuming part in there as well).

meowface · on Aug 24, 2013

If you have anything that's even ever so briefly blocking gevent, then you will indeed be bottlenecked in that way.

I'd be curious to see if your migration does allow better bandwidth peaks, though.

ddorian43 · on Aug 24, 2013

what language are you migrating to ?

RivieraKid · on Aug 24, 2013

By real concurrency he clearly meant CPU concurrency.

rdtsc · on Aug 25, 2013

I got that, but pretended to not understand in order to illustrate a point.

That is a common misconception. And it seems to me nowadays most concurrency people deal with (at least when it comes to server and web back-end world) is heavier IO bound. Yet everyone automatically default to their CS 102 -- algorithms class when they think about solving graph problems in parallel or multiplying matrices. So concurrency automatically is implied to be CPU concurrency.

nshepperd · on Aug 25, 2013

I came in here expecting to see a pedantic debate about the use of the word "concurrency", and I was not disappointed. Though of course you have a point.

anaphor · on Aug 24, 2013

Thanks, people need to learn the difference between concurrency and parallelism ffs, I'm tired of explaining the difference all the time.

RivieraKid · on Aug 24, 2013

It doesn't really matter how you call it. There are more important things people need learn about.

anaphor · on Aug 24, 2013

Actually it is important to learn about, because it's possible to do parallelism without bringing in the issues of concurrency (non-determinism, synchronization) and thus write code with less potential for serious bugs.

RivieraKid · on Aug 24, 2013

Learning the difference between c. and p. won't help me write less buggy software. And I think there's no clear consensus what exactly is the difference.

Peaker · on Aug 24, 2013

It is true, however, that knowing about parallelism without (visible) concurrency is useful for writing less buggy software.

smegel · on Aug 24, 2013

> In fact Python's threads work well for IO concurrency.

So long as you don't have any CPU bound threads competing for the GIL ;)

rbanffy · on Aug 25, 2013

That's why profiling your code is so important.

If you have a CPU bound thread, it may be worth to pay the performance penalty of separating some of the program flow in different processes.

cperciva · on Aug 24, 2013

For what it's worth, BSD has had SO_REUSEPORT since BSD 4.4-Lite.

nemetroid · on Aug 24, 2013

For anyone curious: released in 1994.

sounds · on Aug 24, 2013

More information from 2010 about the way to do that in Linux: http://stackoverflow.com/questions/3261965/so-reuseport-on-l...

This is an example of a major downfall with free software: a developer decides he needs a feature so he implements it without taking any effort to see what has been done before – and more importantly, why.

It leads to the project sprouting thousands of new features while nothing achieves the polish and completeness of the original idea because the developer moved on to something newer and shinier.

I can't find the original blog post where I read the idea, but I did find one on Coding Horror: http://www.codinghorror.com/blog/2008/01/the-magpie-develope...

The Linux kernel solves this by having Linus, who has the long term perspective and the commitment to keep the project moving forward. I'm not claiming he's perfect, just that having him is the correct solution to the problem. Obviously here is someone who thinks the 3.9 kernel has a new feature he needs all the while ignoring past socket work.

simonw · on Aug 25, 2013

"This is an example of a major downfall with free software: a developer decides he needs a feature so he implements it without taking any effort to see what has been done before – and more importantly, why."

Reinventing the wheel is certainly a common flaw of developers, but I don't see what it has to do with free software. Are you suggesting that it's less present in non-open-source software development?

sounds · on Aug 25, 2013

Please be more specific than "non-open-source software development" – what exactly are you suggesting I'm suggesting?

Perhaps I could clarify by defining a cathedral/bazaar axis, and an open/closed axis.

A centralized effort can achieve the original idea more quickly than ad-hoc distributed effort.

Free software has this common downfall: a developer wants to reinvent the wheel, and nobody takes the time to educate him why that's a bad idea. It's easier to just accept his patch and forget about it.

What happens to closed software is completely invisible to the community, so I consider it irrelevant but perhaps worse than what happens to free software.

simonw · on Aug 25, 2013

It looks like I misunderstood you. When you said "This is an example of a major downfall with free software" I assumed you meant that it was a major downfall that was unique to free software, which implied that you thought it didn't affect other kinds of software development. Your reply here implies that you don't believe this, so I retract my previous comment.

sounds · on Aug 27, 2013

I could have stated that better. Thanks!

masklinn · on Aug 24, 2013

And OSX has inherited it from BSD.

joosters · on Aug 24, 2013

You never needed to prefork. One process can open a listening socket and share it with an unrelated process via file-descriptor passing.

rdtsc · on Aug 24, 2013

That is a cool trick! Does that work via pipes and maybe also unix sockets? I suspect latency might still be slightly better with a pool of pre-forked processes/threads.

bdarnell · on Aug 24, 2013

It uses unix sockets. Latency is the same as the standard pre-forking model; the only difference is that file-descriptor passing lets you manage the worker processes independently instead of requiring them to have a common parent process (this is important when rolling out new code to a service with a lot of active connections, since it's disruptive to restart all the workers at once).

Here's a demo in Python: https://gist.github.com/bdarnell/1073945

rdtsc · on Aug 24, 2013

That is pretty neat, thanks for sharing. I added it to my notes and code snippet for future.

joosters · on Aug 24, 2013

Yes, you can do it with unix sockets. Not sure about the latency, but you can pass the listen sockets in advance, rather than having one process accept()ing incoming connections and then passing those to other processes to handle. So unless you're binding to new ports all the time, it's all just a little extra startup work and won't impact the performance of the server.

bratsche · on Aug 24, 2013

Yes, you can do it with unix sockets.

wmf · on Aug 24, 2013

That was covered in the LWN article: "The first of the traditional approaches is to have a single listener thread that accepts all incoming connections and then passes these off to other threads for processing. The problem with this approach is that the listening thread can become a bottleneck in extreme cases." But since most people are dealing with non-extreme cases, they should be aware of fd passing.

joosters · on Aug 24, 2013

That sounds like the 'standard' model of several threads calling accept(), after all if you are multi-threaded you don't need to pass a file descriptor around, each thread will already have it.

With FD passing, you can have multiple processes, related or unrelated, pulling incoming connections from the same socket. You use the FD passing to share the listening socket.

jkn · on Aug 24, 2013

Am I right that this makes it trivial to deploy a new version of my server with zero downtime? I can just start the new server to handle new connections and tell the old one to stop accepting connections and quit when existing requests are completed, no need for another layer routing?

DonPellegrino · on Aug 25, 2013

That's exactly how I do it for my Node programs. Any service that I want to have 100% is using the cluster module, resulting in multiple processes listening to the same port. When I want to update, I replace the files and kill the processes one by one.

caf · on Aug 26, 2013

You could already do this by having a way for the new version to connect an AF_UNIX socket to the old version and request that the listening file descriptor be passed from old to new.

pfraze · on Aug 24, 2013

That seems correct to me.

haberman · on Aug 24, 2013

Interesting. It seems like one potential hazard is that bonafide port conflicts are not detected. If SO_REUSEPORT is preferred for performance reasons, and most/all servers are using it, then starting up a server that uses the same port as an existing service becomes a silent error.

It could even work as expected for a while (since the kernel gets to arbitrarily decide what port to deliver incoming requests to) only to intermittently fail later.

geocar · on Aug 24, 2013

I can't imagine people will start using SO_REUSEPORT by default, since the "performance reasons" are a happy accident of having a hint (that the process wants wakeups distributed across all CPUs). I'd rather get that hint in another way- perhaps by sharing an epollfd with multiple processes.

I would however like SO_REUSEPORT to run experiments: Right now we use iptables/tc to direct some traffic at "new versions" of some of our systems so we can run tests with live data, but connection tracking for localhost is lame. I'd much rather use SO_REUSEPORT.

ikeepforgetting · on Aug 24, 2013

Listening to the same port requires processes with the same uid.

rbanffy · on Aug 25, 2013

> then starting up a server that uses the same port as an existing service becomes a silent error.

Only if it has the same uid as the other one. It'd also be trivial to check whether the other processes listening to your port are "friendly" (as in "you don't want both Apache and Nginx listening on port 80").

jefftk · on Aug 25, 2013

Having Apache and Nginx both listening on port 80 might be an interesting way to A/B test them.

MertsA · on Aug 24, 2013

The second process would have to be run under the same user for that to happen though so a real production system would probably never be impacted by this but what would be nice is a flag to restrict the port to children of a particular PID or just lock it to one particular PPID.

wicknicks · on Aug 24, 2013

I imagine production servers would run monitoring processes to remove such "listening bugs".

audidude · on Aug 24, 2013

This could be useful for periodic tracing/profiling as well. Simply have a second instance with all debugging symbols and tracing enabled, but only accept() a client every X seconds.

hosay123 · on Aug 24, 2013

Sadly it doesn't work like that.. if 2 processes have the same port number bound, then approximately 50% of clients will hash onto the second receive queue. If the debug process only accepts a few connections every so often, then nearly 50% of traffic will essentially be dropped on the floor

It's also not possible to occasionally listen and unlisten.. that causes the hash modulus to change, sending traffic to the wrong sockets and (most likely) resetting all existing connections

pritambaral · on Aug 25, 2013

The hash modulus reset issue is being worked on. Source: the original lwn posting.

lttlrck · on Aug 24, 2013

That's a really neat idea. Thanks, it could be useful.

rgarcia · on Aug 24, 2013

This seems very relevant to people using Node, considering it has basically standardized around the "pre-fork" [0] model as a way to use more than one core. It'll be interesting to see where this goes.

[0] http://nodejs.org/api/cluster.html

robbles · on Aug 24, 2013

One detail that doesn't seem to be mentioned here or in the linked article is how the multiplexing of sockets is actually handled at the kernel level.

Does the kernel use some sort of round-robin approach to assigning client sockets to processes waiting on accept()? This is one area where I'd imagine a dedicated master process would be beneficial, as it could implement "smarter" load balancing based on the health and response times of its child processes.

gwu78 · on Aug 24, 2013

This is the -T option in W.R. Stevens' sock utility.

See Appendix C to his December 15, 1993 book on TCP/IP.

1993.

MalcolmEvershed · on Aug 24, 2013

It seems like this could help solve the thundering herd problem [0][1][2], no?

[0] http://en.wikipedia.org/wiki/Thundering_herd_problem [1] http://stackoverflow.com/questions/15636319/why-is-accept-mu... [2] http://uwsgi-docs.readthedocs.org/en/latest/articles/Seriali...

wmf · on Aug 24, 2013

Problem was already solved: "In modern times, the vast majority of UNIX systems have evolved, and now the kernel ensures (more or less) only one process/thread is woken up on a connection event."

MalcolmEvershed · on Aug 24, 2013

I believe that quote from [2] is referring to simply calling accept(), but modern socket servers use epoll() (or similar) before accept() which I think still has the problem (because I've run strace on nginx and uwsgi and I'm pretty sure I saw all processes wake-up from epoll()). So I'm thinking that with SO_REUSEPORT, each server process would have a different socket to epoll() on, and the kernel would only wake-up one process on a new connection, thus, solving the thundering herd problem for modern servers.

fooyc · on Aug 24, 2013

This is likely to consume more memory, because of copy on write pages (or lack of thereof).

Implementing the prefork model by spawning unrelated processes (by opposition to forking from a common parent process) is likely to consume more memory: each process is unrelated, and do not share copy on write memory pages with other processes.

nullc · on Aug 25, 2013

Shared library code already gets shared, so this may not be as bad as you think.

justincormack · on Aug 25, 2013

You could use SO_REUSEPORT with threads too. Linux threads are processes after all, with tweaked clone() options.

Refefer · on Aug 24, 2013

I'm a bit more worried about the security aspect of it.

Let's say that we are running a server on a port which uses this option to allow multiple processes to bind to it. What's to prevent a rogue process, perhaps with malicious intent, from starting up and siphoning off requests willy nilly? Sounds like a great way to implement a hard to detect MITM attack.

What would be nicer, I think, is if socket reusing was bound not only to the same uid but also to the process listening to it.

takeda64 · on Aug 25, 2013

As I understand you need to have the same EUID to be able to bind to the same port.

subim · on Aug 25, 2013

That's right. This article doesn't mention it, but the LWN article it cited (https://lwn.net/Articles/542629/) does.

pfraze · on Aug 24, 2013

You can mitigate that risk by using one of the first 1024 ports, since they require root access.

IgorPartola · on Aug 25, 2013

So this is inereating, except in the real world your parent process does more than the article implies. The big thing it is in charge of (and the thing that I have seen many of them get wrong) is (a) keeping the child processes running/restating them when they fail and (2) performing graceful config or code reload. The OS has no business doing the latter and would have a very hard time doing the former.

In fact I have seen issues where gunicorn failed miserably simply because it did not handle a bad import in a child process. Tornado as of the latest version I had used (2.0 I think) did not have any ability to check for dead child processes. I am sure there are more examples of this done wrong than right.

This is an interesting option for several use cases but you still need a parent process to monitor things. Perhaps at some point upstart or systemd will get good enough to monitor multiple processes per daemon in real time. Until then, meh.

Edit: actually, one cool thing you can do with this is code reloading. You simply have your parent process start more workers that attach to the same socket, then kill the old ones. That way the idea of code or config reloading doesn't need to be baked into every part of the worker.

sorbits · on Aug 25, 2013

> in the real world your parent process does more than the article implies […] keeping the child processes running/restating […] performing graceful config or code reload […]

The article suggests you let http://supervisord.org/ (or similar) take care of these things.

Amadou · on Aug 25, 2013

Is SO_REUSEPORT really all that much better than a server process that hands off incoming connections to other independent processes via an AF_UNIX socket with sendmsg/recvmsg?

If I understand SO_REUSEPORT right you let the kernel decide everything - access control, receiving process, timing, etc in exchange for not having your own process doing the same thing. Since that simplistic approach is the kind of thing that can be implemented in about 100 lines of user-space code doing file-descriptor sharing with sendmsg/recvmg via AF_UNIX sockets, I don't see the benefit of pushing that complexity into the kernel. Especially since if you want to exercise any greater level of control you'll just have to roll your own AF_UNIX based code anyway.

fexl · on Aug 25, 2013

"in the fork model a number of processes can grow uncontrollably."

You can use setrlimit to prevent that. Plus, your application is likely to have direct control over forking anyway.

bborud · on Aug 24, 2013

Why does the blog posting only mention fork and prefork as options? A very common way to design servers is to do multiplexing IO. The one-connection-per-thread/process isn't the only way.

That being said, this option can simplify things -- removing the necessity of having some moving part to distribute connections across completely independent processes.

jerf · on Aug 24, 2013

"Why does the blog posting only mention fork and prefork as options?"

Because this is a Linux kernel feature involving sharing a socket amongst multiple OS processes, and is therefore only interesting to talk about if you are using multiple OS processes. It's not a generalized primer on all techniques of handling IO.

joosters · on Aug 24, 2013

They aren't mutually exclusive. You can have multiple processes performing non-blocking I/O, as a way of scaling over several cores without multithreading.

bborud · on Aug 24, 2013

Exactly.

buster · on Aug 24, 2013

Cool, i didn't know that would be possible with sockets, sounds like a nice option. Although i wonder how efficient it is, but it may be worthwhile to spawn [number of cores] x [node.js | python | ruby] servers which themselves only run asynchronous functions, greenlets, etc. in a single thread..

mrottenkolber · on Aug 24, 2013

The same model I use in my soon to be released web server. :) Have a thread pool compete for an accept-lock. Performance isn't that bad actually. About the same as thttpd.

halayli · on Aug 24, 2013

nginx already scales by spawning multiple processes. The worker processes share the listening file descriptors from the parent master process which allows the workers to accept connections on the listening fds.

zzzcpan · on Aug 24, 2013

Meh. SO_REUSEPORT doesn't change the way socket servers are written. I was expecting something, like syscall batching for sockets, but not this.

rdtsc · on Aug 24, 2013

They way I understood it, it does change because it simplifies the server. There is no need for one top level listening process/thread. Each separate process/thread can open the listening socket independently.

(Btw, there is another interesting forking-for-client-connection pattern in Erlang. Instead of forking off and handling the client connection in a separate process, instead handle the client connection in the accepting process but fork-off another process to continue accepting. In general, just a process pool, that should be easier to set up with this new feature).

pfraze · on Aug 24, 2013

> There is no need for one top level listening process/thread

And as a result, the user can configure the prefetch pool.

masklinn · on Aug 24, 2013

And more importantly can easily alter it on the fly (just spawn a new "worker" or kill one)

RivieraKid · on Aug 24, 2013

Pretty much, it's just some small unimportant technical detail.

gargoiler00 · on Aug 24, 2013

why would anyone still be using threads or processes these days? :/ hardly scalable or efficient.

jerf · on Aug 24, 2013

You know, this [1] really ought to permanently put away the idea I think you're trying to reference, which is that only "event based" systems can be performant. There's plenty of "thread" or "process" based approaches that do quite well, including I believe the uppermost tier of every benchmark on that site. The idea that threads or processes are intrinsically slow was sheer unmitigated propaganda, and probably not only failed to contain a grain of truth, but are actively false. (Some thread implementations were slower than others, but that turns out to have been the implementations rather than the idea.) Event based systems inevitably have a lot of function calls in them, and that will probably in the end be slower than properly done threads or continuation-based approaches, always, because of that overhead.

[1]: http://www.techempower.com/benchmarks/

rdtsc · on Aug 24, 2013

People measure different things different ways and then draw conclusion (or tweak measurement parameters until it supports their already pre-conceived belief).

Event based system can be more performant in some cases and slow in another cases. If there is not much opportunity for CPU to do any work, then event based system will often outperform threads. One example is proxies. I already gave haproxy as an example, so I'll repeat it here as well. It is single threaded event based by default. It is certainly performant. Why? Because in a simplified model it just shuffles data from one socket to another. Pretty straight forward. Introducing multiple threads and context switches might just thrash caches around and actually make it worse (I have seen that happen).

Now add some CPU work in there. Say make each connection compute something, serialize some JSON. Like in those benchmarks, they use a DB driver get a row, serialize it and return. Ok there is some work. Now it is more likely that multi-threaded will help. But again one can surely tweak CPU affinities, thread pool sizes, hyper-threading BIOS settings, db driver types to really change things up. Threads take up memory. Not an insignificant amount. Now I like green threads, Erlang's processes, Go's goroutines because they are lightweight. (At least Erlang's processes map N:M to CPUs for parallel execution on the host machine).

So I guess my point is you are right that event based are not always and strictly more performant. But I also think in certain cases it can beat multi-threaded code (thread memory size, context switches, cache thrashing). That benchmark there, I wouldn't take it too seriously just like I wouldn't take Language Shootout too seriously.

jerf · on Aug 25, 2013

The whole event-based dogma is that event-based systems are not merely performance-competitive, but performance dominant. If they even tie, but also incur the extra development expense of significantly-increased code complexity, they still lose. If the event-based systems can't stomp thread-based systems in a benchmark, they're unlikely to do it in the real-world either carrying around the extra baggage of complicated code... it's not like event-based code scales up gracefully in size as the problem size increases whereas the (modern [1]!) threading approaches explode in complexity, what with the truth being the exact opposite of that.

Taking benchmarks too seriously is a problem; dismissing them too cavalierly is a problem, too. Those benchmarks may reflect the truth to seven significant digits... but based on what I see in there, I suspect they reflect the truth to about one and a half digits.

I've got some event-based code I manage at work, because it was the best choice. But it wasn't the best choice because of performance, or code complexity, or any of the other putative advantages of event-based systems, it was the best choice due to the local language-use landscape pushing me into a language in which event-based systems are the only credible choice. You know that comment that "design patterns show a weakness in your language?" I don't 100% agree with that, but it's true here; event-based server loops are a sign of a weakness in your language, not a good idea.

[1]: Here defined to a first approximation as "shared little-to-nothing" threading models, rather than the old-school approaches that produced enormous program-state-space complexity.

rdtsc · on Aug 25, 2013

I agree with you, hopefully you see that, but hopefully you can also see why for heavily IO bound application event based systems (basically code woven around a giant epoll/select/poll/kqueue system call) can be faster.

Modern machines are different than those 10-15 years ago. Caches and SMP typologies sometimes play serious roles in what could be an outcome of a benchmark. Threads are often heavyweight memory-wise. That is why the 10K problem had started to be solved better by event based systems.

Even looking at your benchmarks link, I would say more on the top are actually event based. "cpoll" ones look like event based centered around a polling loop. So is openresty -- which is a set of Lua modules working in nginx, also an evented server (but it is also mixed with a set of worker processes from what I understand).

And I like what you said about even if they are the same threaded ones are better. Yes. Not only that, for me it is 10x. Even if threaded ones are 10x slower and that is tolerable, the I would rather pick that. Why? Because code is clearer and matches better which the intuitive breakdown of a problem domain. That is why I like Erlang, Go, Rust and Akka -- actor models just model the world better (a single request is sequential there are clear steps that work in one after another to process it, but there is concurrency between each requests). An actor models that perfectly and I like that.

I also, like you, dealt with an evented promises/futures based system for years and it wasn't fun. It works great for little benchmarks and examples, once it grows it becomes a set of tangled slinkies that only the original writer (me in this case) knows how it works.

eropple · on Aug 24, 2013

> I also think in certain cases it can beat multi-threaded code (thread memory size, context switches, cache thrashing)

Sure. But he's replying to a zealot, so he's using zealot-comprehensible statements.

gargoiler00 · on Aug 24, 2013

> The idea that threads or processes are intrinsically slow was sheer unmitigated propaganda, and probably not only failed to contain a grain of truth, but are actively false.

Threads / processes:

  * Run some code from A
  * Save state, context switch
  * Run some code from B
  * Save state, context switch

  * Deal with locking, synchronisation, etc

vs

  * Run some code.

There is absolutely no instances where [num threads] > [num cores] is as efficient as not using more threads than cores.

jerf · on Aug 25, 2013

Funny, then, you'd think the benchmarks would show that, if it's so obvious, instead of showing the opposite.

The problem is that once you understand what lies behind your glib "run some code", you understand what the problem is. I mean, for one thing, the idea that in a busy server switching to a different event handler which has neither its code nor its data in any processor cache is not itself a "context switch" is a use of the term not necessarily connected to any reality, even if one might pass Computer Science 302 with that answer. Alas, we can not convince our CPUs or RAM to go any faster by arguing at them that they aren't making a "context switch".

But, you know, it's an open benchmark, and the benchmarks themselves aren't all that complicated. Do feel free to submit your event-based handlers that blow the socks off the competition. Bearing in mind that is the standard you've set here. Merely competitive means you've still lost. Nor do I see any "but benchmarks don't mean anything" wiggle room in your statements, because what you're talking about is exactly what is being benchmarked.

lttlrck · on Aug 24, 2013

To take advantage of multiple cores?

gargoiler00 · on Aug 24, 2013

Yeah because I have as many cores as I have concurrent HTTP requests, and obviously it's CPU bound...

erichurkman · on Aug 24, 2013

Sockets are used for a lot more than just serving HTTP requests.

gargoiler00 · on Aug 24, 2013

They certainly are. But my point is that if you're only servicing the same number of IO connections as you have cores, it's not extremely scalable.

Most networking servers should be dealing with hundreds or thousands of concurrent connections.

bdarnell · on Aug 25, 2013

The linked article didn't make this clear, but this feature is mainly designed for process-per-core models, not process-per-connection. The problem you run into with most existing process-per-core systems is that you can't ensure an even distribution of load across the processes without introducing extra overhead. SO_REUSEPORT offers some convenience when changing the number of processes, but the real benefit is that in this mode the kernel uses a better load-balancing scheme.

colanderman · on Aug 25, 2013

To hide I/O latency. You cannot do this effectively without threads without implementing your own scheduler, unless your I/O delays are constant and known a priori.