Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
To Message Bus or Not: Distributed Systems Design (2017) (netlify.com)
290 points by gk1 on May 13, 2019 | hide | past | favorite | 99 comments


In enterprise environments, I more often see overuse of message brokers. I am all for targeted use of messaging systems, but what I usually see is an all or nothing approach where as soon as a team starts using messaging they use it for _all_ interprocess communication.

This has some very painful downsides.

First, traditional message brokers (ie queues, topics, persistence, etc) introduce a potential bottleneck for all communication, and on top of that they are _slow_ compared to just using the network (ie HTTP, with no middleman). I've had customers who start breaking apart their monoliths and replacing with microservices, and since the monolith used messaging they use messaging between microservices as well. Well, the brokers were already scaled to capacity for the old system, so going from "1" monolith service to "10" microservices means deploying 10x more messaging infrastructure. Traditional messaging brokers also don't scale all that well horizontally, so that 10x is probably more like 20x.

Second, performance tuning becomes harder for people to understand. A core tenant of "lean" is to reduce the number/length of queues in a system to get more predictable latency and to make it easier to diagnose bottlenecks. But everyone always does the opposite with messaging: massive queues everywhere. By the time the alarms go off, multiple queues are backing up and no one is clear on where the actual bottleneck is.

What I would like to see more of is _strategic_ use of messaging. For example, a work queue to scale out processing, but the workers use HTTP or some other synchronous method to call downstream services. Limiting messaging to very few points in the process still give a lot of the async benefits to the original client, while limiting how often requests hit the broker and making it easier to diagnose performance problems.


I work for a company that uses message queues for just about everything that moves between (different) hardware, and we've yet to experience problems from a properly configured queue setup.

We've used message queues for 30+ years, and I would say things like reliable messaging/guaranteed delivery, exactly once delivery, as well as not having to care about byte order, has saved us a lot of trouble over the years.

We move data between mainframes, pc servers, and whatever system the customers use, and once the security setup is in place, the queues deliver data reliably.

Without message queues, each application would have to know the details about the receiver, or each client application would have to know the details about the sender, and the implementation would be very specific.

With message queues, you can replace part of your implementation, i.e. migrate it from the mainframe to a unix daemon, and nobody will notice.

That being said, it does come at a heavy cost, and as i started out by saying, we only use it for X-platform messaging. Anything done locally communicates via "something else", files, pipes, shared buffers, or similar. Also, anything idempotent, if using queues at all, is not using "exactly once" delivery, which has a large performance overhead.


An intermediary message broker can give you a HUGE benefit though. A touch more robustness (a service can disappear for a short while and hopefully nobody even notices), you get load balancing "for free" (with carefully designed services), and if you're careful with your framework you can keep failed messages sitting around for you to analyse or even replay later. Most likely with HTTP you're going to need some kind of middle man anyway (load balancers, proxies, etc). Though I agree with your conclusion, and I think it mostly stands for almost all tools. Know your problems and apply the appropriate solutions.


Yes. For simple setups, I'm usually biased towards having less moving parts. Queues sound nice until you start thinking about the edge cases with respect to errors, lost messages, rolling restarts, etc. In short, pretty much every system with queues I've ever seen this was a headache. There are good solutions for this of course but they require extra work like checking dead letter queues, devops related to monitoring of and managing the queues and depending systems, etc. If find lots of teams cut corners here and then have to deal with the inevitable fallout of things going wrong.

An alternative that is usually simple to implement is some kind of feed based system that you poll periodically. This can be dynamic or even static files. Feeds may be cached. Depending systems can simply poll this. If they go down, they can catch up when they come back. If the feed system goes down, they will catch up when it comes back up. It also scales to large numbers of subscribers. You can do streaming or non streaming variants of this. You can even poll from a browser.


> Feeds may be cached. Depending systems can simply poll this. If they go down, they can catch up when they come back.

This is how Kafka works. The "feed" of messages is stored as a log. Clients maintain a reference to their latest position in the log, and pick up where they left off upon reconnecting.


> Second, performance tuning becomes harder for people to understand. A core tenant of "lean" is to reduce the number/length of queues in a system to get more predictable latency and to make it easier to diagnose bottlenecks. But everyone always does the opposite with messaging: massive queues everywhere. By the time the alarms go off, multiple queues are backing up and no one is clear on where the actual bottleneck is.

This is avoidable however. Its trivial to aggregate queue depth across nodes and report it to an alerting service.


I recommend looking at a standard messaging protocol like AMQP 1.0, which will allow you to implement the request-response pattern in an efficient manner without message brokers. Either “direct” client server as with HTTP , or via an intermediary such as the Apache Qpid Dispatch Router.

With the dispatch router, you can use pattern matching to specify if addresses should be routed to a broker or be routed directly to a particular service.

This is way you can get the semantics that best fits your use case.


> I recommend looking at a standard messaging protocol like AMQP 1.0, which will allow you to implement the request-response pattern in an efficient manner without message brokers.

Honest question: if the goal is to implement the request/response pattern then why should anyone adopt AMQP instead of plain REST/RPC-over-HTTP?


If you don’t use any other pattern than request-response, I agree there is no point.

If you have a mix of pub-sub, work queues and request-response, it could simplify your dependencies perhaps.

Also AMQP 1.0 has some nice async capabilities and acknowledge modes that I believe goes beyond what http/2 and grpc supports today.

OTOH I don’t have any real world experience operating such a mix of different communication patterns, so it could be the advantage is insignificant.


> If you have a mix of pub-sub, work queues and request-response, it could simplify your dependencies perhaps.

That's a good point. Indeed if the project is already rolling with a message bus then it wouldn't make much sense to increase complexity just to use a specific message exchange pattern.


eg. That request might take more time than what a http request should. Also, by having workers that consume specific queues/topics, it's easier to scale them with more granularity


Qpid Dispatch Router/AMQP 1.0 are both very interesting technology. I actually work for Red Hat, and do use Qpid Dispatch a fair bit.

Straight Qpid Dispatch (without waypoint messages to a broker) is basically providing something similar to HTTP, as far as end-to-end acknowledgements and "synchronous" processing (as in both the client and server must be up and stay up for the lifetime of the request). No surprise, this makes it way faster than a traditional broker, while still offering similar delivery guarantees.


Queues and messages is a great way to structure application logic (see Erlang, Go, etc). But all message exchange need not be run through the same (heavyweight) central broker.

In an app I recently worked on we used SNS for heavy lifting (e.g. it is replicated across regions), then a few smaller local queues, including just in-process, in-memory queues. It worked well.

Though yes, you want throttling, backpressure, and good instrumentation.


> First, traditional message brokers (ie queues, topics, persistence, etc) introduce a potential bottleneck for all communication, and on top of that they are _slow_ compared to just using the network (ie HTTP, with no middleman).

How is multiplexing supposed to work with plain HTTP without middleman? One of the main uses for message queues is efficient fan-out (m + n connections instead of m x n).


This article talks about the benefit of a message bus being pub/sub. This definitely helps to decouple the internals of apps, which in turn makes things easier to maintain.

There are many other benefits to using a message bus, and it's a better fit in general to distributed systems. The hard part is understanding how to structure apps to make them compatible with messaging.

Imagine sending out a command to the bus and not knowing when it'll get processed. Perhaps under a second, perhaps next week. You can't expect a reply at that point because the service that sent it may no longer be running.

If you're on Node, try https://node-ts.github.io/bus/

It's a service bus that manages the complexity of rabbit/sqs/whatever, so you can build simple message handlers and complex orchestration workflows


> Imagine sending out a command to the bus and not knowing when it'll get processed

I would love to hear how others are correlating output with commands in such architectures - especially if they can be displayed to users as a direct result of a command. Always felt like I'm missing a thing or two.

It seems the choices are:

* Manage work across domains (sagas, two phase commit, rpc)

* Losen requirements (At some point in the future, stuff might happen. It may not be related to your command. Deal with it.)

* Correlation and SLAs (correlate outcomes with commands, have clients wait a fixed period while collecting correlating outcomes)

Is that a fair summary of where we can go? Any recommended reading?


My personal answer would be that commands (as defined as "things the issuer cares about the response to") don't belong on message busses, and there's probably an architectural mismatch somewhere. Message busses are at their best when everything is unidirectional and has no loops. If you need loops, and you often do, you're better off with something that is designed around that paradigm. To the extent that it scales poorly, well, yeah. You ask for more, it costs more. But probably less than trying to bang it on to an architecture where it doesn't belong.

You want something more like a service registry like zookeeper for that, where you can obtain a destination for service and speak to it directly. You'll need to wrap other error handling around it, of course, but that almost goes without saying.


I don't know about correlating output with commands, but if you're looking to correlate output with input, one option is to stick an ID on every message, and, for messages that are created in response to other messages, also list which one(s) it's responding to.

I would say that loosening requirements is also a reasonable option. You can't assume that anything downstream will be up, or healthy, or whatever. On a system that's large enough to benefit from a message bus, you have to assume that failures are the exception and not the norm. And trying to get a system that acts like that is the case is likely to be more expensive than it's worth. For a decent blog post that touches on the subject, see "Starbucks Does Not Use Two-Phase Commit"[1].

[1]: https://www.enterpriseintegrationpatterns.com/ramblings/18_s...


Nice blog post! Certainly puts things into perspective in terms of how one should deal with errors, including sometimes just not caring about them much.


My whole take on this:

Commands can go through message busses and be managed easily or it could just be a sequence of async requests, but regardless of what drives commands and events at that point you should have a very solid CQRS architecture in mind. What should be acknowledged to the client is that the command was published and that's it. The problem is of course eventual consistency but it's a trade-off for being able to handle a huge amount of load by scaling separately both COMMAND handlers which perform data modification, and EVENT handlers that allow the side effects that must occur.

In a typical web app setup I would define a request ID at time of client request. The request creates a COMMAND which carries with it a request ID as well as a command ID. This results in an action and then the EVENT is published with the request ID, command ID, and event ID.

To monitor you collect the data and then look at the timestamp differences to monitor lag and dropped messages. With the events, you get all the data necessary to audit what request, and subsequent command, created a system change. To audit the full data change however, and not just which request caused what change, you need to have a very well-structured event model designed for what you want to audit.

You can't guarantee when a command or subsequent event will be processed, but that's fine. That's the whole point around eventual consistency. It's a bit uncomfortable at first, but use the lag monitoring and traceability as a debug tool when needed and really it's no problem. Also just shift the client over to reading from your read-specific projections on refresh or periodically and data will eventually appear for the user. It's the reason sometimes a new order might not appear right away on your order history for instance on Amazon, and in reality it's fine 99% of the time. Never have your client wait on a result. Instead think: how can I design my application to not need to block on anything? It's doable though it is quite hard and if you've only designed synchronous systems it will feel so uncomfortable.

And remember some things should not have CQRS design, backed by a message bus or not. These will be bottlenecks but they might be necessary. The whole system you design doesn't have to stick to a single paradigm. Need transaction IDs to be strictly consistent for a checkout flow to be secure and safe? Use good old synchronous methods to do it.

Core in all of this is data design. If you design your entities, commands, or events poorly, you will suffer the consequences. You will often hear the word "idempotency" a lot in CQRS design. It's because idempotent commands are super important in preventing unintended side effects. Same with idempotent events, if possible. If you get duplicate events from a message bus, idempotency will save your arse if it's a critical part of the system. If it's something mild like an extra "addToCart" command or something, no big deal really, but imagine a duplicated "payOrder" command ;).

To summarize, I correlate output to commands and requests by ensuring there are no unknown side-effects, critical synchronous components remain synchronous, designing architecture that compliments the data (not the other way around), and ensuring that the client is designed in such a way that eventual consistency doesn't matter from the user perspective when it comes into play.


Trying to limit this to APs question about

> Imagine sending out a command to the bus and not knowing when it'll get processed.

In my systems I separate commands from event handlers based on asynchronicity.

Commands are real time processors and can respond with anything up to including the full event stream it created.

Commands execute business logic in the front end and emit events.

Commands execute on the currently modeled state by whatever query model you have in place chosen depending on needs for consistency.

What I suspect @CorvusCrypto is talking about is event handlers, which are in essence commands but are usually asynchronous.

They are triggered when another event is seen but could theoretically happen whenever you like. It could be real time as events are stored or it could be a week later in some subscriber process that batches and runs on an interval.

I separate commands from event handlers like this because commands tend to be very easy to modify and change in the future, they're extremely decoupled in that they just emit events that can easily be tested without having to do a lot of replay or worrying about inter-event coupling.

Event handlers on the other hand depending on type tend to be very particular/temperamental about how and in what order they get triggered.

I also find having a system with a lot of fat event handler logic to have a lot more unknown / hidden complexity, keeping as much of the complexity and business logic in the front end (RPC included) results in a much simpler distributed system.

All this hinges on the fact I'm sticking to strict event sourcing where events are after the fact and simply represent state changes which are then reduced and normalized per system needs.

I would also like to point out, I was careful here to not to mention any kind of message bus or publishing because CQRS and event sourcing are stand alone architecture choices.

CQRS/ES does not require a message bus, in fact it specifically sucks with a message bus at the core of your design because it forces eventual consistency and it puts the source of truth on itself.

CQRS/ES systems should have multiple message buses and employ them to increase availability and throughput at a trade off with consistency. CQRS/ES should not force you to make this trade.

A message bus is a tool to distribute processing across computers. It is not and should not be at the central philosophy of your architecture. You should be able to continuously pipe your event store through a persistent RabbitMQ for one system that is bottlenecked by some third party API with frequent downtime problems. And you should be able to continously pipe your event store through some ZeroMQ setup for fast realtime responsiveness in another system. Whether or not you choose to introduce system wide inconsistency (or "eventual consistency") in order to pipe your events into your event store is up to you to figure out if the increased availability is worth the trade off.


The downside of message busses is that they can be really hard to debug for third parties. I run into this in SystemD, where a process will hang at startup waiting for a particular signal, and I have to hunt down all of the locations that signal could have been generated from to see which one failed. Much much harder than the old system where you would find the line you stalled on and then read up a few lines to see what it was trying to do.


It also becomes difficult to reason about what code will actually run when a message goes out. Top level handlers are ok, but they can spawn messages which spawn messages which... lead to me running out of mental capacity.


>Imagine sending out a command to the bus and not knowing when it'll get processed. Perhaps under a second, perhaps next week. You can't expect a reply at that point because the service that sent it may no longer be running.

That's because this is missing the point of a message bus. Namely, communication with a message bus is meant to be one way, there is not supposed to be a response. If you want a timely response, you should set up a loadbalanced autoscaling microservice that maybe hooks into some large backend store.


There is no explicit forbiddance of Request/Reply semantics in a message bus. Data on a message bus absolutely does not have to be one way.


When going towards an asynchronous function, that both requests and listens for reply, what you have is two functions. Or for that matter, one that listens and then sends. So yeah, you can go both ways on a message bus, just don't have functions being in limbo because of it.


As far as I know, typically message buses reply semantics are only acknowledgement that they've received the data. That's the way that Kafka and AWS Kinesis and Azure EventHubs work, though I'm not familiar with other message buses. You'll note that almost all architectural diagrams for these message buses, along with the article linked in this thread, show a one-way data flow in/out of a message bus.

Do you know of any message buses that have more complicated reply semantics? I'm thinking maybe the javascript world is calling things message buses that the big data / distributed systems world wouldn't think are message buses


Often times these "Service Bus" type frameworks implement request/reply by creating a queue per host, and in every message adding envelope information that includes the queue that the reply should be sent to. This does allow a request/reply pattern. I've never actually done this, but I imagine in environments where reliability is more important than latency, it probably works pretty well.

One minor nitpick. The patterns you're talking about aren't typically implemented on Kinesis, since there's no concept of topics there. On AWS, you could do something like that with SNS + SQS.


I know the least about Kinesis, but my assumption is that it still has an OK-like response to any batch of events you send to it, does it not?

I think you might be misunderstanding me. Obviously I know about topics. In Kafka and EventHubs (where really you have a namespace containing multiple event hubs corresponding to topics) at least, there is no reply pattern whatsoever beyond an OK. I still don't know which specific distributed message queue platforms you and others are referring to that implement this pattern.


Sorry if I wasn't clear. You aren't relying on just the ack semantics, you're building bidirectional communication on top of software, in a way that's potentially a little unnatural.

Here's an example of how you'd accomplish this with Kafka. Whenever a new instance launches, it could generate a UUID, and create a Kafka topic with a matching name. Any messages it sends that it expects a response to could include a "reply-to" field, with this UUID. When something processes the request, they publish a message to that Kafka topic.

Essentially when people are talking about "Service Bussses", they're talking about frameworks that implement this and other similar patterns on top of generic queues like rabbitmq, msmq and sqs. One such framework I've personally used is MassTransit.


It's also trivially easy to extend this to Kinesis with a lambda.


Sort of how any Turing machine can solve any problem solvable by any other Turing machine, almost any communication system can be used to implement other patterns of communication. You combine the primitives of the base to provide the structure you want above it.


NATS is a solution I have used in the past that supports Request/Reply semantics (I actually really enjoyed NATS). I am not sure Kinesis fits the traditional mold of a "message bus", but it certainly can be used as such, and I have used it as a message broker extensively. Kinesis with its tight integration with lambda can be trivially modified to add Request/Reply semantics with a lambda, but I will concede your point Kinesis is designed to be many ways in and one way out.


I built a framework in Ruby for both req/resp and pub/sub between two datacenters over RabbitMQ that massively outperformed the HTTPS calls it replaced. Unfortunately the only part I was able to open source was the log formatter.


AMQP 1.0 has request reply built into the protocol. But it is really simple to roll your own by specifying a replyto topic in the request message.


Azure Service Bus definitely has this, there's an explicit Reply To field and complicated correlation semantics.


> Imagine sending out a command to the bus and not knowing when it'll get processed. Perhaps under a second, perhaps next week.

Isn’t that the situation on any network?


> Imagine sending out a command to the bus and not knowing when it'll get processed

That's one of the things about tuple spaces[0,1,2] that was described in the book Mirror Worlds[3]. The author gives a lengthy description of a tuple's life. In practice, it really depended on what pattern you used to process the tuples.

0) https://software-carpentry.org/blog/2011/03/tuple-spaces-or-...

1) https://en.wikipedia.org/wiki/Tuple_space

2) http://wiki.c2.com/?TupleSpace

3) https://www.amazon.com/Mirror-Worlds-Software-Universe-Shoeb...


How does Bus compare to Celery?


Celery is a job execution system, nothing more. It can be used synchronously or asynchronously, in a central-bus type broker layout or a different one. It is orthogonal to the architectural patterns being discussed here, but can be used to facilitate many of them when deployed purposefully.


I meant specifically compared to https://node-ts.github.io/bus/, not Messages Bus architecture.


Whether to use pub/sub or rest is entirely domain dependent.

So, who cares? Its an implementation detail. Just do whatever makes sense. The most sane systems I've seen make use of both.

> In an architecture driven by a message bus it allows more ubiquitous access to data.

Please stop mistaking design details for architecture. Lots of things allow more ubiquitous access to data.

Talking about this stuff in this way is just going to wind up with you replacing MySQL with Kafka and never actually solving any real problems with your contexts/domains.


What does architecture mean to you? The set of components and how they’ll interact is what it seems to mean in my world.


I'm not sure I have a clear definition.

What I do know is that I avoid confusing tools, frameworks, languages, libraries, or really anything you could call "software" with architecture. They're all just tools that enable me to draw boundaries around contexts.

To attempt to define it, I would say architecture has far more to do with _where_ I draw bounded context lines than how or what tools I use to do so.

I saved this talk[1] in my favorites because he manages to distill ways to induce the right boundaries out of business needs. (surprisingly not an uncle bob talk!)

For a more concrete example I separate a CQRS/ES architecture philosophy from the implementation detail of a message bus in this comment I wrote further down in the thread [2].

[1] https://www.youtube.com/watch?v=ez9GWESKG4I

[2] https://news.ycombinator.com/item?id=19905016


> To attempt to define it, I would say architecture has far more to do with _where_ I draw bounded context lines than how or what tools I use to do so.

Your definition makes no sense at all. There is no distinction. Architecture specifies which components comprise the system, their interfaces, their responsibilities, and how components are expected to interact with each other. You pick certain tools because they are expected to handle certain responsibilities, because they implement certain interfaces, and because they support or enable certain types of interactions. There is no distinction.


Well I did say I'm not sure I had a clear definition..

Distinction wise, didn't you just lay out the same distinction?

And please don't get me wrong, you can have a pub/sub, rest, or event driven architecture. I would just posit that you're basing your architecture on fundamentally the wrong thing. Those sorts of things are design choices, maybe big choices but choices that are much better off being driven by the needs of the problem at hand which in any complex system is many and diverse. Your systems framework should be flexible enough to cater to whatever makes sense with the problem set.

Pushing every single problem into the same rest api, rpc, or pub/sub event system is going to end the same way.

The next time you see someone talking about some kind of architecture driven by some kind of tool replace the tool's name with "Database" and see just how stupid it sounds to even be discussing it.

> Using a database will allow for the same communications, but it’s a little more simple. Single query/result, connection pool, and transactional models are supported out of the box. Service discovery is a matter of just querying the right tables. There is an operational cost of maintaining the database, and possibly having a piece of infrastructure that impacts all services. But, all the production grade databases support clustering, but still things can go wrong and it can lock the whole system up (looking at you MyISAM).

> The primary benefit of a database driven architecture is that data is freely available. Services just provide data and don’t mandate how it is used. You still have the necessary coordination in developing a system; part a generates rows like this and part b will modify it like that. But now you can have any new service start non-destructively consuming those tables. This free flow of data allows for rapid prototyping, simple services crosscutting, intuitive monitoring and easier development.


I've been burned by both but I think I still lean away from event based systems.

REST is more likely to result in a partially ordered sequence of actions from one actor. User does action A which causes thing B and C to happen. In event driven systems, C may happen before B, or when the user does multiple things, B2 is now more likely to happen before C1.

IME, fanout problems are much easier to diagnose in sequential (eg, REST-based) systems. If only because event systems historically didn't come with cause-effect auditing facilities or back pressure baked in.


> I've been burned by both but I think I still lean away from event based systems.

I think this is the right call overall but I don't want you to completely shy away from events as a tool. You can still use event modeling to decouple systems you otherwise would need to couple in an ugly way.

For instance, they're extremely powerful when it comes to retrofitting wanted side effects into legacy systems you're unable or unwilling to redesign around a new business need.

I've used event systems to add massive business value to an order system where the order and its legacy systems remain in a transactional RDBMS but events are used to augment the order with information from various other systems we previously never had any insight or way to link it all back to our RDBMS. (there was just too much data and it was ever growing in scope.)


I make a lot of design decisions not based on what I can understand but on what I’m willing to teach/support.

And I’ve never even witnessed someone teaching event thinking to people, let alone mastered that skill myself. People either self teach event based thinking or struggle. In a teamwork situation that property is a huge con. I have no interest in creating an underclass of developers so I can look good because I’m the only one who can fix certain problems.


Similar feelings for the first point, we differ on the second.

Personally speaking, I never saw any kind of underclass developing with these evented systems techniques, if anything, blockage decreased and teams got decoupled. (caveat that could have just been introduction of solid versioning and maintaining backwards compatibility)

I saw much greater attrition with the onset of unit testing and enforced code reviews. But that also led to much higher quality developers so maybe that had a bit to do with it. (the ones that stayed got better and the ones that joined were stronger)

> People either self teach event based thinking or struggle.

I feel like if you're dumping a new way of thinking on your staff, any staff, and then expect them to self teach or have zero support you're asking for problems, I don't think that makes event systems more complicated than REST. Rest involves a crap tonne of thinking to get right if you're able to easily blur the lines between RPC and REST or the client and server code.


Isn't sequentiality just an implementation detail too?

We use Apache Camel to orchestrate messages and you just declaratively state where you want sequential vs multicast / parallel behavior.


I don’t mean sequentiality of events I mean sequential steps in the process. All of the side effects of my button press happening in order. If you’re trying to say that’s an implementation detail instead of design elements then I’ll counter that by that yard stick, the difference between email and Slack are just implementation details.


I mean that is orthogonal to whether you are using messaging or not.

With API calls you default to doing them in order but you can certainly multithread the calls and lose the deterministic order if you want. With messaging you default to sending them off in parallel and not knowing which are received / processed / completed before the others, but the non-default is available there too. You have a message router that knows what has to be ordered and what doesn't and manages the flow. In both cases the non-default requires a bit more work but in both cases it's available and it just depends what your most critical needs are.

One thing though is that as systems scale up and get more complex, you want less and less to have blocking API calls anywhere that you don't need them, as those quickly become your bottle necks and hard failure points.


Yes but I don't think hinkley was stating otherwise, only that it is generally speaking easier to reason about when things are sequential. Of course, personally I find RPC to be impossible to reason about because you can't really guarantee anything in a distributed system.


Totally agreeing with you.

<rant>

A message bus is to do things like fan-out, fucking messages, not act-as-a-fucking-data-store. Situations that need a message bus are so fucking few and far between I can't believe they posted this article.

Message bus is the fastest way to distribute your monolith... I don't know why people are always trying to get rid of REST like it's the problem (it's not; your code/infrastructure is)...

</rant>


The first sentence already is triggering me.

No, it's not hard. Like most topics in software engineering there are 50+ years of pretty successful development, backed by science, backed by software from the 70s, backed by all seniority levels of engineers.

The problem nowadays is that people don't want to learn the proven stuff. People want to learn the new hip framework. Learning Kubernetes and React.js is much more fun than learning how actual routing in TCP/IP works, right?

The problem is that something can only be hip for 1-5 years. But really stable complex systems like a distributed network can only be developed in a time frame of 5-10 years. Therefore most of the hip stuff is unfinished or already flawed in its basic architecture. And usually if you switch from one hip stuff to another, you get the same McDonalds software menues, just with differently flavored ketchup and marketing.

So if you feel something is hard it might be because you are not dealing with the actual problem. For instance you might think about doing Kafka, and that's fine. But be aware that email is shipping more messages than Kafka and its doing it longer than Kafka.

For instance topologies: There is no point-to-point. There's star, meshed, bus etc. See here: https://en.wikipedia.org/wiki/Network_topology

If you don't know your topology it might be star or mesh. But it's still a topology or a mix of multiple topologies.

And if you develop a distributed system you really need to think about how your network should be structured. Then if you know which topology fits your use case you can go and figure out the way this topology works and what the drawbacks are. Star networks (like k8s) for instance are easy to setup but have a clear bottle neck in the center. A Bus (like Kafka) is like a street. It works fine until it is full, and there are sadly some activity patterns were an overloaded Bus will cascade and the overload is visible even weeks later (although you have reduced the traffic already), if you don't reset it completely from time to time.

It's not magic. You can look all of it up on wikipedia if you know the keywords. Also there is not a single "good" solution. It depends always on how well the rest of your system integrates with the pros and cons of your topology choice. And if you use multiple topologies in parallel you have a complexity overhead at some point, which is why working in big corps is usually so slow.


> The problem nowadays is that people don't want to learn the proven stuff. People want to learn the new hip framework. Learning Kubernetes and React.js is much more fun than learning how actual routing in TCP/IP works, right?

That's an awfully short-sighted comparison.

There are far more job offers for deploying and managing systems with kubernetes and to develop front-ends with React than there are for developing TCP/IP infrastructure. It's fun to earn a living and enjoy the priviledges of having a high salary, and the odds you get that by studying solved problems that nowadays just work are not that high.


If my livelyhood depends on doing bullshit then of course I will also do bullshit. But that doesn't stop me from applying for other jobs or from creating random HN accounts and complain about it. ;-)

I also found it's not bad everywhere though. If you treat the smart people around you well here and there you will get some opportunity to actually change something.

So what I also try to do is getting people out of the mindset that they are actually doing something reasonable when they do this bullshit circus to pay the rent. It's possible when you are really frustrated to spend a few hours here and there to learn the actual stuff instead of the new hip stuff. And over time you will thereby be able to solve more and more problems with actual solutions.

An example from my own life: At one point I really learned about Ansible, Chef, Puppet etc. Then I learned about actual configurations, improved my knowledge of ssh, bash-scripting etc, and in the end I wrote bash scripts that replaced all the Ansible I had used. The results where more flexible, the error messages more readable (thanks to set -x you were able to see what actually went wrong), it was a lot less than 1000 lines of code, and it was a lot of fun to do some actual problem solving for a change.


Besides the distributed case, message bus is invaluable in building crash-proof applications. I've used lightweight message bus within the app itself, for better crash recovery on long running tasks. E.g. Need to generate lots of emails and send them out. I would create a small command object for each email and queue it to the message bus, and let the message handler to handle email generation one by one. The lightweight command objects can be queued up quickly and the UI can return to the user right the way. The slow running email generation can be run in the background. In the event of a system crash, the message bus will be restored automatically and continue on the remaining tasks in the queue.


What if your process to generate the initial messages crashes halfway through?


You return the error to the user.

Crash handling is a matter of managing user expectation. If the user hits a button and the UI shows the command as success, the system better ensures the command will complete eventually. If the user hits a button and the UI shows an error, well the user can try again when the system is up.


Ah, let me restate the question: suppose the user has clicked a button to approve sending 5,000 emails; the server process puts 2,000 messages onto a queue for later processing--but then, something occurs to interrupt adding the remaining 3,000 messages to the queue.

Presuming the queue doesn't support any transactionality beyond per-message, and optionally supposing a queue consumer has already started sending some of these emails, how do you recover from the fault and help the user send the rest of the email (without duplicating outbound email)?


You update the status of each message when it's acked. You show a live count of messages sent / total messages on the screen where the user sent it. 2000/5000 sent... if those 3,000 never get sent, it will be obvious to the user.

If you want the user to be able to try re-sending, you can provide that functionality... you'd need to "cancel" the outgoing messages using a separate queue, re-sending when each cancellation is acked.


It seems as though you might be describing a process with state stored in an RDBMS or the like. Which, while a perfectly reasonable approach, is not much like the initially-described case of firing a bunch of "send email to foo@example.com"-type messages into a queue, subsequently to be drained and acted upon by possibly-remote workers.

What I am trying to uncover here is how one might expect to use a queue, on its own, to support a very-much-non-idempotent interruptible non-transactional process.

Also, what does it mean to "cancel" an already-sent email?


I have yet to use it, but isn't Java EE's Batch API is used for things like this? It tracks the progress of long running tasks (e.g. the index of the current mail) and can continue from the last sent index in case of some error.


Queue is a component of the solution for this kind of non-transactional process, not the entire thing.


I've solved this problem by adding a status field to my email objects/rows which can assume the values "Unsent, sending, sent, error". I guess it is a bit self-explanatory, but status values are set like this:

- Unsent: Default status - Sending: Message dispatched to broker - Sent: Send task/message ack'ed - Error: Send task/message any other exception

This of course assumes that your 5 000 emails are not ephemeral and are in a database.

If by "doesn't support any transactionality" you mean ack's aren't possible for one reason or the other then of course you have to go a slightly different route (pun somewhat intended) by publishing a "result" task/message and updating your database rows based on what comes back on this new "task_result" queue.


I guess the OP's point is that if you're storing everything on a db anyway i.e. 1 row per user,per email and a sent/not sent status flag against each row - then why do you need a messaging system? The email sending system could just poll the db for emails to be sent (by status=not sent), and then update the status after sending, as well as a timestamp for later cleanup. A more robust system is then achieved completely without messaging.

Of course polling is never a great design, but I guess that's the crux of his question.


Push vs pull, auto redelivery, scalable, distributed storage vs single DB.


Imo I'd resort to an ACID datastore if permitted but I'm curious if there are common patterns that enable this with messaging systems.


In the new stated problem, it's a transaction problem rather than a message bus problem. In any case, you don't want to start sending emails while still accepting (queuing) the user's command. I meant you could do it; just have to make it clear to the user that the process is best effort and can fail half way through (setting expectation).

Anyway, there're ways to address the problem. Message bus can certainly help.

If the message bus supports queuing transaction, great. Just mb.beginTrans(), mb.queue(), mb.queue(), ..., mb.commit(). The consumer won't be called until the whole batch is committed. When crash, the whole batch is aborted. Presumably the user is notified the command failed and retry is needed.

If queuing transaction is not supported, there're several strategies to deal with it.

1. First method.

1a. Pack the 5000 email id as one message and queue it. This is like squeezing everything into one transaction.

1b. The consumer unpacks the message onto the list of id, walks the list with a cursor, and processes each id.

1c. After processing each id (sent email), saves the cursor pointer, in a file or in a db table.

1d. After a crash, the message bus will re-deliver the whole message to the consumer. Unpack the list. Read the saved cursor pointer to see where the last processed id was. Set the cursor to continue beyond that pointer.

1e. In the worst scenario, the crash occurs after the email is sent but before the cursor pointer is saved. At recovery, a duplicate email is sent. Email cannot be rolled back so it really cannot participate in a transaction. That's nothing you can do about it. It's an acceptable business wrinkle.

2. Second method.

2a. Queue the email id one by one, attach the user's session id or some pseudo transaction id along with each one (e.g. today's date as the pseudo id).

2b. On a crash, ask the user to re-submit all the email id again, and queue the email id one by one again, with the same pseudo transaction id. You might have duplicate entries queued up.

2c. Have a table recording the processed id's.

2d. The consumer picks up the email id and the pseudo transaction id. Consult the processed table to see if the (email id, pseudo-id) has been processed. If so, skip. Otherwise, process it and save the id in the processed table.

2e. Again in the worst case, a crash occurs after an email sent but before saving the id to the processed table. In that case, a duplicate email is sent out.

3. Third method.

3a. On some task command that's transactional, unlike email, you can set up a transaction to do the task command and updating the cursor pointer at the same time (with method one). A crash would roll back the command and roll back the cursor pointer update.

3b. Same with method two. Set up a transaction on the task command and the insert of the id into the processed table.

3c. If the transactional task command is performed on a separate system, two-phase commit can be used to bind the command and the cursor pointer update (or processed id insertion) together into a transaction.


I strongly suggest anyone interested in this type of architecture to read https://www.goodreads.com/search?q=Enterprise+Integration+Pa... - it is a really well though out catalog of patterns and solution for Message-based systems.


It's weird for me to not see even a single mention of any BEAM languages, such as Erlang or Elixir. They are naturally distributed and have discovery, networking and messaging built into the virtual machine itself.


My main gripe with service buses is that they can be very hard to deploy and test automatically. At least for traditional ‘middleware’ like WebSphere/MQ, Weblogic etc. It potentially adds another monolith to your architecture, which, whilst fancy, may not be required. Using ZeroMQ or similar ‘lightweight’ tech could be a better choice for small teams as it is easy to integrate into containers and testable.


I'm curious to hear others' opinions on using a database as a message queue. One issue I have with most message brokers is that you can not perform adhoc queries or manipulations of the queue.

When you've got a backlog situation, it's nice to be able to answer questions like: - how many of these queued msgs are for/from customer/partner X.


Must be noted that service meshes bring some of the message bus advantages to the point-to-point architecture.


Somewhat, but it's a different use case. I'd say the main difference is that message buses are better for non-urgent, hopefully-soon-but-definitely-eventually workloads, while creating something like a message bus between services within a mesh will impose more urgency.

Anecdotally I've heard that extremely chatty services (like something that approximates a message bus) are considered poor mesh design but I don't really understand why that is the case so long as the service architecture is kept clean


I agree. The advantages I was referring to are things like adding monitoring, which is mentioned in the article.

Interesting point about chatty services :-)


A couple of big reasons don't like message buses (but open to hearing about why I'm wrong):

-All "requests" will succeed even if malformed

-Couples producers/consumers to the bus (unless you put in the extra work to wrap it in a very simple service)


> -All "requests" will succeed even if malformed

Took me a minute to grok this, but I think you mean, message bus clients can send 'any old shit' and the broker will happily queue it?

A lot of the more 'managed' (abstracted) clients in the statically typed world deal with this by structuring queues by the object type/interface. A bad actor could probably circumvent this if they really wanted, but for normal usage it will at least ensure that objects sent are of the expected type.

This means that in real-world usage, this isn't an issue.


The fix to A is client side filtering. That can be hard to set up if you don't enforce it to begin with. I'm currently dealing with poor/no client side validation at work and it does become a complete fucking pain dealing with poor upstream schema hygiene

I don't see why the second point is a bad thing, at all. Message buses are meant to provide a non urgent abstraction layer to simplify the relationship between producers and consumers (if you need urgency, don't use a message bus). It simplifies load balancing the consumer stage and doesn't impose any limits on producers (many of which can be clients released into the wild).


> All "requests" will succeed even if malformed

A good pattern is to use a schema to verify you're sending valid data in the producer (at the very least, in a unit test, if not at runtime) - for example, the Confluent wrapped Kafka ships with a schema registry and painless Avro based Kafka producers

If overhead in the producer is stopping you from validating each record at creation time, then you can validate downstream, so long as your consumers agree to only consume the validated data.


> Couples producers/consumers to the bus (unless you put in the extra work to wrap it in a very simple service)

This is an area where adding a simple layer of abstraction is a good thing - most times, all you want is something like a `Send<T>`/`Send(obj)` method, which really will translate fairly universally across message busses.

I used such a thing recently, switching out RabbitMQ for MQTT - all I really had to do was point to a new implementation of the interface!


- The consumer needs to validate the incoming requests. Just like any input sources, the incoming data cannot be trusted until validated.

- You need coupling somewhere anyway. Moving the coupling to the bus let the consumer and producer evolve more freely.


> - The consumer needs to validate the incoming requests. Just like any input sources, the incoming data cannot be trusted until validated.

That's correct. And the producer will be long gone by the time the consumer attempts to validate that message and rejects it.


indeed. Here's an example of a case against messaging in distributed systems design. https://www.brisbanetimes.com.au/national/queensland/educati...

A synchronous call would have failed when the operator issued the request, and the onus is on the operator to find alternate ways of contacting the police.


Message bus proponents never build large systems. The only sane way is to be pretty specific about data flow with clear mental model shared between developers and ops. Message bus hides producer-consumer relations and with multiple endpoints it's very hard to reason about the system as a whole.


Some of the most complex systems in the world use a message bus. The entire car and air industry come to mind.

Being able to see a message between systems and have an analyst figure out whats wrong and what it needs to be is an awesome way for a developer to actually code.


Why do you say that? I'm someone who's worked on a few very large message based systems in finance for years (100s of devs working together in multiple countries). I found that messaging and workflow orchestration were the things that helped keep things same.


I've seen people succeed in a big way with message bus architectures and I have seen them fail in a big way.

The master symptom I've seen is that people queue work and that work never gets unqueued. It's amazing how many real-life systems fail with the same symptom.


That's a weird situation, I've never seen that happen personally. A good service bus should avoid this.

For example, https://node-ts.github.io/bus/ subscribes it's queue to all message topics that it handles. If, for some reason, a message arrives that no internal handlers exist for, then the message is simply logged and discarded.


Perhaps that could be solved with monitoring the number of pending requests and alerting over a certain threshold?


That diagnoses the problem, sets the stage for solving it, but doesn't actually solve it.


Isn't that an "easy" problem to solve, in the sense that you just increase the processing capacity of the consumer?


The largest system I ever worked on used a message bus for almost everything, and it worked great.

That said, 100% agreed about sanity. There was a lot of time spent on ensuring that all the developers (and there were a lot of them) had a decent shared understanding of everything, and a very deep understanding of anything they worked on or interacted with directly. Meetings to review the current way data flowed and look for ways to clean it up were a regular occurrence, and ops was deeply involved in everything.

The articles that say or imply, "Don't worry about it, because it's easy for everything to talk or listen to everything else!" strike me as having probably been written by someone whose message bus system hasn't been around for all that long.


I see. My original comment was not clear. I also use queues a lot in my systems. But only as implementation detail and it helps to solve scalability and persistence problems. Message bus proponents like to place common bus as central medium to pass data between components aka fire event and somebody (often multiple handlers) will take and process it, easy-peasy, ha.


Absolutely not true, I’m afraid. I’ve architected or worked with message bus architectures in large-scale medical payments processing systems, as a central broker for a national-level telecommunications company, multiple Fortune 500 consumer-oriented companies with complex sales workflows, and so on. This doesn’t obviate the need to be clear on producer-consumer relations, as you point out, but that’s a governance issue, not a technical architecture problem, and one that is at least as severe with API-only architectures.


I've worked directly with enormous systems that you have likely personally used that were not only highly coherent and well designed, but also message oriented in architecture.


Are you aware that LinkedIn developed and uses Apache Kafka for critical production services?

I also work on a large production service that uses a message bus. There are certainly amateurish uses/implementations of message buses out there but definitely not all of them




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: