I’m not sure about a reference, other than their docs [0]. Basically, you’d modify the config to point to the new servers, issue PAUSE to PgBouncer to gracefully drain connections, then RELOAD to pick up the new config, then RESUME to accept new traffic.
This would result in client errors while paused, though, so perhaps not quite the same. To me, a few seconds of downtime is fine, but everyone has their own opinions. EDIT: you could of course also modify your client code (if it doesn’t already) to gracefully retry connections, which would effectively make this zero downtime.
ProxySQL (which I think now supports Postgres) has a global delay option where you can effectively make clients think that the query is just taking a long time; meanwhile, you can do the same sequence as outlined.
If you had HA Bouncers (which hopefully you would), you could cheat a little as you eluded to in the post, and have one still allow read queries to hit the old DB while cutting over writes on the other one, so the impact wouldn’t be as large.
> you’d modify the config to point to the new servers, issue PAUSE to PgBouncer to gracefully drain connections, then RELOAD to pick up the new config, then RESUME to accept new traffic.
The function we wrote effectively executes these steps [1]. I think it would look similar if we had used PgBouncer. I could see it be an option though if we couldn't scale down to "one big machine".
> This would result in client errors while paused, though, so perhaps not quite the same.
What? Docs say:
> New client connections to a paused database will wait until RESUME is called.
Which fits what I remember when I was testing pgbouncer as part of automatic failover ages ago, if the connection from pgbouncer to the database dropped it would block until it reconnected without the app erroring.
I stand corrected! It may also depend on the application itself, timeouts, etc. I’ve seen errors before when doing this, but now that I think about it, it was on the order of a handful of connections out of thousands, so it was probably poor client handling, or something else.
I thin he means already established connections, but not sure.
Edit: not true, actually. PAUSE will wait for the connections to be released (disconnected in session pooling, transaction ended in transaction pooling...)
This would result in client errors while paused, though, so perhaps not quite the same. To me, a few seconds of downtime is fine, but everyone has their own opinions. EDIT: you could of course also modify your client code (if it doesn’t already) to gracefully retry connections, which would effectively make this zero downtime.
ProxySQL (which I think now supports Postgres) has a global delay option where you can effectively make clients think that the query is just taking a long time; meanwhile, you can do the same sequence as outlined.
If you had HA Bouncers (which hopefully you would), you could cheat a little as you eluded to in the post, and have one still allow read queries to hit the old DB while cutting over writes on the other one, so the impact wouldn’t be as large.
[0]: https://www.pgbouncer.org/usage.html