It's interesting that they don't break the problem apart geographically. It's inherent in Uber that you're local. But their infrastructure isn't organized that way. Facebook originally tried to do that, then discovered that, as they grew, friends weren't local. Uber doesn't need to have one giant worldwide system.
Most of their load is presumably positional updates. Uber wants both customers and drivers to keep their app open, reporting position to Master Control. There have to be a lot more of those pings than transactions. Of course, they don't have to do much with the data, although they presumably log it and analyze it to death.
The complicated part of the system has to be matching of drivers and rides. Not much on that yet. Yet that's what has to work well to beat the competition, which is taxi dispatchers with paper maps, phones, and radios.
Uber is pretty formidable in building and growing a two-sided market. I suspect it's tuned continuously, at high resolution (in space and time), with levers I mostly don't know. And that's got to be a big contribution to the complexity of the stack.
This of it this way. A standard e-commerce site, or SaaS with low-touch marketing... there are a crazy number of KPIs to monitor, loads of levers (e.g. what's the right discount to fix basket abandonment). The instrumentation to track all these conversation funnels (and to do A/B tests to see what works) is half the job.
But at least we have a common understanding of the metrics and the levers -- for SaaS, say, there are tons of similar services, and the knowledge is shared in the community.
Uber? How do you grow while maintaining market liquidity every evening of every week? If you artificially hike demand from passengers in a particular neighbourhood (say with coupons), does WOM amongst potential drivers work to increase the driver pool before the passengers get frustrated and move to Lyft?
All of this is new. So how do you create the tech to track not just all the data you need, but all the data you might need, plus the capabilities to do tests to figure out what levers to pull? Hard.
I have no particular insight into this. But my guess is that Uber isn't flying blind - their growth has been no accident - and the complexity of their tech is due to instrumentation not operations.
Create a global Cassandra cluster with regional datacenters.
Use one keyspace per region
Use per-keyspace replication to only replicate that region's data locally, and to one or more additional datacenters
Have stateless app servers colocated with Cassandra in each DC handling all local traffic
Run spark on top of Cassandra to do analytics, or to do the etl to a dedicated analytics system
Optionally have a single "master" DC, with replicas of all data from all locations, that doesn't serve end user traffic, but is to allow efficient cross region analytics.
Profit (optional step)
And yes, the company I work for (Datastax) has a product and services to help make it simple.
Thats an interesting approach. One question though, in our use case user often travel from city to city and country to country. How do you model that if you are only using local DC and local replications?
I think the regional keyspaces would be have to be caches -- denormalize it, basically. Pop/re-fresh people into the geo-based caches as they moved around. Truth sits behind it, centralized (perhaps partitioned in some way that makes sense globally but is sub-optimal from a regional cache perspective). Might not be worth it -- hard to know from here. :)
afaik, this is how Facebook does it, but with regional sources of truth.
If you signed up for FB in Paris and move to San Francisco, your master profile lives in Europe in perpetuity and you'll use your regional cache forever in the USA.
The number of people moving far away from their home DC's should be a reasonably small fraction of the total for it not to matter.
As far as I'm aware, Hailo (>3 backend devs, not quite 100s) did exactly this as well, and the ex-Hailo devs I've spoken to considered it a pretty bad move. It took them ages to refactor into a global system if I remember rightly.
All it takes is for one server in one datacenter to be slightly different, or perhaps you had a bugfix that needed to go out for users in one area, but you couldn't take the risk of a flaky deploy for the areas that didn't need it, now you've got a deploy that will be a lot more complicated or error-prone than a loop around deployment to one location.
> All it takes is for one server in one datacenter to be slightly different
Don't do that. No one is allowed to ssh to boxes. If you need to enforce it by blowing up and rebuilding all servers once per week, do that.
> perhaps you had a bugfix that needed to go out for users in one area, but you couldn't take the risk of a flaky deploy for the areas that didn't need it
Feature flags. Default off, but flip on a new path of code for a set of users.
You should deploy so often that it is routine. Deploy 50 times a day. You will find bugs at first, but eventually you should get to the point that you could deploy for every single commit and no one would notice. (Now they may not be a good idea depending on your risk tolerance and other things, but you should be ABLE to deploy every single commit).
Not saying these are simple things to do, but if you are approaching servers with a devops mindsets, you literally should not care about number of servers or datacenters.
> Don't do that. No one is allowed to ssh to boxes. If you need to enforce it by blowing up and rebuilding all servers once per week, do that.
Yep, this is change control 101. I used to have a boss who would go in and edit sprocs on the production server and never get them into source control. I finally just encrypted everything on the server so the only way to push out new code was to commit to source control. A sledge hammer yes, but sometimes it is required.
Use Terraform or some other tool to codify your entire infrastructure.
>You should deploy so often that it is routine. Deploy 50 times a day. You will find bugs at first, but eventually you should get to the point that you could deploy for every single commit and no one would notice. (Now they may not be a good idea depending on your risk tolerance and other things, but you should be ABLE to deploy every single commit).
This is a great point. At first it is terrifying, but then you realize that a deployment is so easy that bugs are generally not a big deal. Bugs happen, so the goal should be to shorten the time between bug found and fix deployed.
> Don't do that. No one is allowed to ssh to boxes. If you need to enforce it by blowing up and rebuilding all servers once per week, do that.
I'd love to do that. I'd love to have no access to production servers, but ultimately that requires far more work to get right than Ansible configuring the same machines again and again. It also means you can't use dedicated hardware as easily, which restricts performance. It's a great situation to be in, but difficult to get to and requires a non-trivial amount of overhead.
At my place of work we deploy somewhere between 5 and 40 times on any given work day (on a team of 5 engineers). That's because we've managed to engineer a reliable and fast deployment process, but that took a long time to get right. It's powerful, but the overhead, particularly on a small team who are under pressure in a startup environment, can be quite large.
I'm not saying you're wrong, in terms of best practice I completely agree, but when the tradeoff is between sales/acquisition/product market fit/etc, and having a 'smooth' devops process, in many cases, the latter must come second.
You can devops dedicated hardware. It's a little bit different but not that much. Heck there are boot2docker and such that let you just run docker on bare metal.
This is so weird. I hear this all the time. At my place of employment, we all have the ssh keys into our EC2 instances, but no one configures them. Ever. Period. Those ssh keys are purely for either validating changes in a test environment (like to .ebextensions) or diagnosing production issues (why did Puma fall over this time? why isn't syslog output making it to loggly?).
Of course, we lean heavily on Elastic Beanstalk; autoscaling regularly kills old instances daily since we scale from 2 to 18 and back to 2 instances in a 24 hour period across about 9 microservices.
So, if this ssh into boxes and change things is common, it means people aren't doing auto-scaling? THAT is scary.
I'm very very pro devops. But I am cautious about auto scale.
It adds a lot of complexity. If your app has predprtable load it may simplify stuff a lot to not auto scale. Think a b2b app with manual account creation. You know your user levels. 0 reason to turn on auto scale.
...so you run the servers you need for peak load 24/7? Ew. So much money just thrown away because "it's too complicated". We just scale based off of Network Out and only really spent a week finding the right threshold.
How long does your stack take to add a server. 3 minutes?
I have created a lot of services that couldn't have crappy perf for 3 minutes every time the load scales. Especially since in many apps the most expensive part is storage and dbs which cannot auto scale.
So say you spend 1k a month in your dbs and $100 for peak web load. You could spend a week fixing auto scaling bugs and try to save $25 by scaling web nodes down... But then you still have bad perf every time load spikes. I would not do that to save $25. I would question a company I work for that did that.
You need to be scaling up and down for hundreds of dollars per swing and have big spike loads before you add the complexity of auto scale.
We're sensitive enough that we preemptively scale as traffic "appears" to be increasing. So we add 5 EC2 instances, not at crisis levels of traffic, but "Hmmm...I feel a tingling in my extremities". We then remove one instance at a time if traffic falls below "Not doing anything" levels. The time between scaling actions is 15 minutes. Since ASG's go to remove older instances first, we don't end up getting charged a full hour for instances that are up for less than an hour that often.
Admittedly, our web traffic is very US Business Hours centric and peaks predictably between 3 and 4 in the afternoon.
Also, we're operating more at a scale of $40,000/month for peak traffic capacity 24/7 and $25,000/month once I got autoscaling worked out. So...yeah. I guess the scale for savings matters. :-)
All it takes is for one server in one datacenter to be slightly different
But doesn't a global system still run on multiple DCs, at least for redundancy?
perhaps you had a bugfix that needed to go out for users in one area, but you couldn't take the risk of a flaky deploy for the areas that didn't need it
But if you have a single global system, you can't even make that decision.
To be clear, I'm not arguing against it, I'm just trying to understand why, since my first instinct would be to divide geographically as well.
> But doesn't a global system still run on multiple DCs, at least for redundancy?
If you've got 2 levels of separation - servers and 'groups' (whether they are datacenters, or whatever) - you've got 2 levels at which that special casing needs to happen. If you only have 1 level - servers - i.e. one deployment, even if that's across multiple datacenters, you only have 1 place to special case. I'd say that's easier.
> But if you have a single global system, you can't even make that decision.
Good point, but my point was that it will be simpler and less error prone in general. You might not be able to push the bugfix, or you might have to risk the deploy globally, but I think either would be better in the long run for a simpler deployment. It is a trade-off though.
"Yet that's what has to work well to beat the competition, which is taxi dispatchers with paper maps, phones, and radios."
In some part (maybe even a large part) of the world, yes. But in markets where taxis are from different companies (or in other ways make ordering one in advance more common) already has fairly sophisticated technology. Let's not promote the myth of startup exceptionalism. Uber has modern (but not futuristic) technology, but the real difference is in the business model.
You know that saying : "I didn't have time to write a short letter so I wrote a long one".
With this kind of technology stack you end up when you try to move fast. I'm sure that if more time and thought would have been put into it, it would have been more elegant and simple. But has time these days ?
I had similar thoughts. "Wow. All those moving parts. Each one of which could fail." Each new piece of unique technology added means that it's probability of failure gets multiplied against what you already have.
Well, to some extent, they do. Their geofencing [0] service takes in coordinates and returns the geofences these coordinates fall into. They make this faster by pruning irrelevant geofences:
>> Instead of indexing the geofences using R-tree or the complicated S2, we chose a simpler route based on the observation that Uber’s business model is city-centric; the business rules and the geofences used to define them are typically associated with a city. This allows us to organize the geofences into a two-level hierarchy where the first level is the city geofences (geofences defining city boundaries), and the second level is the geofences within each city.
Sharding your application geographically is a quite a bit of complexity and requires a lot of work developing support infrastructure to manage load balancing, failover, and placement. One of the advantages of SOA is that different services can have different architectures.
To be precise, we do do geographic sharding in the services that benefit from it, but avoid it in the services that don't.
Also note that the assumption of region based partitioning doesn't extend to all applications. Analytics, for example, may want to dice and slice the data along different dimensions. Partitioning is a convenient abstraction for managing marketplace scale, as you mentioned, but inconvenient elsewhere :).
Riders and drivers are also not local. I travel a lot and yet my star rating, profile picture and payment details work regardless of if I'm in the Bay Area, Berlin or DC. Further I've heard of Uber drivers giving rides to other regions, e.g. SFO airport to Sacramento (apparently fairly common as the Sacramento airport has limited service and is expensive).
I'd love to know how many people are responsible for devops/operations/app at various stages of any company's journey. Wikipedia says Uber employs 6,500 people so if even 15% of that is on the tech side of the business that's still 1,000+ people allocated to tech. I think this metric would be a useful reality check for a "modern" SaaS project with 3-10 people that's trying to emulate a backend structure similar to the big league.
There are 20+ complex tools listed in the stack, and to run a high-visibility production system would require high level of expertise with most of them. Docker, Cassandra, React, ELK, WebGL are not related in required skills/knowledge at all (as, for example, Go and C are). Is it 5 bright guys and girls managing everything, like the React time within Facebook? Or a team dedicated just to log analytics?
I don't know the numbers, but at least some of Uber's tech employees are working on things that aren't directly connected to the app and rides, like mapping and self-driving cars.
One of their recruiters contacted me a while back, and it sounds like they're working on some really neat stuff, but I don't agree with all their business practices, so I didn't pursue it :-/ In any case, he pointed to their website: https://www.uberatc.com/
That's all bloat. Pure and simple. At the end of the day Uber just does routing and basic allocation. It's a simple operations problem that has been solved since the 70s and no one back then needed ELK, Docker, Cassandra, etc.
I've seen this bloat everywhere. It is usually a result of internal politics and posturing by management types. The kinds of people Steve Jobs would have called B and C players. Now the actual people operations is another matter entirely but the tech stack definitely doesn't need to be that complicated.
I have to admit, I could see this running for a city the size of SF on a desktop machine under the table at the taxi depot. Uber has 11,000 drivers in SF, but probably only a few thousand are on at any one time. A ride takes a few minutes, so if you figure 3,000 active drivers and 4 rides per hour, that's only about 3 ride transactions per second. You have a transaction at ordering, one at ride start, and one at ride end. Plus you have tracking of where all the active drivers are, pinging maybe once a minute. That adds up to only at 10-20TPS. You can offload the routine web and app stuff to some front end machines. And you want to do some analysis every minute or two to see where there are "surges".
The only non-trivial part of this is assigning drivers to rides.
It's easy to imagine the simplest stack that can serve the core features of any service, and that is well served by a single box. What's missing from the picture is the infrastructure to replicate this 500 times by separate teams, monitoring all of it, backup, auditing, aggregating customer and business metrics, back-office systems, and more. Plus the fact that these things always grow organically and embed a host of imperfect decisions - the imaginary system will always be better designed.
Couldn't agree more. It's all the invisible details that cause the load.
It's a much more trivial example, but highlights the point well I think - we have pages in the app I work on that would respond in ~100ms, but might have a single sentence on them that takes another 100ms to generate because of the complex data relationships involved in figuring out what that sentence needs to say. The 'request handler' might be 20 lines of code, with a 50 line util function to generate that line of text. No armchair architect will ever take into account things like that, but the end result is a page that is just a bit more personalised to the user and therefore improves their experience.
In an app of any real size, I imagine there are anywhere from hundreds to many thousands of tiny little details like this that all together drastically increase the amount of power needed to run a service.
> No armchair architect will ever take into account things like that
An armchair architect would say it's not needed. They would question whether spending 50% of your response time generating a single sentence is in any way worth it, and wonder what kind of architectural mistakes led to that.
The problem with this line of reasoning is that it implies the business exists to serve the software. Unless you work at a tech-focused non-profit, the software actually exists to serve the business.
> the software actually exists to serve the business.
Sure it does, but the business also wouldn't exist without the tech in Ubers case (and a lot of other cases). And it's going to be your head on the line when you keep adding these 100ms sentences because the business wants it for no good reason and your page takes 3 seconds to load, and nobody buys anything from the site.
You're making the assumption that the additional features slowing down the service aren't adding value.
More common is a "death by 1000 cuts" scenario where the various causes of slowness are apparent to the developers, but quite difficult to remove because they've become necessary to the continued success of the business.
> You're making the assumption that the additional features slowing down the service aren't adding value.
No, I'm questioning whether the value added is greater than the value lost, and in this hypothetical example clearly not. So it's your job to point that out to whoever and not silently obey.
We saw this in the 90s, people with a whole rack of machines running Java or Perl CGIs to serve a site with less traffic than we were doing with a single, ordinary box running NSAPI. You need loads of scaffolding that you mention, only if you are trying to fit a square peg into a round hole.
When I use these services, it doesn't always give me the closest car of the ones shown. It's all about which driver accepted my request first. Isn't it a fairly easy calculation of "all cars within X km or the nearest Y cars", and then whichever driver taps first gets matched with you?
> At the end of the day Uber just does routing and basic allocation.
The thing is though that that algorithm is easily copied as evidenced by lyft and etc. So really Uber's business model needs to be all about differentiation, marketing, analytics and prediction otherwise they won't survive.
I think analytics and other fuzzy avenues are where those technologies shine.
It sounds like you have no idea how credit card transactions were or are processed. They are almost exclusively file and batched once a day, even today. Back in the 1970s it was even worse because there was no real time authorization.
This is helpful information if accurate, but please edit incivility like "It sounds like you have no idea" out of your comments here. The site guidelines ask you to omit this sort of thing, so please post civil and substantive comments only:
He or she is correct that the cash settlements were and are batched, but authorizations, reservations etc were online - to the mainframe that's just another transaction.
The reason settlements were batched was to save on wire fees.
I love how any armchair quarterback on HN can sit back and dismiss the work of thousands of engineers as bloat, with no actual qualifications of their own.
Build a multi-billion dollar company that has satisfied customers around the world, and then let's hear what you have to say.
Let me guess, you also came up with the idea for Google Adsense and the iPhone in high school, right?
The business idea is a separate concern. A solid business model can withstand all sorts of abuse and incompetence at all levels. Many eBay CEOs besides their best efforts to destroy the company have been unable to do so. Similarly for PayPal and a few other companies that have excellent product/market fit.
Each of those companies survives despite the best efforts of 1000s of engineers and managers to over-engineer and justify their salaries. So the logic of "1000s of people have worked on this so it must be valuable" is incorrect reasoning. The more pertinent question is how do these companies survive despite all the over-engineering that is happening? Once you ask that question you are almost surely led to the conclusion that the technology is not as relevant as people would like to think and inefficiencies at the technology level have very little effect on actual business outcomes when the product itself provides value people are willing to pay for.
Like bureaucracy, the complexity of your tech stack grows to accommodate the number of people available to work on it. Go crazy on the hiring, and what do you expect all those people to do all day?
That's a very uncharitable views, and it could be argued the other way around, that the hiring occurs to support the need for more people in engineering. I'm not certain which way around it goes with Uber, but I've seen both.
Exactly. I saw the presentation of one of thei tech guys and was very surprised by it. First, by a number of IT people they have, second, by the work they do. It looked inventibg problems and solving problems for the sake if problems and solutions, i. e. no business value in it.
I disagree. Even on hacker news, people rarely express such absurd things with so little confidence. You fail to take into account many of the following:
* Extremely high volume. Uber has indicated elsewhere that they receive upwards of a few hundred thousand requests per second on just one service. Please show me the logistics stack that did this in the 70s.
* Yes, building the first version of something is extremely cheap and easy. But being able to improve it becomes harder and harder. Especially given high volume, modern companies need sophisticated analytical tools that provides reliable data to both technical and non technical staff. Please show me the analytics stack that was able to ingest, store, and analyze terabytes of business data in realtime from the 70s.
* Reliability. Modern web applications need to fail gracefully and be debugged quickly. Please show me the logistics and routing stack that was capable of extremely high uptime while being deployed constantly and serving hundreds of thousands of requests per second from the 70s.
* Extensibility. Businesses need to extend to new markets. Moves like this often invalidate past assumptions. In order to support business flexibility, modern engineers deliberately invests considerable time into building decoupled components that can be reused as platforms instead of stuffed into a monolithic codebase. Please show me the operations and routing stack that could easily be reconfigured to enable such products as Amazon Web Services, Uber's external API, the google maps API, or Uber EATS—from the 70s.
To make this more concrete, I worked on a routing stack at another company which probably works similarly to Uber's ETA systems. When considering these things, it's important to keep in mind the dependency tree of each new problem set and the work required to make those dependencies work reliably at big scale.
To give you an idea of what this area alone entails:
1. Machine learning.
- wiring together of and improving algorithms: linear regression to begin with, then random forest, then neural networks.
- ensuring data required for learning is reliably available and correctly computed.
- tools to launch, deploy, test these models.
2. Working with map data in memory many times larger than what fits onto the smallest consumer laptop.
- how do you handle updates of data?
- what if you want to use different data sets in different places, because they're more accurate?
- how do you debug errors in the data without visual tools (hint: it's really hard and time consuming)?
- how do you optimize loading this data into memory without requiring hours to deploy your application?
- where do you even store this data?
3. Requests per second in the hundreds of thousands and latency requirements (in order to ensure the app responds quickly) hovering around 10ms.
- how do you profile complex distributed applications?
- what optimizations are available to make graph search faster (hint: A* isn't fast enough)?
- how hard is it to implement these optimizations?
4. Data science and data science tools
- Visualizations!
- again, reliable data pipelines
That's about what one team works on over the course of a year. Note the dependencies we have here:
1. We assume access to cloud infrastructure that doesn't require us to do all of our own devops.
2. We assume mature and automatically scaling data infrastructure: that kafka and storm have been set up and tuned to a degree that we don't have to worry about it. In reality, kafka alone requires a team of at least a dozen at linkedin to keep up with the maintenance, operations, and optimization burden of keeping up with scale.
3. We assume mature and scalable service oriented architecture tooling—if a call to another service is slow, I should be able to see on a dashboard what service is slow, how frequently, it's slow, why it's slow (if it depends on another service) etc.
and countless other things I could spend days enumerating for you but I guess it'd be wasted on you because you're pretty convinced you already solved these problems in the 70s, so why am i wasting my breath
It's important to counter the trivializing sort of dismissal that people often post to HN (the old "I could build Twitter in a weekend" and whatnot). We want the culture to move more toward thoughtful, substantive critique. So your detailed argument here, based on experience, is valuable. Please don't spoil it by becoming uncivil like this:
> I guess it'd be wasted on you because you're pretty convinced you already solved these problems in the 70s, so why am i wasting my breath
With that your comment does more harm than good: it poisons the atmosphere and detracts from your substantive contribution.
You're definitely not "wasting your breath" even if you fail to persuade the other person not to be snarkily dismissive, because the real audience for a comment like yours is everybody else: i.e. the rest of us who are curious about how (in this case) Uber operates and why things might be the way they are. That audience needs to see both good information about the challenges involved (as opposed to this-has-been-trivial-since-the-70s) and a good example of how to patiently respond to a trivializing comment with a thoughtful one. It's bad if, instead, you give us a reason to wince and an example of replying to a dismissive comment with a rude one.
How are you going to discuss all that in a substantive manner? All anyone can do in the limited time frame of a HN discussion is draw parallels to previous job experiences or previous user experiences.
My own experiences are more aligned with the sentiment expressed in the "dismissal" comment.
Mine too, but that doesn't make it a good comment. In fact its first paragraph is almost a parody of the know-it-all internet comment.
There are a zillion ways to make the same kind of argument thoughtfully. Talking about one's own concrete experiences helps. So does not acting like you know everything about somebody else's situation.
Bloat is a problem, and so (in my view) is the kitchen-sink software culture of hauling in libraries and frameworks without thought for overall complexity. But we need to be able to talk about this at a higher level than other-people-are-idiots-compared-to-me. A much higher level.
Point taken. I'll try to do better next time but it does get old after a while of seeing the same set of mistakes and articles parroted over and over again. Trivial problems blown out of proportions because people don't know the proper science, theory, and history and have opted to re-invent things badly. Uber is especially known for this since they re-invented/re-wrote basic geospatial algorithms in Go and hailed it as innovation.
The dismissal comes from years of reading such articles and then chipping away at the veneer to see what's really underneath and being disappointed every time and then working on such things and experiencing first hand how the bloat comes about.
I understand, but we need you to give us the experience and omit the dismissal. The former can dramatically improve the quality of this site; the latter only degrades it. And the former will actually be persuasive while the latter merely gets people's backs up (or makes them cheer if they happen to hate the same thing) without teaching the reader.
I get irritated at having to repeat the same things over and over, too, but the internet is basically stateless and so (sadly) is the software business. And like everyone, I get peevish when people say/do wrong things and act like they know what they don't. The longer one has been around, the more occasions one has to secrete bile. But it's a humor one must metabolize internally and not release into the community—hard work and not fun at first, but far more rewarding in its effects, and maybe our only chance at creating an actually functional culture.
I've built things that handle less than 1 tps and things that handles more than a few thousand without any significant memory or CPU load. All of these things have had uptime that has been unmatched with other systems that it has had to interface with all the while degrading gracefully and handling everything else in between. So lets just say I understand a thing or two about designing fault-tolerant systems that need to operate under high loads and degrade gracefully.
* Re: workload in the 70s. You are missing the point about logistics stacks than handle 1000s of transactions per second. The point is that Uber's problem is self-imposed. Stepping back and thinking about the problem a little will let them handle the same amount of work with 1/10 the hardware costs.
* Re: first version. The first version and the n-th version when properly designed requires the same set of gradual steps. If you build the first version to throw away then whose problem is it that you built it that way and need 10x the hardware to handle the workload because of shitty architecture? Again, stepping back and taking a holistic view and thinking a little bit is the trick.
* Re: extensibility. Same deal. Design your architecture properly and you can extend it as far as any business requirement forces it without spending 10x on hardware and software. How do you do this? Same as above. Thinking.
* Re: reliability. See above. Thousands of transactions a second with unmatched uptime. It is more likely the systems I interface with will go down or even for AWS to have an outage than for a properly designed system to fail.
1. Machine learning - already doing it wrong. You've failed to learn from history and instead are following fads and trends. When properly framed routing/allocation is a linear program and there are solvers than will solve such problems with millions of variables. Instead you have opted to complicate the problems with latest fads and trends that are not even suited to the problem you are solving. In essence you've made my point.
2. Consumer laptop? I'd hope the software runs on server grade hardware. Bringing up a consumer laptop as a restriction on memory is a non-sequitur.
3. Hundreds of thousands. Great. I can handle several thousand connections per second on a dinky c4.2xlarge instance with 10-20ms guarantee with a ruby stack. There are plenty of ways to optimize it further but I've never needed to. The literature is full of optimized and distributed graph search algorithms. Operationalizing any one of them wouldn't be much work. How do I know? Because I've done it before.
4. Reliable data pipelines have been a solved problem since hadoop and friends. This is a solved problem. Again making my point about bloat.
Re: one team over a year. Seems like you need better engineers or better designed systems. If you're developing software with more than 100 engineers and the boundaries between teams are so ill-defined that you need more than 10 per team then that's an organizational problem and highly inefficient way to do things. How do I know? Worked on teams that gelled and those that didn't. The determining factor was always reducing communication overhead by proper architectural design. The amount of communication overhead was almost directly correlated with software bloat and sprawl.
1. Devops: Solved problem. Chef, ansible, puppet. Pick one they're all the same.
2. Kafka is not good software. Pick something else for your event management pipeline. Heck, build it from scratch. Neither Kafka nor Storm are novel or required. Chances are you've over-engineered it if you are reaching for those and need to step back and think.
3. Simplify your call graph. There is no magic bullet here. No amount of dashboards, logs, and metrics will let you get around an ill-designed and bloated service architecture. Again you've made my point.
Same reason any other software is not good software. Chances are you don't need it and are reaching for a shiny tool. Kafka requires zookeeper and in my experience zookeeper is an operational nightmare. If you need an event bus then there are many out there that are much simpler and easier to maintain operationally with much simpler failure modes.
Don't just reach for something because it has been the most common thing posted on programming forums. The behavioral psychologists and economists consider this a well known cognitive bug.
Rabbitmq, perfectly fine message bus in pretty much all use cases. Easier to operate and maintain without any extra dependencies and much simpler failure modes. A few more: zeromq, sqs, hornetq, nats, nsq, etc. Any one of those will most certainly fulfill whatever use case you have.
The point being kafka has a very heavy operational overhead and you better understand what you are getting into and what bargain you're making for the scalability you mention.
I've used Rabbitmq, it most certainly does not fulfill the volume requirements I have.
The fact that you are comparing zeromq to Kafka is pretty good evidence that you have no idea what you are talking about, and are just tossing out names from google. I'm a little disappointed, honestly, I hoped you were aware of something I hadn't heard of.
There are two ways to solve problems in engineering. You either bring the problem closer to your existing solutions by redefining the problem or you keep the problem the same and bring your solutions closer to the problem.
Sounds like you are unwilling to redefine your problem so that it is amenable to solutions that are not kafka.
Yeah, you recommended a sockets library as an alternative to a distributed durable circular buffer. Not obviously clueful. Might as well recommend Nginx as an alternative to JavaScript.
I need to durably handle billions of events per day. No amount of redefining changes the underlying business problem.
Kafka, on the other hand, has been helping me solve that problem for years.
Let's see. I can handle a few million on a single instance and I have yet to hit any memory or CPU limits indicating I can handle 10x of what I'm currently handling. Oh and it's about 100k or more transactions per hour at peak load. Just from basic operational observation and logs. Also, have yet to see any durability issues and I've managed to do it without kafka. So pretty basic math says the entire thing can be scaled to a "few billion" transactions in a pretty straightforward way. Then again I'm more willing to redefine my problems to come up with simpler solutions.
But this discussion has devolved into personal insults at this point. We have nothing to teach each other it seems.
If nothing else, I could teach you that zeromq has nothing to do with queing or durability.
It's certainly possible that Rabbit has improved in the years since I used it, if it works for your use cases, great. But don't assume that everyone using a popular technology is doing so because of a fad or without understanding the tradeoffs.
You certainly have an interesting argument but it doesn't sound convincing. Can you explain more about how 1970s technology can solve the problems Uber is facing in 2016?
Here you go https://en.wikipedia.org/wiki/Operations_research. Go to town. The history section and the problems addressed section is more than sufficient. You can expand further if necessary and decide for yourself whether what Uber does is in any way novel and whether it requires all the bloat.
True, I think their CTO must be confused and easily mislead by techies who just want to get the latest buzz words onto their cvs.
Silly comments such as - we gain true insight from pretty graphics rather than tedious sql queries - says it all for me.
I suppose they have to burn the insane amount of capital they raised $50 Billion somehow?
I recommend reading "The Visual Display of Quantitive Information" by Tufte. I would have partially agreed with you before, but I really do think that correct visualisation of data can make it vastly more useful, and as a few other commenters have noted, Uber has a big challenge to differentiate themselves from Lyft and others, and effective use of data could well be one of their differentiators.
Uber is really strapped for engineering talent. Especially when it comes for SRE. Myself and many friends working SRE at various Bay Area companies get consistently hit up for free lunches and interviews. It's really weird considering that their stack doesn't NEED to be this complex....
It probably could be more simplistic. It seems like with enough engineers every company I've ever worked at eventually ends up using every technology they can because of the one thing it does well.
This "one thing it does well" business is then presented as : "using the right tool for the right job" and it's difficult to argue against that because the counterpart can easily deride you as a fanatic of some technology, someone not objective enough, etc...
It is however interesting that we used relational databases for virtually everything for decades even though SQL is suboptimal at most things if we take them in isolation. Some will argue that people are now realizing their mistake, but the truth is these companies were successful and we were all getting our paychecks. (PS: I choose to use NoSQL for virtually all my projects)
The real driver shouldn't be the one thing it does well. Many times - if not most of the time - it's preferable to use a tool optimal for the most important parts and suboptimal for the rest. I personally prefer to provision two more instances, than to add two more technology stacks.
> It is however interesting that we used relational databases for virtually everything for decades even though SQL is suboptimal at most things
You have no clue what SQL or ACIDity is. For 99% of the cases SQL/RDBMS is the right choice. You probably think you belong in that 1%, but from your comment, I suspect you do not.
> I choose to use NoSQL for virtually all my projects
That's because you have no important data to store.
When you get to store data that are important to your customers you're gonna have a big revelation.
"NOSQL" doesn't mean "no ACID". There are plenty of NOSQL DBs that are ACID compliant.
And SQL is not the only way to write your queries. There are a lot more QLs.
Even though what you say it's true, my comment is still correct and relevant to the OP.
Also, of the NoSQL DBs that support ACID, I wouldn't touch them for any serious work or primary data at least. None of them are battle-tested in the same way Postgres is for example.
And again, people who really need these type of DBs fall into the 1%, and I'm being very generous.
That's quite an attack. I trained for an Expert SQL certification from Microsoft back then, when I was writing 3000K+ long stored procedures to migrate an Access application at a fortune 40 company. So I know what it is and I know quite a good deal about RDBMS. I'm not among those who criticize what they don't know.
Regarding the gist of your comment on NoSQL, I haven't been able to convince people coming from where you are with two days of meetings in a row, so I'm fairly confident I'm not going to change your mind on HN.
If you have a clue, as you say, and still believe that it's a good idea to store critical data with NoSQL then I don't know what to say.
Obviously you don't care enough that almost every NoSQL solution out there has been found to make false claims about their guarantees. The billions that have been sunk in the blackhole that's called NoSQL in the last decade is unprecedented.
You don't have to change my mind. I have(and still use) both. And I still maintain that people who use NoSQL for 99% of their projects are making the wrong choice.
I think that is far past database choice. I think many people build SPAs that end up hurting the product over a traditional setup. I think many people use microservices where a monolith would have much better performance and reliability.
As for the basic premise
> that people who use NoSQL for 99% of their projects are making the wrong choice.
Is perhaps kinda right? Some people may really only touch giant data sets. So for them always using NOSQL is smart. The people that write webapps with 12 users? More questionable.
Most cases you can decide if you need to leave RBMS with something like.
1) Do you need to store in the next year > 100GB of data that you need to access in realtime?
2) Do you need in the next year to store > 1TB of data that you need to access in semi-realtime?
3) Do you need in the next year to handle > 1000 writes per second?
4) Do you need in the next year to handle > 1000 reads per second?
Not a perfect guide, and I am sure you can think of edge cases that can still be dealt with in a RDBMS.. but it is a decent starting place. One tricky part is that if you are optimistic, almost any app can check off #3 or #4 (Like Uber but for Baby Strollers). Knowing how to realistically estimate demand for a possibly viral startup is hard.
Another one that I'd add is:
- "Are the records in each table in the hundred of millions? Then most probably you'll do fine with an RDBMS".
If you go above that, or you have operations that will extrapolate that number in the billions then you can offload them into whatever non-RDBMS storage you want and do your thing. But that's the thing with RDBMS, you can always move(or offload part of) your data to a non-RDBMS solution afterwards.
But doing the inverse? I wouldn't want to be in that person's shoes ;)
Does row count matter that much compared to data size? I.e. if I have a billion rows but they are 2 32-bit ints, that isn't a lot of data (2 GB + index). I guess the index starts to get pretty big.. but I always just think of raw data size vs # of rows.
Remember, it's just a rule of thumb. Now... tables with 2 32-bit ints as columns are not exactly typical RDBMS data.
Also, data in RDBMS are... well relational :) Meaning, the rows of just one table are not that important. The data are going to be queried and combined with data from other tables. And I know that typical relational data that consist of hundreds of millions of entries in each table is something that most DBs can handle.
Because in the 2 decades I'm in the industry I see RDBMS make the world spin and NoSQL DBs destroying companies and families.
MongoDB and CouchDB eat data for breakfast, I know that from 1st hand experience. And all the others DBs that claim that do not keep cropping up in Aphyr's blog.
I ain't saying that all NoSQL dbs are useless. I'm just saying that proposing and choosing an RDBMS solution is going to be the right choice for 99% of the projects.
Yes, most people think that they belong in that 1% where they have the infrastructure problems and big data of Google, FB and Twitter but.... they don't.
In the last 2 decades in the industry as well I've never lost data with MongoDB, Riak or Cassandra but have with Oracle, DB2 and PostgreSQL. After all databases are just software and there will always be bugs. Some people just get tripped up by different ones.
And you are woefully ignorant to think the RDBMS is the right choice for 99% of projects. Especially since you think that the 1% of remaining users are purely worried about scalability. Hint: think about the schema problems associated with storing auto generated features from deep learning models.
>In the last 2 decades in the industry as well I've never lost data with MongoDB, Riak or Cassandra but have with Oracle, DB2 and PostgreSQL
Yet every test proves otherwise. Also, use Google to see how people have lost data with MongoDB. Mongo is not considered a serious piece of technology by any scientist or engineer I know. Postgres though is universally considered an engineering marvel.
>Hint: think about the schema problems associated with storing auto generated features from deep learning models.
Hint: The problem you mentioned? Even less than 1%
Calling me ignorant doesn't change reality you know.
NoSQL DBs usually target distributed environments.
So... enter CAP theorem. There's no free lunch. People think we can simply throw away half a century's worth of science because JSON and schemaless are teh awesome derp derp.
Implementation is surely an issue, if you take into account that the mongodb guys had to acquire another company [1] in order to overcome their abysmal write performance. And yet there were people, and benchmarks that were trying to tell us that mongo was faster than RDBMS alternatives. All this circa 2009-2012.
You know what's faster than everything? Writing to /dev/null ;)
Anyways, depending on your use case there might be a NoSQL out there that might fill your needs and it might actually deliver what it claims it can deliver. But it's hard to sift through all this ad-driven, buzzword-ridden informacials that gets thrown around by start-up companies in the DB domain.
Also, DBs are like filesystems; even if the match/science is correct, it needs at least a decade of proven track record before you can say that it works as advertised.
> NoSQL DBs usually target distributed environments. So... enter CAP theorem.
Surely FB is not running MYSQL on a single machine. Perhaps i am misunderstanding what you are saying but saying SQL db's dont face the issues of distribution seems a little strange.
Distribution comes into picture from shape and size of the data not data saving/retrieval techniques. yea?
FB and all big companies are a very bad example. They have ton of resources and usually they don't use vanilla products, since they have the engineering capacity to support their own forked versions. e.g. see their own version of PHP.
Also distributing reads is easy, writes... not so much. NoSQL systems usually offer distributed writes with the caveat of eventual consistency. RDBMS have referential integrity and other constraints which by definition cannot migrate into a distributed environment. Or at least there's not a one size fits all solution.
> Distribution comes into picture from shape and size of the data not data saving/retrieval techniques. yea?
Most definitely not. It has nothing to do with the shape and size of data. Also.. there's not such thing as "distribution" in our context. Only "distributed", from "distributed computing"[1] and it's everything to do about data saving and retrieval :)
>RDBMS have referential integrity and other constraints which by definition cannot migrate into a distributed environment.
so,
Use RDBMS if your data can be handled by a single machine( or have the resources of FB) ? '99% ppl need RDBMS' argument boils down to 99% of ppl have data that can be handled by a single machine RDBMS.
The single machine shouldn't be the deciding factor.
If your application is like most apps(far more reads than writes) then you can easily distribute the load across multiple machines. If you have more writes than reads(quite rare but still) then scaling an RDBMS will be challenging.
In this case, if eventual consistency is something you can live with, a NoSQL store might be best for you.
Like what's gonna happen if they have a couple of corrupt data? A minor incovenience at worst?
Is anyone gonna lose millions? Nah. Anyone gonna die? Nah. Anyone gonna get sued? Naaaaaah
Also Facebook uses MySQL for their primary data. Pretty sure it's the same for ebay. Don't know about Adobe, I bet it's the same deal there too.
People get so excited when they hear some big company using X, but they have no clue in what capacity it's used. I can guarantee you that all the data that matters, that need to be consistent and whole are in some kind RDBMS.
MongoDB is used in Facebook for Parse, eBay for analytics and Adobe for Experience Manager.
All are pretty important parts of their business. In particular the latter which if there was data loss would cause the biggest shockwave in the web community.
But no point discussing it with you since you think: Sony Playstation Network, Apple iCloud, Office 365 etc aren't important data to these companies.
Have you actually used Parse? Obviously not, because you wouldn't dare mention that POC in this discusssion. Hint: search around about experiences.
There's no point discussing with me, because you can't have a coherent debate. Analytics data are not critical neither primary. You really have to reread what I said.
what are some of the skills/experience needed to be an SRE?
I've been having a really hard time finding a job due to being a 'jack of all trades' and having no specialty. Just an assumption. I have over a decade of experience building webapps.
I've spent over 500 hrs on interviews over the past 3 months doing countless coding tests/exercises, whiteboard interviews.
I just seem to never get past on-site interviews.
> I have over a decade of experience building web apps.
Were you also running those apps? SRE means you understand the intricacies of running an app too.
When I hire SREs, I look for people who have the following skills, in this order:
1. Leadership under pressure. What I mean is can you stay cool and calm and keep everyone around you cool and calm when everything is melting down
2. Experience operating a platform. Do you know basics like networking, system startup, system and OS tuning, etc. Can you diagnose a problem on a running instance?
3. Coding. Can you write decent code and can you understand good code.
The reason it is in that order is because staying cool under pressure is something I can't really teach you, it's just sort of innate for the most part.
If they made it only as simple as it needs to be, then they couldn't patent very much. Investors want exclusivity, lots of convoluted tech-speak, big grants, etc. :P
Most major apps phone home for a big config object (likely JSON) at the start of a run. This would contain things like car icons, etc. You can see an example of this in the 3p Uber API which has a call to get which car types are available at a given geolocation. This API returns not only the vehicle types (Uber X, Uber Black, etc) but also a jpeg icon representing the car. In this way Uber can roll out new car types in locations without a client side update.
I'm biased for this next part (since I work on the product) but if you're interested in making your app have abilities like this check out Firebase Remote Config (https://firebase.google.com/docs/remote-config/). While setting up your own config service is not rocket science, having a free one with a web UI is pretty nice.
That is for your answer. I suppose that the interesting code is on the app side rather than the server side (which basically returns a Bunch of Json)
How do you architect a view layer that's so malleable - for example even the routes in Uber were shown in rainbow during pride.
Last time I checked their iOS app pinged the backend around every 10 seconds for a big ol' payload of JSON config etc, which contained most of the A/B stuff and UI config (which car types to display, e.g., as where you are they might not have all of X/taxi/Black/Lux etc).
Most likely just a simple alerts or announcements model that they fetch from the server and display into a pre-allocated section in the ios view (whatever they call it in iOS), if any. It's actually really good thinking.
They likely already preset views and controllers to react to backend events. If they add an event like "pride" to the backend, the app just has to render the view associated with that event.
I don't think it's that simple. The distinction between "code" and "data" is somewhat arbitrary. I'm sure Uber could get away with a rules engine that supports the cases the parent comment is talking about.
Quite an intricate architecture. I can't help but wonder if all of the complexity and different moving parts are worth it. Does it really make more sense than throwing more resources at a monolithic web service? Clearly the folks at Uber think it does, and they've obviously thought about the problem more than me, but I'd love to understand the reasoning.
"We use Docker containers on Mesos to run our microservices with consistent configurations scalably, with help from Aurora for long-running services and cron jobs."
So much technology, yet I still had to load the site 3 times and fiddle with uMatrix to get the page to scroll. Now, lots of people do silly things with javascript, but on a blog article on your tech stack it doesn't speak well of things.
When I saw the story, it stood distinctly apart on the HN homepage as the only story title with ALL CAPS LOOK AT ME. It was definitely a HN culture faux-pas. Alone by itself this is not a serious indictment, but coming from a company with a reputation for arrogance it seemed to be in particularly poor taste.
The title of the story on HN has since been corrected to normal case.
I'm not an Uber hater, if anything I'm inclined to defend the company. But posting ALL CAPS to HN is either arrogance or carelessness or (most likely) some combination of both. I would not normally pay attention except this is a company which already has a reputation, so maybe it's actually part of the corporate culture? Or maybe I'm overthinking it.
I think you're massively overthinking it. The poster probably just copy pasted the headline directly from the blog when submitting the link. "All caps" is sorta the Uber aesthetic and looks totally normal in the context of the blog post itself.
Sometimes this happens because people use the HN bookmarklet to submit a post and the original article uses all-caps typography for its title (as this one does). But it's so rare for such cases to make the front page that I don't think we need to worry much about it. If it ever becomes a problem it shouldn't be hard to deal with.
All: it's fixed now, so please let's talk about the article rather than title mishaps.
For those of you complaining about the title being all caps, it was done so for aesthetic purposes. Which means somehow the submitter went through the time to uppercase each character of the HN title before submitting.
Sounds like a very solid foundation! I'm glad to see they have sufficient system in place to continue spamming the heck out of people who never opted into their advertisement in the first place.
/sarcasm
I only wish LE would treat CAN-SPAM seriously and put more sources into criminal enforcement.
I just got rejected from them. I applied for a SE position, but they didn't like me I guess. They send you this really condescending rejection letter. I showed them my programming language that I built in C from scratch, and also my data structure library where I implement all the common data structures found in high level languages that I built from scratch in C, among the many projects
I have. It must have been my state school that turned them off. I know I could keep up there, but maybe they also turned me down because I'm 5 states away and they thought I wasn't worth the recruiters time.
edit: downvoter, if you could provide your rationale that would be great.
When I first started interviewing, I got turned down at a lot of companies because they were concerned about my self taught background, often in spite of strong project work and interviews. In spite of so many rejections (2 offers after 17+ interviews), I've been wildly successful in my current job—I received a promotion in the first 6 months and have since held down tech lead roles.
Look, the bottom line is that companies optimize for false negatives. In order to achieve a high accuracy rate, [tests must have exceptionally low false positive rates](https://www.math.hmc.edu/funfacts/ffiles/30002.6.shtml), just based on stats. I won't work out the math for you—you sound like you're perfectly capable of plugging numbers into bayes theorem—but that implies that even very good engineers are likely to get false negative rejections at many companies. It does not mean, however, that those companies are necessarily judging you based on unfair criterion, and I don't thin it's fair, thoughtful, or mature to indicate otherwise—(especially because some of my strongest coworkers at Uber are from less prestigious schools in the midwest...)
I've interviewed at Uber as well. The truth is failing / passing an interview is a really bad indicator on how good a coder you are. My uber interview was like any other interview. Don't take it personally. Keep calm and move on.
Most of their load is presumably positional updates. Uber wants both customers and drivers to keep their app open, reporting position to Master Control. There have to be a lot more of those pings than transactions. Of course, they don't have to do much with the data, although they presumably log it and analyze it to death.
The complicated part of the system has to be matching of drivers and rides. Not much on that yet. Yet that's what has to work well to beat the competition, which is taxi dispatchers with paper maps, phones, and radios.