In this No Dumb Questions, Phoebe is joined by Stack Overflow’s tech lead for the infrastructure team, Josh Zhang, who manages all of our cloud infrastructure. He teaches her about cloud computing and data centers, the difference between CPUs and GPUs, and what it actually took to migrate Stack Overflow to the cloud.
Phoebe Sajor: Josh can you tell me in the simplest terms…what is cloud computing?
Josh Zhang: In the simplest term: someone else’s computer. So historically people would run their own data centers. That would mean you would have to rent a space from a company, manage your own internet, and then manage your own racks, buy your hardware, all that refresh cycle. So it required a lot of specialization. You needed a hardware engineer. Then you need somebody to lay software onto the hardware. You need purchasing contracts. All very cumbersome for somebody who’s trying to start up or somebody that doesn’t want to manage that and be stuck to a location. So cloud computing’s promise was that you don’t have to deal with any of that. It’s software driven. So instead of having to have a human being make an order, take an order, rack it, and then do a bunch of physical stuff before it could be accessed by the engineer through software, you just can declare it and configure it through a cloud interface. And that made it a lot quicker for smaller companies to get spun up and get access to infrastructure that would usually have a pretty high upfront cost.
PS: I know containers and nodes are the basis of cloud computing, but what exactly are they? How does your data go into them? How does it all work under the hood?
JS: So containers are a new unit of packaging for software. In the very old days you would have a server and you would put an OS on it, whether it’s Linux or Windows, and then you would install your software directly onto the operating system. The software would have package requirements. For instance, we’re a .NET shop so we would have to install .NET onto the server itself and then we run the software on it. The problem is that it’s really expensive. You have an entire server that’s configured in a very specific way for one application. Later on, there was the concept of a virtual machine. So you slice up a single server into multiple smaller servers by dividing the resources. That way you can divide a physical server into smaller pieces because applications generally don’t need an entire server’s worth of resources. That gives you cost savings but also space savings.
That was the next logical step. But after that people realized VMs were very bulky. You have to install an operating system in each virtual machine. That’s repeated and you don’t need all of it. So the concept of Docker basically came about to package software. You take a very minimum OS install, usually some distro of Linux that’s very, very tiny. Then you can add packages into it and it runs inside its own container and doesn’t interfere with other software you need to run. You can run multiple Docker applications on the same computer and they won’t interfere with each other. Instead of splitting up a server into multiple small servers, you’re splitting up a server into multiple tiny little self-contained application run spaces. That’s what Docker does.
But you would need something to orchestrate it because Docker is just the individual application and how you package it. So the orchestration is called Kubernetes. Kubernetes is basically how you utilize a servers to allow different Docker applications to run. Then there’s the concept of pods. On a physical server if I wanted to have one application on it, if that server dies then your application dies. In the reliability space you would say, “Okay, I want to run two servers.” That’s very expensive but that way if one dies the other one’s still there. That’s called redundancy. But when you move to the Docker level, that’s called a pods. You have the same application running side by side—a pod of two—and so that if one application dies that’s just one of two in the pod. The other one keeps running. And that’s the kind of thing Kubernetes orchestrates.
PS: That actually makes a lot of sense. Earlier you said the cloud was basically “using someone else’s computer.” What does that mean? Are we just running everything on AWS’ computers? JS: So when Amazon was spinning up their data centers, they realized they had a lot of extra capacity and extra servers. So they said, “Hey, we have all this extra room that we’re not doing anything with. Those are just idle servers. What if we give people access to this extra headroom we have since we have to invest in the infrastructure costs anyways?” That’s how AWS was spun off. The idea is you are borrowing a sliver of their servers to run your own applications. You’re taking advantage of Amazon’s scale because they’re buying servers from all the major manufacturers so they can get a better price. They’re offering you a slight discount based on that, and they can make money off of their idle servers they’d have anyway.
PS: This is why it’s good for me to ask these questions because I always thought the cloud was a whole different thing from data centers. But everything’s still running on hardware. Is that why AI companies are buying up towns in Michigan and Texas to create data centers?Are AI companies using so much compute that they are going far beyond what AWS could offer in their own cloud?
JS: Yes. All of the cloud providers are struggling for capacity because there’s also a shift. Before, everybody was using standard compute on an Intel or AMD CPU. Those have gotten really specialized, the number of CPUs per unit. So in the data center space, it’s a big rack and each rack contains a certain amount of height. It’s all standardized into what’s called units. A standard server can be two to three units, but they got small enough where a single unit could have 128 CPUs, 256 CPU’s. That’s a lot. And they can cram a bunch of RAM in there, which means that in a tiny space, a lot of people can share that compute. That was how it was for a very long time.
Then the AI boom came and everybody had to move to GPUs, basically processors from NVIDIA. Those are incredibly power intensive, but also they’re big because it’s a different kind of computing power. Those chips are actually very big and I think even at the smallest those are like three to four unit height servers. They don’t hold the same amount of capacity because the compute power required for AI is not the same density as traditional CPU compute.
So you’re talking about data centers that were doing just fine and they were keeping up with everything. Now they have to account for all these bigger servers with more space requirements, more power requirements, and more cooling requirements, all in the same amount of space that they already have. That’s why there’s this data center boom. Everybody has to build more space, get more power and all that so they can accommodate these brand new class of compute that everyone’s using.
PS: I remember before I was at Stack hearing that if they don’t figure out how to get the chips to have a higher capacity we’re all screwed because we’re not going to be able to deal with all of the data that exists on the internet. So, it sounds like AI is pushing that even further because it’s using so much compute. Before we get back to cloud computing, can I ask…what’s the difference between a CPU and a GPU? It sounds like GPUs are more powerful than CPUs.
JS: CPU are very generalist. So every computer has a CPU because it’s a generalist processing unit. CPU stands for central processing unit. Boiled down, its ones and zeros and yes and nos—binary. It can do anything and everything but not necessarily one thing specifically very well. Originally, a GPU was used to render graphics like in video games, NVIDIA was one of the original pioneers in that space and developed GPUS that did math very well—specifically matrix math. NVIDIA figured out, “Hey, we can apply this matrix math power to also process AI workloads.” AI workloads involve a lot of matrix math. So GPU’s specialize in that. If you were to tell a GPU to do what a CPU does well, it would do it very slowly. But because it specializes in this specific form of mathematics and that’s where the current compute space is, everyone’s going to GPUs. The current trend is very matrix math heavy. There’s still a need for standard CPU compute for our normal applications, it’s just that right now the tech space is using a lot of GPU to crunch the data and get the models out. That’s what’s taking up most of the space.
PS: I love it. I’ve just learned so much. Back to cloud computing. It sounds like maybe one of the big deals about cloud computing is that it was cheaper. Is that true?
JS: Nope. When we were moving to the cloud we had a senior director who said, “The one thing in the cloud that definitely scales is your bill.” I still say that to this day. So the promise of the cloud is more flexibility. You can scale up and you can scale down very quickly. If we were in a data center and we needed more capacity, I would have to call Dell and buy a new server. It would take so long to ship to me and then I would have to rack it. Then I’d have to configure the software. The lead time is incredibly long. In the cloud, if I want more compute, I just type a command and in like a minute I will have more compute. That’s just how it works.
In theory, if you’re very smart with spinning up and spinning down to only what you need, it might be close to the cost of running a data center. But in the grand scheme of things, you can choose to not buy new hardware if it’s still running on older hardware just fine. You can do other things to optimize costs. But I can say from actual experience on our end, it’s not cheaper than running a data center. You’re just allocating the resources differently. To run a data center, you need somebody who knows hardware and hardware installation. Those people are incredibly specialized at just only doing that. If you’re in the cloud, a standard engineer can handle most of those workloads. So then you don’t need to have specialty people that just handle a specific thing.
PS: Got it. So the big deal was the flexibility, the ability to scale. That’s why everyone and their mom moved to the cloud. Now we can build things easier without having to have a person on site.That used to be you, right? You’d go in and check and make sure all of our stuff is up, and if something went down, you’d check. Would you literally go and look at all of the hardware racks to see what was working and what wasn’t?
JS: So that stuff gets really specialized, but we have monitoring software and other things that we use. I was 20 minutes away from the data center when I lived downtown. It was quite nice. But I still didn’t want to waste the trip for no reason. The hardware manufacturers give you software for monitoring, and they even have predictive software that says, “This hard drive, based on how it’s behaving, is probably going to fail in this amount of time.” All of that has gotten very mature.
PS: What is the process of going from a physical data center to the cloud? How long does it take? How do you prepare for it? What does that look like? We just recently finished doing this at Stack so I know you know it well.
JS: Well, discovery is the biggest part with a data center. At any company, there might be things running that you just don’t know about. You have to figure out everything you care about and the things you don’t. Then you start migrating them and you have to find cloud equivalents. For instance, in our data center we had load balancers and that was configured in a very data center centric way. We had to figure out a cloud analog for that. There are one to one cloud analogs but it’s not efficient and it’s expensive. Some people brute force it by just provisioning servers in the cloud 1:1 with their data center but you’re going to incur extra cost by doing that. So we took a category of every application we have, decided what’s moving and what’s not moving and tried to figure out the right path to migrate it.
Kubernetes was one of them. We took our application and had to update it to be containerized. That means everything from the developer all the way to deployment had to become “cloud native.” That was the conversion process. Then there was the migration. The way we did that was we treated the cloud like a third data center. We had a load balancer in the very front that basically asked, “Where does traffic go?” Stack Overflow actually ran in two data centers—one in New York and one in Denver. And we said, “Now, there’s one in the cloud,” and slowly deployed things over to it and tested it. We basically just pointed traffic slowly to the cloud and we monitored traffic and other telemetry. That’s migration. I’m oversimplifying a lot of it because it’s incredibly complex. If I could do it incredibly well I could probably start a consulting firm and make millions because it’s difficult. So many people were involved. But that’s the very high-level idea. And once you’re moved over, honestly, the rest was pretty straightforward. There’s a lot of vendors that’ll just come and take your hardware. That’s more or less what we did. We unracked things and we paid them to take care of it. All of it had to be securely disposed of, so it was all crushed. So we didn’t even have to be careful. At the end we were just cutting cables and throwing stuff around. It was kind of fun.
PS: Is there anything else a newbie like me should know about cloud computing?
JS: A lot of the data centers are closer than you think. A lot of them are just in warehouses—big old warehouses that you just think are normal warehouses. Ours was actually in a skyrise on the 17th floor in Jersey City that had a beautiful view of the Statue of Liberty, which is incredibly weird.
PS: All of the Stack Overflow data was looking out the window and staring at the Statue of Liberty. I’ll have to look out for those empty floors on buildings sometime although I’m sure they won’t let me in.
JS: Yeah, entering a data center is actually super interesting. They’re hyper secure. They use fingerprint and eye scans and man traps.
PS: That’s like secret spy stuff.
JS: It really is.











Leave a Reply