When the Financial Industry Regulatory Authority (FINRA), the private regulatory body that oversees US brokerage firms and securities markets, was planning for a Securities and Exchange Commission (SEC) mandate to improve transactional transparency, it initially considered building a $500 million system based on supercomputers and data warehousing. Before the FINRA board greenlighted that proposal, however, the new CIO suggested exploring a public-cloud option, one that eventually exceeded all expectations for less than half the cost. Steve Randich, FINRA’s CIO, discussed his organization’s path-breaking cloud migration, and shared his advice for others on the journey, with McKinsey’s Chhavi Arora, Kaavini Takkar, Venkatesh Lakshminarayanan, and Brendan Campbell. What follows is an edited version of their conversation.
Using the cloud to deliver on a colossal mandate
McKinsey: How did the Financial Industry Regulatory Authority (FINRA) begin its cloud journey?
Steve Randich: We started the cloud journey in 2013, soon after I joined FINRA. There was a financial-services industry initiative called the Consolidated Audit Trail (CAT), an SEC rule that applied to all US equity and option exchanges in response to the lack of transparency leading up to the 2010 flash crash.
The CAT system was proposed to process 80 billion market events a day, collecting all the quotes, orders, and trades and surveilling them for patterns of market manipulation and fraud. Because the system requires five years of history to examine behavior patterns, it needs petabytes of data to be immediately available at all times.
When I joined FINRA, a team was studying the architecture and technology for a new system. They went to the board of directors with a $500 million proposal to build and run it for five years, using a combination of supercomputers and vertically scalable data-warehouse appliances.
As an alternative, I quickly sanctioned a skunkworks effort to examine public-cloud computing to achieve horizontal scale. By the next board meeting, I submitted a rough proposal to build the CAT in the public cloud for half the cost, or $250 million.
At the time, we probably could have gotten a supercomputer to work, but it would have been too expensive and too proprietary. So we looked hard at public-cloud computing and went all in, moving market surveillance into the public cloud by July 2016. We now process more than 300 petabytes of data. In the end, it ended up costing less than $200 million.
On-demand computing and security: Reasons for adopting cloud across the entire enterprise
McKinsey: Why did you decide to move the rest of the enterprise to the cloud?
Steve Randich: The CAT initiative was so successful that we went back to the board and said, “Everything we said we were going to get out of the cloud, we got even more. Can we go ahead and move the rest of our applications?” We started in early 2017 and completed the move just before COVID-19 hit.
At this point, all of our data and applications are in the public cloud. Unlike a lot of companies with consistent processing needs, ours are very spiky, because they are based on market behavior. A busy day in CAT can total over half a trillion market events, while a slow one is closer to 200 billion. There’s no way that we would be able to handle 500 billion market events a day using available conventional technology without it being wildly uneconomical. And our data needs keep growing, so the infrastructure cost savings alone are enough to justify it. So on-demand computing, where we can quickly rev up and down to zero, is key and really pays dividends.
The other enhancement is security. When we first made the board presentation, we weren’t pitching the cloud as something more secure than our data center. But we reversed our position when we realized we were encrypting everything within our cloud service provider (CSP), which we weren’t coming close to doing in our data center.
There are probably 20 to 40 more things that our CSP is doing today that we weren’t doing in our data center—and wouldn’t be able to afford today. That’s not to mention all the audits and testing they do to ensure the ongoing resiliency and innovation they’re putting into their security strategy.
Learning how to ‘do cloud right’ from the founders
McKinsey: Developers often describe security issues as “misconfigurations.” Have you encountered that, and if so, how did you manage it?
Steve Randich: We took a pretty sly approach to this problem. In late 2013 and early 2014, we shopped all the major providers. We asked, “How do we do this right?” So we built an internal cloud.
Because we’re a regulator in the financial-services industry and it was early days, they all gave us their best support. We were dealing with the founding engineers of these companies, and they were all in our offices, trying to help us succeed. They taught us how to do it right. Now we’re so far ahead of other cloud users. We have monthly meetings with the top executives at all the CSPs to discuss all the things we’re working on together. Every time I read about somebody having an outage or a security event, it happened because they weren’t building it in the way the engineers intended.
Dealing with vendor lock-in and concentration risk
McKinsey: Some companies have had regulators voice concerns over concentration risk, such as a nation–state-level attack that compromises the control plane for one of the cloud service providers. How have you thought about that?
Steve Randich: That’s actually our board’s number one concern right now, and one that I spend a lot of time on. It’s a concern because it would take us 18 months at least to move from one CSP to another. One thing we have done is to use open-source products whenever possible and as few proprietary products as possible, even if the CSP claims theirs is better, because we don’t want any more vendor lock-in than we already have.
But I don’t see it that differently than the old concentration risk of using mainframes or being in a single data center with a backup. We will always have vendor and concentration risks that we should all be concerned about. I’ll be the first to admit that I was naive when we started this, thinking that we could do a dual-cloud solution. We’ve talked to dozens of companies that tried—and continue trying—to do it. I speak at conferences on this whole multicloud thing, and it’s not there. It’s probably five to ten years away.
Navigating capacity limitations
McKinsey: What significant challenges or roadblocks have you faced in your cloud journey?
Steve Randich: We migrated our whole shop from February 2014 to December 2019, and looking back, it obviously appears easier than it was. I think our biggest challenge was being ahead of everybody else, and as a trailblazer, we didn’t have anybody to consult besides our CSP.
Overcoming resistance to cloud and upskilling talent
McKinsey: Were there any significant organizational changes during your cloud migration? Did you require a talent upgrade?
Steve Randich: That’s a big part of our story, since we had to completely reorganize. We were in a hybrid data center/cloud mode, and there were senior and middle managers who didn’t know which side they wanted to end up on. We found that 50 percent of them were hanging onto the old world, while the other 50 percent were jumping to the new world.
I think we let the 50 percent that were hanging onto the old world stay around too long. That organizational change took too long, and we ended up with too many people in influential management positions passive-aggressively resisting the move. What’s interesting, and kind of funny, is that the people we let go ended up going to other companies as cloud experts.
The second organizational change really came down to talent. At the very beginning, we realized what we were doing was special, since nobody in financial services was doing it, and certainly no regulators. We knew we had to get the word out to attract talent, so we enlisted people to go to conferences, apply for awards, and do as many interviews and media events as we could to associate FINRA’s name with cool technology.
It worked. We went from hiring people from Fannie Mae and Freddie Mac to landing talent from Facebook, Google, Cloudera, and Amazon. Our attrition rate also went up because so many of our talented people were getting hired away, but we’re still able to attract high-caliber talent.
The other thing is that we didn’t rely on any consultants. First of all, there weren’t many consultants with the skills we needed in 2014, so we had engineers come in and train us. We worked very closely with all the founding cloud engineers to build the skills internally, and by 2017, we had about 600 in-house cloud experts.
The importance of automation and agile
McKinsey: Could you talk about any operating-model changes? Many companies moving to the cloud find it means being more agile in order to adopt best practices.
Steve Randich: I agree 100 percent. That’s a mistake many people make, because the intention of cloud products and services is to remove the labor and automate as much of the IT cycle, operations, and security as possible.
Prior to the cloud, we were already very agile and had probably 80 percent of our build-test-deploy fully automated. But when we met with the CSP engineers, they let us know it wasn’t enough and said, “You need to automate everything: patching, security, everything.”
So that’s what we did. We now have an incredibly automated operation, the likes of which I’ve never seen. I don’t think anyone I’ve talked to at other organizations compares to our level of automation, and there are a lot of cost savings associated with that. Reliability, time to market, and performance have all improved as well.
For example, one of our key surveillance systems immediately ran 400 times faster once we put it in the cloud. This is a query somebody used to start in the morning, and it might finish after lunch. And now it was finishing in seconds. When your reliability, costs, time to market, and performance improve, the IT organization enjoys a better partnership and trust with the businesses.
One of my strategic goals is to remove the constraints of technology so that our businesses don’t have to call somebody in IT to get something done. They used to call us with problems such as a wrong feed from a stock exchange, which meant rebuilding the database to rerun the query with the right information. That once required a team of 100 IT operations people doing 24/7 gymnastics to get it right. But now, it’s completely automated, so the business side can do it themselves.
Embrace the public cloud and beware infrastructure obstruction
McKinsey: What advice would you give other companies whose cloud journey isn’t progressing as swiftly and smoothly?
Steve Randich: I’ll base my reply on the 200 or so companies that came through FINRA over those seven years, as I listened to all of their struggles. A big source of their problems stemmed from a failure to align the organization behind the move to the cloud.
When I spoke to people from large financial organizations, many were convinced that their buying power and scale precluded them from seeing any benefits to the cloud. So I told them, “Eventually, you’re going to be wrong, and you’re going to be wrong too late. So accept the fact that you can’t compete with these cloud vendors.”
Many companies also believed they should do a hybrid or a private cloud for data security and control, and I would tell them, “If you’re going to build a private or hybrid cloud, just give the money you’re going to waste to charity instead. You’ve got to embrace public cloud in order to be on the cutting edge. If you build a hybrid or private cloud, you’re just doing what we all used to do, which is to buy a kit that just sits there 24/7 without getting any on-demand scale.”
The other problem I often saw was an internal political struggle. Whoever runs infrastructure tends to be the most influential IT person in the organization in the eyes of the CEO and CFO, because they can take costs out. And what does that mean? They can convince the CEO and CFO that going to the public cloud isn’t such a good idea.
They can pull that off because of their influence. I can’t tell you how many large organizations where I saw that taking place. Developers would come to me and say, “We really want to do this, but our infrastructure person isn’t supportive of it.” So getting around that is absolutely key