Changing the Engine While Flying: Scaling and Growth in a Cloud World
People today have high expectations of their software and technology, and the users of Axon Evidence are no exception. Our customers demand, and deserve, a product that grows and adapts to the changing world of public safety, enables them to do more with shrinking budgets and resources, and is always available when they need it - 24 hours a day, 7 days a week.
As we expand our platform to cover every officer nationwide, building a system to scale with these challenges puts a tough demand on engineering teams. How do we innovate quickly to help our customers with their changing and growing needs, but also build and scale systems for growth? It seems so daunting a task that the phrase “changing the engine while flying” comes to mind. However, with some creativity, planning, and a bit of discipline, it is possible to upgrade or replace the most complex of our systems mid-flight.
Establish Goals and Milestones
Early on in 2016 it had become apparent that our search infrastructure for Evidence.com was unable to scale to the demands of our growing data volumes. Indexing was inordinately slow, the system required significant maintenance, and recovery from infrastructure issues was painful. As new data types and complexity came onboard, we needed a better and faster search indexing solution.
The team got together at the whiteboard to define both what we wanted out of the search indexing replacement as well as what we did not want out of our new system. We needed a significant indexing performance update - full reindexes should be measured in a matter of hours and incremental nightly reindexes in a matter of minutes, and our current system was not meeting those requirements.
We also did not want to substantially break search indexing compatibility with the front-end systems already in place. Having the old and new systems be compatible would enable us to quickly alternate between the two in case of unforeseen problems. That meant that instead of switching indexing technologies immediately, we would continue to rely on Apache Solr, our current solution, as our main indexing system, but innovate all of the technology around it to meet our business needs.
Success was easily measured - could we switch traffic to a new search indexing system without any regression in search capabilities? Performance? Index size and result quality?
It may come across as obvious, but first set out to establish why you are making the change, and what success will look like. With the intense rate of innovation in software engineering, its difficult to resist rewriting something simply to follow the latest design pattern or technology stack that will “magically” solve a lot of your problems. Stop that!
Another pit that I commonly see engineers falling into is designing for an ambiguous future. Sit down and really think about what your needs will be two years from now and try to design your system around that goal. For brand new systems it may be even shorter - a year or less. Accept that you will eventually need to overhaul your first attempt and don't be averse to that reality.
Agility and time-to-market are often much more important than any additional overhead of refactoring. Don't neglect the fact that your engineering skills will also grow during this timeframe - use that opportunity to refactor and upgrade along the way.
Experimentation is quite possibly the most important, and least practiced, task in engineering - and most likely, your team is doing it wrong.
Now I know what you're saying - “Our team experiments! We try everything out and build a prototype to prove our ideas.” Successful experimentation requires much more than that. You need to establish a hypothesis, collect data, and communicate results to ensure that your new system design has a chance to meet the goals you determined earlier in the process. Well-designed experiments, although time consuming, will net huge wins down the line.
For our search indexing replacement project, we opted to build a system modeled after Command Query Responsibility Segregation (CQRS) - a design pattern that while potentially complex, had the best chance of adapting our data structures to our requirements.
We designed and ran several experiments to validate key design criteria such as:
- Can we extract data at a rate fast enough to meet our SLA?
- Would the CQRS modifications to our service architecture negatively impact performance during bulk and incremental search indexing operations?
- What was the average time to refactor an existing service to integrate it into the new system, and how many potential and actual bugs were introduced during the process?
The team spent a few weeks designing and building these experiments - writing benchmarks and testing the design against real-world data spanning back several years. It helped us significantly improve several aspects of our approach and head off potential bugs early on.
Investing in Falsework
Modern cloud software development mimics the building industry in many ways. Just as you wouldn't normally tear down a home to remodel your kitchen, you generally do not replace your entire cloud software stack in one fell swoop.
In construction, falsework refers to the use of temporary structures to support objects until they can fully stand on their own. In software you often need to build such temporary structures to allow seamless testing and integration of your new concepts and support controlled rollouts.
In our old search system, services would send messages to a queue determining what data needed to be indexed and a worker would pull those messages off of the queue and process them into Apache Solr. Services that needed data to be indexed would write messages directly into that queue.
The new indexing system would behave similarly, but with a few significant changes. First off, the message format needed to be changed to support additional use cases. Second, due to the use of a new high-speed streaming API for bulk indexing and a Quality of Service (QOS) feature to ensure that UI updates are processed before background work, we needed additional queues to support the system.
Our plan for replacing search indexing involved a multi-stage rollout. The first phase would deploy the new system into production and have it run in parallel to the existing system, but in write-only mode to a second data store. This would allow us to monitor the system over several days and check for any errors in processing. The second phase would roll customers into the new search engine indexing infrastructure slowly, giving us the ability to quickly switch back in case of any errors.
The team decided to build a new micro-service to recieve search indexing requests over Apache Thrift and enqueue them into one or more processing queues. This buffer between the producer and consumer allowed us to transform and multicast incoming requests into multiple formats and locations. Replacing all of the old queue-based code with this new service call wasn't the most glamorous work, but it did offer us the ability to run multiple versions of search indexing and queues in our QA environment during development. Today we use this feature to do controlled upgrades of the indexing system without having to take any downtime.
In a 24x7 cloud environment, continuity of service is extremely important. Good scaffolding and rollout infrastructure is critical to maintaining your SLA. Don't let your fear of the “old codebase” scare you from making targeted improvements to the old system in support of your replacement.
Good scaffolding also enables you to build known replacement points in your infrastructure. Replacing the queue based system with a journal (such as Apache Kafka) now is a straightforward and lower-risk change, thus future-proofing the design.
Determine the Right Amount of Change
One of the tenets of software engineering at Google is the concept of frequent rewrites. Software is often replaced every few years to cut away unneeded complexity or requirements which are no longer applicable.
When rewriting search indexing, we felt that it might be important for our growing business to move to a different search indexing technology that offered some advantages in automatically growing large search clusters and supporting new data structures. Unfortunately, making this change would have required a substantial rewrite of our search query engine to the new system's search syntax and require us to big bang the search changes.
Instead of taking on this complexity up front, we decided to instead invest in writing our own domain-specific search language (using the flexible and easy Scala parser-combinators) that transpiles into Apache Solr query syntax. This approach not only gave us the flexibility to eventually replace the search indexer and query language, it also (somewhat accidentally) launched a new feature! Several customers now take advantage of our search DSL through the Evidence.com public API platform. This enables them to quickly integrate our software with their agency's specific tools and processes and has been quite a hit for those customers who use it.
When designing systems, strong interfaces and data contracts are more important than making the absolute correct algorithmic, platform, or technology choice. As an engineer, assume that someone will come after you and potentially rewrite large portions of the software to meet new requirements or needs. I have found that by focusing on making the next engineer's job easier with well defined interfaces between systems that are easily replaced, or code which is designed to be defensive and difficult to misuse, you end up designing a more robust and reliable platform overall.
Following these practices also makes it easier for you to go back to the drawing board and throw out bad ideas. Designing for frequent rewrites helps to keep the sunk cost fallacy at bay.
Looking too far into the future and trying to solve too many problems too soon can often be worse than not doing enough. Two years ago systems such as Axon Fleet were not even on the roadmap. Two years from now our challenges may look substantially different.
So you've invested time and effort into replacing a major component of your infrastructure. The team has run experiments, built microbenchmarks, run synthetic load tests in a stress or staging environment, and built rollout tools that let you precisely control rollouts down to the microsecond. How do you know it's all working?
A system without good metrics is guaranteed to fail, but what and how should you monitor? For Evidence.com we use a combination of real-time and aggregation systems to measure system performance in great detail. A good rule of thumb is to measure and trace all external boundaries - such as overall request time, database calls, and external service calls.
It's important that your dashboards accurately represent the customer experience using the site, making sure that the metrics span across systems. As systems become more complex and interconnected, platforms such as Zipkin or Opentracing.io are also great tools for debugging complex call graphs and measuring site performance.
I personally obsess over dashboards. A good dashboard helps the engineers on the project quickly pinpoint any potential system failures. A great dashboard should be self explanatory for any engineer within your organization to diagnose service problems.
Modern cloud systems require consistent care and feeding. With so many moving parts and with the internal and external dependencies consistently in flux, it's easy for failures in upstream systems or unanticipated behavioral changes to significantly impact your system or exacerbate conditions which were once rare. You can't abandon systems, and a culture of continuous improvement is key to exceeding customer expectations. Evidence.com is designed to quickly build and deploy, allowing engineers to safely update the system several times per day. Continuous integration and deployment are important tools to maintain this culture - nothing is more rewarding to an engineer to see their hard work deployed and in the hands of their customers or peers.
The era of the cloud has opened up new opportunities and challenges for software engineering. Our complex modern world with global user bases put new demands on our technology to always be there for us, available at a moment's notice. However with some discipline, skill, and planning you too can change the engine while flying.
Interested in hearing more?
Want to influence the development of future products, discuss the best practices and current uses of technology in public safety, and connect with your peers from around the world?