Every exchange has to find a balance between performance (latency and throughput) and reliability. These two are always at odds: the fastest possible system will be the least reliable. Let’s take a matching engine as an example: we could keep all the state in memory on a single server and avoid storing anything on a disk to get the maximum performance, however, the data will be lost if the machine crashes.
Exchanges face unique requirements: they must be crazy fast, exhibiting consistent sub-millisecond latency, and at the same time nothing should be lost in case of failures. The exchange should be operational even when several machines shut down. So how exactly do exchanges deal with this?
Sharding as a Low Hanging Fruit
A usual answer to any redundancy-related technical problem is ‘sharding’ or ‘partitioning’. Indeed, if we could somehow split the state and work between several independent machines, we would decrease the impact of a possible failure. In addition, this will improve performance as each partition would do less work and therefore run faster.
With that being said, regular exchanges are centralized by nature: for each instrument, we see a single order book with all the orders from all customers. This book is extremely hard to split between machines, so we have no option but to keep the book on a single server. It means that in the absence of some additional technology, such an exchange will not survive the failure of a machine with the order book. Exchanges, however, usually run multiple segments of instruments, with each segment serving just a part of the instrument universe.
Another Obvious Method: Using a Database
If we can’t shard the CLOB (central limit order book), we must take some other measures to ensure that data is not lost and the exchange can endure the loss of a machine (or several machines).
Another obvious idea would be to employ some sort of a database: every transaction (e.g. order or trade) should be reliably recorded to a database before its results are delivered to the customer. While valid, this solution has its caveats: the database should also somehow survive crashes, and the latency of a database is usually much greater than the latency a good exchange can afford. In fact, we’re looking at several milliseconds just for a single database transaction, while the exchange is usually expected to have sub-millisecond end-to-end latency.
This happens because databases usually require extra network hops and rely on disk data, which can be very slow. So, it turns out databases are also not a good choice for the matching engine of a modern exchange.
Replicated State Machine as a More Viable Solution
Since we can’t shard the data and we can’t use a database – what are other options then? It looks like the only viable solution here is building a replicated state machine. We can treat each order book as a model under which there is a finite set of inputs (say, issue an order or cancel all orders) that produce a set of outputs (e.g. trades, market data, or execution reports). Outputs are strictly determined by the internal state and the inputs, and nothing else (we cannot rely on anything else – like external clocks, for example).
This model is really useful because we can run several identical instances of a matching engine and copy all the inputs to all the instances. If the initial state is the same, the outputs will also be the same – and boom, we now have redundancy! Should any of the instances fail, the work would continue with using the remaining instances.
Almost any matching engine out there is designed as a replicated state machine where multiple instances are run in clusters. A lot of questions remain: how do we ensure that all the machines receive the same input in the same order? What happens when a machine fails? What happens in case of network failures? It turns out there is a set of algorithms responsible for answering these questions and the problem they solve is usually called “the problem of distributed consensus”. There are several protocols out there, most notable are Paxos and Raft. These algorithms guarantee that a cluster eventually agrees on its inputs – and therefore each machine can provide the same output.
While being theoretically sound, consensus algorithms are notoriously hard to implement from scratch. This is one of the main reasons why we don’t see many open-source high-performance matching engines out there: building a reliable and low-latency matching engine requires a lot of effort, testing, trial, and errors.
Making Your Matching Engine Swift
Having a good consensus implementation is still not enough to build a fast and reliable matching engine. Here is what usually prevents matching engines from keeping latency at bay:
1. Jitter caused by your programming language. Most modern high-level languages employ some sort of automatic memory management aka “garbage collection”. Programmers are free to allocate memory, and the underlying language runtime is responsible for reclaiming it. While being a really useful feature, this also imposes a latency penalty that can be up to several seconds at a time.
When garbage collection is in progress, the application might stop completely or slow down significantly. This is why modern exchanges are developed using relatively low-level languages with full control over memory allocation (such as C or C++) or require a special approach if written in higher-level languages such as Java.
2. Jitter caused by the network stack or OS. Usually, operating systems and networks are optimized for standard use cases where multiple processes are run at the same time. OS can pause processes and move them between CPUs, thus incurring latency. In situations where every microsecond counts, this is unacceptable. Any virtualization is also unacceptable.
To overcome this, exchanges run as close to the hardware, as possible, dedicating CPU cores solely for the matching engine application, removing any complex coordination between processes if possible. Also, specialized networking hardware is usually used, such as Mellanox or SolarFlare.
3. Engineering weaknesses. Even with a good technology stack, engineering should be very thorough. Selection of algorithms and data structures is critical: an engineer should exhibit the so-called ‘mechanical sympathy’ (the term was first introduced by LMAX in the early 2010s) and understand how the hardware works to be able to write the most efficient software.
Some exchanges might utilize FPGA-based solutions. These are specialized programmable chips that run your programs with no overhead caused by the necessity to support operating systems, standard interfaces, etc. There is no public information about this, and the implementation might be extremely complex as matching engines might have some very sophisticated logic.
At the End of the Day
When building an exchange, it’s extremely hard to find a middle ground between performance and reliability. Basic and obvious approaches like sharding and database implementation may seem fit at the first glance: they reduce the impact of possible failures and improve performance, but present too many shortcomings. A better decision would be to build a replicated state machine and pair it with some careful engineering to achieve low latency.
Hopefully, we’ve hit the high spots of building a reliable and fast exchange in this article. You can learn more about our experience in building one from scratch here.