Horizontally Scaling a Risk Calculator for a US Broker-Dealer
Case study

Horizontally Scaling a Risk Calculator for a US Broker-Dealer

About the Client

The client is a US broker-dealer that provides financial assets trading through its proprietary platform, which was released back in the 90s. Their offering includes common and preferred stocks, futures contracts, ETFs, Forex, options, cryptocurrency, mutual funds, fixed income investments, margin lending, and cash management services. The brokerage serves 11 million customers with over $1 trillion assets under management. The trading platform processes over 500,000 trades each day.

Business Challenge

The client’s trading platform had a legacy risk calculator that was experiencing stable logarithmic account growth. On top of that, they expected their accounts to double due to a recent merger.

The risk calculator had already been operating at near-maximum capacity, and the client had exhausted all computing resources and possibilities for vertical scaling. This led to performance issues caused by the allocation of large memory blocks (hundreds of gigabytes per Java process).

Considering all of this, the client needed a solution to increase the system’s performance and reduce risk calculation latency from two minutes to 15 seconds. The latency value is assumed from the moment an account joins the risk calculation queue to the moment the operation is complete.

Solution

Devexperts had previously worked with the client, so we were aware of the system’s specifics: steady account growth and the associated performance slowdown. The client gave us free rein to explore new technologies and algorithms to solve their performance and latency issues.

The client used to distribute their tasks among dozens of hosts. After thoroughly researching the matter, we decided to reshape the in-memory data grid solution by Apache Ignite, and apply the update to the client’s legacy system. This solution would enable the client to drastically increase the number of hosts, enhancing performance and reducing latency.

The in-memory data grid solution can unite hundreds of hosts for joint task work. In our case, this was useful for the accounts’ risk calculation. With this solution, the client would be able to efficiently and uniformly distribute task parts among hosts and then request their calculation results. The latter is called map-reduce, which enables the client to leverage the API of Apache Ignite to make calculation requests that will be sent to a large network of hosts. Each host will reply with their part of the data that complies with the user’s request attributes.

Introducing this solution allowed the client to process a significantly larger number of accounts, providing lasting opportunities for horizontal scaling.

When we decided to introduce the in-memory data grid solution, we knew it was not a turnkey fix or a silver bullet for this case. We went to considerable lengths to adapt it to the client’s needs and goals. Devexperts team of top specialists made it possible to successfully fine-tune the in-memory data grid solution and transfer the calculation algorithm and all accounts to the new risk calculator.

Performance metrics and hardware requirements before and after implementation 

 Legacy risk calculatorNew risk calculator
Hardware8 huge servers, 48 CPU cores each, hundreds of gigabytes of Heap allocated on each server40 general-purpose servers, 16 CPU cores each, 80 grid nodes total, 48Gb Heap per node
Number of Accounts2M5M
Latency2 minutes1-2 seconds
GC pauses15-25% of the time2-3% of the time
Avg. CPU utilization (i.e. efficiency)30-40%90-95%
Dynamic, horizontal scalingxv
Fault-tolerantxv

Results

The client’s legacy system had been limited by dozens of hosts, while the in-memory data grid solution expanded the limit to hundreds. On top of that, we were able to preserve the client’s algorithms and accounts.

Ultimately, our work reduced the allocated heap memory and the number of accounts per host. This made the calculations much faster for each financial account. The fewer accounts per host, the quicker the system can process it. The client’s initial latency value was 2 minutes, and they needed it to be less than 15 seconds. With the in-memory data grid solution and meticulous research and development, we were able to achieve the 1-2-second latency.

The client’s system now processes tens of millions of financial accounts.