An exchange feed for a given financial instrument (stock, future, option, etc.) is a sequence of messages describing two distinct objects: orders and trades
.
Orders are offers by market participants to buy (sell) up to a given quantity at or below (above) a specified price. Generally there are a number of orders to buy at low prices, and a number of orders to sell at high prices, since everybody is trying to get a good deal. If at any point the exchange has a buy order and a sell order at the same price, the two orders are 'matched' with each other, producing a trade for the smaller of the two order quantities.
If there are multiple orders that could be matched against each other, the orders are matched first to the best possible price (i.e., the lowest-price sell or the highest-price buy order). In the (very common) event of a tie, orders are matched in time priority (first order to arrive will be filled first). For example, if the standing orders are:
price orders (oldest to newest, B = buy, S = sell)
1100
1075 S 1 // one order to sell up to 1 at 1075
1050 S 10 // one order to sell up to 10 at 1050
1025 S 2 S 5 // one order to sell up to 2 at 1025,
// and second newer order to sell up to 5
1000 B 9 B 1 // buy up to 9 at 1000, second to buy up to 1
975 B 30 // buy up to 30 at 975
950
The best buy order is at a price of 1000
, and the best sell order is at a price of 1025
. Since no seller is willing to sell low enough to match a buyer, and no buyer is willing to buy high enough to match a seller, there is no match between any of the existing orders.
- If a new buy order arrives for a quantity of
3
at a price of1050
, there will be a match. - The only sells that are willing to sell at or below a price of
1050
are theS10
, theS2
, and theS5
. - Since the
S2
andS5
are at a better price, the new order will match first against those. - Since the
S2
arrived first, the new order will match against the fullS2
and produce a trade of2
. - However, the
1
remaining quantity will still match theS5
, so it matches and produces a trade of1
, and theS5
becomes anS4
. - Two trade messages will be generated when this happens, indicating a trade of size
2
at price1025
, and a trade of size1
at price1025
. - Two order-related messages will also be generated: one to remove the
S2
, and one to note the modification of theS5
down to anS4
.
The new set of standing orders will be:
price orders
1100
1075 S 1
1050 S 10
1025 S 4
1000 B 9 B 1
975 B 30
950
Note that if a new sell order arrives at a price of 1025
, it will be placed to the right of the S4
(i.e., behind it in the queue). Also, although there are only a few price levels shown here, you should bear in mind that buys and sells can arrive at any price level.
The types of messages that can arrive on the exchange feed are as follows:
Order: action,orderid,side,quantity,price (e.g., A,123,B,9,1000)
action = A (add), X (remove), M (modify)
orderid = unique positive integer to identify each order;
used to reference existing orders for remove/modify
side = B (buy), S (sell)
quantity = positive integer indicating maximum quantity to buy/sell
price = double indicating max price at which to buy/min price to sell
Trade: action,quantity,price (e.g., T,2,1025)
action = T (trade)
quantity = amount that traded
price = price at which the trade happened
The following set of messages builds the initial book from the example in the Background
section above. Note that order 100004
is canceled, so it doesn't appear in the book above.
A,100000,S,1,1075
A,100001,B,9,1000
A,100002,B,30,975
A,100003,S,10,1050
A,100004,B,10,950
A,100005,S,2,1025
A,100006,B,1,1000
X,100004,B,10,950 // cancel
A,100007,S,5,1025
The following sequence of additional messages represents the new order and subsequent trades, removes, and modifies that happens in the second part of the Background
example above:
A,100008,B,3,1050
T,2,1025
T,1,1025
X,100008,B,3,1050
X,100005,S,2,1025
M,100007,S,4,1025 // order is modified down to reflect quantity traded
The state of the book after these messages is shown at the end of the Background
section.
(1) Given a sequence of messages, as defined above, construct an in-memory representation of the current state of the order book. You will need to generate your own dataset to test your code.
(2) Write out a human-readable representation of the book every 10th message.
(3) Write out a machine-readable sequence of 'midquotes' (average of best buy and best sell order prices) on every message. For example, the correct sequence of midquotes for the messages given in the first part of the Example above is
NAN // no buys yet, so midquote is undefined
1037.5 // = (1075. + 1000.)/2.0
1037.5 // new buy at 975 is lower than best, so midquote doesn't change
1025 // = (1050. + 1000.)/2.0 (new sell is at a better price than old)
1025 // new buy at 950 is lower than best, so midquote doesn't change
1012.5 // = (1025. + 1000.)/2.0 (new sell is again lower than best)
1012.5 // new buy at 1000 changes quantity, but not midquote
1012.5 // cancel of old buy at 950 (less than best) doesn't change midquote
1012.5 // new sell at 1025 adds to quantity, but again doesn't change best
(4) Write out the total quantity traded at the most recent trade price on every trade message. For example:
message => output
T,2,1025 => 2@1025
T,1,1025 => 3@1025
T,1,1000 => 1@1000 // quantity resets when new trade price is seen
T,1,1025 => 1@1025 // doesn't remember the previous trades @1025
(5) Your code should be clean, easy to understand, efficient, and robust. Specifically, your program should not crash if it gets:
- a. corrupted messages
- b. duplicated order ids (duplicate adds)
- c. trades with no corresponding order
- d. removes with no corresponding order
- e. best sell order price at or below best buy order price, but no trades occur
- f. negative, missing, or out-of-bounds prices, quantities, order ids
You should note how often each of these cases occurs and print a summary when the program finishes processing all messages.
(6) The following is a stub C++ program to get you started. Feel free to modify it as you see fit. You may use STL.
class FeedHandler
{
public:
FeedHandler();
void processMessage(const std::string &line);
void printCurrentOrderBook(std::ostream &os) const;
};
int main(int argc, char **argv)
{
FeedHandler feed;
std::string line;
const std::string filename(argv[1]);
std::ifstream infile(filename.c_str(), ios::in);
int counter = 0;
while (std::getline(infile, line)) {
feed.processMessage(line);
if (++counter % 10 == 0) {
feed.printCurrentOrderBook(std::cerr);
}
}
feed.printCurrentOrderBook(std::cout);
return 0;
}
All benchmarks below were done onto a CPU (4 cores): Intel(R) Core(TM) i5-5200U CPU @ 2.20GHz
machine.
With:
Linux: 3.19.0-43-generic #49~14.04.1-Ubuntu)
Complier: gcc version 5.3.0 20151204 (Ubuntu 5.3.0-3ubuntu1~14.04)
Python: Python 2.7.6 (default, Jun 22 2015, 17:58:13)
See INSTALL instructions.
-
Since 3.5 years, I am very focused on C++ new features. Many parts of boost are becoming less interresting since C++11 and C++14 new features (keywords and STL). Therefore this project does not use boost but rely on C++14 which is cleaner and easier to read.
-
Only order quantity can be modified (neither price nor order id, in that case I expect cancel and add new order).
-
First version was a monothreaded program (from reading input file, parsing messages, orderbook management then print results). => Make it simple and make it work then consider optimization. => Good design (KISS) helps to simplify later refactoring (like usage of
auto
). -
Focus on parsing messages and printing results latency.
-
Tests of different ways to read input files (
getline
=>mmap
). -
Unit tests (50% of overall lines of code) from the beginning with random generators to cover much larger possibilities of failure.
-
Second version with two threads to separate critical path (parsing messages and orderbook management) versus print results.
- Parse Quantity:
std::stoul [105] Decoder::retreive_unsigned_integer [45] (in ns)
- Parse Price:
std::stod [531] Decoder::retreive_float [259] (in ns)
- Convert Quantity to string:
std::to_string [262] Decoder::convert_unsigned_integer [65] (in ns)
- Convert Price to string:
sprintf [1716] Decoder::convert_float [172] (in ns)
- Parse good order lines:
internal parser [610] naive stl impl [1723] (in ns)
- Parse good trade lines:
internal parser [317] naive stl impl [811] (in ns)
- Read file (test6.txt):
line by line [19'254'867] std::getline [34'078'561] with mmap [16'158'367] (in ns)
- FeedHandler and Reporter tests (see below in
Design benchmarks comparison
section)
- test1.txt:
new (12), modify (2), cancel (6), trade (4)
Overall processing time for monothreaded version is 217 usec
Overall processing time for dual threads version is 221 usec sec but for parsing messages and orderbook management is 31 usec
- test5.txt:
new (50376), modify (49721), cancel (50052), trade (99773)
Overall processing time for monothreaded version is 0.487432 sec
Overall processing time for dual threads version is 0.490797 sec but for parsing messages and orderbook management is 0.075556 sec
- test6.txt:
new (239893), modify (94891), cancel (239888), trade (200891)
Overall processing time for monothreaded version is 1.474538 sec
Overall processing time for dual threads version is 1.499489 sec but for parsing messages and orderbook management is 0.238893 sec
- External allocators (
gperftools, lockless
) with test6.txt:
with gperftools 2.4:
Overall processing time for monothreaded version is 1.467972 sec
Overall processing time for dual threads version is 1.455903 sec but for parsing messages and orderbook management is 0.243840 sec
with lockless:
Overall processing time for monothreaded version is 1.47099 sec
Overall processing time for dual threads version is 1.448604 sec but for parsing messages and orderbook management is 0.242458
- Compiler flags for optimization (default is -O2): -O3 or -Ofast: nothing noticeable for those tests.
In my past experience, I did several benchmarks with different data structures in order to come up with the most efficient in terms of access (insertion, deletion and searching) and user friendly in terms of readability (further maintenance and team adoption). Of course, there is no obvious perfect solution and it depends on situation and experience we have with the software.
Usually I prefer using straightforward stl containers
like map
(for orders) and vector
(for limits) but I chose here unordered_map
and deque
to bring more performance for middle size orderbooks (around 10k to 100k
live orders which is high even for liquid instruments).
Beyond a map
would be a better choice in terms of memory. Less a vector
would be a bit faster when looping over limit prints each 10 messages.
Only continuous tests and measurments will allow us to keep the best choice.
Once we wrote a working clean and efficient code, it is time to consider optimizations when our target is low latency software. To address this, there are several topics to consider:
Here is some numbers to show the cost of a dual threads design (due to the intermediate queue and duplication of orderbook management) compared to monothreaded design but also the expected benefit for the critical path
:
- In monothread design (only FeedHandler managing its orderbook):
Prefill (100000 buy and sell orders) and empty orderbooks performance comparisons (in ns)
Action | Prefill buy | Empty buy | Prefill sell | Empty sell |
---|---|---|---|---|
new order | 647 | 384 | 580 | 390 |
new order same price | 560 | 344 | 549 | 397 |
modify order | 518 | 190 | 501 | 346 |
cancel order | 500 | 268 | 502 | 234 |
- In dual thread design (FeedHandler and Reporter linked by the wait free queue):
Prefill (100000 buy and sell orders) and empty orderbooks performance comparisons (in ns)
Action | Prefill buy | Empty buy | Prefill sell | Empty sell |
---|---|---|---|---|
new order | 856 | 498 | 727 | 551 |
new order same price | 604 | 546 | 629 | 499 |
modify order | 419 | 363 | 421 | 324 |
cancel order | 582 | 337 | 644 | 336 |
- Where is the latency?
Print prefilled (100000) orderbook perfs : [26'964'652] (in ns)
Copy prefilled (100000) orderbook perfs : [2'786'017] (in ns)
Insert at beginning of prefilled (100000) orderbook perfs : [968] (in ns)
As we can see, Print is the latency killer whereas adding at beginning is much faster. Then the overhead introduced by the dual thread design is negligeable compared to the Print latency cost.
Make the critical path only doing critical
This means we may consider multithreading if we have enough core (and optimally cores on different cpus for L3 cache):
C++: std::thread::hardware_concurrency.
cat /proc/cpuinfo
cat /sys/devices/system/cpu/cpu*/topology/thread_siblings_list
Even if the expected solution here doesn't specify a critical processing in the problem statements, I chose to separate in two threads (for simplicity of synchronisation):
- First one is parsing input messages and managing orders to allow a live limit orderbook.
- Second one to receive updates (or events) to rebuild his own copy of the limit orderbook (and keep track of errors) and do all print tasks like a logging mechanism.
My first attempt was to develop a simple ring buffer implementation but because of the different latencies between the two threads and the continuous input, the first one was spending too much time to wait the second thread.
I moved to a simple wait free queue with a deque
and a spinlock
as barrier which is not optimal but quicker to implement and easier to maintain.
Anyway, there are many good reading on the web like 1024cores.
System architecture considerations:
- When doing this, we have to pay attention of cpu cache to avoid "false" sharing (thanks
Martin Thompson
), I added padding. I could check in the program the cache size but I chose to put 2 padding of 64 characters that should cover most of the current situations.
size_t cache_line_size() {
FILE * p = 0;
p = fopen("/sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size", "r");
unsigned int i = 0;
if (p) {
auto ret = fscanf(p, "%d", &i);
fclose(p);
}
return i;
}
- Also, we need to pay attention of concurrent heap memory allocation (
gperftools
or other allocator can fix this otherwise an object pool could be implemented). - I also tried to play with memory page locks (for page faults) but I didn't find a good usage here. I didn't try prefault or prefetch neither.
- For cache miss, since the beginning I stored quantity/price
tuples
intodeque
for print loops and binary searches (dichotomy) efficiency. - I prefer to use
mmap
to relay on OS to flush the data from/to disk in asynchrone for the input file. - Branching prediction (likely/unlikely macros). Perfers switchs than list of if/else when it was possible.
- GCC flags (-mtune=native)
- CPU overclocking
- CPU affinity (
pthread_setaffinity_np
) - Scheduler (FIFO, RT)
- Isolated CPU (IRQ affinity)
- Network stack (kernel-by-pass)
- SSD (but disk controllers are very efficient, so tests are mandatory).
Continuous deployment is the key of success to acheive both targets (doing the right thing, doing the thing right and deliver fast). Unit tests with coverage is the basic thing to do at the very beginning. RapidCheck is a good tool to go beyond guessing test cases. Simulator or test cases generator (in Python) is the second step of validation. They allow quick and high availability feedback of the forecast delivery.