Q&A: McObject's Steve Graves on Vectors, CPU Cache and Pipelining to Boost Analytics Performance

Blog entry

Trading firms are looking to boost the speed at which they run analytics, in order to provide faster trading signals and more effective risk management. Leveraging vector-based processing, and managing data so that it remains in the CPU cache can help. IntelligentTradingTechnology.com spoke to McObject's CEO Steve Graves for some explanations.

Q: You just released a white paper titled "Pipelining Vector-Based Statistical Functions for In-Memory Analytics" ... so can you start by explaining "Vector-Based"?

[Note: Read the white paper here]

A: Definitions first: in programming, a “scalar” variable or data type holds just one atomic value at a time (think X = 1). In contrast, an array can hold multiple values (think X = 1, 2, 3…). A vector is a type of array (a one-dimensional array). Vector-based programming languages, in use since the 1960s, make coding easier by generalising operations normally applying to scalars, so that they apply transparently to vectors. In other words, working with the (potentially large) set of values in a vector becomes as easy as working with a single value. So, if A and B are vectors, a vector-based programming language might let you code C = A + B to obtain a third vector whose elements are the sum of the corresponding elements of the vectors A and B.

Accordingly, vector-based languages can slash the amount of code required to perform some tasks. Beyond this programming efficiency advantage, and under the right conditions, vector-based languages can also deliver greater processing efficiency – that is, they reduce application latency – by leveraging CPUs’ built-in instruction sets for handling vectors (if present in a given CPU).

A vector-based approach is useful in working with market data, which typically consists of one or more sequences of data (closing prices, daily volume, etc.) that must be crunched in their entirety. Being able to handle these sequences as vectors simplifies the programmer’s job while making the application faster – two very appealing characteristics when it comes to creating the software underlying today’s capital markets.

Yet it is not necessary to adopt a “pure” vector-based language to get these benefits. Scalar languages can be extended with functions that both simplify programming with vectors, and take advantage of the CPU’s specialised features to handle vectors. For example, with the eXtremeDB Financial Edition database system, McObject supplements the C programming language with a library of vector-based functions for statistical processing. The vectors on which these functions operate are sequences of market data; the functions are used to perform operations on, say, a year’s worth of closing prices for a security.

Q: Next, what is meant by 'Pipelining'?

A: The term “pipelining” is used in many computing contexts, in both hardware and in software, referring to output from one element or function serving as input for the next. McObject’s paper describes the technique of pipelining the vector-based statistical functions that I just mentioned. The functions can be combined end-to-end to form an “assembly line” of processing.

This may not sound particularly unusual, since almost any algorithm or program can be viewed as a sort of assembly line that acts on data. But the pipelining described in the paper has a unique purpose: keeping data inside CPU cache as the data is being worked on by the functions. This really revs up performance, because it eliminates a great deal of back-and-forth traffic that would normally occur between CPU cache, on the one hand, and main memory, on the other.

And of course, this benefit just adds to the performance gain that results from working with vectors of market data using vector-based functions.

Q: Can you back up a little and talk about CPU cache on microprocessors? How does it compare to/relate to RAM memory?  How much faster is CPU cache than RAM?

A: Microprocessors have a relatively small amount of integrated memory, called CPU cache. Often this cache is arranged hierarchically (L1, L2, L3), and its purpose is to make sure that data and instructions that are likely to be needed are kept “close to the action” – that is, readily accessible by the CPU’s processor cores. When the CPU needs a piece of instruction code or data, it first looks in this cache. If the needed item is not cached, the CPU must fetch it from main memory (RAM). The route for such transfers between CPU cache and RAM is through the Front Side Bus (FSB) in older Intel hardware, or the Quickpath Interconnect (QPI) in newer Intel hardware.

Because this “door” is of limited bandwidth, passing data through it imposes significant latency. Assuming a processor clock speed of 2.8 gigahertz and (generously) a QPI bandwidth of 800 megahertz, fetching from main memory is like speeding down the Autobahn at 280 kph, then having to slow down to 80 kph for an obstacle, then racing to the next obstacle at 280 kph, and so on. And in many applications for capital markets, these obstacles crop up again and again, adding to latency in working with market data.

Q: Can you give a practical example of how this might occur?

A: Say you want to calculate 5-day and 21-day moving averages for a stock, and detect the prices at which the faster moving average (5-day) crosses over or under the slower one (21-day). We use this example in the white paper, and it is realistic, as some view a cross over (the 21-day averages goes from greater than, to less than, the 5-day moving average) as a market entry point (a buy signal), and a cross under as an exit point (a sell signal).

A software program would start with available market data: closing prices for a given time period. Finding crossover points would typically take five steps: calculate moving averages for both the five- and 21-day “windows”; subtract each 21-day average from the same day’s 5-day moving average, to get the delta between the two; find crossovers using a function that detects the points at which that delta elements cross zero, i.e. goes from positive to negative or vice versa; then map these points to the original sequence of closing prices, to return a sequence of closing prices at which the 5-day and 21-day moving averages crossed.

In the traditional approach to this work, even with a vector-based language, these steps entail creating interim results that are transferred to main memory for temporary storage, then fetched back into CPU cache for further processing. For example, both the 5-day and 21-day moving averages are “materalised” as output in main memory – then they’re brought back into CPU cache in order to subtract one from the other.

Each of these transfers is a bottleneck – to use the analogy above, it’s an obstacle that makes the car speeding down the Autobahn hit the brakes. This delay is multiplied by the number fetches/writes from/to main memory required to move a given set of data back and forth. In other words, processing wouldn’t be just 3-4 times slower than the approach using pipelining.  It would be 3-4 times slower times the number of trips across the QPI/FSB, whereas pipelining reduces the number of transfers for interim results to zero.

Q: So, this is an extreme form of 'in memory' computing? But surely CPU cache is limited in capacity? How do you address that?

A: In-memory database systems (IMDSs) use main memory (RAM) as storage, which can lend performance an order of magnitude faster than traditional DBMSs that use persistent storage. eXtremeDB Financial Edition is, at its core, an IMDS, and IMDS architectures are irreplaceable for crunching market data. But they're practically the “new normal” for analytics. The white paper explores the question, how do you move analytics beyond the status quo in terms of reducing latency? The conclusion is that you have to move in the direction of the processor itself.

One key technique – which we use in eXtremeDB Financial Edition, but didn’t invent – is column-based handling of market data. Database pages are the fundamental unit of input/output in database management systems. They’re the units by which DBMSs (and IMDSs) cause data to be brought from database tables into CPU cache. Conventionally, pages consist of rows and columns.

But most instances of market data – closing prices, for example – will occupy a single column in that table. If you need to perform a financial calculation on that column, and you're fetching data by pages consisting of rows, you wind up “flooding the cache” with irrelevant data (that is, data from all the other columns that make up the rows). To address this, column-based data handling creates pages consisting of columns of data. Animation is a good way to grasp the row vs. column concepts, and for this I recommend this video McObject has created (it also explains pipelining).

Q: Back to pipelining. Does column-based data handling achieve the pipelining you’ve talked about?

A: Pipelining builds on the column-based handling. To enable pipelining, we’ve developed a technology we call tile-based processing of vector elements that works with the database pages discussed above to assemble blocks called tiles as units of transfer and processing within and between the pipeline functions. When a sequence (a column of market data) is retrieved from the database, it is assembled into these blocks, which are sized so that an entire tile always fits in CPU cache.

The process is triggered by the invocation of the first function in the pipeline – let’s call it function A. This causes a tile to be assembled from the database and brought into CPU cache. After function A has done its work on this tile, the function’s output (in the form of a tile of transformed data) is passed to function B. Function B then produces a tile that is passed to function C, and so on. When the final function in the pipeline has finished its work, the fully transformed tile is transferred back to main memory as output. Then a new tile of input data can enter CPU cache, if all the elements of the input sequence (column) did not fit into the first tile.

Note that after function A finishes its work, the tile(s) that served as its input is, for lack of a better term, vaporised. In fact, all of the tiles from various stages in the pipeline disappear as they are replaced by a "more transformed" tile further down the line. This eliminates the need to send interim results "over the transom" between CPU cache and main memory between execution of the different functions making up the pipeline. To use the earlier analogy, the car can speed down the Autobahn, without obstacles.

Q: It seems you have invented a different kind of mousetrap – but does the mechanism you describe do anything that can’t be accomplished by combining a column-oriented database with a vector-based language?

A: With vector-based languages, managing interim results still presents a problem, because operations must typically execute over entire vectors before transformed data is passed along for further processing. If you attempt to operate on two large vectors (e.g. C = A * B), those vectors may be too long to be held in cache. Say the vectors have two million elements of 8-byte double precision floating point numbers. That’s 16 million bytes each. If CPU cache is 8MB, then neither of those vectors will fit in the cache. That implies that the processor has to fetch the vectors in chunks from main memory, and write the interim result. In other words, grab a chunk of A, grab a chunk of B, produce a chunk of C and write it back to main memory.

This becomes increasingly burdensome when pipelines grow more complex – when they resemble a chain of operations (functions) such as C = A * (B + D) / E.  With a vector language, B + D have to be added first, on the entire vectors. The interim result is multiplied by A, creating a new and temporary vector that is then divided by E.  Again, for long vectors, this surely won’t fit in CPU cache. If very long vectors are used, the interim results may not even fit entirely in RAM, in which case they’ll have to be transferred in and out of swap space on the hard disk, with even greater latency than main memory storage. In contrast, with our tile-based processing, the need to operate over entire vectors is eliminated. A portion of a vector or multiple vectors can be processed by all the functions within the chain, with none of the interim results written back into RAM.

Q: How else does your database leverage hardware functionality to improve performance?

A: Another characteristic of eXtremeDB Financial Edition, which complements the efficiencies of pipelining, is that the database system has a very short execution path. This can be seen in its code footprint, which is approximately 150K – much smaller than most DBMSs and IMDSs used in capital markets. Small code size means a given operation is likely to be carried out using fewer CPU instructions, resulting in faster performance. And, in addition to data caches, CPUs also have separate instruction caches. Like data caches, these instruction caches are limited in size. With a small code size, the instructions for any given operation are going to be in cache so that the processor will not have to go "over the transom" into main memory to fetch needed instructions for the operation.