Reducing memory consumption for dynamic tabular data

At FundApps we had a semi-dynamic data schema, which could be defined by our regulatory team in accordance with the market data required to enforce regulations.

Imagine a table of stock holdings like the below, with potentially millions of rows, and possibly a few hundred column headers. The data was sparse, but with much overlap in data according to the asset class.

AssetIdQuantityAssetClassDeltaTotalSharesOutstanding
GB12343900Equityn/a10,000,000
GB29382000Option0.3n/a

We were originally representing the data model with a pretty basic data structure – essentially a collection of dictionaries, one dictionary for each row, to represent the dynamic properties of the schema defined by our regulatory team.  

As our workloads grew in size, one of our team highlighted this perhaps wasn’t the smartest when it came to memory consumption – the keys in our dictionary were roughly the same for every row of table data, and yet we were storing the hash values of those keys for every row:

AssetIdKeys
GB1234AssetId, Quantity, AssetClass, TotalSharesOutstanding
GB2938AssetId, Quantity, AssetClass, Delta

While there is a DataTable in .NET for this kind of data, it’s a pretty heavy-weight and ugly API for what we needed.

Instead of storing similar keys over and over again in rows and rows of dictionary, we switched to a model whereby the rows themselves are simple arrays, with a key indexing into these arrays defined once for the entire set of dictionaries in the collection.

In .NET world we went from something close to

interface IPortfolio {
     IReadOnlyList<IAsset> Assets { get; }
}
interface IAsset {
     IReadOnlyDictionary<string, object> Properties { get; }
}

to something like

interface IPortfolio {
    IReadOnlyDictionary<string, int> PropertyKeys { get; }
    IReadOnlyList<IAsset> Assets { get; }
}
interface IAsset {
    IReadOnlyList<object> Properties { get; }
}

In the actual implementation, we wrapped this up so the underlying data structure was hidden – and consumers could continue to treat the asset properties as if they were a dictionary.

Each ‘SharedKeyDictionary’ was initialised with the same SharedKeyLookup which enabled the keys to be shared across multiple instances of the dictionary.

// when we know the keys upfront, then we can just define them
var sharedKeyLookup = new ReadOnlySharedKeyLookup<string>("key1","key2","key3");
// row data initialised with the same shared key lookup
var dictionary1 = new SharedKeyDictionary<string, object>(sharedKeyLookup)
var dictionary2 = new SharedKeyDictionary<string, object>(sharedKeyLookup)

The dictionaries can now be used as a normal dictionary but we’re sharing the hash table across both instances. By defining the keys up front, the implementation is simple, and thread-safe – but the downside is you are unable to later add elements to the dictionaries for a key that was not in that initial set.

What about when you don’t know the keys up front?

We didn’t actually know the set of columns up front in some scenarios, due to the way we loaded the data into memory.

The first time we attempted to add data for a key that didn’t exist, we needed to add it to our SharedKeyLookup. We also needed to re-size the arrays in the individual rows the first time we added data for a new key.

The downside of this is losing thread-safety. When you call GetEnumerator() the collection is potentially modified by one of the other dictionary instances, if you are continuing to add keys.

In our scenario, the mutation stage and the read stage were distinct so we could ignore this in the implementation – but you could work around this by adjusting the implementation to ‘snapshot’ the current list of keys prior to enumeration.

Leave a comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.