Reducing memory consumption for dynamic tabular data

At FundApps we had a semi-dynamic data schema, which could be defined by our regulatory team in accordance with the market data required to enforce regulations.

Imagine a table of stock holdings like the below, with potentially millions of rows, and possibly a few hundred column headers. The data was sparse, but with much overlap in data according to the asset class.

AssetIdQuantityAssetClassDeltaTotalSharesOutstanding
GB12343900Equityn/a10,000,000
GB29382000Option0.3n/a

We were originally representing the data model with a pretty basic data structure – essentially a collection of dictionaries, one dictionary for each row, to represent the dynamic properties of the schema defined by our regulatory team.  

As our workloads grew in size, one of our team highlighted this perhaps wasn’t the smartest when it came to memory consumption – the keys in our dictionary were roughly the same for every row of table data, and yet we were storing the hash values of those keys for every row:

AssetIdKeys
GB1234AssetId, Quantity, AssetClass, TotalSharesOutstanding
GB2938AssetId, Quantity, AssetClass, Delta

While there is a DataTable in .NET for this kind of data, it’s a pretty heavy-weight and ugly API for what we needed.

Instead of storing similar keys over and over again in rows and rows of dictionary, we switched to a model whereby the rows themselves are simple arrays, with a key indexing into these arrays defined once for the entire set of dictionaries in the collection.

In .NET world we went from something close to

interface IPortfolio {
     IReadOnlyList<IAsset> Assets { get; }
}
interface IAsset {
     IReadOnlyDictionary<string, object> Properties { get; }
}

to something like

interface IPortfolio {
    IReadOnlyDictionary<string, int> PropertyKeys { get; }
    IReadOnlyList<IAsset> Assets { get; }
}
interface IAsset {
    IReadOnlyList<object> Properties { get; }
}

In the actual implementation, we wrapped this up so the underlying data structure was hidden – and consumers could continue to treat the asset properties as if they were a dictionary.

Each ‘SharedKeyDictionary’ was initialised with the same SharedKeyLookup which enabled the keys to be shared across multiple instances of the dictionary.

// when we know the keys upfront, then we can just define them
var sharedKeyLookup = new ReadOnlySharedKeyLookup<string>("key1","key2","key3");
// row data initialised with the same shared key lookup
var dictionary1 = new SharedKeyDictionary<string, object>(sharedKeyLookup)
var dictionary2 = new SharedKeyDictionary<string, object>(sharedKeyLookup)

The dictionaries can now be used as a normal dictionary but we’re sharing the hash table across both instances. By defining the keys up front, the implementation is simple, and thread-safe – but the downside is you are unable to later add elements to the dictionaries for a key that was not in that initial set.

What about when you don’t know the keys up front?

We didn’t actually know the set of columns up front in some scenarios, due to the way we loaded the data into memory.

The first time we attempted to add data for a key that didn’t exist, we needed to add it to our SharedKeyLookup. We also needed to re-size the arrays in the individual rows the first time we added data for a new key.

The downside of this is losing thread-safety. When you call GetEnumerator() the collection is potentially modified by one of the other dictionary instances, if you are continuing to add keys.

In our scenario, the mutation stage and the read stage were distinct so we could ignore this in the implementation – but you could work around this by adjusting the implementation to ‘snapshot’ the current list of keys prior to enumeration.

Have you taken the Founders Pledge?

In the midst of growing FundApps, I gave far less time to charity than I’d have liked. Despite the generous volunteering policy we established, at a personal level my charity amounted to a few small donations, a little micro-finance, and some occasional volunteering.

I countered this feeling of ‘I should do more’ with a vague notion. That is, if against all the odds we successfully exited some day, I’d finally look into which charities really do have meaningful impact and donate a portion of the proceeds to them.

When my co-founder Andrew introduced me to Founders Pledge, the concept totally clicked for me. I could turn my vague plan into a legally binding personal commitment in a matter of minutes. Even better, when the time (hopefully!) comes to donate they’d help me figure out the most effective charities to support the causes I care about.

At the time of writing they’re now close to $1 billion dollars pledged to charity by founders at all stages of company growth. They are funded independently, so there’s no cost to pledgers or the charities. Their evidence-based approach to selecting charities appealed and opened my eyes to the world of effective altruism, giving cause for reflection on the existing charities and micro-financing I had been backing.

Perhaps it’s not the british thing to shout about giving but if you’re a founder with a similar itch, do go check them out.

Replacing maternity & paternity with parental leave

A friend of mine recently went to his head of HR to arrange shared parental leave. They had no clue how the process worked, asked him how he was going to cover the work, and warned that it might not be possible. My friend had to calmly point out that this was in fact their problem, not his – and that he was legally entited to take his leave regardless.

So it sadly came as little surprise when I learnt that shared parental leave take-up in the UK is around 2%. My friend and colleague Pat recently wrote about FundApp’s shift from maternity and paternity pay to a single combined paternity policy:

  • All new parents received 12 weeks of paid leave, regardless of gender, location, family structure or circumstances
  • The leave is flexible to take at any point in the first year following the child’s birth or adoption
  • All FundAppers are eligible globally with no minimum service requirements
  • Unlimited, paid time off for prenatal, medical or adoption appointments for both mums and dads
  • Flexible options for return to work

It’s well worth a read and I’m proud to be co-founder of a company where we try and do better. Many of the comments were interesting, with a few common themes that I thought I’d share here:

  • So what? 12 weeks isn’t particularly generous. This change is about building a workplace of inclusion and equality for all parents. We’re a 40-person bootstrapped company, so 12 weeks is what we can achieve at the moment. The aim is to increase that when we can.
  • What about after 12 weeks? Presumably this question was mainly from our US audience, but to be clear this is in addition to any statutory leave (up to 12 months in the UK for mothers).
  • What about more flexibility? The flexible options might seem loose because it is – almost everyone in our company works flexibly in some way. We have part-time team members, adjusted hours, adjusted days, working from home, sabbaticals etc. If that means staggering work, job-sharing, setting up remote arrangements, we do this now and would want to speak to our new parents about what will work best for them. 
  • Also, mentioning our pay gap. We have an hourly earnings gap. Many things have negatively influenced that over time. None are excuses. We are improving it and will keep doing so until it’s gone.
  • And a bunch of comments that prove we’re on the internet. “Nobody should bed paid for “time off” whether sick, vacationing, or parenting.”. “Absolutely idiotic rules…spoiling the work culture.

I’d love to hear how your own company is approaching this.

Restoring an old bkf backup file on macOS or Windows 10 (/8/7)

I recently realised I had a load of old projects and data sitting in a lovely 100GB bkf file – generated by the ntbackup program that used to ship with Windows XP and Windows 2008 – but no way to access them.

Microsoft released a restore-only version for 2008 R2 / Windows 7, but there was no version of ntbackup I could find that would run on Windows 10.

There were loads of blog and StackExchange posts but generally pointing to dodgy (and since dead) downloads of ntbackup.exe. I could install Windows XP on a VM, or launch a Windows 2008 R2 instance in Azure/AWS and transfer the files up there, but those felt like a lot of hassle.

Fortunately I found a C based utility on GitHub called mtftar which converts an MTF stream into a TAR based stream, and someone has generously updated it to compile on both Windows and macOS. Great!

I have Ubuntu running under the Windows subsystem for Linux, so with build-essential already installed via apt-get, I ran:

> git clone https://github.com/sjmurdoch/mtftar
> cd mtftar
> make
> ./mtftar -?

Which lists out the various supported command line arguments. I went for

./mtftar < Backup.bkf | tar xvf -

which extracted the bkf archive straight onto the file system in the same directory. The same commands work on macOS, if you prefer to run there.

Retro-fitting remote working

I wrote my last post about remote working on my way to Gran Canara almost 4 years ago. It was the first time I had attempted to work remotely from the team based in London, and the reality was – remote working was really hard!

We aren’t a “remote-first” company, and I knew working with the team when they weren’t used to having to deal with a remote team member was going to be challenging.

Even a slightly flaky internet connection became massively frustrating during calls. Reverse engineering context from discussions that were happening ‘offline’ was a constant challenge. And trying to engage the team in the work I was doing from Gran Canara.

On the plus side, there were no time-zone differences, I met some awesome people, we redesigned the FundApps branding, and I was living one minute from the beach!

However, the challenges of growing a team in person in London were enough that I hadn’t really attempted it since.

How does it look now?

Roll forward several years, and FundApps has grown from 8 to approaching 50, has offices in London and New York, with remote workers in Toronto, Darlington and Auckland.

Now, whether we like it or not, we have to get good at this! It’s still hard, but we’re inching closer to this being a better experience:

  • A decent video conferencing set up. Sounds obvious, but it took us much experimentation to find something that worked for us. We’re now using Zoom and a proper speakerphone that works over USB too. If someone has a dodgy connection, they can join the call by phone instead.
  • If there’s one remote person in a meeting, then everyone joins a call from their desks. We don’t do this for everything yet, but it does level the playing field significantly.
  • Face to face time is invaluable, particularly for new starters. Our remote workers had the advantage of having worked in our office for some time – they knew the team, and the culture. Our recent recruits in New York didn’t have that luxury though, and so making the time for them to visit London, and their colleagues visiting New York to build those relationships have been super valuable.
  • Increase signal to noise. We’re trying to separate discussion from actual decisions so it’s easier to keep track of what’s going on – we’ve had some success using dedicated project channels in Slack (for chat), while ensuring core decisions are recorded more explicitly in GitHub or Google Docs.
  • Preferring async communication, if you have team members working on different time zones – try finding a time that works for folks in Canada, London and New Zealand for a call!

Simple catch-all AWS budgets

We got caught out recently by significantly high usage of AWS CloudWatch, and realised we’d been spending $1000/month more than expected. After tracking down the cause (one of the team had turned on detailed instance monitoring) – I wanted to ensure we had a bit more of a heads up next time. We had budgets set for all the major services, but not the ‘small’/insigificiant ones.

The really simple (and in hindsight, obvious!) solution was just to create a ‘catch all’ budget for all the services we didn’t have a seperate budget for, even if we’re not currently using them:

I’d probably move this to terraform next, but this works well for us so far.

BeyondCorp proxy possibilities on AWS, Google Cloud, Azure

It appears there’s now another tool in the arsenal for those looking at implementing BeyondCorp style security model, with the arrival of OIDC authentication support in AWS’s application load balancer. It adds to a growing list of possiblities, at least for HTTP-based services. Who needs VPN anyway?

The options I’m aware of now include:

  • Bitly’s oAuth2 proxy – a simple open source reverse proxy with OAuth support, written in Go
  • Amazon Application load balancer – will allow you to offload authentication to a seperate IdP, and then passes claims via HTTP headers to the proxied application.
  • Google Identity-aware proxy – though this only works if the services you are securing live within the Google cloud
  • Azure AD application proxy – Microsoft’s answer to the zero-trust model, with a lightweight proxy that sits within your internal network enabling outbound connectivity to the proxy rather than inbound.
  • CloudFlare access – hosted reverse-proxy with support for major identity providers like Azure AD and Okta
  • ScaleFT – commercial zero-trust platform for securing HTTP based web and SSH based server access, with a high entry cost (starts at $500/month)
  • Pritunl Zero – a freemium SaaS service offering HTTP and SSH based proxying.
  • DuoBeyond

Any others I’m missing? Would love to hear of folks experiences of these.

Not ready to #DeleteFacebook? Here’s some baby steps…

I admit it. I still haven’t taken the plunge to #DeleteFacebook. I can’t remember the last time I posted anything on it, but friends still invite me to events and send me messages via Messenger. Likewise, I haven’t brought myself to using VPN as standard, or Tor for that matter!

That said, here’s some things you might like to try, that have stuck for me. Let me know what else I’m missing!

Use Firefox, and enable first-party isolation. This isolates cookies so that only the site which you were accessing when the cookie was ‘dropped’ can access it, making it harder for folks like Google and Facebook to track you across multiple sites. If you can’t quite quit the Google Chrome habit, try installing the Privacy Badger extension instead.

Switch to DuckDuckGo for your search. If you’re giving Google everything you’re searching for, you’re just making it easy. While you’re at it, review your Google privacy settings.

Consider changing your DNS to 1.1.1.1. It’s not perfect — you’re trusting CloudFlare rather than your ISP or other DNS provider with your privacy, but it seems like the best of a bad set of options for now. Limitations in other aspects of internet protocols like SNI, and the fact IP addresses themselves can identify the site you’re accessing means this will unlikely to be completely anonymous for the immediate future.

Consider disabling Facebook Platform. The horse has definitely bolted the stable door, but at the very least it’s worth reminding yourself exactly who your data has already been shared with…

Unfollow all your Facebook friends. This does nothing for your privacy, but wonders for your time. My feed is empty — but I can still check up on individual friends when I think of them, and they can still invite me to events.

Are there other habits you’ve successfully changed to increase your privacy online? Would love to hear them!

Exploding cows in Minecraft…

Last weekend I was at Womad festival, helping kids fire exploding cows from catapults in Minecraft. Not my usual line of work as CTO, or typical festival experience for that matter!

I was volunteering with Devoxx4Kids who organise events worldwide where children can develop computer games, program robots and also have an introduction to electronics. CERN had invited Devoxx4Kids to take part in the workshops happening at the Physics Pavilion.

We ran 3 packed out workshops across the weekend, with children ranging from about aged 6 to 13. While there was a whole range of knowledge levels, almost everyone was familiar with Scratch — and they most definitely knew far more about Minecraft than me!

Warm up before a session!

The workshops involved writing some Java using Minecraft Forge and Eclipse in order to introduce a catapult into the Minecraft world, understand the impact of angles on how far the catapult could fire, and ultimately throw some surprisingly explosive animals!

As volunteers, we were split around 50:50 between those that had a technical background or not — it wasn’t about showing off our own technical knowledge — more asking questions and helping the children stay on track with the activity. A particular shout out to Cesar and Dan, whose hard work meant the rest of us could just turn up on the day!

It was humbling to see how well our attendees all tackled the challenge — their thoughtfulness on variable names for their animal of choice, and somewhat more destructively, delight at changing how big an explosion to create when it landed!

While it was only a small taster, hopefully it reinforced the realisation (for both parents and children!) that by coding they can actively change the world they experience in these games, and perhaps continue to grow an interest in technology.

This document has been composed with the free HTML edior. Click here to give it a try.

Licensing SQL Server in AWS? It’s up to twice as expensive as Azure or Rackspace Cloud.

… and regardless of cloud provider, it’s (probably) costing you 2x what it would on dedicated kit. So AWS could be costing you 4x what it would cost to license on dedicated hardware.

Disclaimer: I am certainly not a SQL Server licensing expert, nor that much of a cloud expert. The purpose of this post is to hopefully prove that I am, in fact, wrong. Please help with this!

Anyone that’s ever head to deal with SQL Server licensing (or indeed any kind of Microsoft licensing) knows what a minefield it is. In the public cloud, all your worries go away (ahem) and you can just wrap the license fee in the monthly cost via the hosting providers “service provider license agreement” with Microsoft.

Looking into it further, however, I realised there are actually big discrepancies between how much the different cloud providers charged you for licensing SQL Server itself.

Microsoft Azure: $75–80/core/month
Rackspace Cloud: around $100/core/month
AWS: anything from $136 to a $236/core/month

In other words, your costs to license SQL Server could be 2x higher than Azure or Rackspace Cloud, and 4x higher moving to AWS cloud from dedicated hosting (see dedicated vs cloud).

At the moment, I have no good explanation for this. I’m still waiting for a response from AWS, and I’m hoping that I’m missing something here.

To head off the obvious — I realise there are databases with more cloud-friendly licensing. Unfortunately real life gets in the way, so we’re stuck with SQL Server until we can migrate away from it.

AWS, Rackspace Cloud & Azure compared

If you’ve got this far, I’m assuming you want some detail, so here goes.

Assumptions

  • I am ignoring platform as a service offerings entirely (so SQL Azure, RDS you’re out)
  • I am only looking at SQL Server Standard
  • I am only considering SQL licensing costs, not the cost of the underlying hardware. For cloud providers that don’t explicitly state a licensing cost (only Azure does this), I’m going to take the cost of a Windows VM with SQL Server on it and subtract the cost of a Windows VM.
  • SQL Server is licensed per core, so I’m going to group these calculations by core
  • I’m just looking at monthly pricing with no commitment, as that’s how we’re currently paying for our dedicated kit too.
  • Rackspace has different UK pricing in GBP, but I’ve spot checked the numbers and there was no material difference from their USD pricing, so we’re just going with that.
  • I am looking at prices for Europe-based data centres, in USD. I haven’t checked but I don’t expect this varies significantly from the US
  • I am assuming 730 hours in a month

Data sources

I have taken the prices from these locations: http://www.rackspace.com/cloud/public-pricing http://azure.microsoft.com/en-gb/pricing/details/virtual-machines/#Windows, http://azure.microsoft.com/en-gb/pricing/details/virtual-machines/#Sql, and http://aws.amazon.com/ec2/pricing/ (as of 22 Jan 2015).

Medium isn’t the best place for displaying tabular data, so I’ve included an image below, and you can download the spreadsheet here.

Azure is the cheapest, working out around $75–80 per core (regardless of VM, as they list their SQL prices separately). I’m assuming this is essentially cost price, as I’ve been told elsewhere that a 4 core license via the SPLA scheme is generally $300/month. Not surprising, given it’s Microsoft licensing their own product.

Rackspace Cloud is more expensive, but very consistent, working out between $100 per core (regardless of VM), so a 25% markup. As a note of comparison, on our dedicated kit, we pay £420/month for a 6 core license, which works out at £70/$106 USD so this is consistent.

AWS is the big surprise here, with prices ranging anything from $136 at it’s cheapest to $236 (m series followed by r series, followed by c series). That’s a whopping 70%-195% markup.

Cloud vs Dedicated

I mentioned earlier another price differential moving to cloud. Regardless of which cloud provider you’re using, SQL Server will (probably) cost you 2x what you’re paying for dedicated kit. That’s because SQL Server licensing ignores hyper threaded cores (PDF) on dedicated kit, but not on virtualized kit, assuming you are only licensing the VM and not the host itself. So if you’re running in a virtualized environment with hyperthreading turned on (Azure, AWS, Rackspace Cloud, for instance), you have two virtual cores for each real processing core, and so you’ll need to pay for twice the number of cores to get the (vaguely) equivalent performance.

Wrapping up

Like I said at the start, I’m no licensing expert, but given the already painful SQL licensing costs, seeing the huge differences we’d be paying in AWS vs other cloud providers is a hard pill to swallow. Please let me know if you spot any mistakes in my calculations!

It would certainly push us away from SQL Server even faster than we were planning, but in the meantime… would be interested to hear any insights as to why there are such big variations and such a huge markup at AWS!