Data Brokers and Sandpiper

Over the last two weeks we’ve been discussing specific implementations, and one of the more complicated scenarios deserves a post here: data brokers.

When we say data broker, what we mean is an intermediary who specializes in providing persistent, reliable “data hosting” for providers. Most suppliers are not going to have the in-house team, resources, and/or desire to maintain an always-on server that can act as a Sandpiper respondent. This is how the industry works today, in fact, and data brokers serve a crucial role in that capacity. They also often provide additional services like format conversion and a healthy ecosystem of supplier datasets to choose from.

This leads to a natural human abstraction: we tend to think that, if a customer of ours is receiving our data through a broker, then we are giving that data to our customer. We think that we have a direct data relationship. But we don’t — if we did, we’d send the data directly. Imagine if the customer had issues downloading through the broker’s interfaces. Would they need to talk to us, or to the broker? It’s clear that there are actually two relationships: between us and the broker, and between the broker and the customer.

When we started building Sandpiper, we founded it on this idea of getting down to what’s actually, really happening, discarding as much confabulation and assumption as possible. Sandpiper is a two-actor framework, because the fundamental transfer of product data doesn’t happen as a multicast free-for-all. We transfer data based on our relationships.

So how should a broker operate?

In this diagram, Actor A is a supplier with data to communicate. Actor B is serving as the broker, and Actor C is the consumer.

Actor A and Actor B have a plan to synchronize A’s Pool 1. Actor B can’t turn around and provide that data with the same plan to Actor C, because Actor B doesn’t own it. Instead, the broker needs to copy its copy of A’s data (a.k.a. a snapshot pool) into a new pool that it controls (a.k.a. a canonical pool), and then provide that data under its own agreement with Actor C. In this way, the source data remains unchanged, which is important to prevent any transformations Actor B might unknowingly or necessarily have to carry out as part of its internal infrastructure. It also future-proofs for our plans for Sandpiper to allow auditing and messaging about data through a long chain of handlers.

Notably this requires the broker to maintain two copies of the data. Grain UUIDs absolutely cannot be reused: this is an attack vector whereby a malicious party could, if they knew any of another party’s UUIDs, create their own slices referencing them, which the broker would happily free-associate into the attacker’s pool in full.

Maintaining this separation could be burdensome in large data environments. But:

Pragmatically, it’s less burdensome than having to divert expert programmers to trace a hidden data modification through network traffic, logs, and transaction histories. Storage is cheap compared to the hard costs of developer salaries and the projected costs of lost business, trust, and opportunity
There are well-established strategies for overlapping blob storage. If it becomes an issue, the backend of a broker’s solution can do hash-based deduplication of grain payloads, reducing this overhead to the cost of an 8kb row rather than the full size of the content

The broker should maintain one actor for each party for whom they broker, because this will let them set the name and links of that actor in a way that makes it clear downstream who they’re brokering for. One piece we need to come to a decision on is exactly what format that should take. This will need to be in place before the final 1.0 release, but I believe there will not be any hard changes to the schemas because of it.