background-shape

Using Obfuscated SQL Data Streams to Keep Learner Data Private

by Justin Hernandez

Working at Outschool is fun. However, we do take Learner safety seriously and how we handle their private data.

PII (Personally Identifiable Information) and data privacy has become a hot topic among tech companies. There are a wide range of views; from Apple stating that privacy is a human right, to Facebook basing their advertising model on user generated content. PII is at the heart of many tech companies’ business models.

We use PostgreSQL as our main datastore and here we store the PII data necessary to run our application services. Taking care of PII is so critical that we don’t want to share this type of data with a data warehouse. How could we share data without exposing learner PII? We decided to extend our database export service and add a streaming obfuscation layer. Why? It is our duty as a responsible tech company to handle Learner PII with special care.

Our Responsibility

Children are at the center of the toughest data privacy laws. The United States passed the Children’s Online Privacy Act (COPPA) in 1998. This law regulates how websites collect data from children 13 years of age and younger. Violations are often met with heavy fines - similar to what Dinesh (Silicon Valley) learned while acting as CEO for 11 days. In a recent real world example, YouTube paid a record $170 million dollars for COPPA violations in 2019.

Silicon Valley’s Dinesh

One core cultural value of ours is “Stand for Learners”. Keeping this value in mind, in early 2021 we came up with two initiatives to stand for learners’ data privacy. The first one was to make sure that analytic data we collect for minors is never sent to a third party. The second initiative was to remove as much learner data as possible from our in house development and testing systems. As a result of this effort sensitive data in our development environment is obfuscated.

How do we use the obfuscated data? Obfuscated data is used when testing services such as data stores and search pipelines. We need data to run these tests but we don’t need to use “real world” data.

What is Obfuscation?

To get a better understanding of how obfuscation works, let’s think of it in real-world, physical terms. Imagine for a moment that “the data” is an actual person and the obfuscation pieces are window blinds.

Window Blinds

You may be able to tell when a person is standing behind window blinds that are slightly open - but personal details will be hard to discern. How much do they weigh? How long is their hair? What color is their hair? This is how data obfuscation works. Obfuscated data looks and behaves like real data but without personal details.

In addition to this blog post, we created a demo that simulates our database export service - Demo :Obfuscate Leaner PII. This demo utilizes node.js streams to access PostgreSQL and then pipes the data down to column-level obfuscation filters. We make use of node.js streams to keep the export service as performant as possible. Here is a high-level walkthrough of what the Database Export Service does:

  1. AWS Batch initializes the Database Export Service
  2. The export task starts by opening a node.js stream to our PostgreSQL follower cluster
  3. Next, it determines which columns need to be obfuscated and through which filter based on our table to column mappings. The demo code contains 3 example obfuscation filters.
    • Column mappings that do not need to be filtered are ignored and copied directly to the export stream.
  4. Finally, it exports a compressed PII obfuscated database file

Now that we’ve covered how we use the obfuscation library, let’s take a deeper look at the demo code.

Demo: Obfuscate Learner PII Data

For more information on how to run the demo on your local machine, please refer to the README. The main piece of logic that takes apart the PostgreSQL data stream is the class pgCustom.ts. For the sake of brevity and time, however, let’s focus on the implementation flow contained within index.ts.

File: index.ts

Index.ts is instantiated via the scheduled database export service and it initializes via the run() function; run() calls the method exportDb() which then wraps obfuscateDbExport(). In obfuscatePgCustomExport() is where things start to get interesting. Let’s examine this method more closely.


async function obfuscatePgCustomExport({
  logger,
  inputStream,
  outputStream,
  tableMappings,
}: MainDumpProps): Promise<Buffer> {
  const { prelude, reader, head } = await consumeHead(inputStream);
  logger("Header consumed");
  updateHeadForObfuscation(head);

  const formatter = new PgCustomFormatter(prelude);
  const initialHeader = formatter.formatFileHeader(prelude, head);
  outputStream.write(initialHeader);

  const tableCount = countDataBlocks(head);
  logger(`Table export count: ${tableCount}`);

  let dataStartPos = initialHeader.byteLength;
  for (let i = 0; i < tableCount; i++) {
    const dataHead = await reader.readDataBlockHead();
    const bytesWritten = await obfuscateSingleTable({
      logger,
      dataStartPos,
      toc: findTocEntry(head, dataHead.dumpId),
      dataRows: reader.createDataRowIterable(),
      formatter,
      outputStream,
      tableMappings,
    });
    dataStartPos += bytesWritten;
  }

  outputStream.end();
  // Return a header with updated offsets to enable parallel restore
  return formatter.formatFileHeader(prelude, head);
}

Function: obfuscatePgCustomExport

First we check the input stream header information and make sure it’s a valid PostgreSQL stream. After prepping the input stream for obfuscation, we create an output stream to save the obfuscated result. Next, we iterate through each table and pass collected input and output stream information through obfuscateSingleTable().


async function obfuscateSingleTable({
  dataStartPos,
  dataRows,
  formatter,
  outputStream,
  tableMappings,
  toc,
}: TableDumpProps & {
  formatter: PgCustomFormatter;
}) {
  toc.offset = { flag: "Set", value: BigInt(dataStartPos) };
  const dataFmtr = formatter.createDataBlockFormatter(toc);
  const rowCounts = rowCounter();
  let iterator = rowCounts.iterator(dataRows);
  const columnMappings = findColumnMappings(tableMappings, toc);
  iterator = transformDataRows(<ColumnMappings>columnMappings, iterator);

  Readable.from(iterator, { objectMode: true })
    .pipe(createDeflate({ level: 9, memLevel: 9 }))
    .pipe(dataFmtr.transformStream)
    .pipe(outputStream, { end: false });

  await EventEmitter.once(outputStream, "unpipe");

  return dataFmtr.getBytesWritten();
}

Function: obfuscateSingleTable

Within this method, we iterate through the target table’s rows and columns and obfuscate data as needed. We also keep track of the amount of rows we update and the number of bytes written. Don’t forget to keep things tidy and compress your streams - your data consumers will thank you ;).

Oversimplified explanation of what goes on within each PG stream filter:

  1. The filter unpacks the input stream and reads the column’s data
  2. Next it performs the necessary logic to obfuscate the column data
  3. Finally, it packs the data string into the output stream through a serialized buffer

Let’s go over the 3 obfuscation filters included within our demo.

Julia Child Mixing

Obfuscation filters

Within our demo, we created 3 obfuscation filters, they are:

We have also created other filters such as replacing names and location with faker values. Feel free to modify or create your own filters!

Tying it all together

What did we learn from this article? We talked about how costly it can be if you don’t respect children’s online PII data - as YouTube discovered in 2019. Doing our utmost to “Stand for Learners”, we created a Node.js Typescript class named pgCustom.ts that obfuscates SQL data streams on the fly. We also created a demo so you could implement your own streaming services. We created these things because we care about our Learners and want to keep their data private and safe!

If you enjoyed this data deep-dive, please subscribe to our blog and look out for additional articles that cover the intersection of child education and data privacy.


Files Overview

Here is an overview of the files included within our Obfuscate Learner Pii Demo code repository.

About the Author

Justin Hernandez

Howzit! My name is Justin and I’m fascinated with the intersection of empathy and tech. Mahalo for reading 🤙