Using Obfuscated SQL Data Streams to Keep Learner Data Private
by Justin Hernandez
Working at Outschool is fun. However, we do take Learner safety seriously and how we handle their private data.
PII (Personally Identifiable Information) and data privacy has become a hot topic among tech companies. There are a wide range of views; from Apple stating that privacy is a human right, to Facebook basing their advertising model on user generated content. PII is at the heart of many tech companies’ business models.
We use PostgreSQL as our main datastore and here we store the PII data necessary to run our application services. Taking care of PII is so critical that we don’t want to share this type of data with a data warehouse. How could we share data without exposing learner PII? We decided to extend our database export service and add a streaming obfuscation layer. Why? It is our duty as a responsible tech company to handle Learner PII with special care.
Our Responsibility
Children are at the center of the toughest data privacy laws. The United States passed the Children’s Online Privacy Act (COPPA) in 1998. This law regulates how websites collect data from children 13 years of age and younger. Violations are often met with heavy fines - similar to what Dinesh (Silicon Valley) learned while acting as CEO for 11 days. In a recent real world example, YouTube paid a record $170 million dollars for COPPA violations in 2019.
One core cultural value of ours is “Stand for Learners”. Keeping this value in mind, in early 2021 we came up with two initiatives to stand for learners’ data privacy. The first one was to make sure that analytic data we collect for minors is never sent to a third party. The second initiative was to remove as much learner data as possible from our in house development and testing systems. As a result of this effort sensitive data in our development environment is obfuscated.
How do we use the obfuscated data? Obfuscated data is used when testing services such as data stores and search pipelines. We need data to run these tests but we don’t need to use “real world” data.
What is Obfuscation?
To get a better understanding of how obfuscation works, let’s think of it in real-world, physical terms. Imagine for a moment that “the data” is an actual person and the obfuscation pieces are window blinds.
You may be able to tell when a person is standing behind window blinds that are slightly open - but personal details will be hard to discern. How much do they weigh? How long is their hair? What color is their hair? This is how data obfuscation works. Obfuscated data looks and behaves like real data but without personal details.
In addition to this blog post, we created a demo that simulates our database export service - Demo :Obfuscate Leaner PII. This demo utilizes node.js streams to access PostgreSQL and then pipes the data down to column-level obfuscation filters. We make use of node.js streams to keep the export service as performant as possible. Here is a high-level walkthrough of what the Database Export Service does:
- AWS Batch initializes the Database Export Service
- The export task starts by opening a node.js stream to our PostgreSQL follower cluster
- Next, it determines which columns need to be obfuscated and through which filter based on our table to column mappings. The demo code contains 3 example obfuscation filters.
- Column mappings that do not need to be filtered are ignored and copied directly to the export stream.
- Finally, it exports a compressed PII obfuscated database file
Now that we’ve covered how we use the obfuscation library, let’s take a deeper look at the demo code.
Demo: Obfuscate Learner PII Data
For more information on how to run the demo on your local machine, please refer to the README. The main piece of logic that takes apart the PostgreSQL data stream is the class pgCustom.ts. For the sake of brevity and time, however, let’s focus on the implementation flow contained within index.ts.
File: index.ts
Index.ts is instantiated via the scheduled database export service and it initializes via the run() function; run() calls the method exportDb() which then wraps obfuscateDbExport(). In obfuscatePgCustomExport() is where things start to get interesting. Let’s examine this method more closely.
async function obfuscatePgCustomExport({
logger,
inputStream,
outputStream,
tableMappings,
}: MainDumpProps): Promise<Buffer> {
const { prelude, reader, head } = await consumeHead(inputStream);
logger("Header consumed");
updateHeadForObfuscation(head);
const formatter = new PgCustomFormatter(prelude);
const initialHeader = formatter.formatFileHeader(prelude, head);
outputStream.write(initialHeader);
const tableCount = countDataBlocks(head);
logger(`Table export count: ${tableCount}`);
let dataStartPos = initialHeader.byteLength;
for (let i = 0; i < tableCount; i++) {
const dataHead = await reader.readDataBlockHead();
const bytesWritten = await obfuscateSingleTable({
logger,
dataStartPos,
toc: findTocEntry(head, dataHead.dumpId),
dataRows: reader.createDataRowIterable(),
formatter,
outputStream,
tableMappings,
});
dataStartPos += bytesWritten;
}
outputStream.end();
// Return a header with updated offsets to enable parallel restore
return formatter.formatFileHeader(prelude, head);
}
Function: obfuscatePgCustomExport
First we check the input stream header information and make sure it’s a valid PostgreSQL stream. After prepping the input stream for obfuscation, we create an output stream to save the obfuscated result. Next, we iterate through each table and pass collected input and output stream information through obfuscateSingleTable().
async function obfuscateSingleTable({
dataStartPos,
dataRows,
formatter,
outputStream,
tableMappings,
toc,
}: TableDumpProps & {
formatter: PgCustomFormatter;
}) {
toc.offset = { flag: "Set", value: BigInt(dataStartPos) };
const dataFmtr = formatter.createDataBlockFormatter(toc);
const rowCounts = rowCounter();
let iterator = rowCounts.iterator(dataRows);
const columnMappings = findColumnMappings(tableMappings, toc);
iterator = transformDataRows(<ColumnMappings>columnMappings, iterator);
Readable.from(iterator, { objectMode: true })
.pipe(createDeflate({ level: 9, memLevel: 9 }))
.pipe(dataFmtr.transformStream)
.pipe(outputStream, { end: false });
await EventEmitter.once(outputStream, "unpipe");
return dataFmtr.getBytesWritten();
}
Function: obfuscateSingleTable
Within this method, we iterate through the target table’s rows and columns and obfuscate data as needed. We also keep track of the amount of rows we update and the number of bytes written. Don’t forget to keep things tidy and compress your streams - your data consumers will thank you ;).
Oversimplified explanation of what goes on within each PG stream filter:
- The filter unpacks the input stream and reads the column’s data
- Next it performs the necessary logic to obfuscate the column data
- Finally, it packs the data string into the output stream through a serialized buffer
Let’s go over the 3 obfuscation filters included within our demo.
Obfuscation filters
Within our demo, we created 3 obfuscation filters, they are:
- replaceEmailWithColumn
- This method replaces the “local-part” of the user’s email with the column data value from the passed column name. Within our demo passed the “uid” column.
- Example: users-personal-email@mydomain.com → 5320e8cc-606c-45e0-b7e3-4fb2efec2761@mydomain.com
- replaceWithNull
- Replaces the entire column with null.
- Example: “Secret value” → null
- replaceWithScrambledText
- Switches each letter or digit with a random letter or digit
- Example: “Secret pin value: 1234” → “ujmqa hqr mzidl: 8915”
We have also created other filters such as replacing names and location with faker values. Feel free to modify or create your own filters!
Tying it all together
What did we learn from this article? We talked about how costly it can be if you don’t respect children’s online PII data - as YouTube discovered in 2019. Doing our utmost to “Stand for Learners”, we created a Node.js Typescript class named pgCustom.ts that obfuscates SQL data streams on the fly. We also created a demo so you could implement your own streaming services. We created these things because we care about our Learners and want to keep their data private and safe!
If you enjoyed this data deep-dive, please subscribe to our blog and look out for additional articles that cover the intersection of child education and data privacy.
Files Overview
Here is an overview of the files included within our Obfuscate Learner Pii Demo code repository.
- README.md
- How to run the demo on your local machine
- constants.ts
- Shared code constants
- filters.ts
- Custom obfuscation filters. Included within the demo: replaceEmailWithColumn, replaceWIthNull, replaceWithScrambledText.
- Feel free to extend and create your own custom obfuscation filters!
- index.ts
- Main invocation file
- pgCustom.ts
- Custom class that parses and reconstructs PostgreSQL export streams on the fly
- shared.ts
- Shared library logic
- tableMappings.ts
- Exports an object of tables with columns to obfuscation filter mappings. Feel free to extend and add your own tables!
- types.ts
- Typescript types