Deleting Data Without Regrets

by Nuria Ruiz

Deleting data (without regrets) is much harder than it seems at first sight. For example: customer data is read by the outschool.com website from a relational database and OpenSearch but much of this data also exists on a data warehouse where we run our ETLs. In larger systems that same customer data might also be replicated to a key/value storage like Cassandra or DynamoDB. In outschool’s example the primary storage is postgres, but historic versions of the data only exist in the data warehouse. This historic data is not a nice-to-have dataset as, while outschool will continue to work without access to the data warehouse, the historic data is crucial for proper accounting.

Deleting data from our systems should be done in a manner that guards against possible mistakes. While we might have a backup, restoring data from a backup that was accidentally deleted is never easy, no matter the system. There are some obvious precautions to follow when coding scripts that will be used for automated data deletion (like, ahem, make sure that code is well tested) but there are many others that might save the day that are not so obvious. Here are some precautions and procedures that are worth considering.

Do not use a privileged user

Do not schedule scripts that delete data to run under a user with broad privileges. It is easy to create users with limited permissions and ideally the user that deletes data has as few privileges as possible It is easy to make mistakes when scheduling data deletion jobs, and the broader the privileges from the user that executes the deletion the bigger those mistakes can be.

Deleting is a two-step process

When executing frequent deletions according to a criteria (compliance, privacy guidelines) we need two scripts (as explained below).

Soft Deletion

The first script identifies records to delete and soft-deletes them. That is, it marks them as deleted, filling the value of the column deleted_at. This means that the tables that you want to delete from need to have a deleted_at column. This is known as soft deletion.

Execute Guards and Actual deletion

The second script is the one that will actually delete the records in question. This is a script that executes by default as a dry run and prints out what is about to delete. So any script execution that does not include the argument —execute is a no-op. Execution of deletions needs to be explicit, it is not a default.

Before the deletion of any records, guards should be executed. Guards ensure that deletions are sensical. While deletions are executed against the primary, guards might need to be executed against a replica if they are intensive. If guards are not satisfied, the script exits.

Guards can be:

select count(*) of records marked with deleted_at - if this number exceeds a given threshold, the script does not delete; it exits in error. In general, guards are better expressed as a ratio. So, in this case, if the ratio of records to delete to records in the table is too high, the script exits in error.
Look at the creation date of the records we are deleting. Are we deleting data that “seems” too old? Are we deleting data that is “too recent?” This notion of old/recent is one that will depend on the nature of the application. If the date of record creation is older/newer than a threshold, the script should log and exit in error.
Is this data tied to an admin action of some sort (i.e. part of an audit trail currently being investigated). In that case the data should most likely not be deleted, regardless of age/ shape.

Guards are very different from tests. Your code can be working perfectly and according to your design YET there are conditions in your data that you did not anticipate. There will be conditions in your data that you did not anticipate at some point. That is for sure. For example, an error when syncing data from one system to another can leave you with empty data in the date column leading your automated process to think that those records are older than they are if they are filled in with a default date like 1970-01-01 or similar.

By default, executions of scripts that delete data are always a no-op

This is worth reiterating: it should never be the case that a deletion script can delete data when it is being tested. An easy way to avoid this would be adding a parameter like –execute and in the absence of it the script just logs the data that it will delete– thus executing a dry-run. This might be OK to prevent some error scenarios but it does not work for situations where you mean to execute something like this:

delete_data.ts –table customer_messages –retainDays 90 –execute

and yet, when you schedule it on a cron or similar you end up with:

delete_data.ts –tablecustomer_messages –retainDays 9 –execute

Notice that while we intended to retain 90 days of data we scheduled to retain just 9. This is a problem that might be caught by guards …or not.

A good strategy to avoid the scenario described above is making sure scripts can only be executed in prod with the parameters you have tested them with. A way to achieve that is to have the dry run produce a signature (via md5 of parameters or similar) that is required when the –execute parameter is included (credit to https://github.com/marcelrf for this idea).

With this approach, the scenario above will play out like this:

Test deletion script; the dry run execution prints a hash

delete_data.ts –table customer_messages –retainDays 90

          parameter hash : '0736211b7b38f5d1fba53e13d8121f33'

When scheduling, the deletion via cron we would include the corresponding hash:

delete_data.ts –table customer_messages –retainDays 9 –signature 0736211b7b38f5d1fba53e13d8121f33 – execute

When executing, the script will see that the signature does not match the parameters and will exit in error.

Resources

There are many other topics around deletion that you might want to explore, we highly recommend reading the blogpost about Data Deletion on a distributed system by Leah Kissner

About the Author

Nuria Ruiz

Nuria is an Engineer with the Infrastructure pod.