how-to/3-anonymous_dumps
3- Anonymous Dumps
In many situation, what we want is simply to export the anonymized data into another database (for testing or to produce statistics). This is what pg_dump_anon does!
The Story
Paul has a website and a comment section where customers can express their views.
He hired a web agency to develop a new design for his website. The agency asked for a SQL export (dump) of the current website database. Paul wants to \“clean\” the database export and remove any personnal information contained in the comment section.
How it works
Learning Objective
- Extract the anonymized data from the database
- Write a custom masking function to handle a JSON field.
Load the data
DROP TABLE IF EXISTS website_comment CASCADE;
CREATE TABLE website_comment (
id SERIAL PRIMARY KEY,
message JSONB
);
curl -Ls https://dali.bo/website_comment -o /tmp/website_comment.tsv
head /tmp/website_comment.tsv
COPY website_comment
FROM '/tmp/website_comment.tsv'
SELECT
message->'meta'->'name' AS name,
message->'content' AS content
FROM website_comment
ORDER BY id ASC
Activate the extension
CREATE EXTENSION IF NOT EXISTS anon CASCADE;
SELECT anon.init();
SELECT setseed(0);
Masking a JSON column
The \“comment\” field is filled with personal information and the fact the field does not have a standard schema makes our tasks harder.
In general, unstructured data are difficult to mask.
As we can see, web visitors can write any kind of information in the comment section. Our best option is to remove this key entirely because there\’s no way to extract personnal data properly.
We can clean the comment column simply by removing the \“content\” key!
SELECT message - ARRAY['content']
FROM website_comment
WHERE id=1;
First let\’s create a dedicated schema and declare it as trusted. This means the \“anon\” extension will accept the functions located in this schema as valid masking functions. Only a superuser should be able to add functions in this schema.
CREATE SCHEMA IF NOT EXISTS my_masks;
SECURITY LABEL FOR anon ON SCHEMA my_masks IS 'TRUSTED';
Now we can write a function that remove the message content:
CREATE OR REPLACE FUNCTION my_masks.remove_content(j JSONB)
RETURNS JSONB
AS $func$
SELECT j - ARRAY['content']
$func$
LANGUAGE SQL
;
Let\’s try it!
SELECT my_masks.remove_content(message)
FROM website_comment
And now we can use it in a masking rule:
SECURITY LABEL FOR anon ON COLUMN website_comment.message
IS 'MASKED WITH FUNCTION my_masks.remove_content(message)';
Finally we can export an anonymous dump of the table with
pg_dump_anon
:
export PATH=$PATH:$(pg_config --bindir)
pg_dump_anon --help
export PATH=$PATH:$(pg_config --bindir)
export PGHOST=localhost
export PGUSER=paul
pg_dump_anon boutique --table=website_comment > /tmp/dump.sql
Exercices
E301 - Dump the anonymized data into a new database
Create a database named \“boutique_anon\” and transfer the entire database into it.
E302 - Pseudonymize the meta fields of the comments
Pierre plans to extract general information from the metadata. For instance, he wants to calculate the number of unique visitors based on the different IP adresses. But an IP adress is an indirect identifier, so Paul needs to anonymize this field while maintaining the fact that some values appear multiple times.
Replace the remove_content
function with a better one called
clean_comment
that will:
- Remove the content key
- Replace the \“name\” value with a fake last name
- Replace the \“ip_address\” value with its MD5 signature
- Nullify the \“email\” key
HINT: Look at the
jsonb_set()
andjsonb_build_object()
functions
Solutions
S301
export PATH=$PATH:$(pg_config --bindir)
export PGHOST=localhost
export PGUSER=paul
dropdb --if-exists boutique_anon
createdb boutique_anon --owner paul
pg_dump_anon boutique | psql --quiet boutique_anon
export PGHOST=localhost
export PGUSER=paul
psql boutique_anon -c 'SELECT COUNT(*) FROM company'
S302
CREATE OR REPLACE FUNCTION my_masks.clean_comment(message JSONB)
RETURNS JSONB
VOLATILE
LANGUAGE SQL
AS $func$
SELECT
jsonb_set(
message,
ARRAY['meta'],
jsonb_build_object(
'name',anon.fake_last_name(),
'ip_address', md5((message->'meta'->'ip_addr')::TEXT),
'email', NULL
)
) - ARRAY['content'];
$func$;
SELECT my_masks.clean_comment(message)
FROM website_comment;
SECURITY LABEL FOR anon ON COLUMN website_comment.message
IS 'MASKED WITH FUNCTION my_masks.clean_comment(message)';