Problem
TL;DR : use HMAC-SHA256
Every once in a while there is a need to perform data sanitization on personal data to be in compliance with local regulations and with the common sense.
Unless you are working for Kongo Gumi and even though if you have doubts on whether to keep the data – consider that following the information theory all data is asymptotically public or deleted.
Hence the idea listed below is in no sense a guidance to the internal policies, just a practical method to achieve one goal : replace id that is qualified as a personal data by the id that is a technical construct, which is handy for the machine learning, statistics, business intelligence.
Q-Day
It’s no more a question if
, and just a question when
. Designing a system right now requires consideration of the recent achievements in the domain of post-quantum cryptography
Hashing
Let’s consider a simple use case with vehicle registration plates : we get the incoming flux of data from the speed cameras and we wish to understand the patterns leading to the excessive speeding, in this context it’s a good idea to replace the license plate number by a surrogate.
The first idea that comes in mind is to apply hashing with SHA-256
.
Let’s see what could be the caveats.
Input : AA01AA0001
SHA-256 :
:~$ echo -n "AA01AA0001" | sha256sum
557db30f40ad709d5e075eab6b52913faadb40c36f1e29c60cfd40756e9374c7 -
Setup
DISCLAIMER : Only test hashes you are authorized to test. Unauthorized cracking is illegal and unethical.
Let’s get kali linux with hashcat
sudo apt update
sudo apt install hashcat
hashcat --version
Testing with 4 digits mask :
:~$ echo -n "AA01AA0001" | sha256sum | cut -d ' ' -f1 > hash.txt
:~$ hashcat -m 1400 -a 3 hash.txt AA01AA?d?d?d?d --status
output :
557db30f40ad709d5e075eab6b52913faadb40c36f1e29c60cfd40756e9374c7:AA01AA0001
Session..........: hashcat
Status...........: Cracked
Hash.Mode........: 1400 (SHA2-256)
Hash.Target......: 557db30f40ad709d5e075eab6b52913faadb40c36f1e29c60cf...9374c7
Testing with full mask will take more time, you may test it yourself to see the estimate on your system.
:~$ hashcat -m 1400 -a 3 hash.txt ?u?u?d?d?u?u?d?d?d?d --status
At this point you can speedup the work by using wordlist or rainbow table, this is out of scope for this guide, the main outcome is that knowing the pattern you may find the original id that has undergone hashing, hence hashing itself is not suitable for data sanitization.
HMAC
The much better alternative which is immune to the procedure listed above and is inherently quantum resistant is HMAC with sufficiently long key, I will be using the standard variation HMAC-SHA256
:
:~$ echo -n 'AA01AA0001' | openssl dgst -sha256 -mac HMAC -macopt hexkey:0000111122223333444455556666777788889999aaaabbbbccccddddeeeeffff
SHA2-256(stdin)= 459d458f308f01b165169bf6ec32d0c8cb2af38be91b26bde73c8121a11451e1
The key above is predictable and hence for demo purposes only, use high-entropy random key for the real case.
Summary
Use of HMAC-SHA256
for data sanitization purposes is a more secure alternative to a simple hashing with SHA256
as the data pattern is usually publicly known.
Bonus
Hashing with SHA256
is not suitable for password imprints either, internet does remember the cases when industry leaders had undergone severe damage due to the storing of simple hashes in the database.
Yes, you may find a good key derivation function, still it’s much better solution to use OAuth standard with an authorization server.
Happy Diwali!