A Simple Enough Hashing Algorithm
Data privacy is a big deal. But so is customer identification. Let's talk about it.
Here’s the thing: data privacy is important. Like, extremely important. Like, financially devastatingly important. As developers, we need to take careful measures to follow and respect data privacy laws.
However, problems arise when we’re faced with the need to identify customers without violating data privacy. An excellent example of this actually happens to be something I’ve worked on recently.
As we all know, our code just isn’t perfect. And try as we might to make it as fool-proof as possible, there’s always going to inevitably be a situation where an end-user manages to break something. Of course, we try to implement error handling methods to the best of our ability, but the ability to cover every single imaginable (and unimaginable) edge case is nigh impossible.
Tools have been made to help with this, with a prime example being Sentry, one of the better-known and better-performing error monitoring systems. If you’re unfamiliar with Sentry, it’s essentially nothing more than a glorified try…catch
block with a fancy dashboard to boot.
Okay, it’s actually much, much more than that, but you get my drift.
Now, instead of paying for yet another third-party service, I’m the type of dev that’d much rather build it in-house. So fast forward a week later and my team and I have an internal bootleg version of Sentry that we’ve lovingly dubbed Sentinel up and running, logging all sorts of errors that we would’ve otherwise missed. Perfecto.
This is great and all, and boy howdy is it useful, but there’s one slight problem with it. Without any sort of identification being logged, we have no way of knowing if an error is being produced en masse by a singular John Doe with a massive PEBCAK issue, or if the error is something that’s affecting a larger subset of users and is thus an actual problem to prioritize a solution for. And with identification logging comes privacy violations. No perfecto.
Enter: encryption.
My initial idea was to use bcrypt in order to hash a single identifier from an end-user. I’ve used bcrypt before, and am familiar enough with it that it’d be easy to integrate. However, the hashes generated by bcrypt are a bit… lengthy. This makes it annoyingly difficult to tell at a quick glance if a cluster of errors are generated by one user or the same user.
I could also truncate the bcrypt hash to a handful of characters — a solid enough idea in and of itself, except there’s no real guarantee that the first handful of characters in a hash are going to be truly unique, although the possibility of a collision is incredibly rare.
So instead of using bcrypt and adding yet another dependency, I decided to take the approach of writing a custom hashing algorithm. To do this, I had two requirements.
The algorithm needs to be one-way. It should be impossible to take the output and revert it back to its original textual input.
The end result should be lengthy, yet also short enough that it’s easy to differentiate hashes at a quick glance.
So after doing a bunch of digging around through the interwebs, I came across this nice little post on Stack Overflow, regarding converting a string into a hash. I liked the accepted answer, despite it being a touch outdated, but instead opted to modify one of the other answers due to its simplicity.
The final result isn’t the most efficient thing in the word, but I’m not aiming for efficiency here. I just need a simple solution to fuck up a piece of text beyond recognition or repair. And thus, zcrypt (shut up, I’m bad with names) was born.
export default function zcrypt(data) {
return data.split("").reduce(function (a, b) {
a = (a << 5) - (b.charCodeAt(0) << 2) + b.charCodeAt(0);
return a;
}, 0);
}
Let’s walk through this.
First off, I don’t have many opportunities to use .reduce()
, so I’m happy to take them when I can. Essentially we’re taking a piece of data, and converting it into an array of individual characters and combining them into a single item.
The actual hashing happens on the third line.
a = (a « 5) - (b.charCodeAt(0) « 2) + b.charCodeAt(0);
Now, I wasn’t familiar with what «
meant — this was actually my first time seeing an operator like this in code, so I had to look it up. Fortunately, MDN is the best resource when it comes to JavaScript.
In detail what’s happening is this: a
is the accumulator, a running total if you will, starting off at 0. For each index in our array of data, we take the value of a
, then bitshift it five places to the left. This effectively multiplies a
by 2^5, or 32. b.charCodeAt(0)
gets the ASCII value of the current character. Once we bitshift our initial value by 5, we then subtract from it the current character’s value after it’s been bitshifted to the left twice (2^2, or multiplied by 4). Once that’s done, we then add the original ASCII value of the current character to the accumulator.
From there, we cycle through our array until we’ve exhausted the list and have returned our final hash. The actual length of the final hash will vary depending on the original input, but it’s short enough to be readable while also still being effectively undecipherable.
It’s not the most perfect or efficient solution, there are quite likely better methods out there. However, it serves its purpose quite well — my team is now able to easily identify whether or not a particular error is affecting a large subset of users or if it’s a one-off caused by some guy slapping his keyboard in the wrong way, while also ensuring user data is anonymous.
Perfecto.