Intro to Password Hashing

User accounts are one of the important service our team’s website provides to students, but password storage is one of those things you don’t want to mess up. The problem with password storage is that the obvious ways to do it have some huge problems - and we should get it right the first time.

In this little article I’m going to take you through a few approaches to password storage, and teach you about some important techniques that we will need to make our system secure.

Approach 1: Just store the password

The most obvious way to store a user’s password is just to store their password as-is. Our database might look like this:

username password
benv My_Passw0rd!
noahc d@rkth3m3
mom password123
dad password123

When a user tries to log in, we take their username and their password, and we compare them against what we have in our database. If the password they gave us matches what’s in the database, then the password is correct.

Diagram of Approach 1

This works, but it really isn’t very secure. If anyone ever got access to our database, they would be able to just read all of our user’s passwords and everyone’s information would be compromised. We need a better way.

Approach 2: Hash the password

A much better way to store passwords is to “hash” them first. But what’s “hashing”?

The idea behind hashing is to come up with a unique “fingerprint” for your data, without revealing what the actual data is. When you hash a piece of data, you get a random-looking but unique string of characters that looks nothing like the original. For example, here’s a bunch of strings that I ran through a hashing algorithm called MD5:

original hashed
hello 5d41402abc4b2a76b9719d911017c592
Hello 8b1a9953c4611296a827abf8c47804d7
this string is quite a bit longer 6a0170d5f09421661284844464743b4d

I mentioned that a hash is like a “fingerprint”. If you hash the string “hello”, you will always get the result “5d41402abc4b2a76b9719d911017c592”. And just like a fingerprint, hashes are unique, so no other string will give you that result. Notice how changing a single letter results in a completely different hash! (Also note that every hash is the same length, no matter how long the string is.)

Now, the important thing about hashing is that it is one-way. It is easy to calculate the hash of some data, but impossible to reverse that process except by brute force.

Using hashing for passwords

To use hashes in our password system, we can store the hash of the password instead of the password itself. That way, if somebody gets access to our database, they will only have a list of password “fingerprints” - but not the passwords themselves.

Our database might look like this:

username password_hash
benv 9f823275b248cf57382508db717a103d
noahc 4b83db5e506f2e84cdf27812ad43b7be
mom 482c811da5d5b4bc6d497ffa98491e38
dad 482c811da5d5b4bc6d497ffa98491e38

So how do we check if a user entered the right password? All we have to do is hash the password they gave us before we compare it to the hash in the database. Because hashes are unique, if the hashes match, the passwords match!

Diagram of Approach 2

Is this good enough?

This is definitely an improvement over just storing the passwords in plain text. If anyone gains access to our database now, they will just have a list of hashes instead of a list of passwords, and each hash can only be cracked by brute force.

However, this system still has some problems. Take a look at mom and dad in our database again:

username password_hash
mom 482c811da5d5b4bc6d497ffa98491e38
dad 482c811da5d5b4bc6d497ffa98491e38

See how they have the same hash? Because they have the same hash, we can tell that they have the same password - which might mean that they have a common password that’s easy to guess. It also means that an attacker can crack multiple passwords in your database by figuring out a single hash.

Also, although hashes can only be cracked by brute force, people have spent a lot of time brute-forcing! You can easily download lists of hashes for every common password out there. Using those, if an attacker sees “482c811da5d5b4bc6d497ffa98491e38” in a database, it won’t take them any time to find out that the password is “password123”.

We need an even better way.

Approach 3: Salted hashes

The big problem with hashing alone is that it is too predictable. So what if we add some randomness to it?

One of the problems with Approach 2 was that hashes for common passwords are already well-known. So what if before we hash the user’s password, we tack some random data on the end? After all, everybody knows the hash for “password”, but nobody knows the hash for “passwordWOIUHDBMNFBJYASTIUHKJNGKJDHJSDF”.

This process of attaching random data is called “salting”, because it adds a unique “flavor” to each password. The random string we use is called the “salt”, and it’s different for each user in our database.

Let me take you through an example. Say we want our system to salt my password from the previous examples. It would follow these steps:

  • First it generates some random string to act as the salt - say, “HFGDZFGHMXGKJYHT”.
  • It appends the salt to the password I provided: “My_Passw0rd!HFGDZFGHMXGKJYHT”. The password is now “salted”.
  • Finally, it hashes the salted password and stores both the salt and the hash in our database.

Here’s what our database might look like when we use salting:

username salt password_hash
benv HFGDZFGHMXGKJYHT 445d35ec793288142d78a0c0a94b38db
noahc POIYEGBKNSLDQNMM 71245521003d91875f9485752a76ad6c
mom QDCQVBSDKJGGFEYM b22dcae8cc02084fa54f6a7573ca2816
dad FHZKUFXEWIWSGDIC d69aae987a6fdf22638a5ad587147e65

Note that mom and dad have different hashes now! When we use salting, every user in our database should have a different hash, even if they have the same password. Plus, all that random data should make even common passwords much harder to crack.

So when we’re using salts, how do we actually check if a user’s password is correct? We modify Approach 2 just a little bit - before we hash the given password, we look up the user’s salt and stick it on the end. Everything else is still the same.

Diagram of Approach 3

Also, it’s important to note that the salt doesn’t have to be secret! All the salt does is make the hashes themselves less predictable.

In Conclusion

Approach 3 is the most secure of the three, and it’s good enough for us to use.

Also, we shouldn’t use MD5 for our hashing algorithm, since it’s actually not very secure at all. Instead we should use bcrypt, which forces the computer to work really hard for each hash. It’s very hard to brute-force a password when it takes half a second to do every single hash!

Let me know if you have any questions or need any more examples.

Bonus: A taste of how hash functions work

Hashes use mathematical operations that are easy to perform but hard to reverse. One example is modular addition - in other words, addition that wraps around at a maximum value.

As a primer, the modulo operation in math finds the remainder of division. For example:

  • 5 mod 3 = 2, because 5 / 3 = 1, with a remainder of 2.
  • 10 mod 2 = 0, because 10 / 2 = 5, with no remainder.
  • 7 mod 9 = 7, because 7 / 9 = 0, with a remainder of 7. (This one is a little trickier!)

One interesting thing about the modulo operation is that it makes the output value “wrap around”. For example, let’s look at a bunch of numbers mod 3:

n n mod 3
0 0
1 1
2 2
3 0
4 1
5 2
6 0
7 1
8 2
9 0

See how the n mod 3 column just repeats “0 1 2” over and over? When you divide by 3, your remainder can never be greater than 2, so it keeps wrapping around to 0 again.

Now, let’s invent a really bad hashing algorithm!

First, let’s assign a number to each letter of the alphabet. We’ll keep it simple: A is 1, B is 2, C is 3, etc., up to Z is 26. Our hash algorithm will be to add together all the letters of our string, then take that result mod 10.

For example, let’s hash the string “hello”:

  h + e + l  + l  + o
= 8 + 5 + 12 + 12 + 15
= 52

52 mod 10 = 2

So our final hash is 2.

Now, here’s the important question. We know that the hash of a particular string is 7. What is the original string?

(Bonus question: What problems do you think we might have if we used this hashing algorithm?)