What are hash functions and hash values?
A hash function is any function that maps input of varied sizes into a fixed-size output. The values returned by the hash function are called hashes, hash values, hash codes, checksums, or hash digests.
Depending on the hashing algorithm used, two inputs may be mapped into the same output. This is called hashing collision. An example of such a collision is shown below.
In this article, we will use Python to generate random hashes and also discuss how to implement the popular hashing algorithms, namely, Message-Diggest (MD) and Secure Hash Algorithm (SHA).
Method 1: Generate Random Hashes with random Module
In the random package, we can specify the number of bits for the hash we wish to generate, and random hash values are generated on execution. For example,
1 2 3 4 5 |
import random random_bits = random.getrandbits(128) hash1 = "%032x" % random_bits print(hash1) |
Output (random):
2ad98ac316a093cd32f2b05296d98e3b
The above example generates a 128-bit long random number and converts it into hexadecimal using %x string formatting. %032x represents the random bits as a hex string of 32 characters. If the hex does not contain at least 32 characters, pad it with zeros on the left. The same concept can be code-written as:
1 2 3 4 5 6 |
import random random_bits = hex(random.getrandbits(128)) print(random_bits) hash1 = random_bits[2:-1] print(hash1) |
Output (random):
0x2d708d53a3e2ec2cb13b852aec465cefL 2d708d53a3e2ec2cb13b852aec465cef
The hex() function generates the hexadecimal equivalent of a randomly generated 128-bit long number. The output of hex() contains some characters we are not interested in – 0x at the beginning and the trailing L. We remove those by indexing the characters we want.
Important note (Are computer-generated random characters really random?)
In most cases, the random characters generated by computers (using packages like random and NumPy) are not genuinely random – they are pseudo-random. This is because these characters are generated using a mathematical formula and are therefore predictable.
In the world of cryptography, therefore, using pseudo-random characters is not recommended; instead, characters with “true” randomness are used. Genuine random characters can be generated on this site: https://www.random.org/.
Method 2: Using the secrets package
The secrets module was added in Python 3.6+. It provides cryptographically strong random values. The functions take an optional nbytes argument, the default is 32 (32 bytes * 8 bits = 256-bit tokens). Examples,
1 2 3 4 5 6 7 8 |
import secrets hash_hex = secrets.token_hex(nbytes=16) print(hash_hex) hash_url = secrets.token_urlsafe(16) print(hash_url) hash_bytes = secrets.token_bytes(128 // 8) print(hash_bytes) |
Output (random):
2bd8d5ea4040abae052804b905e6445a XaiTTn1LBBv9JZ_0QQ5WOw b'CJZ|\xa1\x948\xb2hVgG&\xb48\xc2'
A 128-bit long hash has 16 bytes, so we set nbytes to 16.
Method 3: Using os and binascii
The os module has urandom() method that can be used to generate a string of random bytes suitable for cryptographic use. Once the random string is generated, it is converted into hex using binascii, as shown below.
1 2 3 4 5 6 7 |
import os import binascii random_bytes = os.urandom(16) print(random_bytes) hash_value = binascii.hexlify(random_bytes).decode() print(hash_value) |
Output (random):
b'\x1d3"\xed\x93\xdb\xc3\xab\xaa\xbdb\xad_\xf93?' 1d3322ed93dbc3abaabd62ad5ff9333f
Note: the above code works for Python 2 and Python 3, but in the latter, the output is in bytes and therefore needs to be decoded. In Python 2, that is not necessary.
Method 4: Using string and random modules
From the previous examples, notice that the digests we generated contain letters and numbers. Given that fact, we can use alphabetical letters and numbers to create random hashes as follows randomly.
1 2 3 4 |
import random, string hash1 = ''.join(random.sample(string.ascii_letters + string.digits, 32)) print(hash1) |
Output (random):
3xnroUdRq2kvtAaSWZE0GVsu19TbgBJK
In the example above, string.ascii_letters contain a-z and A-Z letters, whereas string.digits has numbers 0 through 9. These make 62 characters. From these characters, we sample 32 of them.
If you want hexadecimal output, you need to sample from the possible characters, that is, “abcdef” and “0123456789”. In this case, it is impossible to sample 32 characters from the 15 available in hex format. Therefore, we will use sampling with a replacement achieved with random.choice() function. See below.
1 2 3 4 |
import random, string hash1 = ''.join(random.choices("abcdef" + string.digits, k=32)) print(hash1) |
Output (random):
a07c706a7d1d731de23c763889bfc56a
In the next two methods, we will use Python to implement the two most commonly used hashing algorithms – MD5 and SHA256.
These algorithms are used for cryptographic purposes, but MD5 has been found to suffer from extensive vulnerabilities. For this reason, it is mainly used as a checksum for data integrity but not as a backbone for secured encryption.
Method 5: Implementing MD5 hashing in Python
MD stands for Message-Digest algorithm. The number 5 in MD5 is just the series number. MD5 generates a 128-bit digest given an input key. In Python, this algorithm can be implemented using the hashlib library. Here are some examples,
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
import hashlib #initalize string str2hash = "secret word" # implement MD5 hashing result = hashlib.md5(str2hash.encode()) #get the hexadecimal value print(result.hexdigest()) str2hash = "secret word " result = hashlib.md5(str2hash.encode()) print(result.hexdigest()) str2hash = "this is my secret key" result = hashlib.md5(str2hash.encode()) print(result.hexdigest()) |
Output (not random):
74a11ef33c5252edfa87c4eb8b566c2a fb4de79cd3ef366102d02c7f465cb760 95d8ac53b544781d2e2f4f63567c940d
Unlike the previous methods, the MD5 function maps inputs to a unique (in most cases – hash collision can happen but rarely) non-random fixed-size output.
Notice that any slight change in the string input, like in “secret word” and “secret word ” (there’s a trailing white space), leads to a significant change in the hash digest. This is called the avalanche effect – a phenomenon adopted by hashing algorithms to make them unpredictable.
MD5 does not only work with strings and other data types in Python. MD5 hashing is also used to check the integrity of files.
Two different files cannot have the same MD5 checksum. In this case, we can use MD5 to check if the two files have the same content. When downloading files, you can also use MD5 hash values to check if the file downloaded is corrupted or not. Here are some examples:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
import hashlib hash_teamviewer = hashlib.md5(open('teamviewer_15.31.5_amd64.deb','rb').read()).hexdigest() print(hash_teamviewer) hash_teamviewer2 = hashlib.md5(open('teamviewer_15.31.5_amd64_12.deb','rb').read()).hexdigest() print(hash_teamviewer2) hash_test1 = hashlib.md5(open('/mnt/MountPt1/Upwork Tomasz/test1.txt','rb').read()).hexdigest() print(hash_test1 ) hash_test2 = hashlib.md5(open('/home/kiprono/Desktop/test1.txt','rb').read()).hexdigest() print(hash_test2) |
Output (not random):
7263347d62d5cfb37ca879f9ae3740a3 7263347d62d5cfb37ca879f9ae3740a3 3d9571beddd9881d4fa2f808f36b3118 3e7705498e8be60520841409ebc69bc1
Two things to note from the output: 2 files with the same contact have the same checksum even if the names are different, and files having the same name but differing in contents have different MD5 digest.
Method 6: SHA hashing in Python
SHA (Secured Hashing Algorithm) works just like MD5 but is known to be more secure. SHA is, however, more computationally demanding than MD5 because of its size. Unlike 128-bit MD5, SHA comes in different sizes higher than 128, that is, 224, 256, 384, and 512 bits. In our case, we will implement the SHA256 (256-bit) hashing algorithm in Python.
1 2 3 4 5 6 7 8 9 10 11 |
import hashlib a_string = 'password 124' hashed_string = hashlib.sha256(a_string.encode('utf-8')).hexdigest() print(hashed_string) hash_teamviewer = hashlib.sha256(open('teamviewer_15.31.5_amd64.deb','rb').read()).hexdigest() print(hash_teamviewer) hash_teamviewer2 = hashlib.sha256(open('teamviewer_15.31.5_amd64_12.deb','rb').read()).hexdigest() print(hash_teamviewer2) |
Output (not random):
9900128e42aed22ea32a6dacecb6eb2361214514b15a88629e34926127955401 1c4850a710b7e3733785fe8d3f0a5a96d0f666e3901657feaa7f5d83c7e05eeb 1c4850a710b7e3733785fe8d3f0a5a96d0f666e3901657feaa7f5d83c7e05eeb
Conclusion
In this article, we have discussed different hashing methods in Python. We discussed methods relying on randomness (methods 1 through 4) and implemented the popular hashing algorithms – MD5 as method 5 and SHA as method 6.
MD5 and SHA could be used to verify the integrity of data and files. For that reason, the two algorithms can be used to check for changes in the local files.