There are so many compression schemes available for different platforms. This article will focus on extracting .gz, .tar.gz, and .tgz files using Python (we will explain these extensions shortly). We will also cover how to read files from an archive without extracting them into a disk.
Before we do that, however, let’s briefly define gz compression and other related terms.
The .gz, .tar.gz and .tgz
.gz, or GNU Zip, is a primary compression scheme used by UNIX devices. This compression format is officially called gzip.
On the other hand, tape archive (tar) is an archival format used for UNIX-like systems. It is generally used with compression formats like gzip, xz or bzip2.
When tar is used with gzip compression are compiled, we get a “tarball” file format. Tarball files usually come with .tar.gz or .tgz file extensions.
In simple terms, a tar file is an archive containing multiple files put into one, whereas a gz file is a compressed file.
Note: All the code examples used in this post have been tested on Windows and Linux (Debian). That means they should be working across all platforms, even Mac.
Extracting .gz Files
This section discusses extracting single or multiple GZIP files in a folder.
Example 1: Unzipping a single GZIP file
The unzipping task, in this case, happens in two steps – first, open the GZIP file using the gzip package, and second, write the file’s contents into another file using shutil.
The following example shows how to extract a gzipped README markdown file.
1 2 3 4 5 6 7 8 |
import gzip, shutil # Open the GZIP file in "read bytes" mode (rb). with gzip.open("README.md.gz", "rb") as infile: # Write the Extracted GZIP file content into # outfile using shutil. with open("README.md", "wb") as outfile: shutil.copyfileobj(infile, outfile) |
The code example is shorted into this snippet.
1 2 3 4 |
import gzip, shutil with gzip.open("README.md.gz", "rb") as infile, open("README.md", "wb") as outfile: shutil.copyfileobj(infile, outfile) |
Example 2: Extracting multiple GZIP files in a folder
This example shows how to extract all .gz files in a given directory. The idea is to loop through all files in the given folder extracting the GZIP files as discussed in Example 1.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
<a id="post-2881-_yumsx3ayhtw2"></a># Extract each gz file in a given directory import os, shutil, gzip def ExtractAllGZ(dir): # Loop through all files in the directory # os.listdir(dir) returns a list of files and directories inside dir. for file in os.listdir(dir): # Filter the GZIP file using the file extension if file.endswith(".gz"): # Join the dir folder with the file name to get a valid file path of GZIP. gz_path = os.path.join(dir, file) print(file) # Join the dir with the file name to get a valid file path of the extracted file. extract_path = os.path.join(dir, file.replace(".gz", "")) # Open GZIP with gzip and write the content into outfile using shutil. with gzip.open(gz_path, "rb") as infile, open( extract_path, "wb" ) as outfile: shutil.copyfileobj(infile, outfile) # Uncomment this line if you want to remove the GZIP file. # os.remove(gz_path) # Call the function with the source directory. ExtractAllGZ("./evernote2") |
Extracting .tar.gz or .tgz Files
As said before, a tarball (.tar.gz or .tgz files) is an archive consisting of multiple files put together into one. The idea is to extract the archive to get a folder containing some files.
The tarfile module comes in handy in this case. The following syntax shows how to use the tool to read tarball archives.
1 2 3 4 5 |
import tarfile tar_file = tarfile.open("<path to .tar.gz or .tgz file>") tar_file.extractall("<destination directory>") tar_file.close() |
If the <destination directory> is not provided in the code example above, the archive will be extracted into the current working directory.
Reading Compressed/Archived Files without Extracting them into a Disk
The examples above involve writing the extracted files into a disk. What if you don’t want to do that? You only want to read the files inside the archive (without extracting).
The tools we have discussed – gzip and tarball packages – can also serve this purpose. Here is an example code used to read README.md file using gzip.open() function.
1 2 3 4 5 6 |
import gzip # Read without extracting to disk. with gzip.open("README.md.gz") as infile: for line in infile: print(line.strip().decode("utf-8")) |
If you want to read tar file contents without untarring it, you can use <tar>.extractfile(<member>) function from the tarfile module, as shown below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
import tarfile # Open the tar file as infile. with tarfile.open("Python-3.11.2.tgz") as infile: # Loop through all the members of the tar file - dirs, subdirs and files. for member in infile.getmembers(): # use the dir() function to get all the attributes you can call on the member # print(dir(member)) # Check if the member is a file and if it ends with .txt if member.isfile() and member.path.endswith(".txt"): print(member.path) # Extract the content of the text file. infile2 = infile.extractfile(member) # Read through the lines. for line in infile2: print(line) # You can also read the contents of infile2 with # infile2.read() |
Key functions in the code above:
- <tar>.getmembers() returns a list of all directories, subdirectories, and files in <tar>,
- <tar>.extractfile(member) extracts the member (without writing it into the disk).
Conclusion
This guide discussed extracting GZIP compressed gz files and tarballs (files with .tar.gz or .tgz extensions). We showed that gzipped files could be extracted using gzip, shutil modules, and tarballs can be extracted using the tarfile package.
We also showed that you could read the contents of compressed/archived files without extracting them into the disk.