Before we talk about compressing files in Linux, let’s first talk about what compression is and more importantly, why is it used.
Compression is used to encode information in a way to take up less space (space being measured in computer memory in this case). Decompression is the reverse process – reading the information that was compressed and reconstructing the original.
Let’s say I have 10 letters A in a row:
AAAAAAAAAA
and let’s say I need 1 memory unit to represent the letter A. I also need 1 memory unit to represent any other letter or a digit. Then I have 10 memory units in total.
However, I could think: “OK, how can I transfer the same information, but using less memory units?”. One way is to send the information over like this:
A10
and that there’s an agreement between me (the sender) and the receiver that A10 means “A repeated 10 times”. That way (assuming, as I stated above, that every letter and digit takes up 1 memory unit) I have represented the occurrence of A 10 times in a row with only 3 memory units. 10/3 would be the compression ratio, the ratio between uncompressed and compressed information.
Data compression can be lossless or lossy – lossless means that no information is lost (like in our example) and lossy compression means that we lose some information in the compression procedure, but we can gain a close approximation when decompressing it. (Shotts, 2019)
Those are the very basics of compression and why it is used. There is an entire field called Information Theory that deals with compression. There is also the Hutter prize, which aims to reward the person who can advance state-of-the-art in compression: (“Hutter Prize,” n.d.) The compression algorithms used today are more elaborate than the basic one I explained above, of course, but you get the idea.