Data Compression‎ > ‎

Text Compression

Text is a very big part of most files that digital technology users create. For example, these files could be: Word or PDF documents, emails, cellphone texts (SMS format) or web pages. Therefore being able to compress text for storage or transmission is extremely important. Fortunately files containing mainly text can be significantly compressed.

Like image compression there are many algorithms or methods that have been devised to do this. There is one important point to note about text compression and that is it needs to use a lossless method. This means the method must not discard any data when it compresses the data. If this was so, the data when it is uncompressed would be incomplete. To quote an article in Wikipedia to do with data compression:

 

Lossless (compression)

The Lempel–Ziv (LZ) compression methods are among the most popular algorithms for lossless storage. DEFLATE is a variation on LZ which is optimized for decompression speed and compression ratio, therefore compression can be slow. DEFLATE is used in PKZIPgzip and PNGLZW (Lempel–Ziv–Welch) is used in GIF images. Also noteworthy are the LZR (LZ–Renau) methods, which serve as the basis of the Zip method. LZ methods utilize a table-based compression model where table entries are substituted for repeated strings of data. For most LZ methods, this table is generated dynamically from earlier data in the input. The table itself is often Huffman encoded (e.g. SHRI, LZX). A current LZ-based coding scheme that performs well is LZX, used in Microsoft's CAB format.


Points to note from this quote: 
  • The DEFLATE method is mentioned in the web browser header towards the top of the compression introduction page. The browser uses it because it is fast at decompressing data, so it is a suitable method as browsers normally receive far more data than they transmit.
  • LZ methods (there are more than one) are amongst the most popular and are used by popular zipping programs, such as 7z and Zip.

The images below show compression settings for the 7z program. The program was able to take the XHTML mark-up that made up the Wikipedia data compression page and compress it from 116KB, when it was a text file, to a 19KB 7z zip file. That is a compression ratio of 16.11 (116/19=16.11) or the zipped version is 6.11% (1/16.11 x 100 = 6.11) of the size of the original. See below.

uncompressed webpage file size 7z compressed file size


Note that 7z uses the LZMA compression method by default, but others can be used. Files can be saved in various formats, such as the popular zip one, and compression levels can be specified. Lastly, an important point: zip programs can be used to 'zip up' any data type. For example, the XHTML of the Wikipedia page mentioned above copied and saved as a Word document measures 221KB in size (due to the extra formatting information added by Word); compressed it measures 31.2KB. The images below show some of the compression option settings that the 7z program has. Note that at bottom right the compressed file can be password protected using the AES 256 encryption format. This is virtually impossible to break.

Click on an image to see a bigger version.
Formats Levels


Methods Results



CS Unplugged activity and resources

  • The following link points to a CS Unplugged activity that show how the LZ method is used to compress text: text compression activity
  • The following link points to a CS Unplugged web page that contains valuable resources (at the bottom of the page): text compression web page



Comments