Jump to content

Zip Compression question how do they do it?


anthonyaudi

Recommended Posts

Has anyone ever seen the following?

A file that is lets say 50mb unzipped and when it is zipped it goes down to exaggerated sizes like 1mb.

The most I've ever seen in my life was a file that was 740MB compressed down to 980k. Once unzipped it came to 740MB.

The latest one I've seen was an image of a hard drive which was 47GB which was compressed down to 4.2GB.

Both were compressed using 7ZIP.

Now I have tried every single option in 7zip to make files smaller and I cannot do it. I usually get a very very small difference in size say if my file is 60mb and I compress it I will get it down to 58mb.

I eventually began to believe that it had to do with the power of the computer so I compressed the file using a very very high powered PC which yeilded the exact same results.

How is it done is this a secret of IT that I am not good enough to know?

Any help would be GREATLY appreciated.

Thanks!

Link to comment
Share on other sites


Try making a largish file filled with 00's.

Then compress it (still with the same 7-zip).

Astounding how much it can be compressed, isn't it?

Three rules of thumb:

  1. compression ratio depends on contents of the UNcompressed source
  2. compression ratio depends on how "homogenuous" is the UNcompressed source
  3. each compression tool may have particularly efficient algorithm for a specific file format of the source

A "generic use" tool tend to be more or less "symmetric" in computing time, i.e. it must compress in "reasonable" time and be able to uncompress still in a "reasonable" time.

As a general rule, the more you compress something, the more it takes (and the more it takes to uncompress it).

In theory highly asymmetric compression algorithms can be devised that take ages (hours/days) to compress a source, needing extremely powerful machine/plenty of resources and that can yet be uncompressed in a "reasonable" time on an average machine.

As an example the KGB archiver had (among MANY senseless FUD spread about it) a period of notoriousness:

http://en.wikipedia.org/wiki/KGB_Archiver

http://sourceforge.net/projects/kgbarchiver/

(current results are however better than that)

And - as a side note - this is the reason why compression tests are usually done on a given set of known files:

http://www.maximumcompression.com/data/files/

this may produce - if the archiver developer is trying to cheat - lead to tools written and optimized explicitly for a given set of files.

THe factors involved in a compression tool, on th esame fileset are three:

compression time

compression ratio

uncompression time

to which everyone can give an appropriate "weight", resulting in "efficiency", the formula used in the mentioned site:

Scoring system: The program yielding the lowest compressed size is considered the best program. The most efficient (read:use full) program is calculated by multiplying the compression + decompression time (in seconds) it took to produce the archive with the power of the archive size divided by the lowest measured archive size. The lower score the better. The basic idea is a compressor X has the same efficiency as compressor Y if X can compress twice as fast as Y and resulting archive size of X is 10% larger than size of Y. (Special thanks to Uwe Herklotz to get this formula right)

is a good way to judge generic compressors.

To make a practical example, if I had to chose:

http://www.maximumcompression.com/data/summary_mf.php

i would use nanozip and have 74.4% of compression ratio, compress 316 Mb in 24.1 s, uncompress in 14.1 or FreeArc, rather than having PAQ8 breaking the 80% compression ratio "wall", but doing so in several tens thousands of seconds, both in compressing and uncompressing.

A good example of a very well compressible source file is a log, see:

http://www.maximumcompression.com/data/log.php

around 98 % is achieved by many compressors.

A good example of "difficult" to compress file is JPEG (which is already compressed):

http://www.maximumcompression.com/data/jpg.php

but here PAQ8 takes a revenge, even over a specific-for-jpg compressor such as PackJPG

jaclaz

Link to comment
Share on other sites

Ok so I have understood what your post means. I tried making a 00000 file and compressing it.

I made a 140mb file and used 7zip with maximum compression to shrink it down to 20.4KB.

What confuses me is how people can get complete image files (particularly the one I have) that is 47gb down to 4. On a 7zip. Once it is unzipped it is a full image of a drive.

I know it is in 7zip because the extension is 7z.

I am currently using PAQ8 to test the compression theory.

EDIT: PAQ8 using default settings came to 51kb

I can understand compression works better with more homogeneous files. What I don't understand is how sometimes there are zips that make no sense on how they are compressed. My online backups are not even compressed this much.

Edited by anthonyaudi
Link to comment
Share on other sites

An image of an "average" drive (I actually presume you are meaaning "disk", but when it comes to this compression topic this disambiguation is irrelevant) is ACTUALLY made mostly of 00's.

A "filled-up-to-the-brim" disk (or partition/drive) image is obviously not compressible as much.

A 3/4 empty (disk or drive) image will be actually made for 3/4 of 00's.

So, if your 47 Gb image compresses in around 4Gb, very likely it contains between 8 and 16 Gb of (not-compressed) files (i.e. once taken the 00's out of the equation, as they will compress in Kb's, not Gb's, you have a compression ratio between 50% and 75%).

jaclaz

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...