[GSOC] HAMMER2 compression feature week1 report

Sat Jun 22 12:14:06 PDT 2013

Hello everyone,

Here is my report on the progress made during the first week of GSOC's
coding period. I'd be happy to receive your feedback and criticism.

It turned out that I really overestimated the part of prototype application
(and possibly underestimated the amount of time related directly to
HAMMER2). In any case, the prototype part is done for now. The algorithm
that was integrated in prototype application is LZ4, the algorithm
suggested by Freddie.

This algorithm turned to be extremely comfortable to use and I'm also very
satisfied with its performance. Using the prototype application I ran some
tests and, summarizing, the results are those:

1. Most of files that we use, like images, documents, audios, etc., are
already compressed to some point. So this algorithm can't really compress
those in most cases since it works only on small amounts of data in our
case. At best, you would have like 1-2 blocks per 100 blocks that are
sufficiently compressed.

2. Plain text files that contain texts like books or mailing lists
generally are not compressible at all with this approach.

3. Some uncompressed files are very well compressible with this approach.
For example, some TIFF images that I tested had all of their blocks
sufficiently compressed, sometimes spectacularly like going from 64KB to
just 1404 bytes (or 2KB physical block in the actual file system). That
doesn't happen with all files though and it also seems to never happen with
uncompressed audio (.wav).

4. Other types of file that are really well compressed with this approach
are source code files. For example, from DragonFly's own sources:
/usr/src/sbin/hammer/cmd_cleanup.c was compressed from 31502 bytes to 13977
bytes. Another example: /usr/src/sbin/md5/md5.c was compressed from 12950
bytes to 7867 bytes.

5. Because there was expressed a certain interest about log files, I tested
those too. They are also very, very well compressed with this approach. I
tested some logs from my VPS, like access log and error log and they had
all of their blocks sufficiently compressed, most of them well below 10000
bytes. I assume that a dedicated compression of a whole file would be even
more efficient, but using just file system compression is also very
beneficial.

So, the conclusion is that for the many types of files that we use, like
images, .pdfs, documents, there wouldn't be a lot of difference with that
type of compression, even though some parts of those files can be
compressed. However, there is a huge difference for files like source code,
logs and some uncompressed formats. Generally any file that has obvious
pattern or many repeated elements is well compressed with this approach.
Sadly, for the same reason it's not possible to compress things like books
or mailing list archives with it.

So, right now I consider the part of the prototype done. I'll probably
return to it later to test other algorithms – DEFLATE and LZO, but for now
I'll move to HAMMER2. The reason is that I'm very satisfied with the
performance of LZ4 and it's very unlikely that other algorithms would
outperform it significantly. I'd like to implement at least one algorithm
in HAMMER2 and once that is done, it would make more sense to consider
other algorithms and test them.

Right now I'm working on the hammer2 utility to implement a new command to
set the compression mode on a specified directory. When this will be done,
I'll implement LZ4 in HAMMER2 and do tests in real-life.

And then, if there is time, I would work on DEFLATE and LZO and ultimately
let the user choose which one to use.

If you wish to check out my prototype application, you can get it from my
repository, branch ‘prototype’ [1]. Alternatively, you can also download it
from my VPS [2], if you prefer it that way. The code is a bit rough, but
hopefully it is possible to understand it.
To compile the application, just run 'make'. Then to perform a test on a
file, run './prototype filename'. All the code contained in “lz4” directory
is created by Yann Collet, the author of LZ4 implementation. I didn't
modify anything in it.

Then there is also “zero_blocks.c” which is an application that simply
creates a file with several blocks that contain only zeros. It is used to
check zero-blocks detection for algorithm #1 (zero-checking).

So, that's all I can report for now.

Thank you for attention!

Daniel

[1] git://leaf.dragonflybsd.org/~iostream/dragonfly.git
[2] http://project5555.com/dragonflybsd/prototype.tar.gz
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.dragonflybsd.org/pipermail/kernel/attachments/20130622/f7a63e50/attachment-0002.htm>