Dec 062011
 

compression algorithms and methods have been on my mind lately. i’m still not the best at c++, so please excuse my (horrible) php code- i know my way around php better than c++ at the moment.

i started out with another project i’m working on- a c++-based video streaming server, for netflix-like movie watching of whatever movies are on your hard drive. i was trying to figure out a good way to save bandwidth on the audio/video transfer. i decided to mess around with some base64-based compression. starting out with a directory structure like this:

[slowbro@alta cpp-stream-server]$ ll
total 4
-rwxrwxr-x 1 slowbro slowbro  3016 Dec  6 14:34 compress.php
-rwxrwxr-x 1 slowbro slowbro  9248 Dec  4 00:58 server

my goal is to compress the ‘server’ executable file, which is about 9kb. ‘compress.php’ is my program. first, i made a zip for comparison purposes, and a base-64 encoded version of ‘server':

[slowbro@alta cpp-stream-server]$ cat server | base64 -w0 > server.b64
[slowbro@alta cpp-stream-server]$ zip server.zip server
  adding: server (deflated 65%)
[slowbro@alta cpp-stream-server]$ ll
total 44
-rwxrwxr-x 1 slowbro slowbro  3016 Dec  6 14:34 compress.php
-rwxrwxr-x 1 slowbro slowbro  9248 Dec  4 00:58 server
-rw-rw-r-- 1 slowbro slowbro 12334 Dec  5 13:42 server.b64
-rw-rw-r-- 1 slowbro slowbro  3358 Dec  6 14:45 server.zip

so 65% is the number to beat- or match. approximately 3kb.

starting with the base64 file, i used the condense-repeating method to slim it down. this method replaces repeating characters with a 3-5 character representation. for example, AAAAAAAABBBBBB3FFF5kjdAAAA3v is replaced with A-8B-6|3FFF5kjdAAAA3v this is based on the following rules:

  • the repetition spans more than three characters
  • if the immediate next character after a replacement is a number, append a pipe (|) symbol to the end
  • the replacement must span less space than the original. for example ‘AAAA3′ was not replaced with ‘A-3|3′ because it uses the same amount of space

using these basic rules, i started out with the pass 1 code:

<?php
$rh = fopen("server.b64", 'r');
$wh1 = fopen("server.compress1", 'w+');
$wh2 = fopen("server.compress2", 'w+');
$curpos = 0;
$orig = filesize("server");
$end = filesize("server.b64");

#placeholder

echo "pass 1: condense repeating chars\n";
while($curpos < $end){
    $curpos += 1024;
    $data = trim(fread($rh, 1024));
    $dend = end(str_split($data));
    while($dend == ($d = trim(fread($rh, 1)))){
        $data .= $d;
        $curpos++;
    }
    fseek($rh, $curpos);
    echo "\tread ".strlen($data)." bytes ({$curpos}b read)\n";
    //condense repeating
    if(preg_match_all('{(.)\1{3,}}', $data, &$matches, PREG_OFFSET_CAPTURE)){
        $data = str_split($data);
        $osshift = 0;
        foreach($matches['0'] as $m){
            $l = strlen($m['0']);
            $r = $m['0'][0].'-'.$l.(is_numeric(@$data[$m['1']+$l+$osshift])?'|':'');
            if($r > $l)
                continue;
            array_splice($data, $m['1']-$osshift, $l, str_split($r));
            $osshift += ($l-strlen($r));
        }
        $data = implode("", $data);
    }
    fwrite($wh1, $data);
}

$end = $fsz = filesize("server.compress1");
echo "pass 1 completed: ".round((1-($fsz/$orig))*100,1)."% deflated ({$orig}b -> {$fsz}b)\n\n";

the output was okay- it achieved about 14% compression (based on the original, 9kb file- not the 12kb b64 file):

[slowbro@alta cpp-stream-server]$ ./compress.php
pass 1: condense repeating chars
        read 1036 bytes (1036b read)
        read 1024 bytes (2060b read)
        read 1025 bytes (3085b read)
        read 1024 bytes (4109b read)
        read 1097 bytes (5206b read)
        read 1024 bytes (6230b read)
        read 1025 bytes (7255b read)
        read 1024 bytes (8279b read)
        read 1040 bytes (9319b read)
        read 1024 bytes (10343b read)
        read 1024 bytes (11367b read)
        read 966 bytes (12391b read)
pass 1 completed: 14% deflated (9248b -> 7950b)

the next method i wanted to try was finding similar patterns. for simplicity in testing, i added this argv code where #placeholder is above:

//ARGV
$KEYWORD_LEN = 8;
$MAX_NUM_KW = 50;
$MIN_KW_NUM = 2;
array_shift($argv);
while(count($argv) != 0){
    switch(($p = array_shift($argv))){
        case '-l':
            $KEYWORD_LEN = array_shift($argv);
        break;
        case '-n':
            $MAX_NUM_KW = array_shift($argv);
        break;
        case '-k':
            $MIN_KW_NUM = array_shift($argv);
        break;
        default:
            echo "Unknown Parameter: $p\n";
            exit;
    }
}

i then added the pass 2 code. this code is a little.. messy. too many foreaches, but it’s not the final product and was just for testing. what it does is read from pass 1’s compressed b64, $KEYWORD_LEN bytes at a time, storing the number of times each of them occur in an associative array. it then iterates through that array, removing any that have less than $MIN_KW_NUM occurrences. it iterates through the array once more, adding up to $MAX_NUM_KW keywords to the translation table. finally, it writes the final file, $KEYWORD_LEN bytes at a time, replacing patterns with translations as it goes.

the code:

$curpos = 0;
rewind($wh1);

echo "pass 2: similar patterns\n";
$occ = array();
while($curpos < $end){
    $data = trim(fread($wh1, $KEYWORD_LEN));
    $curpos += strlen($data);
    if(isset($occ[$data]))
        $occ[$data] += 1;
    else
        $occ[$data] = 1;
}
$oc = count($occ);
echo "\t$oc keywords found\n";
arsort($occ);
foreach($occ as $k=>$v){
    if($v <= $MIN_KW_NUM)
        unset($occ[$k]);
}
echo "\t".($oc-count($occ))." keywords dropped\n";
echo "\tbuild translation table..";
$i = 256;
$stop = $i+$MAX_NUM_KW;
foreach($occ as $k=>&$v){
    if($i == $stop){
        unset($occ[$k]);
        continue;
    }
    fwrite($wh2, "$k:".chr($i).",");
    $v = $i;
    $i++;
}
fwrite($wh2, "\n");
echo " done\n";
echo "\ttranslating ".count($occ)." keywords..";
rewind($wh1);
$curpos = 0;
while($curpos < $end){
    $data = trim(fread($wh1, $KEYWORD_LEN));
    $curpos += strlen($data);
    if(isset($occ[$data]))
        fwrite($wh2, chr($occ[$data]));
    else
        fwrite($wh2, $data);
}
echo " done\n";
$fsz = filesize("server.compress2");
echo "pass 2 completed: ".round((1-($fsz/$orig))*100,1)."% deflated ({$orig}b -> {$fsz}b)\n\n";
fclose($wh1);
fclose($wh2);
fclose($rh);
?>

this produced pretty crappy results with the default parameters:

[slowbro@alta cpp-stream-server]$ ./compress.php
pass 1: condense repeating chars
        read 1036 bytes (1036b read)
        read 1024 bytes (2060b read)
        read 1025 bytes (3085b read)
        read 1024 bytes (4109b read)
        read 1097 bytes (5206b read)
        read 1024 bytes (6230b read)
        read 1025 bytes (7255b read)
        read 1024 bytes (8279b read)
        read 1040 bytes (9319b read)
        read 1024 bytes (10343b read)
        read 1024 bytes (11367b read)
        read 966 bytes (12391b read)
pass 1 completed: 14% deflated (9248b -> 7950b)

pass 2: similar patterns
        920 keywords found
        910 keywords dropped
        build translation table.. done
        translating 10 keywords.. done
pass 2 completed: 16.2% deflated (9248b -> 7753b)

…so it was testing time. i exited the script to only output the percentage times 10 (so 16.7 percent became 167), since bash doesn’t play nice with decimals. i then ran the following for bash code:

best=0;
for l in `seq 1 12`;do
  for k in `seq 2 5`; do
    for n in `seq 30 300`; do
      echo -en "$l\t$k\t$n\t";
      p=`./compress.php -l $l -k $k -n $n`;
      echo $p/10 | bc -l;
      if [[ "$p" -gt "$best" ]];then
        besttext="best is l:$l, k:$k, n:$n, p:$p";
        best=$p;
      fi;
    done;
  done;
done;
echo $besttext

after letting that chug for a while, i got:

best is l:2, k:2, n:137, p:311

meaning the best way to run it, at least for this file, was ./compress.php -l 2 -k 2 -n 137, resulting in a 31.1% compression. not bad for just the first two passes, i think. for time comparison, it takes about 0.076s to compress it:

[slowbro@alta cpp-stream-server]$ time ./compress.php -l 2 -k 2 -n 137
311
real    0m0.076s
user    0m0.061s
sys     0m0.015s

zip is a bit more optimized, lol:

[slowbro@alta cpp-stream-server]$ time zip server.zip server
  adding: server (deflated 65%)

real    0m0.002s
user    0m0.001s
sys     0m0.001s

that’s it for now. i’ll continue thinking of ways to compress it, and make a decompression tool to make sure it’s actually recoverable :)