LZ77 - algorithm - resolution_问答_开发者_运维开发者技术经验分享

I was reading about this algorithm... And I coded a class to compress, I have not coded the decompressing class yet...

What do you think about the code?

I think i've got a problema... My codification is : "position | length", but I believe this method will keep me on problems when ill decompressing, because I wont know if the numbers of positions and length are 2, 3, 4 digits... :S

some advice will be accepted... :D

Any suggestions will be accepted.

Main file:

      #include <iostream>
      #include "Compressor.h"

      int main() {
          Compressor c( "/home/facu/text.txt", 3);
          std::cout << c.get_TEXT_FILE() << std::endl;
          std::cout << c.get_TEXT_ENCONDED() << std::endl;
          c.save_file_encoded();
          return 0;
      }

header file :

#ifndef _Compressor_H_
#define _Compressor_H_

#include <utility>
#include <string>

typedef unsigned int T_UI;

class Compressor
{
    public:
    //Constructor
    Compressor( const std::string &PATH, const T_UI minbytes = 3 );

    /** GET BUFFERS **/
    std::string get_TEXT_FILE() const;
    std::string get_TEXT_ENCONDED() const;
    /** END GET BUFFERS **/

    void save_file_encoded();

    private:
    /** BUFFERS **/
    std::string TEXT_FILE; // contains the text from an archive
    std::string TEXT_ENCODED; // contains the text encoded
    std::string W_buffer; // contains the string to analyze
    std::string W_inspection; // contains the string where will search matches
    /** END BUFFERS **/

    T_UI size_of_minbytes;
    T_UI size_w_insp; // The size of window inspection
    T_UI actual_byte;

    std::pair< T_UI, T_UI> v_codes; // Values to code text

    // Utilitaries functions
    void change_size_insp(){ size_w_insp =  TEXT_FILE.length() ; }
    bool inspection_empty() const;
    std::string convert_pair() const;
    // Encode algorythm
    void lz77_encode();
};

#endif

implementation file :

#include <iostream>

#include <fstream>
using std::ifstream;
using std::ofstream;

#include <string>

#include <cstdlib>

#include <sstream>

#include "Compressor.h"

Compressor::Compressor(const std::string& PATH, const T_UI minbytes)
{
    std::string buffer = "";
    TEXT_FILE = "";
    ifstream input_text( PATH.c_str(), std::ios::in );
    if( !input_text )
    {
    std::cerr << "Can't open the text file";
    std::exit( 1 );
    }
    while( !input_text.eof() )
    {
    std::getline( input_text, buffer );
    TEXT_FILE += buffer;
    TEXT_FILE += "\n";
    buffer.clear();
    }
    input_text.close();
    change_size_insp();
    size_of_minbytes = minbytes;
    TEXT_ENCODED = "";
    W_buffer = "";
    W_inspection = "";
    v_codes.first = 0;
    v_codes.second = 0;
    actual_byte = 0;
    lz77_encode();
}

std::string Compressor::get_TEXT_FILE() const
{
    return TEXT_FILE;
}

std::string Compressor::get_TEXT_ENCONDED() const
{
    return TEXT_ENCODED;
}

bool Compressor::inspection_empty() const
{
    return ( size_w_insp != 0 );
}

std::string Compressor::convert_pair() const
{
    std::stringstream out;
    out << v_codes.first;
    out << "|";
    out << v_codes.second;
    return out.str();
}

void Compressor::save_file_encoded()
{
    std::string path("/home/facu/encoded.txt");
    ofstream out_txt( path.c_str(),std::ios开发者_如何转开发::out );
    out_txt << TEXT_ENCODED << "\n";
    out_txt.close();
}

void Compressor::lz77_encode()
{
    while( inspection_empty() )
    {
    W_buffer = TEXT_FILE.substr( actual_byte, 1);
    if( W_inspection.find( W_buffer ) == W_inspection.npos )
    {
        // Cant find any byte from buffer
        TEXT_ENCODED += W_buffer;
        W_inspection += W_buffer;
        W_buffer.clear();
        ++actual_byte;
        --size_w_insp;
    }
    else
    {
        // We founded any byte from buffer in inspection
        v_codes.first = W_inspection.find( W_buffer );
        v_codes.second = 1;
        while( W_inspection.find( W_buffer ) != W_inspection.npos )
        {
        ++actual_byte;
        --size_w_insp;
        v_codes.second++;
        W_inspection += TEXT_FILE[actual_byte - 1];
        W_buffer += TEXT_FILE[actual_byte];
        }    
        ++actual_byte;
        --size_w_insp;
        if( v_codes.second > size_of_minbytes )
        TEXT_ENCODED += convert_pair();
        else
        TEXT_ENCODED += W_buffer;
        W_buffer.clear();
    }
    }
}

Thank you!

Im coding the descompressing class :)

I generally recommend writing the decompressor first, and then writing the compressor to match it.

I recommend getting a compressor and corresponding decompressor working with a fixed-size copy items first, and only afterwards -- if necessary -- tweak them to produce/consume variable-size copy items.

Many LZ77-like algorithms use a fixed size in the compressed file to represent both the position and length; often one hex digit for length and 3 hex digits for position, a total of 2 bytes.

The "|" between the position and the copy-length is unnecessary.

If you are really trying to implement the original LZ77 algorithm, your compression algorithm needs to always emit the fixed-length copy-length (even when it is zero), the fixed-length position (when the length is zero, you might as well stick zero here also), and a fixed-length literal value.

Some LZ77-like file formats are divided into "items" that are either a fixed-length copy-length,position pair, or else one or more literal values. If you go that route, the compressor must somehow first tell the decompressor whether the upcoming item represents literal value(s) or a copy-length, position pair. One of many ways to do this is to reserving a special "0" position value that, rather than indicating some position in the output decompressed stream like all other position values, instead indicates the next few literal values in the input compressed file.

Nearly all LZ77-like algorithms store an "offset" backwards from the current location in the plaintext, rather than a "position" forwards from the beginning of the plaintext. For example, "1" represents the most recently-decoded plaintext byte, not the first-decoded plaintext byte.

How is it possible to for the decoder to tell where one integer ends, and the next one begins, when the compressed file contains a series of integers? There are 3 popular answers:

Use a fixed-length code, where you've set in stone at compile-time how long each integer will be. (simplest)
Use a variable-length code, and reserve a special symbol like "|" to indicate end-of-code.
Use a variable-length prefix code.
other approaches, such as range coding. (most complicated)

https://en.wikibooks.org/wiki/Data_Compression

Jacob Ziv and Abraham Lempel; A Universal Algorithm for Sequential Data Compression, IEEE Transactions on Information Theory, 23(3), pp.337-343, May 1977.