reed-solomon: parity data contains information beyond R-S symbol size

Question

reed-solomon: parity data contains information beyond R-S symbol size

james224 opened this issue 8 years ago · comments

Hi @pjkundert . While following the code given below. It works well when encoding file using any codec greater than 8 symbols. But when decoding i get this Error after 25 bytes: reed-solomon: parity data contains information beyond R-S symbol size And here is my code

#include <string>
#include <iostream>
#include <fstream>
#include <memory>

#include "ezpwd/rs"
#include "ezpwd/output"
//#define RSDECODE
constexpr size_t paritysize = 5; 
constexpr size_t chunksize = 20; 
ezpwd::RS < 511, 511 - paritysize> 

int main(int argc, const char **argv) {
    // Set 'inp' to std::cin, or a newly opened file (which will be released at end of scope)
    std::unique_ptr<std::ifstream> inpfile;

    if (argc > 1 and strcmp(argv[1], "-")) {
        inpfile.reset(new std::ifstream(argv[1], std::ifstream::binary));
    }
    std::istream & inp(inpfile ? *inpfile : std::cin);

    // Set 'out' to std::cout, or a newly removed and opened file (released at end of scope)
    std::unique_ptr<std::ofstream> outfile;
    if (argc > 2 and strcmp(argv[2], "-")) {
        std::remove(argv[2]);
        outfile.reset(new std::ofstream(argv[2], std::ifstream::binary));
    }
    // Read, R-S encode/decode and write chunks
    std::string chunk;
    size_t inptotal = 0;
    try {
        while (inp) { // 'til error/eof
#if defined( RSDECODE )
            // R-S Decode

            chunk.resize(chunksize + paritysize);
            inp.read(&chunk.front(), chunksize + paritysize);
            std::cout << chunk;
            chunk.resize(inp.gcount());
            inptotal += inp.gcount();
            std::cout << inptotal;
            if (chunk.size() == 0)
                continue;
            if (chunk.size() < paritysize + 1)
                throw std::logic_error(std::string()
                    << "Insufficient data for an "
                    << rs << " encoded chunk");
            //  std::cout<<chunk;
            // R-S decode (raises std::exception on R-S decode failure), then remove parity
            std::vector<uint16_t> data(chunk.begin(), chunk.end());
            rs.decode(data);
            data.resize(data.size() - paritysize);
#else
            // R-S Encode

            chunk.resize(chunksize);
            inp.read(&chunk.front(), chunksize);

            std::vector<uint16_t> data(chunk.begin(), chunk.end());


            chunk.resize(inp.gcount());
            inptotal += inp.gcount();
            // std::cerr << "Read:  " << chunk.size() << std::endl;
            if (chunk.size() == 0)
                continue;

            //   std::vector<uint16_t> data(chunk.begin(), chunk.end());
            rs.encode(data);

#endif
            // std::cerr << "Wrote: " << chunk.size() << std::endl;

            std::copy(data.begin(), data.end(), std::ostreambuf_iterator<char>(out));
            //  out << std::vector<uint8_t>(data.begin(), data.end());
        }
    } catch (std::exception &exc) {
        std::cerr << "Error after " << inptotal << " bytes: " << exc.what() << std::endl;
        return 1;
    }
    return 0;
}

I don't know why this codec is behaving weirdly on decoding while working well on encoding.
Any help would be appreciated
Thanks,
James

james224 · Answer 1 · Mon Apr 04 2016 15:58:38 GMT+0800 (China Standard Time)

@pjkundert The problem is too much specific. That's why i don't have any other option than to contact you. Any help would be highly appreciated.

Perry Kundert · Answer 2 · Mon Apr 04 2016 20:34:02 GMT+0800 (China Standard Time)

Hi, James;

The root of the problem is that you are mixing 8-bit data (the characters from the file) with 16-bit container (the std::vector<uint16_t>). Each uint16_t is 2 bytes long. It requires 2 characters taken (also called "deserialization") from the file to fill it; one in the low 8 bits, a second shifted into the high 8 bit position. Likewise, when you store the data to the file (a process called serialization), you must take the 16 bits of data in each uint16_t value in the vector, and output it in 8-bit chunks into the file -- in exactly the inverse of how you read it from the file.

You would use an RS(511,...) codec because you have data symbols with exactly 9 bits of data. (Or, you have more bits of data, but you are only "protecting" the low 9 bits with a Reed-Solomon encoding.)

Is that really true? Are you protecting data that consists of 9-bit data elements? Or, are you just trying to use a larger R-S codec to process a larger chunk at a time, since the 8-bit R-S codecs RS(255,...) will only operate on codewords (data chunks) up to 255 total bytes (and actually less, because that includes the R-S parity bytes, so the payload is 255-parity)?

I think that you need to understand the problem you are trying to solve. C++ is a great language -- but it is extremely pedantic; you need to precisely know the exact bit-wise layout of your data, and determine precisely how you wish to serialize/deserialize it to 8-bit characters, if you wish to send it to a file.

The general layout of your changes is OK. If you are indeed working with 9-bit symbols, I presume that each 9-bit symbol is stored as 2 bytes of the data file. So, when you are storing 511 9-bit codewords in the file, this will take 511*2 == 1022 bytes to store a single codeword. If you store it in standard "network" (big-endian) byte order, you'll be storing the highest-order 8 bits of the 16-bit uint16_t data first, followed by the low-order 8 bits. Therefore, to serialize each data element, you'd use something like:

uint8_t hi = (datum >> 8) & 0xff;
uint8_t lo = (datum >> 0) & 0xff;

and then you'd write the hi and lo bytes to the file. Later, when loading the file, you'd load a *2 sized chunk, and for each pair of bytes in the chunk, you'd process pairs of 8-bit characters and deserialize them back into symbols:

std::vector<uint16_t> data;
data.resize( 511*2 );
for ( size_t symbol = 0; symbol < data.size(); ++symbol ) {
    uint8_t hi = chunk[symbol*2+0];
    uint8_t lo = chunk[symbol*2+1];
    uint16_t sym = ( uint16_t( hi ) << 8 ) | uint16_t( lo );
    data[symbol] = sym;
}

Note that we're converting each signed character in chunk into unsigned 8-bit data first (to ensure that no sign-extension occurs on any later casts), and then casting it to the target 16-data type uint16_t, and then bit shifting it into position, and "or"-ing it back together, to place the bits into the correct positions.

Anyway, that's a basic tutorial on serialization. None of this is tested, of course ;)

Cheers,

james224 · Answer 3 · Mon Apr 04 2016 21:08:57 GMT+0800 (China Standard Time)

@pjkundert you guessed it right . I just trying to use a larger R-S codec to process a larger chunk at a time. And very much thanks for the detailed answer. My request is that if possible could you please make this code compatible to use with RS(511,k) or larger codecs. And the objective is just to process the larger chunks and data is not necessary 9 bit symbol.

Perry Kundert · Answer 4 · Mon Apr 04 2016 21:12:54 GMT+0800 (China Standard Time)

I would challenge you to make the code changes necessary to use the larger codecs, and handle 16-bit symbols! This is good exercise for you.

You should look into how Compact Discs use Reed-Solomon encoding, to get an idea of how to process large blocks of data, using small 8-bit Reed-Solomon codecs. Hint: Interleaving...

james224 · Answer 5 · Mon Apr 04 2016 22:02:11 GMT+0800 (China Standard Time)

I have already put too much effort on it. And started an issue as a last resort. Because i think it should work like i am doing right now. when encoding works perfectly with larger codec and chunk size this makes me even more confuse . But it stuck in decoding by showing this error paritydata contains information beyond R-S symbol size. I assume on decoder side reverse process of encoding should. I don't know what am i missing there.

Perry Kundert · Answer 6 · Mon Apr 04 2016 22:17:12 GMT+0800 (China Standard Time)

I would challenge you to reconsider your approach.

In progrogramming, there are only two amounts of effort; A) the amount that produces a correct result that you understand completely, and B) inadequate effort.

It does not work as you are doing it right now, because the approach is incorrect, for the reasons that I described. There are several errors, the major error being that the number of bits in each character is 8, but the number of bits in each Reed-Solomon symbol is 9 (and hence, you must use a 16-bit data unit, containing your 9-bit symbol, to perform the Reed-Solomon encode/decode. Therefore, you must store one R-S symbol in 2 file characters.

You are not doing this. You must use (something like) the serialization/deserialization code I gave you. Until you do this, while you may think that your code "should work like i am doing right now", you are mistaken.

In programming, humility is a key attribute. Your job as a programmer is (mostly) to find out (as quickly as possible) that you are wrong. And then, to discover why you are wrong, and to correct the miscomprehension that led to the root cause of your error. This is how you develop.

Eventually, (after 10,000 hours or so), your understanding of A) the problem domain, and B) the tools you use, becomes much closer to actual reality. Never 100% correct, but close enough that you make fewer errors, and (sometimes), mercifully, your programs work as you intended.

james224 · Answer 7 · Tue Apr 05 2016 02:17:17 GMT+0800 (China Standard Time)

You are very much right sir. But sometimes in life you will have to do some work which is beyond your capacity. Frankly speaking, if you were not helping enough i would not even consider to use ezpwd rs codec. I have read previous issues and get an idea of how to use it. By using your code i have completed most of my work which is assigned to me. But at this point of time i have been asked to apply RS parity on video GOP(Group of pictures) level. And one GOP consist of more than 10 frames which eventuality forced me to use large codec so that a GOP level data could fit into the RS block. Now i don't have an option to change my library because most of the work is done. BTW why encoding is working without serialization in my case? This is a real twist for me.

Perry Kundert · Answer 8 · Tue Apr 05 2016 21:31:56 GMT+0800 (China Standard Time)

I have improved the rsencode.C example to support symbol sizes greater than 8 bits.