DecoderException: Odd number of characters from ExportCSVTask if resuming processing

Question

DecoderException: Odd number of characters from ExportCSVTask if resuming processing

DeveloperNamedMax opened this issue 4 months ago · comments

DeveloperNamedMax commented 4 months ago

When my processing finished i receive error:

2024-02-07 10:18:24	[INFO]	[engine.graph.GraphGenerator]			Running post generation statements.
2024-02-07 10:18:24	[INFO]	[engine.graph.GraphGenerator]			Running CREATE INDEX ON :EVIDENCIA(category)
2024-02-07 10:18:26	[INFO]	[engine.graph.GraphGenerator]			Running CREATE INDEX ON :EVIDENCIA(source)
2024-02-07 10:18:26	[INFO]	[engine.graph.GraphGenerator]			Running CREATE INDEX ON :EVIDENCIA(evidenceId)
2024-02-07 10:18:26	[INFO]	[engine.graph.GraphGenerator]			Finished running post generation statements in 1641ms.
2024-02-07 10:18:28	[INFO]	[engine.graph.GraphGenerator]			Grouping TELEFONE contacts.
2024-02-07 10:18:28	[INFO]	[engine.graph.GraphGenerator]			Grouped 0 TELEFONE contacts.
2024-02-07 10:18:29	[INFO]	[engine.graph.GraphGenerator]			Grouping EMAIL contacts.
2024-02-07 10:18:29	[INFO]	[engine.graph.GraphGenerator]			Grouped 0 EMAIL contacts.
2024-02-07 10:18:29	[INFO]	[engine.graph.GraphGenerator]			Grouping FACEBOOK contacts.
2024-02-07 10:18:29	[INFO]	[engine.graph.GraphGenerator]			Grouped 0 FACEBOOK contacts.
2024-02-07 10:18:29	[INFO]	[engine.graph.GraphServiceImpl]			Shutting down neo4j service.
2024-02-07 10:18:30	[INFO]	[engine.graph.GraphTask]			Generating graph database finished.
2024-02-07 10:18:30	[INFO]	[engine.graph.GraphTask]			Compressing graph CSVs...
2024-02-07 10:18:30	[INFO]	[engine.graph.GraphTask]			Compressing graph CSVs finished.
2024-02-07 10:20:36	[INFO]	[engine.core.Worker]			Worker-0 finished.
java.lang.IllegalArgumentException: Invalid hash string microsoft-windows-printing-wfs-fod-package-Wrapper~31bf3856ad364e35~am1].slice(0, -1)
	at iped.utils.HashValue.<init>(HashValue.java:23)
	at iped.engine.task.ExportCSVTask.finish(ExportCSVTask.java:263)
	at iped.engine.core.Worker.finishTasks(Worker.java:135)
	at iped.engine.core.Worker.run(Worker.java:300)
Caused by: org.apache.commons.codec.DecoderException: Odd number of characters.
	at org.apache.commons.codec.binary.Hex.decodeHex(Hex.java:97)
	at org.apache.commons.codec.binary.Hex.decodeHex(Hex.java:77)
	at iped.utils.HashValue.<init>(HashValue.java:20)
	... 3 more
2024-02-07 10:20:36	[INFO]	[engine.core.Worker]			Worker-3 finished.
2024-02-07 10:20:36	[INFO]	[engine.core.Worker]			Worker-5 finished.
2024-02-07 10:20:36	[INFO]	[engine.core.Worker]			Worker-6 finished.
2024-02-07 10:20:36	[INFO]	[engine.core.Worker]			Worker-7 finished.
2024-02-07 10:20:36	[INFO]	[engine.core.Worker]			Worker-1 finished.
2024-02-07 10:20:36	[INFO]	[engine.core.Worker]			Worker-2 finished.
2024-02-07 10:20:36	[INFO]	[engine.core.Worker]			Worker-4 finished.
2024-02-07 10:20:36	[ERROR]	[app.processing.Main]			Processing Error: 
java.lang.IllegalArgumentException: Invalid hash string microsoft-windows-printing-wfs-fod-package-Wrapper~31bf3856ad364e35~am1].slice(0, -1)
	at iped.utils.HashValue.<init>(HashValue.java:23) ~[iped-utils-4.1.5.jar:?]
	at iped.engine.task.ExportCSVTask.finish(ExportCSVTask.java:263) ~[iped-engine-4.1.5.jar:?]
	at iped.engine.core.Worker.finishTasks(Worker.java:135) ~[iped-engine-4.1.5.jar:?]
	at iped.engine.core.Worker.run(Worker.java:300) ~[iped-engine-4.1.5.jar:?]
Caused by: org.apache.commons.codec.DecoderException: Odd number of characters.
	at org.apache.commons.codec.binary.Hex.decodeHex(Hex.java:97) ~[commons-codec-1.15.jar:1.15]
	at org.apache.commons.codec.binary.Hex.decodeHex(Hex.java:77) ~[commons-codec-1.15.jar:1.15]
	at iped.utils.HashValue.<init>(HashValue.java:20) ~[iped-utils-4.1.5.jar:?]
	... 3 more

Version 4.1.5
Any simple fix for this?

Luis Filipe Nassif · Answer 1 · Wed Feb 07 2024 20:24:26 GMT+0800 (China Standard Time)

Have you used --append or --continue command line options?

Any simple fix for this?

Disabling ExportCSVTask in IPEDConfig.txt should avoid this error...

When it finishes, I would appreciate if you could export all properties of microsoft-windows-printing-wfs-fod-package-Wrapper~31bf3856ad364e35~am1].slice(0, -1) file and send them to me.

Luis Filipe Nassif · Answer 2 · Wed Feb 07 2024 23:52:55 GMT+0800 (China Standard Time)

Disabling ExportCSVTask in IPEDConfig.txt should avoid this error...

This is achieved by setting exportFileProps = false.

Wladimir Leite · Answer 3 · Thu Feb 08 2024 10:56:39 GMT+0800 (China Standard Time)

When it finishes, I would appreciate if you could export all properties of microsoft-windows-printing-wfs-fod-package-Wrapper~31bf3856ad364e35~am1].slice(0, -1) file and send them to me.

It would be interesting to have the CSV file. My guess is that it was corrupted/incomplete.

PS: Sorry, I closed it by accident while typing this post.

DeveloperNamedMax · Answer 4 · Thu Feb 08 2024 16:57:26 GMT+0800 (China Standard Time)

Have you used --append or --continue command line options?

My current setup:

At the moment i am appending to existing case.
Since i have large amount of data i did bat script to run iped script and when iped closes does the same thing 10 more times. This was to prevent that if any errors appeared it would maybe on a second run wont have the same error. (Not that i have had any errors except the one i am describing).

Command is in simplified form : "iped.exe -d data.dd -o output -profile forensic --portable --continue --append"

Disabling ExportCSVTask in IPEDConfig.txt should avoid this error...

Tried this and this works!

When it finishes, I would appreciate if you could export all properties of microsoft-windows-printing-wfs-fod-package-Wrapper~31bf3856ad364e35~am1].slice(0, -1) file and send them to me.

Here i have questions. How does the "File list.csv" works? Does it append to the end of the line every time IPED is ran? I am asking because for some reason it is 500 GB in size. I would have copied the properties from that file. I can try to read the file line by line to find it but that would take a while.

Any other way i can export all properties of the file?

Wladimir Leite · Answer 5 · Thu Feb 08 2024 18:59:30 GMT+0800 (China Standard Time)

Does it append to the end of the line every time IPED is ran?

Yes, it appends to the existing file, one line per item. And when you use --continue option it has a special procedure to remove duplicated lines (the exception happened in that part of the code).

I am asking because for some reason it is 500 GB in size.

That is pretty odd. For example, a case with 10 million items I am working on has a ~4 GB CSV.
How many items are there in your case?

I would have copied the properties from that file. I can try to read the file line by line to find it but that would take a while.

Can you try getting the first ~1000 lines and the last ~1000 lines (or any pieces from the head and from the tail of the file, not necessarily containing full lines)?
And try to use grep to find lines with 31bf3856ad364e35.
If you can send me these results privately, it will probably help to find out what is going on.

DeveloperNamedMax · Answer 6 · Thu Feb 08 2024 19:30:58 GMT+0800 (China Standard Time)

That is pretty odd. For example, a case with 10 million items I am working on has a ~4 GB CSV. How many items are there in your case?

Currently about ~26m. My other cases had also 1-4GB only in size. Might the issue be that the multiple reruns added to the list but since the exception happened before duplicated lines were deleted the file just grew? Can i just regenerate the list on the whole case?

Can you try getting the first ~1000 lines and the last ~1000 lines (or any piece, not necessarily containing full lines)? And try to use grep to find lines with 31bf3856ad364e35. If you can send me these results privately, it will probably help to find out what is going on.

Started to search, will post when i get result.

Wladimir Leite · Answer 7 · Thu Feb 08 2024 20:01:55 GMT+0800 (China Standard Time)

Currently about ~26m. My other cases had also 1-4GB only in size.

With 26 M items, the CSV should be something around 10 GB, so there is something definitely wrong with it.

Might the issue be that the multiple reruns added to the list but since the exception happened before duplicated lines were deleted the file just grew?

As far as I know (@lfcnassif is more familiar with that part of the code), the duplicated lines should be a relatively small number, likely from an abruptly terminated execution.

Can i just regenerate the list on the whole case?

I don't know if there is an easy way of doing that.

Started to search, will post when i get result.

Thanks!

Luis Filipe Nassif · Answer 8 · Tue Feb 13 2024 00:10:41 GMT+0800 (China Standard Time)

Hi @DeveloperNamedMax and @wladimirleite, sorry for my delay.

Might the issue be that the multiple reruns added to the list but since the exception happened before duplicated lines were deleted the file just grew?

As far as I know (@lfcnassif is more familiar with that part of the code), the duplicated lines should be a relatively small number, likely from an abruptly terminated execution.

Actually not, because ExportCSVTask overwrites processIgnoredItem() method to always return true (because we want to write properties of ignorable - files with known hashes - and duplicated files, if they are being ignored), so it will write again properties of already committed items. This explains why @DeveloperNamedMax's CSV reached 500GB size... We could use a control flag different than "ignorable" for committed items to avoid huge temporary CSVs like that.

Reading the code, it is really possible to corrupt the properties CSV after an abrupt termination, like the process crashing, being killed or abrupt power off. I'll try to reproduce the situation... I think using an auxiliary file to track the last commit point (CSV size), or writing it to the beginning of the temp CSV, would allow to rollback the CSV to the last healthy state when resuming the processing.

Luis Filipe Nassif · Answer 9 · Tue Feb 13 2024 00:15:11 GMT+0800 (China Standard Time)

Can i just regenerate the list on the whole case?

If you are not ignoring known files or duplicated files, you can generate a similar CSV from the Options button in the UI, although the exported properties will be a bit different.

DeveloperNamedMax · Answer 10 · Tue Feb 27 2024 22:36:53 GMT+0800 (China Standard Time)

Finally got file lines containing the referenced file inside the 500GB FileList.csv:

"microsoft-windows-printing-wfs-fod-package-Wrapper~31bf3856ad364e35~am1].slice(0, -1),
    base: allParts[2],
    ext: allParts[3],
    name: allParts[2].slice(0, allParts[2].length - allParts[3].length)
  };
};



// Split a filename into [root, dir, basename, ext], unix version
// 'root' is just a slash, or nothing.
var splitPathRe =
    /^(\/?|)([\s\S]*?)((?:\.{1,2}|[^\/]+?|)(\.[^.\/]*|))(?:[\/]*)$/;
var posix = {};


function posixSplitPath(filename) {
  return splitPathRe.exec(filename).slice(1);
}


posix.parse = function(pathString) {
  if (typeof pathString !== 'string') {
    throw new TypeError(
        "Parameter 'pathString' must be a string, not " + typeof pathString
    );
  }
  var allParts = posixSplitPath(pathString);
  if (!allParts || allParts.length !== 4) {
    throw new TypeError("Invalid path '" + pathString + "'");
  }
  allParts[1] = allParts[1] || '';
  allParts[2] = allParts[2] || '';
  allParts[3] = allParts[3] || '';

  return {
    root: allParts[0],
    dir: allParts[0] + allParts[1].slice(0, -1),
    base: allParts[2],
    ext: allParts[3],
    name: allParts[2].slice(0, allParts[2].length - allParts[3].length)
  };
};


if (isWindows)
  module.exports = win32.parse;
else /* posix */
  module.exports = posix.parse;

module.exports.posix = posix.parse;
module.exports.win32 = win32.parse;
{
  "name": "path-parse",
  "version": "1.0.6",
  "description": "Node.js path.pa

This code will continue for 180k+ lines. And ends with:

Machines/WinDev2102Eval/WinDev2102Eval-disk1.vmdk/vol_vol3/Windows/System32/catroot2/{XXXXXX-XXXX-XXXX-XXXX-XXXXXXXX}/catdb>>Table_HashCatNameTableSHA1.4_col-HashCatNameTable_CatNameCol_row5893.data","0e3f9d25a7e728c8e30642548b3d005a"

Adding that the FileList.csv occasionally has sections full of "\x00"-s .

Luis Filipe Nassif · Answer 11 · Wed Feb 28 2024 20:05:33 GMT+0800 (China Standard Time)

Thanks @DeveloperNamedMax! That is really odd, the CSV shouldn't contain code fragments like that. I'll try to take a look at this in the next days.

Luis Filipe Nassif · Answer 12 · Tue Apr 02 2024 11:49:28 GMT+0800 (China Standard Time)

Finally got file lines containing the referenced file inside the 500GB FileList.csv:

"microsoft-windows-printing-wfs-fod-package-Wrapper~31bf3856ad364e35~am1].slice(0, -1),
    base: allParts[2],
    ext: allParts[3],
    name: allParts[2].slice(0, allParts[2].length - allParts[3].length)
  };
};



// Split a filename into [root, dir, basename, ext], unix version
// 'root' is just a slash, or nothing.
var splitPathRe =
    /^(\/?|)([\s\S]*?)((?:\.{1,2}|[^\/]+?|)(\.[^.\/]*|))(?:[\/]*)$/;
var posix = {};


function posixSplitPath(filename) {
  return splitPathRe.exec(filename).slice(1);
}


posix.parse = function(pathString) {
  if (typeof pathString !== 'string') {
    throw new TypeError(
        "Parameter 'pathString' must be a string, not " + typeof pathString
    );
  }
  var allParts = posixSplitPath(pathString);
  if (!allParts || allParts.length !== 4) {
    throw new TypeError("Invalid path '" + pathString + "'");
  }
  allParts[1] = allParts[1] || '';
  allParts[2] = allParts[2] || '';
  allParts[3] = allParts[3] || '';

  return {
    root: allParts[0],
    dir: allParts[0] + allParts[1].slice(0, -1),
    base: allParts[2],
    ext: allParts[3],
    name: allParts[2].slice(0, allParts[2].length - allParts[3].length)
  };
};


if (isWindows)
  module.exports = win32.parse;
else /* posix */
  module.exports = posix.parse;

module.exports.posix = posix.parse;
module.exports.win32 = win32.parse;
{
  "name": "path-parse",
  "version": "1.0.6",
  "description": "Node.js path.pa

This code will continue for 180k+ lines. And ends with:

Machines/WinDev2102Eval/WinDev2102Eval-disk1.vmdk/vol_vol3/Windows/System32/catroot2/{XXXXXX-XXXX-XXXX-XXXX-XXXXXXXX}/catdb>>Table_HashCatNameTableSHA1.4_col-HashCatNameTable_CatNameCol_row5893.data","0e3f9d25a7e728c8e30642548b3d005a"

Adding that the FileList.csv occasionally has sections full of "\x00"-s .

I have no idea how that code appeared into the CSV, since all content written to CSV is escaped and any new lines and control chars are filtered out. Anyway, I submitted #2151 to try to avoid corrupting the CSV which could result in aborting errors like the one reported here initially.