nemerle / dcc

This is a heavily updated version of the old DOS executable decompiler DCC

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Long signatures

lab313ru opened this issue · comments

I have read readsig.txt, and have found that currently, signatures are 23 bytes long. Is it true?
If so, is it possible to create signatures that will be longer? And, why is 23? How much signatures are missed in DCC because of this (and because of collisions)?

The DCCS file format hard-wires it to 23 bytes. Sprinkled in the DCC code you find:

#define  PATLEN 23

Goodness knows why the original authors chose that size; it's likely to avoid excessive file sizes on a project that may have been developed on an x86 MS-DOS environment.

You could try changing PATLEN to a larger number and then running makedsig on a *.LIB file, but the resulting signature file will be incompatible with older DCCS files.

What I want: Regenerate new libs, that will be with all functions, and almost without collisions. Then, I want apply them in IDA.

You'll need access to the original LIB files from the various compilers to accomplish this, since only the first 23 bytes are available in the existing DCCS files.

I understand.

Another question: signature models:
dccb3c.sig
dccb3l.sig

"l" and "c" - what the difference?

The naming convention used here is: dcc<v><n><m> where

  • v is the vendor of the compiler library (b= Borland m= Microsoft)
  • n is the version of the library
  • m is the x86 memory/pointer model, where c is "compact", s is "small", m is "medium" and l is large. For more about the x86 memory models, see https://en.wikipedia.org/wiki/Intel_Memory_Model

Ok. There is some compiler lib-file. Which model will be selected and which criteria will be used when naming it? There is some lib-inner memory model?

When generating the signatures using makedsig, the user herself has to know what vendor, version and model the LIB file was compiled with.

But makedsig only asks libname as parameter.

If you look at makedsig.cpp, you'll find the usage:

"This program is to make 'signatures' of known c and tpl library calls for the dcc program.\n"
"It needs as the first arg the name of a library file, and as the second arg, the name "
"of the signature file to be generated.\n"
"Example: makedsig CL.LIB dccb3l.sig\n"
"      or makedsig turbo.tpl dcct4p.sig\n"

So it's the user's responsibility to provide a correct file name for the .sig file.

Ah, I see. dcc selects correct sig file. And I should provide correct file name.

Exactly so.

AFAIK there is no identification information contained inside lib/tpl files.

Reko will probably use a variant of this scheme, but the mapping of signature files may be happening in the configuration file to avoid dependencies on the naming of the signature files themselves.

Yup, the format of signature files could be made a bit more robust:

{
    "Vendor": "Borland",
    "CompilerName" : "TurboC 3.0",
    "Language": "C",
    "Version": "3.0",
    "SignatureBlocks": [{
    "Model": "Large",
    "SigLength": 29,
    "Signatures": []
    }, {
    "Model": "Small",
    "SigLength": 23,
    "Signatures": []
    }]
}

and makedsig could be made to work with this to 'add'/'update' signatures inside this files

Makedsig asks me for "Seed:". What is it?

And second question: how to merge signatures from different lib fies?

Consider using a schema as well, so a JSON parser can identify what kind of data this is:

{
    "$schema":  "urn:executable:signature",
    "Vendor": ....
} 

Merging signatures from different lib files should done by relevant decompilers when they "ingest" the JSON described above. Ie. there should be a function
LoadSignaturesFromFiles: list<filename> => internal-signature-representation that collects all relevant metadata and "cooks" it as appropriate.

This work is underway on the Reko project: there are at least three signature file formats that Reko is aware of, and I'm making it so that they all get unified internally . It would be cool if dcc and Reko could interoperate on this level.

The DCC signature file format creates a perfect hash. The algorithm they are using requires a random number generator (RNG). The Seed: prompt is asking you for a seed to the (RNG). Not sure why this is provided explicitly, perhaps for making sure, during development, that the hashtable is getting created correctly and reproducibly. Just enter some number < 32637 and you should be OK.

.lib files (and .obj files) are OMF files. Sadly, they have no magic number at the beginning, so you have to depend on file extensions to figure out what's inside. This is why I'm suggesting the $schema above -- so that both humans and computers can figure out the contents of the file.

Common signature format: agreed - will try to flesh it out and post it here.

Also, consider looking at the Yara format. It's not JSON, but we could consider making a JSON compatible version.

John, should we consider other pattern schemes ?

Once upon a time I've had some fun with an xbox emulator that used pattern matching to identify SDK functions, and rewritten it to use pre-generated per-SDK TRIE ( string with wildcards )

As for YARA, I think their pattern matching language is not a very good match for our purposes ?

What we might consider is pattern disambiguation by symbol names ?

given two patterns with the same signature:

FuncA:  12 43 65 [xx xx xx xx] 44 55 66 ...    where [xx xx xx xx] is reference to symbol FuncX
FuncB:  12 43 65 [xx xx xx xx] 44 55 66 ...    where [xx xx xx xx] is reference to symbol FuncY

we would be unable to correctly locate those patterns in the binary, but if previously we managed to locate either FuncX, or FuncY then we could use those to augment the pattern matcher?

Reko uses another signature format, provided by @halsten, for identifying packers and unpackers. It is again different:

<SIGNATURES>
  <ENTRY>
    <NAME>Microsoft Visual C++ 7</NAME>
    <COMMENTS />
    <ENTRYPOINT>????4100000000000000630000000000??00??????????00??00??????????????????????????????????00??00??00??????????????????????????????00????20????00??00??????????????00??????????????????????00??00??????00??????????????00??00??00??00??00??00??00??00??00??00??00??00??????00??00??00??00??00??00??00??????00??00??00??00??00??00??00??????00??00??00??00??00??00??00??????00??00??00??00??00??00??00??????00??00??00??00??00??00??00??????????????????????????????00??????????????????????00??00??00??????00??00??00??00??00??00</ENTRYPOINT>
    <ENTIREPE />
  </ENTRY>
  <ENTRY>
    <NAME>Microsoft Visual C++ 8.0</NAME>
    <COMMENTS>
    </COMMENTS>
    <ENTRYPOINT>4883EC28E8????00004883C428E9????FFFFCCCCCCCCCCCCCCCCCCCCCCCCCCCC</ENTRYPOINT>
    <ENTIREPE>
    </ENTIREPE>
  </ENTRY>

As you can see it is just yet another variant with its own benefits and flaws. It's easy enough to add parsers to these simple formats. The hard part is building an efficient automaton from the patterns in order to scan the decompiled image fast enough. My recent commits in Reko have introduced a suffix array implementation that lets me locate a pattern in O(log n) time, where n is the size of the binary file. Once that work is complete, I should be able to rip through any signature file format (like the one above, the one you're proposing, or the Amiga index hunks, or the DCC signature files) and in O(p * log N) time find all matching signatures located in the file, where p is the number of patterns. Any way to decrease the p -- say by partitioning signature files based on detected compiler manufacturer and version -- is of course highly beneficial.

My intent with Reko is to be able to handle as many formats as possible, but drawing the line when it gets too complex and distracts me from actual decompilation :-)

IDA understands BCC's libs format. (plb utility).

@lab313ru I don't believe we can use any of their tooling in our open source projects though ?

We can't, yes. But idea of FLIRT signatures is good for using it.
https://www.hex-rays.com/products/ida/tech/flirt/in_depth.shtml

And my goal was to use bcc signatures in IDA, so...)

@lab313ru: are FLIRT signatures stored as text, or as a binary format described somewhere? I have no access to IDA so I can't go check myself.

Hmm.. I think, pat description is only IDA SDK-inner.

But, it is not problem to rewrite signmake to use max length for symbol names and for pattern length.

Maybe, for current moment it will be better to allow makedsig read file list with lib-files?
Then add signatures from them to map, and parse as it were before?

I mean combining symbols from many lib-files.

Patterns need to be specified:

  • should characters other than hex digits be allowed? For instance it's convenient to allow spaces in the pattern strings since they may be coming from other tools.
  • Should wildcard patterns be allowed? What character should be used in wildcars? I've seen '?' and '.', and don't see any reason for not allowing both.

Updated with:
EBNF-like definition for PATTERN definition

PATTERN :  ("Offset" Number (MATCH_BYTES | SYM_REF_NAME))+
MATCH_BYTES : (HEX_BYTE | WILDCARD)+
HEXBYTE : "0x" HEX_DIGIT HEX_DIGIT
WILDCARD : "." | "?"
SYM_REF_NAME :  Ident

Although more compact representation of MATCH_BYTES might be in order ?

If it's OK to assume hexadecimal representation and 8-bit bytes, you could get rid
of the "0x" which adds nothing but padding in that case. Reko has a couple of megabytes
of signature files donated by @halsten which all have following look: AD3351?????AEB1A2?????. It appears
to be widely used in the community, and would be nice to provide support for it.

Here's my take on a pattern file format, generalizing a little because not all emitters
of machine code are compilers (think obfuscators and packers)

{
    // The defaults if nothing else has been specified
    "Tags": {
        "Vendor": "Borland",
        "Product": "Turbo C",
        "Version": "2.0",
        "Target_machine": "x86-16",
        "Endianness": "little".
        "SourceLanguage", "C"
    },
    "Patterns": [
        {
            "Tags": {
                "Version": "3.0"
            },
            //  4-byte reference to a symbol
            "Match": [ "AAbbCC??D1e2", { "symref": "foo", "size": 4 }, "Fa",

            "Result": { "symbol": "malloc" }
        }
    ]
}

Here is a pattern that could be used to identify a binary as Msdos EXE or ELF

{
    "Patterns": [
        {
            // must be at start of file. Not specifying offset means "anywhere"
            "Offset": 0,
            "Match": ["4D5A"],
            "Result": { "imagefile": "MzExecutable" }
        },
        {
            "Offset": 0,
            "Match: ["7F454C46"],
            "Result": { "imagefile", "ElfExecutable" }
        }
    ]
}

It would be cool if "Offset" could be specified to not only be a fixed number of bytes
from the start of file, but a special symbol "$EntryPoint" which would be the starting point
of the program as defined by the image format (PE, ELF etc)

{
   "Offset", "$EntryPoint",
   "Match":  ["7F3A39....A3B8"],
   "Result": { "Packer": "FileCrusher", "Vendor": "Packers'R'us", "Version": "0.3" }
}

We might want/need to add a Compiler_Flags as a required tag, since patterns for Debug/Release Small/Medium/Large builds will differ

Ok, I've extended/updated the EBNF for PATTERN and DATA parts to incorporate Your suggesstions:

PATTERN:

PATTERN :  PATTERN_ID? ("Offset" OFFSET_SPEC (MATCH_BYTES | SYM_REF_NAME))+ | "@" PATTERN_REF;
PATTERN_ID : Ident;
OFFSET_SPEC : Number | "$EntryPoint";
PATTERN_REF : Ident;
MATCH_BYTES : "[" (HEX_BYTE | WILDCARD)+ "]";
HEXBYTE : HEX_DIGIT HEX_DIGIT;
WILDCARD : "." | "?";
SYM_REF_NAME : Ident;

DATA:

DATA:          (SYMBOL_DEF META_DEF?) | META_DEF;
META_DEF:      "Meta" FREEFORM_DATA;
SYMBOL_DEF:    "Symbol" SYMBOL_NAME ("Typedef" C_TYPEDEF)?;
SYMBOL_NAME:   "Name" Ident; // Ident is a raw symbol name - no demangling should be done here
C_TYPEDEF:     QuotedString; // C typedef extended with custom calling convention attributes
FREEFORM_DATA: (Ident "=" QuotedString)+; // comments, links to documentation, etc.

As for FREEFORM_DATA - it could be extended into:

META_ENTRY: PACKER_SPEC | LOADER_SPEC | FREEFORM_DATA;
PACKER_SPEC: "Packer" QuotedString;
LOADER_SPEC: "Loader" QuotedString;
FREEFORM_DATA: (Ident "=" QuotedString)+;