Adding files with no extension to be searched

Question

Adding files with no extension to be searched

NathanaelA opened this issue 7 years ago · comments

Nathanael Anderson commented 7 years ago

Is there a way to tag "Podfile" as a Cocoapod file; it has no extension so I'm not sure how to add it to the include/classifier/database.json file... I'd rather have it come up with as a known file...

Ben Boyter · Answer 1 · Tue May 02 2017 10:06:23 GMT+0800 (China Standard Time)

Open the ./include/classifier/database.json file in your editor of choice.

add a new entry like the following (changing pod to have whatever extension you need),

{
    "language": "Cocoapod",
    "extensions": [
      "pod"
    ],
    "keywords": []
}

You can leave keywords empty. Be sure to validate that the file is still valid JSON. You can use a command line tool like jq to validate it like so

jq . database.json

If its a common format please post it back here and I will include it into the default list.

Ben Boyter · Answer 2 · Tue May 02 2017 10:08:46 GMT+0800 (China Standard Time)

Just realised I misread this. You want to add a file without an extension.

Currently there is no way to do this. I will need to look at fixing this.

TODO - Add support for files without extension to be classified.

quasarea · Answer 3 · Tue May 02 2017 16:14:08 GMT+0800 (China Standard Time)

Have some fortran without extensions here too ;)

Ben Boyter · Answer 4 · Tue May 02 2017 19:13:09 GMT+0800 (China Standard Time)

Hmmm for that situation would you need to rely totally on the keyword checks. This specific issue can be solved by just looking for an explicit filename.

Keyword checks are something that I have been playing around with locally. The idea being that if a file doesn't match anything based on extension then use keywords to guess what filetype it is. It will slow things down considerably when indexing due to the additional processing overhead. Assuming its a minority of files though it shouldn't be a huge issue.

Nathanael Anderson · Answer 5 · Wed May 03 2017 00:15:26 GMT+0800 (China Standard Time)

Boyter, another way at least to deal with some of these would be something like this:

{
    "language": "Cocoapod",
    "keywords": [], 
    "fixedName":  "Podfile"
},
{
    "language": "License",
    "keywords": [], 
    "fixedName": "License",
    "ignore": true
}

Two additional JSON attributes; fixedName and ignore...

This would not fix the fortran of quasarea; but it would solve both the "License" and "Podfile" issues and allow people to easily ignore files in the classifier file... ;-) And rather than adding gitignore and npmignore to the "binary" file types I could also add them to the classifier and put the "ignore" = true flag on them... Might be a more universal cleaner fix for most issues, as everything it together then.

Then your keyword checks could go into effect after this point to cover things like Fortran files w/o an extension...

Ben Boyter · Answer 6 · Wed May 03 2017 06:06:52 GMT+0800 (China Standard Time)

That was the plan for for your specific case. The fixedname thing. I would probably make it an array though just to cover things like COPYING and LICENSE both generally being license files.

I was going to keep the ignores inside the properties file though. I will have a think about it in this case though. It might make more sense for specific types for them to live in the database file.

Nathanael Anderson · Answer 7 · Wed May 03 2017 06:19:36 GMT+0800 (China Standard Time)

The only reason I suggest ignore be moved to the database; is it makes everything in the same file. Then people aren't having to go between places... If the cpu hit is minor I would actually move all the binary files into that same database with the ignore flag... Keeping things consistent makes it easier to configure and should simplify your code... ;-)

Ben Boyter · Answer 8 · Wed May 03 2017 06:21:26 GMT+0800 (China Standard Time)

Valid reasons. The main issue is during upgrades. Its a little more painful to migrate your own changes into the database file.

The CPU hit should in theory be nothing thankfully. I might do it as though as I can see it being a better solution in the long term.

Ben Boyter · Answer 9 · Thu May 04 2017 06:42:13 GMT+0800 (China Standard Time)

So I was looking into this, and turns out some of it is already done. The problem is that I didn't make the database name "extensions" very descriptive. If you add the following,

{
    "language": "Cocoapod",
    "extensions": [
      "cocoapod"
    ],
    "keywords": []
}

To the database the file with the name cocoapod will be classified correctly. I made it such that if no file extension is specified with a . then the filename itself is treated as it. An example of this already happening is for Jenkins Buildfiles which looks like this

{
    "language": "Jenkins Buildfile",
    "extensions": [
      "jenkinsfile"
    ],
    "keywords": []
  }

I will need to update the KB with this detail and probably add it as part of a readme in the directory itself.

I will however be adding a check which tries to guess the file type given that nothing else matches. This will not however be 100% accurate as it will be based on the most common keywords in the database.

Adding the ignored functionality however is something I will be adding.

I have also added Cocopod into the database to save the effort of having to do this yourself in the future, b141810

Ben Boyter · Answer 10 · Thu May 04 2017 16:01:54 GMT+0800 (China Standard Time)

Logic to guess file type given no matches added. Can be enabled by setting the property

deep_guess_files=true

In the searchcode.properties file.

Ben Boyter · Answer 11 · Fri May 19 2017 05:08:13 GMT+0800 (China Standard Time)

Documentation for KB updated

https://searchcodeserver.com/knowledge-base/how-to-add-files-to-be-recognised.html