Improving on Filesystem

Question

Improving on Filesystem

nochso opened this issue 8 years ago · comments

Currently the Filesystem class is only used by JsonReporter.
However I'd like to suggest extracting an interface from it and letting any scanner related classes use that.
I think it will definitely come in handy with the git version. For example you could avoid checking out entire working directories before scanning them. Instead just implement a GitFilesystem that has access to a repository and implements at least read and fileExists.

That way, scanning huge histories should become quite easy. Use getModifiedFiles of Gitter\Repository to find changed files only. Then use GitFilesystem to retrieve only modified files.
Should there be files that were not changed but still need analyzing (like super classes), GitFilesystem can easily fetch them when this is implement: #73

I'm not too set on "not checking the repository" out, so an alternative could be letting GitFilesystem handle the checkout or clone to a temp directory.

What do you think?

Tom Rochette · Answer 1 · Tue Jan 26 2016 08:34:57 GMT+0800 (China Standard Time)

I'm not too sure I see how that example you gave would work. My guess is that checking out a full git branch is probably more easy/effective than pulling out a list of files from a specific commit.

Overall, I don't see what we'd gain. If you can expose your idea a little more, maybe I'll understand what you're trying to achieve with it.

Marcel Voigt · Answer 2 · Wed Jan 27 2016 05:01:31 GMT+0800 (China Standard Time)

Assuming that

you want to analyze long histories of active projects
generally fewer files are modified than there are stale/untouched files
repositories could be of any size or age, might contain binaries, documentation, assets etc.

Analyzing those in one go, that's a lot of disk writes that are wasting time and I/O.

For example here: https://github.com/tomzx/gitter/blob/php-semver-checker/lib/Gitter/Repository.php#L628-L639
You're already using this to prepare a list of modified files. But then you check out everything anyway, read the whole working directory and filter it again.

Keeping the assumptions in mind, I'm sure running git diff-tree and a few git shows is more effective as you're only reading and git is pretty good at that.

Adding --name-status to git diff-tree lets you know whether a file was added, modified or deleted.
If it was added, we only need to get it out of R2. Deleted: get it out of R1. If it was modified we retrieve both versions, then parse and forget them.

You could even analyze repositories without ever checking anything out, keeping a bare repository which is quite handy.

Added bonus: Users can analyze without modifying their working directory at all making it friendlier.

Anyway, I'm more interested in allowing to do that by using a tiny Filesystem interface/abstraction so that I'm not forced to work with actual files.

Tom Rochette · Answer 3 · Wed Jan 27 2016 12:54:50 GMT+0800 (China Standard Time)

First, the reason I ask for the list of modified files is not to save time on checking out (which is generally very fast with git). The goal is to avoid READING/SCANNING all the files in the repository, which would take a lot of time and is essentially why this was put in place. On project like Symfony it would save a considerable amount of time from not having to parse ~5k files per compare operation.

Your idea sounds interesting in theory, but I'm skeptical we'll see any performance improvement in practice. It also seems to put a lot of burden on us to have to code a system which will do all these git operations while we can "already" just checkout the desired commits.

I'm however open to the idea, but I'd have to see it implemented and benchmarked.

Although, reading your last sentence, it sounds like you want to replace the Filesystem wrapper by something like Flysystem. Is that correct?

Marcel Voigt · Answer 4 · Thu Jan 28 2016 05:07:16 GMT+0800 (China Standard Time)

Well, it doesn't need to be a whole abstraction layer like Flysystem. I'll think about it some more and mess around with the -git version to see what might make sense.

Marcel Voigt · Answer 5 · Thu Jan 28 2016 05:13:31 GMT+0800 (China Standard Time)

The goal is to avoid READING/SCANNING all the files in the repository, which would take a lot of time and is essentially why this was put in place.

But by checking them out, you are reading all the files from the repository AND writing them to disk. On a related note: IMO both findFromString calls in git-checker's CompareCommand could be dropped right now for example. Can be expensive without exclude-* configuration and it should be enough to rely on getModifiedFiles.

Tom Rochette · Answer 6 · Thu Jan 28 2016 07:21:55 GMT+0800 (China Standard Time)

I should probably make it more clear that when I said read/scan, I meant the process of actually opening the file for reading and using PHPParser to parse its content.

I do think you are right that there could probably be some improvement done in order not to scan all the files in the included directory, but only those that are modified. We currently rely on the Finder to take care of exclusion for us, since the modified files can obviously return us an immense list of files we have no interest in.