boyter / searchcode-server

The offical home of searchcode-server where you can run searchcode locally. Note that master is generally unstable in the sense that it is not a release. Check releases for release versions https://github.com/boyter/searchcode-server/releases

Home Page:https://searchcodeserver.com/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Index goes to queued not even close to completing crawl

MaxDiOrio opened this issue · comments

Hi,

We have a filesystem based repository with upwards of 60,000 files in it, as well as a dozen or more Git based repositories that are working fine.

The FS based repo though won't finish crawling. It goes from Queued to Indexing and back to Queued, never once changing the number of files indexed. It's currently stuck at 8936 and has been for almost 24 hours now. Logs don't show me anything good.

Any thoughts on how to get past this?

We actually started indexing this data as locally mounted CIFS share and that was very slow. Then I added additional file extensions and reindexed and it didn't pick up those files. So I wanted to try rsync'ing the repo locally since that should be significantly faster to index. That led us up to where we're at now.

Thanks!

Max

For added info - I built the master branch and it's still doing the same thing.

Here's the tail of the searchcode-server-0.log when the index just stopps.

Dec 19, 2017 1:34:52 PM java.util.logging.LogManager$RootLogger log
INFO: Indexing file /code/vssshadow/Production/ContentOps/Python/datafeed/GlobalAmerican/20091217/GlobalAmerican.createIdav.py
Dec 19, 2017 1:34:52 PM java.util.logging.LogManager$RootLogger log
INFO: Indexing file /code/vssshadow/Production/ContentOps/Python/datafeed/Digikey/20110502/output.20110207.txt
Dec 19, 2017 1:34:52 PM java.util.logging.LogManager$RootLogger log
INFO: Indexing file /code/vssshadow/Production/ContentOps/Python/datafeed/Digikey/20110502/runCreateIdav.bat
Dec 19, 2017 1:34:52 PM java.util.logging.LogManager$RootLogger log
INFO: Indexing file /code/vssshadow/Production/ContentOps/Python/datafeed/Digikey/20111103/createIdav.py
Dec 19, 2017 1:34:53 PM java.util.logging.LogManager$RootLogger log
INFO: Closing writers

It takes a 10 minute break, then starts indexing again. This is with the log level at Info.

Hi Max,

So the easiest way to diagnose this would be to get a copy of the file system. Going to make the assumption now that this is not an option though and provide some other details.

Looking at the log it looks like it is working as expected, at least to a certain point. For the moment can you try changing the following options in the searchcode.properties file and then restart the service.

log_indexed=true

This should drop a log file into the logs directory with the project name as the filename and .log as the extension. This will allow you to see if searchcode is even able to see the full 60,000 files. The file should contain an entry for every file it was able to process and if it indexed it or excluded it an the reasons for this.

If you notice that there isn't 60,000 files in there you can try tweaking the following value,

follow_links=true

Which should allow searchcode to follow the symlinks and the like. Then repeat the above steps. It is set to false by default as that tends to be more reliable.

My guess at this point is that somewhere in the file system there is a link or something odd that is causing the process to not be able to pick up all files.

The 10 minute break BTW is set by the value,

check_filerepo_changes=600

Which controls how often the file process threads run.

Ben

Hey Ben,

Funny thing is I have log_indexed=true All other repos generate their logs, except for the filesystem repo I'm having a problem with. It may be that this log file isn't created until after the first indexing completes?

I have no symlinks in this repo. It's a straight rsync backup from a CIFS share so symlinks aren't supported. There are nothing but regular files here.

Max

In that case can you grep the logs for the following string "indexDocsByPath walkFileTree" quickly? It would appear that it is crashing out somewhere.

Would you be happy to build from source to get additional logging for this?

grepping returns nothing.

I built from the git master earlier - so I have no problem building again. Thanks!

Sorry rubbish internet in Australia. Takes longer to push a simple change then to make it.

02eac6a

Has the relevant log message added. I will have a look at improving the logs for all of these right now while waiting for a result from you. Just need to build from master again and you should be set.

Rebuilt and deployed. So I should be able to grep the searchcode-server-0.log file for IndexFileRepoJob ?

IndexFileRepoJob.execute would be the error line we are looking for but IndexFileRepoJob should work just as well. Hopefully will not take too long to track down.

So far it's gone between queued and indexing a few times and no IndexFileRepoJob.execute is being logged. When I first started the index yesterday, it indexed the close to 9000 files in only a few minutes.

Info is the right log level to be on right?

Each time before when it goes to queued the log is:

Dec 19, 2017 5:16:02 PM java.util.logging.LogManager$RootLogger log
INFO: Indexing file /code/vssshadow/Production/ContentOps/Python/datafeed/Digikey/20110502/runCreateIdav.bat
Dec 19, 2017 5:16:02 PM java.util.logging.LogManager$RootLogger log
INFO: Indexing file /code/vssshadow/Production/ContentOps/Python/datafeed/Digikey/20111103/createIdav.py
Dec 19, 2017 5:16:02 PM java.util.logging.LogManager$RootLogger log
INFO: Closing writers

It sits queued for about 10 minutes then starts up again. I'm wondering what's causing it to pause like this.

The system has 4GB of RAM assigned to it. and about 230MB free. I can add RAM if needed, but I don't really see much memory pressure.

Info will just dump out all of the logs. So in your case yes. We are looking for a severe log error to appear in the logs though.

This is interesting. It should be dumping out the error at this point. Give me a few mins I'm going to pull out some code for you to try against the file location.

OK could you try the following Java code.

https://gist.github.com/boyter/44c45d9d587120e57f29841b34ca82dc

You just need to run it like so,

javac Main.java
java Main DIRECTORY > output.txt

Replace DIRECTORY with the path to your directory and output.txt to whatever you want the file to be written as. This should allow us to work out if the problem is in the way the java walk works or otherwise.

126,018 lines in the output file.

That would be about how many files you have in the directory? IE a number similar to something like

find . | wc -l

Would you be able to try it compared to this version?

https://gist.github.com/boyter/5b5d5fc9ceec3f2ab78b163fbb27a30f

This one includes the options to follow links.

That's annoying. The core loop is the same. With no exception being thrown either. I will have a look through the code and add additional logging to try and catch this one.

Its odd because inside the SearchcodeFileVisitor class which is responsible for walking the path the visitFile override explicitly try catches over all the logic that processes the file at which point it should print out indexDocsByPath walkFileTree due to the following,

Singleton.getLogger().warning("ERROR - caught a " + ex.getClass() + " in " + this.getClass() + " indexDocsByPath walkFileTree with message: " + ex.getMessage() + " for file " + file.toString() + " in path " + file + " in repo " + this.repoResult.getName());

Its possible however that it is crashing so hard that the thread cannot even log however. Ill add more logging around everything and we can try to track it down.

"find . | wc -l" returns 146,863

The second Main.java code yields the same 126,018 lines as the first one.

Very odd. I think there is some sort of hard crash happing on the thread. Maybe out of memory error or something that's causing the issue. I will continue to add logging to where everything is happening so we can hopefully work out what is causing the issue.

Ok I have added some additional logging to help with this. If you pull the latest and compile from source again and this time only add the file repository it should help track down the issue.

You can grep the logs at info level for you GitHub name to find it "MaxDiOrio" it will probably be quite verbose, but what I am hoping to see is that every time it tries to run it will end up crashing out on a specific file or on a line repeatedly.

Got one - wasn't logged with my name, but got this while running from the console, not a service. This happened when the job went to queued.

Dec 21, 2017 10:23:45 AM java.util.logging.LogManager$RootLogger log
INFO: Indexing file /code/vssshadow/Production/ContentOps/Python/datafeed/Digikey/20110502/runCreateIdav.bat
Dec 21, 2017 10:23:45 AM java.util.logging.LogManager$RootLogger log
INFO: Indexing file /code/vssshadow/Production/ContentOps/Python/datafeed/Digikey/20111103/createIdav.py
Dec 21, 2017 10:23:46 AM java.util.logging.LogManager$RootLogger log
INFO: Closing writers
[QuartzScheduler_Worker-21] ERROR org.quartz.core.JobRunShell - Job DEFAULT.updateindex-file-0 threw an unhandled Exception:
java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOf(Arrays.java:3332)
        at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
        at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:596)
        at java.lang.StringBuffer.append(StringBuffer.java:367)
        at java.io.BufferedReader.readLine(BufferedReader.java:370)
        at java.io.BufferedReader.readLine(BufferedReader.java:389)
        at com.searchcode.app.util.Helpers.readFileLinesGuessEncoding(Helpers.java:176)
        at com.searchcode.app.jobs.repository.IndexBaseRepoJob.getCodeLines(IndexBaseRepoJob.java:398)
        at com.searchcode.app.jobs.repository.SearchcodeFileVisitor.visitFile(SearchcodeFileVisitor.java:75)
        at java.nio.file.Files.walkFileTree(Files.java:2670)
        at java.nio.file.Files.walkFileTree(Files.java:2742)
        at com.searchcode.app.jobs.repository.IndexBaseRepoJob.indexDocsByPath(IndexBaseRepoJob.java:272)
        at com.searchcode.app.jobs.repository.IndexFileRepoJob.execute(IndexFileRepoJob.java:89)
        at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
        at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573)
[QuartzScheduler_Worker-21] ERROR org.quartz.core.ErrorLogger - Job (DEFAULT.updateindex-file-0 threw an exception.
org.quartz.SchedulerException: Job threw an unhandled exception. [See nested exception: java.lang.OutOfMemoryError: Java heap space]
        at org.quartz.core.JobRunShell.run(JobRunShell.java:213)
        at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573)
Caused by: java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOf(Arrays.java:3332)
        at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
        at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:596)
        at java.lang.StringBuffer.append(StringBuffer.java:367)
        at java.io.BufferedReader.readLine(BufferedReader.java:370)
        at java.io.BufferedReader.readLine(BufferedReader.java:389)
        at com.searchcode.app.util.Helpers.readFileLinesGuessEncoding(Helpers.java:176)
        at com.searchcode.app.jobs.repository.IndexBaseRepoJob.getCodeLines(IndexBaseRepoJob.java:398)
        at com.searchcode.app.jobs.repository.SearchcodeFileVisitor.visitFile(SearchcodeFileVisitor.java:75)
        at java.nio.file.Files.walkFileTree(Files.java:2670)
        at java.nio.file.Files.walkFileTree(Files.java:2742)
        at com.searchcode.app.jobs.repository.IndexBaseRepoJob.indexDocsByPath(IndexBaseRepoJob.java:272)
        at com.searchcode.app.jobs.repository.IndexFileRepoJob.execute(IndexFileRepoJob.java:89)
        at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
        ... 1 more

Stopped at the same number of files index too - 8936
And it stops at the exact same file every time. Nothing special with the file, it's just a regular python file.

Ah so my guess was correct then a memory issue. There are a few ways you can resolve this.

Edit the searchcode-server.sh or searchcode-server.bat file and add the Xmx and Xms arguments.

In your case I would set it such that Xmx is about 70% of the available RAM on the system. So something like,

exec java -jar searchcode-1.3.12.jar -Xmx2048 "$@"

You can also lower the value of the following properties,

max_document_queue_size=1000
max_document_queue_line_size=1000000

The last one I would consider is that if none of the above work set the searchcode.properties property low_memory to true and restart your instance. This should be method of last resort as it will lower memory usage with the impact of less indexing performance.

I suspect on of the other fixes, in particular the Xmx value setting will resolve it. Let me know how it goes.

Was able to replicate the issue by tweaking the Xmx value and queue size. So fairly confident that setting the Xmx to a larger value will solve this one for you.

The Xmx setting doesn't seem to does much. I set it to 5735 (8GB of RAM in the server now), and it still bombed with the out of memory error. I still have 5GB of RAM free.

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
21733 root      20   0 5732668 920776  20524 S 216.9 11.5   4:33.86 java

I just dropped the queue size from 2000 to 1000 and the document queue line size from 2000000 to 1000000 and running again.

Strange thing is I'm now getting a few negative indexing times.

It's still reverting back to queued and pausing at this point, but so far not erroring. I need to head out for now.

Dec 21, 2017 5:06:31 PM java.util.logging.LogManager$RootLogger log
INFO: Memory Usage: free memory: 1,851,496, allocated memory: 1,938,944, max memory: 1,938,944, total free memory: 1,851,496, memory usage: 85MB
Dec 21, 2017 5:06:31 PM java.util.logging.LogManager$RootLogger log
INFO: cleanMissingPathFiles doClean 2 0
Dec 21, 2017 5:06:31 PM java.util.logging.LogManager$RootLogger log
INFO: Successfully processed writing index success for artemis-parent
Dec 21, 2017 5:06:31 PM java.util.logging.LogManager$RootLogger log
INFO: Indexing file ./repo/artemis-parent/prepare_release.cmd
Dec 21, 2017 5:06:31 PM java.util.logging.LogManager$RootLogger log
INFO: Indexing file ./repo/artemis-parent/tag_release.cmd
Dec 21, 2017 5:06:31 PM java.util.logging.LogManager$RootLogger log
INFO: Closing writers

Nothing seems to be helping at all. It is always going back to queued, although this time it is not throwing OOM errors for heap after setting it to -Xmx7168m

I am seeing the memory messages scroll by that might be contradicting, not sure what type of memory it's referencing.

INFO: Memory Usage: free memory: 1,886,374, allocated memory: 1,970,176, max memory: 1,970,176, total free memory: 1,886,374, memory usage: 81MB

Spoke too soon:


Dec 22, 2017 10:34:14 AM java.util.logging.LogManager$RootLogger log
INFO: Memory Usage: free memory: 1,235,337, allocated memory: 1,626,624, max memory: 1,780,736, total free memory: 1,389,449, memory usage: 382MB
Dec 22, 2017 10:34:14 AM java.util.logging.LogManager$RootLogger log
INFO: Indexing file /code/vssshadow/Production/ContentOps/Python/datafeed/FutureElectronics/20100910/createIdav.py
Dec 22, 2017 10:34:14 AM java.util.logging.LogManager$RootLogger log
INFO: Indexing file /code/vssshadow/Production/ContentOps/Python/datafeed/Element14/20100702/createIdav.py
Dec 22, 2017 10:34:14 AM java.util.logging.LogManager$RootLogger log
INFO: Indexing file /code/vssshadow/Production/ContentOps/Python/datafeed/Digikey/20110502/runCreateIdav.bat
Dec 22, 2017 10:34:14 AM java.util.logging.LogManager$RootLogger log
INFO: Indexing file /code/vssshadow/Production/ContentOps/Python/datafeed/Digikey/20111103/createIdav.py
Dec 22, 2017 10:34:15 AM java.util.logging.LogManager$RootLogger log
INFO: Indexing file /code/vssshadow/Production/ContentOps/Python/datafeed/Digikey/restoreDigikey.py
Dec 22, 2017 10:34:15 AM java.util.logging.LogManager$RootLogger log
INFO: Indexing file /code/vssshadow/Production/ContentOps/Python/datafeed/Farnell/20100819/scrapelog.txt
Dec 22, 2017 10:34:15 AM java.util.logging.LogManager$RootLogger log
INFO: Indexing file /code/vssshadow/Production/ContentOps/Python/datafeed/Digikey/20110502/output.20110207.txt
Dec 22, 2017 10:34:15 AM java.util.logging.LogManager$RootLogger log
INFO: Indexing file /code/vssshadow/Production/ContentOps/Python/datafeed/Gates/20090330/createIdav.py
Dec 22, 2017 10:34:15 AM java.util.logging.LogManager$RootLogger log
INFO: Indexing file /code/vssshadow/Production/ContentOps/Python/datafeed/Festo/20090812/CreateIdav.py
Dec 22, 2017 10:34:15 AM java.util.logging.LogManager$RootLogger log
INFO: Indexing file /code/vssshadow/Production/ContentOps/Python/datafeed/Farnell/20100819/updateIdav.py
Dec 22, 2017 10:34:15 AM java.util.logging.LogManager$RootLogger log
INFO: Indexing file /code/vssshadow/Production/ContentOps/Python/datafeed/Farnell/20100819/createIdav.py
Dec 22, 2017 10:34:15 AM java.util.logging.LogManager$RootLogger log
INFO: Indexing file /code/vssshadow/Production/ContentOps/Python/datafeed/Grainger/20090219/grainger.py
Dec 22, 2017 10:34:15 AM java.util.logging.LogManager$RootLogger log
INFO: Indexing file /code/vssshadow/Production/ContentOps/Python/datafeed/Grainger/20110517/grainger.py
Dec 22, 2017 10:34:15 AM java.util.logging.LogManager$RootLogger log
INFO: Indexing file /code/vssshadow/Production/ContentOps/Python/datafeed/GlobalAmerican/20091217/GlobalAmerican.createIdav.py
Dec 22, 2017 10:34:15 AM java.util.logging.LogManager$RootLogger log
INFO: Indexing file /code/vssshadow/Production/ContentOps/Python/datafeed/Gates/legacy/createIdav.py
Dec 22, 2017 10:34:15 AM java.util.logging.LogManager$RootLogger log
INFO: Closing writers
[QuartzScheduler_Worker-12] ERROR org.quartz.core.JobRunShell - Job DEFAULT.updateindex-file-0 threw an unhandled Exception:
java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOf(Arrays.java:3332)
        at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
        at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:596)
        at java.lang.StringBuffer.append(StringBuffer.java:367)
        at java.io.BufferedReader.readLine(BufferedReader.java:370)
        at java.io.BufferedReader.readLine(BufferedReader.java:389)
        at com.searchcode.app.util.Helpers.readFileLinesGuessEncoding(Helpers.java:176)
        at com.searchcode.app.jobs.repository.IndexBaseRepoJob.getCodeLines(IndexBaseRepoJob.java:394)
        at com.searchcode.app.jobs.repository.SearchcodeFileVisitor.visitFile(SearchcodeFileVisitor.java:71)
        at java.nio.file.Files.walkFileTree(Files.java:2670)
        at java.nio.file.Files.walkFileTree(Files.java:2742)
        at com.searchcode.app.jobs.repository.IndexBaseRepoJob.indexDocsByPath(IndexBaseRepoJob.java:268)
        at com.searchcode.app.jobs.repository.IndexFileRepoJob.execute(IndexFileRepoJob.java:89)
        at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
        at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573)
[QuartzScheduler_Worker-12] ERROR org.quartz.core.ErrorLogger - Job (DEFAULT.updateindex-file-0 threw an exception.
org.quartz.SchedulerException: Job threw an unhandled exception. [See nested exception: java.lang.OutOfMemoryError: Java heap space]
        at org.quartz.core.JobRunShell.run(JobRunShell.java:213)
        at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573)
Caused by: java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOf(Arrays.java:3332)
        at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
        at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:596)
        at java.lang.StringBuffer.append(StringBuffer.java:367)
        at java.io.BufferedReader.readLine(BufferedReader.java:370)
        at java.io.BufferedReader.readLine(BufferedReader.java:389)
        at com.searchcode.app.util.Helpers.readFileLinesGuessEncoding(Helpers.java:176)
        at com.searchcode.app.jobs.repository.IndexBaseRepoJob.getCodeLines(IndexBaseRepoJob.java:394)
        at com.searchcode.app.jobs.repository.SearchcodeFileVisitor.visitFile(SearchcodeFileVisitor.java:71)
        at java.nio.file.Files.walkFileTree(Files.java:2670)
        at java.nio.file.Files.walkFileTree(Files.java:2742)
        at com.searchcode.app.jobs.repository.IndexBaseRepoJob.indexDocsByPath(IndexBaseRepoJob.java:268)
        at com.searchcode.app.jobs.repository.IndexFileRepoJob.execute(IndexFileRepoJob.java:89)
        at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
        ... 1 more

Even with 8GB of RAM, and XmX set to 7GB, and the queue settings set to what you have above, and the low_memory set to true, it's still giving Out of Memory. This time it stopped at 8517 documents. This feels like a memory leak.

That's worrying. I had previously been through an exercise to solve any memory leak with a very large repository of over 200 GB just a few months ago. Would it be possible for you to try the community edition quickly and see if you get the issue.

https://searchcode.com/static/searchcode-server-community.tar.gz

This will allow me to know if its something that has been introduced since there or if it was always there.

Ill try and replicate this myself. Last question, are you using the OpenJDK the Oracle JDK or some other java runtime?

Same thing with the community edition. That's what we were running to begin with. The strange thing is that once we get the OOM error, it restarts the crawl again. Why can't it just pick up where it left off? And since it restarts the crawl, the OOM error isn't fatal - it frees up on what I assume in a garbage collection.

openjdk version "1.8.0_151"
OpenJDK Runtime Environment (build 1.8.0_151-b12)
OpenJDK 64-Bit Server VM (build 25.151-b12, mixed mode)

Here's a JMX monitor of the process. It covers the time span of application launch, index, OOM @approx 9:38:30, then I forced a reindex, and another OOM at 9:41:30

heap

Here's a heap dump taken about 10 seconds before the out of memory.
heapdump-1514299503814.zip

Here I set the min and max heap thinking it may not be able to grow the heap fast enough. Doesn't help. You can see that it's not even remotely coming close to the max heap size when it fails with the OOM error.

heap1

I'm just going to keep spamming :) Turned on verbose GC logging and OOM memory dump.

[GC (Allocation Failure) [PSYoungGen: 1328283K->27708K(1359360K)] 1383753K->345635K(4155904K), 0.1584173 secs] [Times: user=0.52 sys=0.08, real=0.16 secs]
Dec 26, 2017 10:39:03 AM java.util.logging.LogManager$RootLogger log
INFO: Closing writers
[GC (Allocation Failure) [PSYoungGen: 638348K->3728K(1362432K)] 2005106K->1371038K(4158976K), 0.0120329 secs] [Times: user=0.05 sys=0.00, real=0.01 secs]
[GC (Allocation Failure) [PSYoungGen: 3728K->3312K(1361408K)] 1371038K->1371046K(4157952K), 0.0134192 secs] [Times: user=0.05 sys=0.00, real=0.02 secs]
[Full GC (Allocation Failure) [PSYoungGen: 3312K->0K(1361408K)] [ParOldGen: 1367734K->1071418K(2796544K)] 1371046K->1071418K(4157952K), [Metaspace: 27792K->27788K(1075200K)], 0.1566229 secs] [Times: user=0.41 sys=0.02, real=0.15 secs]
[GC (Allocation Failure) [PSYoungGen: 0K->0K(1361920K)] 1071418K->1071418K(4158464K), 0.0052416 secs] [Times: user=0.01 sys=0.00, real=0.01 secs]
[Full GC (Allocation Failure) [PSYoungGen: 0K->0K(1361920K)] [ParOldGen: 1071418K->1069608K(2796544K)] 1071418K->1069608K(4158464K), [Metaspace: 27788K->27588K(1075200K)], 0.1833277 secs] [Times: user=0.59 sys=0.00, real=0.18 secs]
java.lang.OutOfMemoryError: Java heap space
Dumping heap to java_pid28910.hprof ...
Heap dump file created [1106770083 bytes in 5.432 secs]
[QuartzScheduler_Worker-16] ERROR org.quartz.core.JobRunShell - Job DEFAULT.updateindex-file-0 threw an unhandled Exception:
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3332)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:596)
at java.lang.StringBuffer.append(StringBuffer.java:367)
at java.io.BufferedReader.readLine(BufferedReader.java:370)
at java.io.BufferedReader.readLine(BufferedReader.java:389)
at com.searchcode.app.util.Helpers.readFileLinesGuessEncoding(Helpers.java:153)
at com.searchcode.app.jobs.repository.IndexBaseRepoJob.getCodeLines(IndexBaseRepoJob.java:392)
at com.searchcode.app.jobs.repository.SearchcodeFileVisitor.visitFile(SearchcodeFileVisitor.java:72)
at java.nio.file.Files.walkFileTree(Files.java:2670)
at java.nio.file.Files.walkFileTree(Files.java:2742)
at com.searchcode.app.jobs.repository.IndexBaseRepoJob.indexDocsByPath(IndexBaseRepoJob.java:266)
at com.searchcode.app.jobs.repository.IndexFileRepoJob.execute(IndexFileRepoJob.java:89)
at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573)
[QuartzScheduler_Worker-16] ERROR org.quartz.core.ErrorLogger - Job (DEFAULT.updateindex-file-0 threw an exception.
org.quartz.SchedulerException: Job threw an unhandled exception. [See nested exception: java.lang.OutOfMemoryError: Java heap space]
at org.quartz.core.JobRunShell.run(JobRunShell.java:213)
at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573)
Caused by: java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3332)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:596)
at java.lang.StringBuffer.append(StringBuffer.java:367)
at java.io.BufferedReader.readLine(BufferedReader.java:370)
at java.io.BufferedReader.readLine(BufferedReader.java:389)
at com.searchcode.app.util.Helpers.readFileLinesGuessEncoding(Helpers.java:153)
at com.searchcode.app.jobs.repository.IndexBaseRepoJob.getCodeLines(IndexBaseRepoJob.java:392)
at com.searchcode.app.jobs.repository.SearchcodeFileVisitor.visitFile(SearchcodeFileVisitor.java:72)
at java.nio.file.Files.walkFileTree(Files.java:2670)
at java.nio.file.Files.walkFileTree(Files.java:2742)
at com.searchcode.app.jobs.repository.IndexBaseRepoJob.indexDocsByPath(IndexBaseRepoJob.java:266)
at com.searchcode.app.jobs.repository.IndexFileRepoJob.execute(IndexFileRepoJob.java:89)
at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
... 1 more
^C
Heap
PSYoungGen total 1361920K, used 252478K [0x000000076ab00000, 0x00000007c0000000, 0x00000007c0000000)
eden space 1326080K, 19% used [0x000000076ab00000,0x000000077a18f8c0,0x00000007bba00000)
from space 35840K, 0% used [0x00000007bba00000,0x00000007bba00000,0x00000007bdd00000)
to space 35840K, 0% used [0x00000007bdd00000,0x00000007bdd00000,0x00000007c0000000)
ParOldGen total 2796544K, used 1069608K [0x00000006c0000000, 0x000000076ab00000, 0x000000076ab00000)
object space 2796544K, 38% used [0x00000006c0000000,0x000000070148a048,0x000000076ab00000)
Metaspace used 28119K, capacity 28630K, committed 28928K, reserved 1075200K
class space used 2767K, capacity 2932K, committed 3072K, reserved 1048576K

Analyzing the heap dump for memory leaks says:
The thread org.quartz.simpl.SimpleThreadPool$WorkerThread @ 0x6c00e0820 QuartzScheduler_Worker-16 keeps local variables with total size 1,076,664,392 (98.47%) bytes.
The memory is accumulated in one instance of "char[]" loaded by "".

Here's the OOM Heap Dump.
java_pid28910.zip

Ah I have a theory.

I suspect that there is a large file inside that repository you are trying to index that has no newlines. Because the read tries to load a file based on this (the depth setting controls this) I suspect that it loads partially and crashes out. Ill have a play around with this idea over the next few days.

The reason it does not continue is that if it does crash out it assumes there was an issue and tries again. It works of the timestamp of the last successful run against the files. Reprocessing like this is quite fast hence why it works like this.

Just an additional note - I changed the number of file processors down to 1, and it still does it, and it stops on a properly formatted perl script file. Like I said though - each time I remove and re-add the repo, it stops on a different number of indexed files. Still - can't wait to try something else.

In the above its this line in the stacktrace that intrigues me,

at com.searchcode.app.util.Helpers.readFileLinesGuessEncoding(Helpers.java:153)

I suspect you are hitting an issue with this portion of code,

public List<String> readFileLinesGuessEncoding(String filePath, int maxFileLineDepth) throws IOException {
        List<String> fileLines = new ArrayList<>();
        BufferedReader bufferedReader = null;
        String line;

        try {
            bufferedReader = new BufferedReader(new InputStreamReader(new FileInputStream(filePath), guessCharset(new File(filePath))));

            int lineCount = 0;
            while ((line = bufferedReader.readLine()) != null) {
                lineCount++;

                fileLines.add(line);

                if (lineCount == maxFileLineDepth) {
                    return fileLines;
                }
            }
        }
        finally {
            IOUtils.closeQuietly(bufferedReader);
        }

        return fileLines;
    }

Just need to replicate the issue with a file and ill implement a fix.

So was able to replicate this. I used the following python script to create a large file ~1 GB in size with no newlines.

with open('no_newlines', 'w') as myfile:
    for _ in range(1000000000): # ~1 gb
        myfile.writelines("a")

Running the method against it and I get the following exception :)

java.lang.OutOfMemoryError: Java heap space

	at java.util.Arrays.copyOfRange(Arrays.java:3664)
	at java.lang.String.<init>(String.java:207)
	at java.nio.HeapCharBuffer.toString(HeapCharBuffer.java:567)
	at java.nio.CharBuffer.toString(CharBuffer.java:1241)
	at java.util.regex.Matcher.toMatchResult(Matcher.java:250)
	at java.util.Scanner.match(Scanner.java:1294)
	at java.util.Scanner.hasNextLine(Scanner.java:1502)
	at com.searchcode.app.util.Helpers.readFileLines(Helpers.java:155)
	at com.searchcode.app.util.HelpersTest.testReadFileLinesIssue168(HelpersTest.java:35)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at junit.framework.TestCase.runTest(TestCase.java:176)
	at junit.framework.TestCase.runBare(TestCase.java:141)
	at junit.framework.TestResult$1.protect(TestResult.java:122)
	at junit.framework.TestResult.runProtected(TestResult.java:142)
	at junit.framework.TestResult.run(TestResult.java:125)
	at junit.framework.TestCase.run(TestCase.java:129)
	at junit.framework.TestSuite.runTest(TestSuite.java:252)
	at junit.framework.TestSuite.run(TestSuite.java:247)
	at org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:86)
	at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
	at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
	at com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:47)
	at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242)
	at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70)

So what needs to happen is that the method needs to be changed to not read using newlines but instead by bytes. One of those things I considered at the time but figured it would be unlikely to cause issues. Anyway I am in the middle of that now and running though all the usual tests to ensure it works 100% as before. The moment I have merged it into master ill let you know here so you can try things out again.

If you would like to get a jump on things the new code is sitting in https://github.com/boyter/searchcode-server/tree/Issue168 the branch Issue168

Just working though the usual tests before merging in.

Ok merged in. All looks alright to me and there is a new test there to cover this case as best I can.

@MaxDiOrio If you are able to build from master and try again that should work for you.

Testing now - might have found an unrelated bug. On Admin, repository list, the Edit button kicks off an index instead of allowing edit.

                <td>
                    <button class="btn btn-xs btn-danger delete" data-id="VSS" name="delete" type="submit"><span class="glyphicon glyphicon-remove" aria-hidden="true"></span> Delete</button>
                </td>
                <td>
                    <button class="btn btn-xs btn-default reindex" data-id="VSS" name="reindex" type="submit"><span class="glyphicon glyphicon-refresh" aria-hidden="true"></span> Reindex</button>
                </td>
                <td>
                    <button class="btn btn-xs btn-default reindex" data-id="VSS" name="reindex" type="submit"><span class="glyphicon glyphicon-edit" aria-hidden="true"></span> Edit</button>
                </td>

We have a successful index! 100050 files in 408 seconds

New issue - Went to go implement this on the production side...deleted the existing repo, went to add the new one and now I'm getting a 500 Internal Error when I click add repository, but nothing is getting logged that I can see.

Glad to hear the bug is resolved.

Yeah... well its master :) as this is a pretty serious issue I will be looking to move the code into a release candidate so it should be resolved soon. After it runs through the usual tests ill mark it off as a release. Going to keep this open as a reference till done.

So looking into it. The edit was just because I was still working on the code to allow editing of repositories. Hence its just a placeholder. The deletion and adding is an issue though. It looks like the repositories are never deleted which is annoying. Will dig in further.

Editing of repositories... well the fields that can be edited is in master now. So you should be able to update things now.

I have not been able to replicate the deletion/adding issue unless it is running on windows. It seems windows locks the files for a long period of time. Waiting long enough cleans it up. I suspect that since you are using Linux it might be that its taking a long time to clean out the index itself which is blocking the update.

I will keep trying to replicate the issue though.

Great! Going to close this down and do some shilling then :)

Do please consider buying the fully supported version https://searchcodeserver.com/pricing.html using the source as you are limits you to 5 users but not going to push since it was very helpful for you to assist me with the bug. Thanks very much. If not id love to get a testimonial. If you are not comfortable with putting it here just email me at ben@boyter.org