Index goes to queued not even close to completing crawl

Question

Index goes to queued not even close to completing crawl

MaxDiOrio opened this issue 7 years ago · comments

Hi,

We have a filesystem based repository with upwards of 60,000 files in it, as well as a dozen or more Git based repositories that are working fine.

The FS based repo though won't finish crawling. It goes from Queued to Indexing and back to Queued, never once changing the number of files indexed. It's currently stuck at 8936 and has been for almost 24 hours now. Logs don't show me anything good.

Any thoughts on how to get past this?

We actually started indexing this data as locally mounted CIFS share and that was very slow. Then I added additional file extensions and reindexed and it didn't pick up those files. So I wanted to try rsync'ing the repo locally since that should be significantly faster to index. That led us up to where we're at now.

Thanks!

Max

Max DiOrio · Answer 1 · Wed Dec 20 2017 02:44:46 GMT+0800 (China Standard Time)

For added info - I built the master branch and it's still doing the same thing.

Here's the tail of the searchcode-server-0.log when the index just stopps.

Dec 19, 2017 1:34:52 PM java.util.logging.LogManager$RootLogger log
INFO: Indexing file /code/vssshadow/Production/ContentOps/Python/datafeed/GlobalAmerican/20091217/GlobalAmerican.createIdav.py
Dec 19, 2017 1:34:52 PM java.util.logging.LogManager$RootLogger log
INFO: Indexing file /code/vssshadow/Production/ContentOps/Python/datafeed/Digikey/20110502/output.20110207.txt
Dec 19, 2017 1:34:52 PM java.util.logging.LogManager$RootLogger log
INFO: Indexing file /code/vssshadow/Production/ContentOps/Python/datafeed/Digikey/20110502/runCreateIdav.bat
Dec 19, 2017 1:34:52 PM java.util.logging.LogManager$RootLogger log
INFO: Indexing file /code/vssshadow/Production/ContentOps/Python/datafeed/Digikey/20111103/createIdav.py
Dec 19, 2017 1:34:53 PM java.util.logging.LogManager$RootLogger log
INFO: Closing writers

It takes a 10 minute break, then starts indexing again. This is with the log level at Info.

Ben Boyter · Answer 2 · Wed Dec 20 2017 05:12:50 GMT+0800 (China Standard Time)

Hi Max,

So the easiest way to diagnose this would be to get a copy of the file system. Going to make the assumption now that this is not an option though and provide some other details.

Looking at the log it looks like it is working as expected, at least to a certain point. For the moment can you try changing the following options in the searchcode.properties file and then restart the service.

log_indexed=true

This should drop a log file into the logs directory with the project name as the filename and .log as the extension. This will allow you to see if searchcode is even able to see the full 60,000 files. The file should contain an entry for every file it was able to process and if it indexed it or excluded it an the reasons for this.

If you notice that there isn't 60,000 files in there you can try tweaking the following value,

follow_links=true

Which should allow searchcode to follow the symlinks and the like. Then repeat the above steps. It is set to false by default as that tends to be more reliable.

My guess at this point is that somewhere in the file system there is a link or something odd that is causing the process to not be able to pick up all files.

The 10 minute break BTW is set by the value,

check_filerepo_changes=600

Which controls how often the file process threads run.

Ben

Max DiOrio · Answer 3 · Wed Dec 20 2017 05:18:16 GMT+0800 (China Standard Time)

Hey Ben,

Funny thing is I have log_indexed=true All other repos generate their logs, except for the filesystem repo I'm having a problem with. It may be that this log file isn't created until after the first indexing completes?

I have no symlinks in this repo. It's a straight rsync backup from a CIFS share so symlinks aren't supported. There are nothing but regular files here.

Max

Ben Boyter · Answer 4 · Wed Dec 20 2017 05:22:05 GMT+0800 (China Standard Time)

In that case can you grep the logs for the following string "indexDocsByPath walkFileTree" quickly? It would appear that it is crashing out somewhere.

Would you be happy to build from source to get additional logging for this?

Max DiOrio · Answer 5 · Wed Dec 20 2017 05:24:59 GMT+0800 (China Standard Time)

grepping returns nothing.

I built from the git master earlier - so I have no problem building again. Thanks!

Ben Boyter · Answer 6 · Wed Dec 20 2017 05:37:09 GMT+0800 (China Standard Time)

Sorry rubbish internet in Australia. Takes longer to push a simple change then to make it.

02eac6a

Has the relevant log message added. I will have a look at improving the logs for all of these right now while waiting for a result from you. Just need to build from master again and you should be set.

Max DiOrio · Answer 7 · Wed Dec 20 2017 05:46:41 GMT+0800 (China Standard Time)

Rebuilt and deployed. So I should be able to grep the searchcode-server-0.log file for IndexFileRepoJob ?

Ben Boyter · Answer 8 · Wed Dec 20 2017 05:48:58 GMT+0800 (China Standard Time)

IndexFileRepoJob.execute would be the error line we are looking for but IndexFileRepoJob should work just as well. Hopefully will not take too long to track down.

Max DiOrio · Answer 9 · Wed Dec 20 2017 06:15:30 GMT+0800 (China Standard Time)

So far it's gone between queued and indexing a few times and no IndexFileRepoJob.execute is being logged. When I first started the index yesterday, it indexed the close to 9000 files in only a few minutes.

Info is the right log level to be on right?

Max DiOrio · Answer 10 · Wed Dec 20 2017 06:17:31 GMT+0800 (China Standard Time)

Each time before when it goes to queued the log is:

Dec 19, 2017 5:16:02 PM java.util.logging.LogManager$RootLogger log
INFO: Indexing file /code/vssshadow/Production/ContentOps/Python/datafeed/Digikey/20110502/runCreateIdav.bat
Dec 19, 2017 5:16:02 PM java.util.logging.LogManager$RootLogger log
INFO: Indexing file /code/vssshadow/Production/ContentOps/Python/datafeed/Digikey/20111103/createIdav.py
Dec 19, 2017 5:16:02 PM java.util.logging.LogManager$RootLogger log
INFO: Closing writers

It sits queued for about 10 minutes then starts up again. I'm wondering what's causing it to pause like this.

The system has 4GB of RAM assigned to it. and about 230MB free. I can add RAM if needed, but I don't really see much memory pressure.

Ben Boyter · Answer 11 · Wed Dec 20 2017 06:24:32 GMT+0800 (China Standard Time)

Info will just dump out all of the logs. So in your case yes. We are looking for a severe log error to appear in the logs though.

This is interesting. It should be dumping out the error at this point. Give me a few mins I'm going to pull out some code for you to try against the file location.

Ben Boyter · Answer 12 · Wed Dec 20 2017 06:47:22 GMT+0800 (China Standard Time)

OK could you try the following Java code.

https://gist.github.com/boyter/44c45d9d587120e57f29841b34ca82dc

You just need to run it like so,

javac Main.java
java Main DIRECTORY > output.txt

Replace DIRECTORY with the path to your directory and output.txt to whatever you want the file to be written as. This should allow us to work out if the problem is in the way the java walk works or otherwise.

Max DiOrio · Answer 13 · Wed Dec 20 2017 06:51:19 GMT+0800 (China Standard Time)

126,018 lines in the output file.

Max DiOrio · Answer 14 · Wed Dec 20 2017 06:53:05 GMT+0800 (China Standard Time)

output.txt.gz

Ben Boyter · Answer 15 · Wed Dec 20 2017 07:30:32 GMT+0800 (China Standard Time)

That would be about how many files you have in the directory? IE a number similar to something like

find . | wc -l

Ben Boyter · Answer 16 · Wed Dec 20 2017 07:43:41 GMT+0800 (China Standard Time)

Would you be able to try it compared to this version?

https://gist.github.com/boyter/5b5d5fc9ceec3f2ab78b163fbb27a30f

This one includes the options to follow links.

Max DiOrio · Answer 17 · Wed Dec 20 2017 07:47:28 GMT+0800 (China Standard Time)

I'll hop on that tomorrow. The 100k+ files sounds about right. There are a bunch of binary files in there like jpegs, so indexed count will be lower. Sent from Nine<http://www.9folders.com/>

…

________________________________ From: Ben Boyter <notifications@github.com> Sent: Tuesday, December 19, 2017 6:43 PM To: boyter/searchcode-server Cc: DiOrio, Max; Author Subject: Re: [boyter/searchcode-server] Index goes to queued not even close to completing crawl (#168) Would you be able to try it compared to this version? https://gist.github.com/boyter/5b5d5fc9ceec3f2ab78b163fbb27a30f This one includes the options to follow links. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<#168 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AYqGJKGQpSluTJgz1Rme7DZEIXMf3VfHks5tCEougaJpZM4RHNkb>.

Ben Boyter · Answer 18 · Wed Dec 20 2017 08:34:10 GMT+0800 (China Standard Time)

That's annoying. The core loop is the same. With no exception being thrown either. I will have a look through the code and add additional logging to try and catch this one.

Its odd because inside the SearchcodeFileVisitor class which is responsible for walking the path the visitFile override explicitly try catches over all the logic that processes the file at which point it should print out indexDocsByPath walkFileTree due to the following,

Singleton.getLogger().warning("ERROR - caught a " + ex.getClass() + " in " + this.getClass() + " indexDocsByPath walkFileTree with message: " + ex.getMessage() + " for file " + file.toString() + " in path " + file + " in repo " + this.repoResult.getName());

Its possible however that it is crashing so hard that the thread cannot even log however. Ill add more logging around everything and we can try to track it down.

Max DiOrio · Answer 19 · Thu Dec 21 2017 00:04:15 GMT+0800 (China Standard Time)

"find . | wc -l" returns 146,863

The second Main.java code yields the same 126,018 lines as the first one.

Ben Boyter · Answer 20 · Thu Dec 21 2017 06:29:45 GMT+0800 (China Standard Time)

Very odd. I think there is some sort of hard crash happing on the thread. Maybe out of memory error or something that's causing the issue. I will continue to add logging to where everything is happening so we can hopefully work out what is causing the issue.

Ben Boyter · Answer 21 · Thu Dec 21 2017 13:20:31 GMT+0800 (China Standard Time)

Ok I have added some additional logging to help with this. If you pull the latest and compile from source again and this time only add the file repository it should help track down the issue.

You can grep the logs at info level for you GitHub name to find it "MaxDiOrio" it will probably be quite verbose, but what I am hoping to see is that every time it tries to run it will end up crashing out on a specific file or on a line repeatedly.

Max DiOrio · Answer 22 · Thu Dec 21 2017 23:19:10 GMT+0800 (China Standard Time)

Got one - wasn't logged with my name, but got this while running from the console, not a service. This happened when the job went to queued.

Dec 21, 2017 10:23:45 AM java.util.logging.LogManager$RootLogger log
INFO: Indexing file /code/vssshadow/Production/ContentOps/Python/datafeed/Digikey/20110502/runCreateIdav.bat
Dec 21, 2017 10:23:45 AM java.util.logging.LogManager$RootLogger log
INFO: Indexing file /code/vssshadow/Production/ContentOps/Python/datafeed/Digikey/20111103/createIdav.py
Dec 21, 2017 10:23:46 AM java.util.logging.LogManager$RootLogger log
INFO: Closing writers
[QuartzScheduler_Worker-21] ERROR org.quartz.core.JobRunShell - Job DEFAULT.updateindex-file-0 threw an unhandled Exception:
java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOf(Arrays.java:3332)
        at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
        at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:596)
        at java.lang.StringBuffer.append(StringBuffer.java:367)
        at java.io.BufferedReader.readLine(BufferedReader.java:370)
        at java.io.BufferedReader.readLine(BufferedReader.java:389)
        at com.searchcode.app.util.Helpers.readFileLinesGuessEncoding(Helpers.java:176)
        at com.searchcode.app.jobs.repository.IndexBaseRepoJob.getCodeLines(IndexBaseRepoJob.java:398)
        at com.searchcode.app.jobs.repository.SearchcodeFileVisitor.visitFile(SearchcodeFileVisitor.java:75)
        at java.nio.file.Files.walkFileTree(Files.java:2670)
        at java.nio.file.Files.walkFileTree(Files.java:2742)
        at com.searchcode.app.jobs.repository.IndexBaseRepoJob.indexDocsByPath(IndexBaseRepoJob.java:272)
        at com.searchcode.app.jobs.repository.IndexFileRepoJob.execute(IndexFileRepoJob.java:89)
        at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
        at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573)
[QuartzScheduler_Worker-21] ERROR org.quartz.core.ErrorLogger - Job (DEFAULT.updateindex-file-0 threw an exception.
org.quartz.SchedulerException: Job threw an unhandled exception. [See nested exception: java.lang.OutOfMemoryError: Java heap space]
        at org.quartz.core.JobRunShell.run(JobRunShell.java:213)
        at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573)
Caused by: java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOf(Arrays.java:3332)
        at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
        at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:596)
        at java.lang.StringBuffer.append(StringBuffer.java:367)
        at java.io.BufferedReader.readLine(BufferedReader.java:370)
        at java.io.BufferedReader.readLine(BufferedReader.java:389)
        at com.searchcode.app.util.Helpers.readFileLinesGuessEncoding(Helpers.java:176)
        at com.searchcode.app.jobs.repository.IndexBaseRepoJob.getCodeLines(IndexBaseRepoJob.java:398)
        at com.searchcode.app.jobs.repository.SearchcodeFileVisitor.visitFile(SearchcodeFileVisitor.java:75)
        at java.nio.file.Files.walkFileTree(Files.java:2670)
        at java.nio.file.Files.walkFileTree(Files.java:2742)
        at com.searchcode.app.jobs.repository.IndexBaseRepoJob.indexDocsByPath(IndexBaseRepoJob.java:272)
        at com.searchcode.app.jobs.repository.IndexFileRepoJob.execute(IndexFileRepoJob.java:89)
        at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
        ... 1 more

Max DiOrio · Answer 23 · Thu Dec 21 2017 23:20:11 GMT+0800 (China Standard Time)

Stopped at the same number of files index too - 8936
And it stops at the exact same file every time. Nothing special with the file, it's just a regular python file.

Ben Boyter · Answer 24 · Fri Dec 22 2017 05:11:39 GMT+0800 (China Standard Time)

Ah so my guess was correct then a memory issue. There are a few ways you can resolve this.

Edit the searchcode-server.sh or searchcode-server.bat file and add the Xmx and Xms arguments.

In your case I would set it such that Xmx is about 70% of the available RAM on the system. So something like,

exec java -jar searchcode-1.3.12.jar -Xmx2048 "$@"

You can also lower the value of the following properties,

max_document_queue_size=1000
max_document_queue_line_size=1000000

The last one I would consider is that if none of the above work set the searchcode.properties property low_memory to true and restart your instance. This should be method of last resort as it will lower memory usage with the impact of less indexing performance.

I suspect on of the other fixes, in particular the Xmx value setting will resolve it. Let me know how it goes.

Ben Boyter · Answer 25 · Fri Dec 22 2017 05:52:58 GMT+0800 (China Standard Time)

Was able to replicate the issue by tweaking the Xmx value and queue size. So fairly confident that setting the Xmx to a larger value will solve this one for you.

Max DiOrio · Answer 26 · Fri Dec 22 2017 06:03:44 GMT+0800 (China Standard Time)

The Xmx setting doesn't seem to does much. I set it to 5735 (8GB of RAM in the server now), and it still bombed with the out of memory error. I still have 5GB of RAM free.

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
21733 root      20   0 5732668 920776  20524 S 216.9 11.5   4:33.86 java

Max DiOrio · Answer 27 · Fri Dec 22 2017 06:04:35 GMT+0800 (China Standard Time)

I just dropped the queue size from 2000 to 1000 and the document queue line size from 2000000 to 1000000 and running again.

Max DiOrio · Answer 28 · Fri Dec 22 2017 06:05:57 GMT+0800 (China Standard Time)

Strange thing is I'm now getting a few negative indexing times.

Max DiOrio · Answer 29 · Fri Dec 22 2017 06:08:32 GMT+0800 (China Standard Time)

It's still reverting back to queued and pausing at this point, but so far not erroring. I need to head out for now.

Dec 21, 2017 5:06:31 PM java.util.logging.LogManager$RootLogger log
INFO: Memory Usage: free memory: 1,851,496, allocated memory: 1,938,944, max memory: 1,938,944, total free memory: 1,851,496, memory usage: 85MB
Dec 21, 2017 5:06:31 PM java.util.logging.LogManager$RootLogger log
INFO: cleanMissingPathFiles doClean 2 0
Dec 21, 2017 5:06:31 PM java.util.logging.LogManager$RootLogger log
INFO: Successfully processed writing index success for artemis-parent
Dec 21, 2017 5:06:31 PM java.util.logging.LogManager$RootLogger log
INFO: Indexing file ./repo/artemis-parent/prepare_release.cmd
Dec 21, 2017 5:06:31 PM java.util.logging.LogManager$RootLogger log
INFO: Indexing file ./repo/artemis-parent/tag_release.cmd
Dec 21, 2017 5:06:31 PM java.util.logging.LogManager$RootLogger log
INFO: Closing writers

Max DiOrio · Answer 30 · Fri Dec 22 2017 23:29:21 GMT+0800 (China Standard Time)

Nothing seems to be helping at all. It is always going back to queued, although this time it is not throwing OOM errors for heap after setting it to -Xmx7168m

I am seeing the memory messages scroll by that might be contradicting, not sure what type of memory it's referencing.

INFO: Memory Usage: free memory: 1,886,374, allocated memory: 1,970,176, max memory: 1,970,176, total free memory: 1,886,374, memory usage: 81MB

Max DiOrio · Answer 31 · Fri Dec 22 2017 23:35:08 GMT+0800 (China Standard Time)

Spoke too soon:


Dec 22, 2017 10:34:14 AM java.util.logging.LogManager$RootLogger log
INFO: Memory Usage: free memory: 1,235,337, allocated memory: 1,626,624, max memory: 1,780,736, total free memory: 1,389,449, memory usage: 382MB
Dec 22, 2017 10:34:14 AM java.util.logging.LogManager$RootLogger log
INFO: Indexing file /code/vssshadow/Production/ContentOps/Python/datafeed/FutureElectronics/20100910/createIdav.py
Dec 22, 2017 10:34:14 AM java.util.logging.LogManager$RootLogger log
INFO: Indexing file /code/vssshadow/Production/ContentOps/Python/datafeed/Element14/20100702/createIdav.py
Dec 22, 2017 10:34:14 AM java.util.logging.LogManager$RootLogger log
INFO: Indexing file /code/vssshadow/Production/ContentOps/Python/datafeed/Digikey/20110502/runCreateIdav.bat
Dec 22, 2017 10:34:14 AM java.util.logging.LogManager$RootLogger log
INFO: Indexing file /code/vssshadow/Production/ContentOps/Python/datafeed/Digikey/20111103/createIdav.py
Dec 22, 2017 10:34:15 AM java.util.logging.LogManager$RootLogger log
INFO: Indexing file /code/vssshadow/Production/ContentOps/Python/datafeed/Digikey/restoreDigikey.py
Dec 22, 2017 10:34:15 AM java.util.logging.LogManager$RootLogger log
INFO: Indexing file /code/vssshadow/Production/ContentOps/Python/datafeed/Farnell/20100819/scrapelog.txt
Dec 22, 2017 10:34:15 AM java.util.logging.LogManager$RootLogger log
INFO: Indexing file /code/vssshadow/Production/ContentOps/Python/datafeed/Digikey/20110502/output.20110207.txt
Dec 22, 2017 10:34:15 AM java.util.logging.LogManager$RootLogger log
INFO: Indexing file /code/vssshadow/Production/ContentOps/Python/datafeed/Gates/20090330/createIdav.py
Dec 22, 2017 10:34:15 AM java.util.logging.LogManager$RootLogger log
INFO: Indexing file /code/vssshadow/Production/ContentOps/Python/datafeed/Festo/20090812/CreateIdav.py
Dec 22, 2017 10:34:15 AM java.util.logging.LogManager$RootLogger log
INFO: Indexing file /code/vssshadow/Production/ContentOps/Python/datafeed/Farnell/20100819/updateIdav.py
Dec 22, 2017 10:34:15 AM java.util.logging.LogManager$RootLogger log
INFO: Indexing file /code/vssshadow/Production/ContentOps/Python/datafeed/Farnell/20100819/createIdav.py
Dec 22, 2017 10:34:15 AM java.util.logging.LogManager$RootLogger log
INFO: Indexing file /code/vssshadow/Production/ContentOps/Python/datafeed/Grainger/20090219/grainger.py
Dec 22, 2017 10:34:15 AM java.util.logging.LogManager$RootLogger log
INFO: Indexing file /code/vssshadow/Production/ContentOps/Python/datafeed/Grainger/20110517/grainger.py
Dec 22, 2017 10:34:15 AM java.util.logging.LogManager$RootLogger log
INFO: Indexing file /code/vssshadow/Production/ContentOps/Python/datafeed/GlobalAmerican/20091217/GlobalAmerican.createIdav.py
Dec 22, 2017 10:34:15 AM java.util.logging.LogManager$RootLogger log
INFO: Indexing file /code/vssshadow/Production/ContentOps/Python/datafeed/Gates/legacy/createIdav.py
Dec 22, 2017 10:34:15 AM java.util.logging.LogManager$RootLogger log
INFO: Closing writers
[QuartzScheduler_Worker-12] ERROR org.quartz.core.JobRunShell - Job DEFAULT.updateindex-file-0 threw an unhandled Exception:
java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOf(Arrays.java:3332)
        at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
        at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:596)
        at java.lang.StringBuffer.append(StringBuffer.java:367)
        at java.io.BufferedReader.readLine(BufferedReader.java:370)
        at java.io.BufferedReader.readLine(BufferedReader.java:389)
        at com.searchcode.app.util.Helpers.readFileLinesGuessEncoding(Helpers.java:176)
        at com.searchcode.app.jobs.repository.IndexBaseRepoJob.getCodeLines(IndexBaseRepoJob.java:394)
        at com.searchcode.app.jobs.repository.SearchcodeFileVisitor.visitFile(SearchcodeFileVisitor.java:71)
        at java.nio.file.Files.walkFileTree(Files.java:2670)
        at java.nio.file.Files.walkFileTree(Files.java:2742)
        at com.searchcode.app.jobs.repository.IndexBaseRepoJob.indexDocsByPath(IndexBaseRepoJob.java:268)
        at com.searchcode.app.jobs.repository.IndexFileRepoJob.execute(IndexFileRepoJob.java:89)
        at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
        at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573)
[QuartzScheduler_Worker-12] ERROR org.quartz.core.ErrorLogger - Job (DEFAULT.updateindex-file-0 threw an exception.
org.quartz.SchedulerException: Job threw an unhandled exception. [See nested exception: java.lang.OutOfMemoryError: Java heap space]
        at org.quartz.core.JobRunShell.run(JobRunShell.java:213)
        at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573)
Caused by: java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOf(Arrays.java:3332)
        at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
        at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:596)
        at java.lang.StringBuffer.append(StringBuffer.java:367)
        at java.io.BufferedReader.readLine(BufferedReader.java:370)
        at java.io.BufferedReader.readLine(BufferedReader.java:389)
        at com.searchcode.app.util.Helpers.readFileLinesGuessEncoding(Helpers.java:176)
        at com.searchcode.app.jobs.repository.IndexBaseRepoJob.getCodeLines(IndexBaseRepoJob.java:394)
        at com.searchcode.app.jobs.repository.SearchcodeFileVisitor.visitFile(SearchcodeFileVisitor.java:71)
        at java.nio.file.Files.walkFileTree(Files.java:2670)
        at java.nio.file.Files.walkFileTree(Files.java:2742)
        at com.searchcode.app.jobs.repository.IndexBaseRepoJob.indexDocsByPath(IndexBaseRepoJob.java:268)
        at com.searchcode.app.jobs.repository.IndexFileRepoJob.execute(IndexFileRepoJob.java:89)
        at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
        ... 1 more

Max DiOrio · Answer 32 · Fri Dec 22 2017 23:50:01 GMT+0800 (China Standard Time)

Even with 8GB of RAM, and XmX set to 7GB, and the queue settings set to what you have above, and the low_memory set to true, it's still giving Out of Memory. This time it stopped at 8517 documents. This feels like a memory leak.

Ben Boyter · Answer 33 · Sat Dec 23 2017 04:40:42 GMT+0800 (China Standard Time)

That's worrying. I had previously been through an exercise to solve any memory leak with a very large repository of over 200 GB just a few months ago. Would it be possible for you to try the community edition quickly and see if you get the issue.

https://searchcode.com/static/searchcode-server-community.tar.gz

This will allow me to know if its something that has been introduced since there or if it was always there.

Ill try and replicate this myself. Last question, are you using the OpenJDK the Oracle JDK or some other java runtime?

Max DiOrio · Answer 34 · Tue Dec 26 2017 22:46:35 GMT+0800 (China Standard Time)

Same thing with the community edition. That's what we were running to begin with. The strange thing is that once we get the OOM error, it restarts the crawl again. Why can't it just pick up where it left off? And since it restarts the crawl, the OOM error isn't fatal - it frees up on what I assume in a garbage collection.

openjdk version "1.8.0_151"
OpenJDK Runtime Environment (build 1.8.0_151-b12)
OpenJDK 64-Bit Server VM (build 25.151-b12, mixed mode)

Here's a JMX monitor of the process. It covers the time span of application launch, index, OOM @approx 9:38:30, then I forced a reindex, and another OOM at 9:41:30

Here's a heap dump taken about 10 seconds before the out of memory.
heapdump-1514299503814.zip

Max DiOrio · Answer 35 · Tue Dec 26 2017 23:25:28 GMT+0800 (China Standard Time)

Here I set the min and max heap thinking it may not be able to grow the heap fast enough. Doesn't help. You can see that it's not even remotely coming close to the max heap size when it fails with the OOM error.

Max DiOrio · Answer 36 · Wed Dec 27 2017 00:49:04 GMT+0800 (China Standard Time)

I'm just going to keep spamming :) Turned on verbose GC logging and OOM memory dump.

[GC (Allocation Failure) [PSYoungGen: 1328283K->27708K(1359360K)] 1383753K->345635K(4155904K), 0.1584173 secs] [Times: user=0.52 sys=0.08, real=0.16 secs]
Dec 26, 2017 10:39:03 AM java.util.logging.LogManager$RootLogger log
INFO: Closing writers
[GC (Allocation Failure) [PSYoungGen: 638348K->3728K(1362432K)] 2005106K->1371038K(4158976K), 0.0120329 secs] [Times: user=0.05 sys=0.00, real=0.01 secs]
[GC (Allocation Failure) [PSYoungGen: 3728K->3312K(1361408K)] 1371038K->1371046K(4157952K), 0.0134192 secs] [Times: user=0.05 sys=0.00, real=0.02 secs]
[Full GC (Allocation Failure) [PSYoungGen: 3312K->0K(1361408K)] [ParOldGen: 1367734K->1071418K(2796544K)] 1371046K->1071418K(4157952K), [Metaspace: 27792K->27788K(1075200K)], 0.1566229 secs] [Times: user=0.41 sys=0.02, real=0.15 secs]
[GC (Allocation Failure) [PSYoungGen: 0K->0K(1361920K)] 1071418K->1071418K(4158464K), 0.0052416 secs] [Times: user=0.01 sys=0.00, real=0.01 secs]
[Full GC (Allocation Failure) [PSYoungGen: 0K->0K(1361920K)] [ParOldGen: 1071418K->1069608K(2796544K)] 1071418K->1069608K(4158464K), [Metaspace: 27788K->27588K(1075200K)], 0.1833277 secs] [Times: user=0.59 sys=0.00, real=0.18 secs]
java.lang.OutOfMemoryError: Java heap space
Dumping heap to java_pid28910.hprof ...
Heap dump file created [1106770083 bytes in 5.432 secs]
[QuartzScheduler_Worker-16] ERROR org.quartz.core.JobRunShell - Job DEFAULT.updateindex-file-0 threw an unhandled Exception:
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3332)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:596)
at java.lang.StringBuffer.append(StringBuffer.java:367)
at java.io.BufferedReader.readLine(BufferedReader.java:370)
at java.io.BufferedReader.readLine(BufferedReader.java:389)
at com.searchcode.app.util.Helpers.readFileLinesGuessEncoding(Helpers.java:153)
at com.searchcode.app.jobs.repository.IndexBaseRepoJob.getCodeLines(IndexBaseRepoJob.java:392)
at com.searchcode.app.jobs.repository.SearchcodeFileVisitor.visitFile(SearchcodeFileVisitor.java:72)
at java.nio.file.Files.walkFileTree(Files.java:2670)
at java.nio.file.Files.walkFileTree(Files.java:2742)
at com.searchcode.app.jobs.repository.IndexBaseRepoJob.indexDocsByPath(IndexBaseRepoJob.java:266)
at com.searchcode.app.jobs.repository.IndexFileRepoJob.execute(IndexFileRepoJob.java:89)
at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573)
[QuartzScheduler_Worker-16] ERROR org.quartz.core.ErrorLogger - Job (DEFAULT.updateindex-file-0 threw an exception.
org.quartz.SchedulerException: Job threw an unhandled exception. [See nested exception: java.lang.OutOfMemoryError: Java heap space]
at org.quartz.core.JobRunShell.run(JobRunShell.java:213)
at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573)
Caused by: java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3332)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:596)
at java.lang.StringBuffer.append(StringBuffer.java:367)
at java.io.BufferedReader.readLine(BufferedReader.java:370)
at java.io.BufferedReader.readLine(BufferedReader.java:389)
at com.searchcode.app.util.Helpers.readFileLinesGuessEncoding(Helpers.java:153)
at com.searchcode.app.jobs.repository.IndexBaseRepoJob.getCodeLines(IndexBaseRepoJob.java:392)
at com.searchcode.app.jobs.repository.SearchcodeFileVisitor.visitFile(SearchcodeFileVisitor.java:72)
at java.nio.file.Files.walkFileTree(Files.java:2670)
at java.nio.file.Files.walkFileTree(Files.java:2742)
at com.searchcode.app.jobs.repository.IndexBaseRepoJob.indexDocsByPath(IndexBaseRepoJob.java:266)
at com.searchcode.app.jobs.repository.IndexFileRepoJob.execute(IndexFileRepoJob.java:89)
at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
... 1 more
^C
Heap
PSYoungGen total 1361920K, used 252478K [0x000000076ab00000, 0x00000007c0000000, 0x00000007c0000000)
eden space 1326080K, 19% used [0x000000076ab00000,0x000000077a18f8c0,0x00000007bba00000)
from space 35840K, 0% used [0x00000007bba00000,0x00000007bba00000,0x00000007bdd00000)
to space 35840K, 0% used [0x00000007bdd00000,0x00000007bdd00000,0x00000007c0000000)
ParOldGen total 2796544K, used 1069608K [0x00000006c0000000, 0x000000076ab00000, 0x000000076ab00000)
object space 2796544K, 38% used [0x00000006c0000000,0x000000070148a048,0x000000076ab00000)
Metaspace used 28119K, capacity 28630K, committed 28928K, reserved 1075200K
class space used 2767K, capacity 2932K, committed 3072K, reserved 1048576K

Analyzing the heap dump for memory leaks says:
The thread org.quartz.simpl.SimpleThreadPool$WorkerThread @ 0x6c00e0820 QuartzScheduler_Worker-16 keeps local variables with total size 1,076,664,392 (98.47%) bytes.
The memory is accumulated in one instance of "char[]" loaded by "".

Here's the OOM Heap Dump.
java_pid28910.zip

Ben Boyter · Answer 37 · Wed Dec 27 2017 04:24:14 GMT+0800 (China Standard Time)

Ah I have a theory.

I suspect that there is a large file inside that repository you are trying to index that has no newlines. Because the read tries to load a file based on this (the depth setting controls this) I suspect that it loads partially and crashes out. Ill have a play around with this idea over the next few days.

The reason it does not continue is that if it does crash out it assumes there was an issue and tries again. It works of the timestamp of the last successful run against the files. Reprocessing like this is quite fast hence why it works like this.

Max DiOrio · Answer 38 · Wed Dec 27 2017 06:00:48 GMT+0800 (China Standard Time)

Just an additional note - I changed the number of file processors down to 1, and it still does it, and it stops on a properly formatted perl script file. Like I said though - each time I remove and re-add the repo, it stops on a different number of indexed files. Still - can't wait to try something else.

Ben Boyter · Answer 39 · Wed Dec 27 2017 12:58:16 GMT+0800 (China Standard Time)

In the above its this line in the stacktrace that intrigues me,

at com.searchcode.app.util.Helpers.readFileLinesGuessEncoding(Helpers.java:153)

I suspect you are hitting an issue with this portion of code,

public List<String> readFileLinesGuessEncoding(String filePath, int maxFileLineDepth) throws IOException {
        List<String> fileLines = new ArrayList<>();
        BufferedReader bufferedReader = null;
        String line;

        try {
            bufferedReader = new BufferedReader(new InputStreamReader(new FileInputStream(filePath), guessCharset(new File(filePath))));

            int lineCount = 0;
            while ((line = bufferedReader.readLine()) != null) {
                lineCount++;

                fileLines.add(line);

                if (lineCount == maxFileLineDepth) {
                    return fileLines;
                }
            }
        }
        finally {
            IOUtils.closeQuietly(bufferedReader);
        }

        return fileLines;
    }

Just need to replicate the issue with a file and ill implement a fix.

Ben Boyter · Answer 40 · Mon Jan 01 2018 12:09:59 GMT+0800 (China Standard Time)

So was able to replicate this. I used the following python script to create a large file ~1 GB in size with no newlines.

with open('no_newlines', 'w') as myfile:
    for _ in range(1000000000): # ~1 gb
        myfile.writelines("a")

Running the method against it and I get the following exception :)

java.lang.OutOfMemoryError: Java heap space

	at java.util.Arrays.copyOfRange(Arrays.java:3664)
	at java.lang.String.<init>(String.java:207)
	at java.nio.HeapCharBuffer.toString(HeapCharBuffer.java:567)
	at java.nio.CharBuffer.toString(CharBuffer.java:1241)
	at java.util.regex.Matcher.toMatchResult(Matcher.java:250)
	at java.util.Scanner.match(Scanner.java:1294)
	at java.util.Scanner.hasNextLine(Scanner.java:1502)
	at com.searchcode.app.util.Helpers.readFileLines(Helpers.java:155)
	at com.searchcode.app.util.HelpersTest.testReadFileLinesIssue168(HelpersTest.java:35)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at junit.framework.TestCase.runTest(TestCase.java:176)
	at junit.framework.TestCase.runBare(TestCase.java:141)
	at junit.framework.TestResult$1.protect(TestResult.java:122)
	at junit.framework.TestResult.runProtected(TestResult.java:142)
	at junit.framework.TestResult.run(TestResult.java:125)
	at junit.framework.TestCase.run(TestCase.java:129)
	at junit.framework.TestSuite.runTest(TestSuite.java:252)
	at junit.framework.TestSuite.run(TestSuite.java:247)
	at org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:86)
	at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
	at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
	at com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:47)
	at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242)
	at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70)

So what needs to happen is that the method needs to be changed to not read using newlines but instead by bytes. One of those things I considered at the time but figured it would be unlikely to cause issues. Anyway I am in the middle of that now and running though all the usual tests to ensure it works 100% as before. The moment I have merged it into master ill let you know here so you can try things out again.

Ben Boyter · Answer 41 · Mon Jan 01 2018 13:20:39 GMT+0800 (China Standard Time)

If you would like to get a jump on things the new code is sitting in https://github.com/boyter/searchcode-server/tree/Issue168 the branch Issue168

Just working though the usual tests before merging in.

Ben Boyter · Answer 42 · Tue Jan 02 2018 09:47:12 GMT+0800 (China Standard Time)

Ok merged in. All looks alright to me and there is a new test there to cover this case as best I can.

@MaxDiOrio If you are able to build from master and try again that should work for you.

Max DiOrio · Answer 43 · Wed Jan 03 2018 02:02:20 GMT+0800 (China Standard Time)

Testing now - might have found an unrelated bug. On Admin, repository list, the Edit button kicks off an index instead of allowing edit.

                <td>
                    <button class="btn btn-xs btn-danger delete" data-id="VSS" name="delete" type="submit"><span class="glyphicon glyphicon-remove" aria-hidden="true"></span> Delete</button>
                </td>
                <td>
                    <button class="btn btn-xs btn-default reindex" data-id="VSS" name="reindex" type="submit"><span class="glyphicon glyphicon-refresh" aria-hidden="true"></span> Reindex</button>
                </td>
                <td>
                    <button class="btn btn-xs btn-default reindex" data-id="VSS" name="reindex" type="submit"><span class="glyphicon glyphicon-edit" aria-hidden="true"></span> Edit</button>
                </td>

Max DiOrio · Answer 44 · Wed Jan 03 2018 02:35:46 GMT+0800 (China Standard Time)

We have a successful index! 100050 files in 408 seconds

Max DiOrio · Answer 45 · Wed Jan 03 2018 03:08:17 GMT+0800 (China Standard Time)

New issue - Went to go implement this on the production side...deleted the existing repo, went to add the new one and now I'm getting a 500 Internal Error when I click add repository, but nothing is getting logged that I can see.

Ben Boyter · Answer 46 · Wed Jan 03 2018 04:14:13 GMT+0800 (China Standard Time)

Glad to hear the bug is resolved.

Yeah... well its master :) as this is a pretty serious issue I will be looking to move the code into a release candidate so it should be resolved soon. After it runs through the usual tests ill mark it off as a release. Going to keep this open as a reference till done.

Ben Boyter · Answer 47 · Wed Jan 03 2018 11:18:37 GMT+0800 (China Standard Time)

So looking into it. The edit was just because I was still working on the code to allow editing of repositories. Hence its just a placeholder. The deletion and adding is an issue though. It looks like the repositories are never deleted which is annoying. Will dig in further.

Ben Boyter · Answer 48 · Wed Jan 10 2018 11:00:42 GMT+0800 (China Standard Time)

Editing of repositories... well the fields that can be edited is in master now. So you should be able to update things now.

I have not been able to replicate the deletion/adding issue unless it is running on windows. It seems windows locks the files for a long period of time. Waiting long enough cleans it up. I suspect that since you are using Linux it might be that its taking a long time to clean out the index itself which is blocking the update.

I will keep trying to replicate the issue though.

Max DiOrio · Answer 49 · Wed Jan 10 2018 20:28:27 GMT+0800 (China Standard Time)

I’ll have to agree on that. I’ve deleted and added other smaller repos without a problem. I ended up just re-doing everything from scratch since I created a script to grab our repos from AWS CodeCommit, compare to what’s in SearchCode and add them automatically. Which I’ll publish to GitHub this week or next. Thanks for your help Ben. Devs love the new CodeSearch so far. Max DiOrio Global Systems Administrator From: Ben Boyter [mailto:notifications@github.com] Sent: Tuesday, January 9, 2018 10:01 PM To: boyter/searchcode-server <searchcode-server@noreply.github.com> Cc: DiOrio, Max <Max.DiOrio@ieeeglobalspec.com>; Mention <mention@noreply.github.com> Subject: Re: [boyter/searchcode-server] Index goes to queued not even close to completing crawl (#168) Editing of repositories... well the fields that can be edited is in master now. So you should be able to update things now. I have not been able to replicate the deletion/adding issue unless it is running on windows. It seems windows locks the files for a long period of time. Waiting long enough cleans it up. I suspect that since you are using Linux it might be that its taking a long time to clean out the index itself which is blocking the update. I will keep trying to replicate the issue though. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#168 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AYqGJLLenm5bHElTCvTdDF1IReT4sD7cks5tJCfbgaJpZM4RHNkb>.

Ben Boyter · Answer 50 · Thu Jan 11 2018 05:02:31 GMT+0800 (China Standard Time)

Great! Going to close this down and do some shilling then :)

Do please consider buying the fully supported version https://searchcodeserver.com/pricing.html using the source as you are limits you to 5 users but not going to push since it was very helpful for you to assist me with the bug. Thanks very much. If not id love to get a testimonial. If you are not comfortable with putting it here just email me at ben@boyter.org