boyter / searchcode-server

Indexing seems much faster if low_memory=false but it makes indexing suspend and not sure how to push it further.

In log I can see:
INFO: Mon Mar 20 13:10:35 GMT 2017: codeIndexLinesCount 1148817 larger then 1000000
INFO: Mon Mar 20 13:10:32 GMT 2017: Pausing parser.

There is no such case with low_memory=true

Also with low_memory=true CPU usage is even 25% all the time (one of 4 cores I guess), but with low_memory=false it stops using CPU at all when the buffer hits limit.

So if you have low_memory=false then what happens is that searchcode will add processed files into a queue which occasionally is flushed to disk in a single transaction. When low_memory=true each one is written to disk when processed, which is slower (lots more disk IO) but since there is no buffering uses less memory.

The indexing is currently limited to a single core so that is why you are seeing a pegged 25% (#34). The queue needs to be in place because with enough repository processors or complex files the indexer can lag behind the number of things to index.

The lines in the log you are referring to show the inbuilt buffering portion of searchcode. You can however tweak these values through the properties file.

max_document_queue_size=1000
max_document_queue_line_size=100000
max_file_line_depth=10000

The first is the maximum number of files that will be added to the queue because the indexer pauses. By default it is set to 1,000 files.

The second is the number of lines in those files. A count is kept of the number of lines that are in each file in the queue and when the sum of them is over the value the indexer is paused. NB this is a soft cap that can be breached a few times before the pause kicks in.

The last is the number of lines that searchcode will read into a file for indexing. Generally file over 10,000 lines are not that useful for search results and just slow things down. This value dictates how much the soft cap of max_document_queue_line_size can be breached.

If you have enough memory on the machine (and this is a case by case basis) you can increase the first two to have a larger buffer. When the indexer is busy and falling behind that could be because it ran into a bunch of complex/large files which cause it to do more work. However once it runs into a batch of files that it can work on quickly this can cause it to work on larger chunks for indexing.

BTW the number of files that the indexer will work on at any one time is 1000

searchcode-server/src/main/java/com/searchcode/app/service/CodeIndexer.java

Line 231 in 2dbae02

if (count >= 1000) { // Only index 1000 documents at most each time

and is something that should probably be made configurable.

This is now configurable through the properties file with this change 1810b05

Let me know if you need more information on this one, otherwise feel free to close it down.

Going to close this as the question has be answered.

I still have this issue with initial indexing, using only file repo and settings like:

number_file_processors=10
spelling_corrector_size=10000
index_queue_batch_size=1000
max_document_queue_size=1000
max_document_queue_line_size=100000
max_file_line_depth=10000

since two days it is same:

May 02, 2017 2:33:28 PM java.util.logging.LogManager$RootLogger log
INFO: codeIndexLinesCount 120706 larger then 100000
May 02, 2017 2:33:28 PM java.util.logging.LogManager$RootLogger log
INFO: Pausing parser.

It seems to stuck on 120706 and not proceed with any further, the only solution I found to solve it is to initially index with low_memory = true

Not sure what more I could provide you to make it possible to narrow down the case

That is indeed odd. The only thing I can think of is that somehow the parser is choking on a piece of code and hitting the heap issue. Can you check the console output and see if you have a heap exception thrown? This would explain why it is stuck.

If that is indeed the case you will either need to increase the amount of RAM that searchcode can get access to via Xmx setting or use low_memory=true.

Log is on INFO level and that is the only message I'm getting (one set every second or so)

So it is running then. That is indeed odd. I am going to take a guess here that getting a copy of the code to run this against is impossible. I will have a look over the logic and see if there is anything that can be done.

For the moment you can try using low_memory=true to get past it.

Would you be willing to try out a special build?

Can you try increasing the amount of RAM using Xmx to searchcode? No idea what your current setting is but if you set it to 2G that may resolve the issue.

Sure, just make a branch and I can run it.

low_memory=true does the thing, it adds few files every second, while without it progress was not visible at all after initial ~1000 files.

Xmx is already set to 8G, using java 64 and it never used more than 6G as I can tell.

Is this using a GIT SVN or File repository type?

about 30 file repos

…

On 3 May 2017 at 23:12, Ben Boyter ***@***.***> wrote: Is this using a GIT SVN for File repository type? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#85 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AJ2zBhDpb9POwNkJ0rUFjxCHHv_P-_5Dks5r2PvigaJpZM4MiaRA> .

OK cool.

Can you build from the following branch https://github.com/boyter/searchcode-server/tree/issue85

Then start fresh and run it again. Make sure the logs are set to INFO level. What we are interested in is the addtional logging in the index thread b53dc62

We can grep the logs for "issue85" which will indicate which is the last file it ran into issues with and hopefully allow me to replicate whatever it is doing that is causing things to choke.

seems to be doing normal job (repeating this procedure 89 times):

INFO: issue85 Indexing file ${FullFilePath}
INFO: issue85 adding to spelling corrector
INFO: issue85 end adding to spelling corrector
INFO: issue85 getFileName codeCleanPipeline
INFO: issue85 end getFileName codeCleanPipeline
INFO: issue85 getFileName splitKeywords pipeline
INFO: issue85 end getFileName splitKeywords pipeline
INFO: issue85 getContents splitKeywords
INFO: issue85 end getContents splitKeywords
INFO: issue85 getContents codeCleanPipeline
INFO: issue85 end getContents codeCleanPipeline
INFO: issue85 getContents findInterestingKeywords
INFO: issue85 end getContents findInterestingKeywords
INFO: issue85 getContents findInterestingCharacters
INFO: issue85 end getContents findInterestingCharacters
INFO: issue85 updateDocument
INFO: issue85 end updateDocument

up to a point, then after indexing 89 files, then it goes nuts, and no issue85 appears anymore

INFO: Closing writers
INFO: File Indexer Indexing
... traversing through some binary files
INFO: codeIndexLinesCount 104546 larger then 100000
INFO: Pausing parser.
.. another binary file Appears to be binary will not index appears in extension blacklist
INFO: codeIndexLinesCount 104782 larger then 100000
INFO: Pausing parser.
... the cycle has begun

I can confirm that it does parallel indexing, using different file repos (M:\ and C:\ appears) so I would say it is not doggy file but rather something with indexer actually choking, maybe it has something to do that I have more than one file repo? You could make indexer more verbose, but I don't think it will help us much. Also this 89 files is other number than last which was like 1005, and I reverted those settings to default:

spelling_corrector_size=10000
index_queue_batch_size=1000
max_document_queue_size=1000
max_document_queue_line_size=100000
max_file_line_depth=10000

I used *10 previously.

There are no new, unfinished INFO: issue85 Indexing file ${FullFilePath} lines at all

Would it be possible for you to report on what the last file you see is before the problem starts? IE have a look at the file that appears on the very last

INFO: issue85 Indexing file ${FullFilePath}

and see if there is anything unusual about it.

Also can you set the following property,

number_file_processors=1

This will ensure that only a single thread will be processing the file repository. Preferable if you can do 1 file repository at a time waiting each time, or if you are able to identify which file from the start just that file.

Limiting to single processor solves this issue, hope it will give you some ideas where it can be located (still over 30 file repos)

That is an interesting result. A single file processor should make no difference, I only suggested it because it would make sure the logs were clean.

How many were you running? Could you try pulling the latest build of branch issue85 and trying again please. This has a tweak to the file walker which may resolve the issue in full for you. You will need to to ensure though that you have the searchcode.properties file value of

follow_links=false

This is the default, but don't override it to true.

I was running 10. Maybe its just too much? Anyway with 10 it takes few secs to replicate this issue

Again there is cycle:

INFO: Indexing file ${FILEa}
INFO: issue85 Indexing file ${FILEa}
INFO: issue85 adding to spelling corrector
INFO: issue85 end adding to spelling corrector
INFO: issue85 getFileName codeCleanPipeline
INFO: issue85 end getFileName codeCleanPipeline
INFO: issue85 getFileName splitKeywords pipeline
INFO: issue85 end getFileName splitKeywords pipeline
INFO: issue85 getContents splitKeywords
INFO: issue85 end getContents splitKeywords
INFO: issue85 getContents codeCleanPipeline
INFO: issue85 end getContents codeCleanPipeline
INFO: issue85 getContents findInterestingKeywords
INFO: issue85 end getContents findInterestingKeywords
INFO: issue85 getContents findInterestingCharacters
INFO: issue85 end getContents findInterestingCharacters
INFO: issue85 updateDocument
INFO: issue85 end updateDocument

And then about 50 lines like:

INFO: File Indexer Indexing ${REPOa}
INFO: cleanMissingPathFiles doClean 0 0
INFO: cleanMissingPathFiles doClean 0 0
INFO: File Indexer Indexing ${REPOb}
INFO: File Indexer Indexing ${REPOc}
INFO: Unable to guess encoding type or file is empty ${FILE_x}
INFO: Appears to be minified will not index  ${FILE_y}
INFO: Appears to be binary will not index appears in extension blacklist ${FILE_z}

To finally start producing only this output:

INFO: codeIndexLinesCount 100004 larger than 100000
INFO: Pausing parser.

So the Pausing parser output is to be expected.

I think I know what the issue may be based on the above though. I suspect that the codeIndexLinesCount is not being reset when it is finding binary or minified files in this case. I will have a look to see if this is the cause.

So the issue is indeed that codeIndexLinesCount is not being reset for some reason when there are multiple threads for the parsers. Odd because the calls to modify this state are synchronized. I have a feeling that the values being used to set and unset this value in this case are not in sync, and somehow it ends up not being able to ever reduce the value below 100000 which blocks the indexing.

I am going to add a button to allow resetting of this value from within the admin page for this.

Would it be possible for you to set the value

max_document_queue_line_size

in the properties file to something larger? Say 1000000. If you have a lot of RAM this should have little impact on how things are running, and should allow it to get past your issue for the moment.

I am going to scrutinize the code in question and add some more tests for it.

I can survive with low_memory = true for the time being (as I did till now).
Not sure if this is better than single indexing thread or large max_document_queue_line_size.
At least you are on the track and I'm not crazy ;p

Its on of those things that I am slowly decoupling over time and adding more test coverage. As with all software projects it has evolved over time and the internal design needs to change. Once I have the refactor in place I should be able to resolve a lot of these issues (crosses fingers).

Hoping that this is resolved by #110 and #129 however have added the button to reset this anyway in commit 32bf717

Should mention that the value is now displayed on the admin page as well, so its possible to identify this issue without having to look in the logs.

Seems like no good news, with 10 file indexers and default values for memory taking options it get paused after a minute or two and never recovers.

sqlite_file=searchcode.sqlite
server_port=8080
index_location=../index/
facets_location=../index/facet/
repository_location=./repo/
trash_location=../trash/
check_repo_chages=14400
check_filerepo_changes=14400
spelling_corrector_size=10000
max_document_queue_size=1000
max_document_queue_line_size=100000
max_file_line_depth=10000
only_localhost=false
low_memory=false
git_binary_path=C:/Program Files/Git/cmd/git.exe
use_system_git=false
api_enabled=false
api_key_authentication=true
svn_enabled=false
svn_binary_path=/usr/bin/svn
owasp_database_location=./include/owasp/database.json
highlight_lines_limit=5000
binary_extension_white_list=
binary_extension_black_list=3gp,7z,apk,avi,bak,bin,blines,bmp,book,bz2,cab,calc,cfs,chm,classclass,com,cur,db,dgn,dib,dll,dm,dmg,dml,doc,docx,dot,dwf,elf,emf,emz,eot,exe,fbr,flac,flv,fm,gfb,gif,gz,hlp,ico,idx,iso,jar,jpeg,jpg,lib,m4a,m4v,mid,mov,mp3,mp4,mpg,mpp,msg,msi,msm,nupkg,obj,out,pdb,pdf,picfile,plt,png,ppt,pptx,psd,pspimage,pyc,rar,resources,rlib,rm,rpm,rvf,rvm,save,scc,sdb,sfx,shx,so,sqlite,stl,tar,tgz,tif,tiff,tlb,tmp,TPL,ttf,tx,txv,vcd,vip,vsd,wav,wixlib,wixpdb,wma,wmf,wmv,woff,woff2,xls,xlsm,xlsx,xpm,yuv,z,zip,zipx
directory_black_list=bin,lost+found
number_git_processors=2
number_svn_processors=2
number_file_processors=10
default_and_match=true
log_indexed=false
follow_links=false
deep_guess_files=false
host_name=localhost:8080

Using Reset Index Lines brings Index Lines Count to 0, but Repo Parser Status remains paused.
Using Pause / Unpause Indexer alternates Repo Adder Status but does not affect Repo Parser Status
Setting:

low_memory=true

makes it work, Documents Indexed and Words in Spelling Corrector keeps growing while Index Lines Count and Index Queue Count remains at 0

OK, what I think is happening (and I have seen this myself) is that there is a single file bogging down the parser. In this case it seems to be made worse by the parallel processing. I am going to add some additional logging to see if we can narrow down the exact file that is causing the issue.

Assuming this is the case I will need to add a timeout on the current process to catch this situation. At which point it will log out that the file could not be indexed because it was taking to long.

I will put the required logic into master and let you know when its done. Hopefully this will help narrow down if this is indeed the issue. I can sometimes replicate this locally.

So this has been added in. If you could have a look at the fine level logs (not clickable, just make the level fine)

http://localhost:8080/admin/logs/?level=fine

This might help diagnose the issue. I am looking to see if there are any that take longer then say the below example,

FINE: Tue Sep 12 08:32:59 AEST 2017: ForkJoinPool.commonPool-worker-1 190: indexDocument:parallelStream:updateDocument:0.001057959

I would not expect there to be may above ~0.001 but if so there may be an issue there. Please ignore slowness from the "GIT Blame Info Time:" portion. I am aware of the performance issue there and working on it now.

Stopped logging 10 minutes ago, attached file.
SearchCodeFineLog.txt

Yep looks like that might be the issue. I will need to add some more debugging to be sure though. Will add that and comment back in here if you are able to verify for me.

Sorry about this, but its the only way I can debug without actually looking at the repository that you are using.

That is me sorry, because I can't show you the repository. I'm happy to help you to track it down.

Ok going to start adding some very verbose logging to diagnose this one. Would you be comfortable with me logging out details such as the

file extension
file length
number of ascii characters

I can probably get away with just the length but the others would be helpful as well.

Have added additional details tot he fine log 35e3413 @quasarea if you could build from that commit on master and try it out then we should be able to either work out which file is causing the issue or rule out one source of the issue.

What I am hoping to see is a file with something like the below,

FINE: Fri Sep 22 08:23:19 AEST 2017: ForkJoinPool.commonPool-worker-1 50: indexDocument:updateDocument:end:filename=./repo/searchcode-server/assets/sales/gif_crush.sh:0.001681328
FINE: Fri Sep 22 08:23:19 AEST 2017: QuartzScheduler_Worker-16 26: buildDocument:codeCleanPipeline:end:length=15296:0.001703491
FINE: Fri Sep 22 08:23:19 AEST 2017: ForkJoinPool.commonPool-worker-0 54: buildDocument:codeCleanPipeline:start:length=3940
FINE: Fri Sep 22 08:23:19 AEST 2017: ForkJoinPool.commonPool-worker-1 50: indexDocument:updateDocument:start:linescount=7:filename=./repo/searchcode-server/assets/sales/gif_crush.sh
FINE: Fri Sep 22 08:23:19 AEST 2017: QuartzScheduler_Worker-20 30: GIT Blame Info Time:end:assets/sales/gif_crush.sh:0.071189408
FINE: Fri Sep 22 08:23:19 AEST 2017: QuartzScheduler_Worker-20 30: GIT Blame Info Time:start: assets/sales/gif_crush.sh

But with it missing the :end: portion. Hopefully then I can get you to either let me look at the file or I can write some code to hopefully work out why the parsers are choking on it.

This log has no new entry even that I left it running for like 30 minutes. There were 2 indexers per type set and quartz was set to 2 as well.

SearchCodeFineLog-2.txt

there is exactly same amount of :start: as :end: :(

I did few runs like this, and it is totally file independent, like once it was indexing c# units test, other time some c++ project and other time just text files

That is really really odd. It suggests that the issue is not with the indexing or the processing pipeline because they start and finish.

When this happens if you look at the admin screen let me know what you see for the following values?

Index Lines Count: 0 
Index Queue Count: 0

free memory: 424,930
allocated memory: 854,016
max memory: 1,864,192
total free memory: 1,435,106
memory usage: 419MB

Although checking the above shows it looks fine. It just looks like the indexer is blocked for some reason. I have added some additional logging into master for this.

With 2 indexers on repo sometimes it happens while indexing files (so recent 1 or 2 files have no ":end:" case, but sometimes its not. I have not been able to find a case where one of indexers stops far earlier than second one.

Another aproach:

taken searchcode repo content, copied into 4 folders: SearchCodeA, SearchCodeB,...
cleaned index and logs
removed all previous repositories (sqllite editing), and configured new ones (was not able to remove last 39 out of 102 repos via admin panel, they were still there after page refresh, possibly due to indexers stuck already)

Console started (actually started second time as I wanted to not get index recreation into logs):

Java HotSpot(TM) 64-Bit Server VM warning: JVM cannot use large page memory because it does not have enough privilege to lock pages in memory.
[main] INFO spark.staticfiles.StaticFilesConfiguration - StaticResourceHandler configured with folder = /public
[main] INFO org.quartz.impl.StdSchedulerFactory - Using default implementation for ThreadExecutor
[main] INFO org.quartz.core.SchedulerSignalerImpl - Initialized Scheduler Signaller of type: class org.quartz.core.SchedulerSignalerImpl
[main] INFO org.quartz.core.QuartzScheduler - Quartz Scheduler v.2.2.2 created.
[main] INFO org.quartz.simpl.RAMJobStore - RAMJobStore initialized.
[main] INFO org.quartz.core.QuartzScheduler - Scheduler meta-data: Quartz Scheduler (v2.2.2) 'QuartzScheduler' with instanceId 'NON_CLUSTERED'
  Scheduler class: 'org.quartz.core.QuartzScheduler' - running locally.
  NOT STARTED.
  Currently in standby mode.
  Number of jobs executed: 0
  Using thread pool 'org.quartz.simpl.SimpleThreadPool' - with 2 threads.
  Using job-store 'org.quartz.simpl.RAMJobStore' - which does not support persistence. and is not clustered.

[main] INFO org.quartz.impl.StdSchedulerFactory - Quartz scheduler 'QuartzScheduler' initialized from default file in current working dir: 'quartz.properties'
[main] INFO org.quartz.impl.StdSchedulerFactory - Quartz scheduler version: 2.2.2
[main] INFO org.quartz.core.QuartzScheduler - Scheduler QuartzScheduler_$_NON_CLUSTERED started.
[main] INFO org.quartz.core.QuartzScheduler - Scheduler QuartzScheduler_$_NON_CLUSTERED started.
[main] INFO org.quartz.core.QuartzScheduler - Scheduler QuartzScheduler_$_NON_CLUSTERED started.
[main] INFO org.quartz.core.QuartzScheduler - Scheduler QuartzScheduler_$_NON_CLUSTERED started.
[main] INFO org.quartz.core.QuartzScheduler - Scheduler QuartzScheduler_$_NON_CLUSTERED started.
[main] INFO org.quartz.core.QuartzScheduler - Scheduler QuartzScheduler_$_NON_CLUSTERED started.
[main] INFO org.quartz.core.QuartzScheduler - Scheduler QuartzScheduler_$_NON_CLUSTERED started.
[main] INFO org.quartz.core.QuartzScheduler - Scheduler QuartzScheduler_$_NON_CLUSTERED started.
[main] INFO org.quartz.core.QuartzScheduler - Scheduler QuartzScheduler_$_NON_CLUSTERED started.
[Thread-1] INFO org.eclipse.jetty.util.log - Logging initialized @1560ms
[Thread-1] INFO spark.embeddedserver.jetty.EmbeddedJettyServer - == Spark has ignited ...
[Thread-1] INFO spark.embeddedserver.jetty.EmbeddedJettyServer - >> Listening on 0.0.0.0:8080
[Thread-1] INFO org.eclipse.jetty.server.Server - jetty-9.3.6.v20151106
[Thread-1] INFO org.eclipse.jetty.server.ServerConnector - Started ServerConnector@68be9c5f{HTTP/1.1,[http/1.1]}{0.0.0.0:8080}
[Thread-1] INFO org.eclipse.jetty.server.Server - Started @1951ms

SearchCodeFineLog-3.txt

restarted service and run again:

SearchCodeFineLog-4.txt

Config for that instance:
SearchCodeConfig.zip

Then I switched to low_memory=true and restarted service, whole thing started to indexing happily

I have to wonder if this is something that only happens on Windows. I certainly have never encountered anything like this on macOS or GNU/Linux to date.

While the code was written such that it should work on Windows its not the main target platform. I might have a go at digging out a Windows and give it a go there to see if I can replicate the issue.

So good news, I managed to replicate the issue at least once. It has a lot of Lines and files to index but everything is paused.

Will attempt to see if I can replicate this and then track down the root cause.

So I was able to reliably cause this to happen. I am also pretty confident I have the fix.

Strictly speaking it is not a bug in the implementation. You were running into a Thread starvation issue.

The quartz.properties file in the base directory should only need to be modified when changing the searchcode.properties values of number_git_processors, number_svn_processors and number_file_processors. By default searchcode spawns 10 background threads which are used for repository processing and internal processing logic. By itself searchcode uses 5 threads by itself leaving over 5 for background repository processing tasks. If you adjust the number of repository processors higher then you should increase the value for org.quartz.threadPool.threadCount to a higher number up-to a maximum of 100.

I had assumed that quartz would yield the other sleeping threads (which happens when they hit the file limits). Going to fiddle around with the priority of quartz for this.

To get around this you can edit the quartz.properties file in the root directory like the below. Any value up-to 100 should work.

org.quartz.threadPool.class=org.quartz.simpl.SimpleThreadPool
org.quartz.threadPool.threadCount=50

Try it again. It should resolve it.

Stupid Github. I didn't tell you to close this.

Anyway the commit that closed this issue has some changes.

The background jobs have priority set. Indexing has max spawn priority with lowest thread priority
The index job no longer exits. It sits in a loop that sleeps on every iteration for 100ms
Updated the documentation to make it more explicit that changing the values for number_git_processors should have a change to the quartz.properties file

The above is not required to resolve it in your case @quasarea but it couldn't hurt to get the latest version anyway.

I verified that this works by setting the number of quartz threads to 2 (one for the index and the other for everything else) and letting it run for a while. Going to stick this one in test since it looks like that has resolved the issue for me.

Hmm, wanted to ask you how System Statistics / Threads relate to quartz threads, but never considered it as important :) actually now its around 66-69 but setting quartz to 50 definitely helped.

I was considering issue with windows implementation too :)

I will keep it running for a while and see if it still gets stuck, hopefully not ;)

The system statistics just shows the estimated number of threads used by the JVM process, http://docs.oracle.com/javase/7/docs/api/java/lang/Thread.html#activeCount()

The internal web-server reserves some (about 20-30 I think) and there are about 3 background processes that searchcode uses internally which will impact the number. You can then configure the number of GIT/SVN/File processors to get a rough idea of how many you need.

It actually will increase by as many CPU's as you have on occasion now that the indexing runs in parallel. If you have enough GIT/SVN/File processors and enough IO then you can actually peg 8 cores pretty easily (see below). In fact if you hit the "Pausing parser." message in the log frequently it suggests you need to improve the CPU of your server now :)

I should mention a lot of the above text is now out of date. For example

if (count >= 1000) { // Only index 1000 documents at most each time

The above check has been removed. Everything that is queued to be indexed will be done in a single batch now. The only time the queue will be limited is when it hits the soft caps of the below.

max_document_queue_size=1000
max_document_queue_line_size=100000

Going to make the call that this has been resolved. Certainly I could trigger or disable it at will, and with the new code that is no longer possible.

parser gets paused when low_memory=false