ContentMine / getpapers

Get metadata, fulltexts or fulltext URLs of papers matching a search query

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

getpapers 'JavaScript heap out of memory error'

J-E-J-S opened this issue · comments

I am hoping to do an extensive mine (~1 million papers ultimately). I am attempting a 100k paper mine first, however, I am running into memory errors which I would guess are due to the number of papers attempting to download?
The error I recieve is:

 getpapers -q biotechnology -a -k 100000 -o 100000_biotechnology
info: Searching using eupmc API
(node:14840) Warning: Accessing non-existent property 'padLevels' of module exports inside circular dependency
(Use `node --trace-warnings ...` to show where the warning was created)
info: Found 1030125 results
warn: This version of getpapers wasn't built with this version of the EuPMC api in mind
warn: getpapers EuPMCVersion: 5.3.2 vs. 6.5 reported by api
info: Limiting to 100000 hits
Retrieving results [==========--------------------] 34% (eta 532.9s)info: EuPMC gave us the wrong hitcount. We've already found all the results
info: Duplicate records found: 33709 unique results identified
info: Saving result metadata

<--- Last few GCs --->

[14840:0000020982D837B0]   308443 ms: Scavenge 1962.9 (2049.8) -> 1961.1 (2064.3) MB, 21.1 / 0.0 ms  (average mu = 0.938, current mu = 0.624) allocation failure
[14840:0000020982D837B0]   308534 ms: Scavenge 1975.6 (2064.3) -> 1975.0 (2065.8) MB, 10.6 / 0.0 ms  (average mu = 0.938, current mu = 0.624) allocation failure
[14840:0000020982D837B0]   308568 ms: Scavenge 1976.9 (2065.8) -> 1975.1 (2079.8) MB, 23.3 / 0.0 ms  (average mu = 0.938, current mu = 0.624) allocation failure


<--- JS stacktrace --->

FATAL ERROR: MarkCompactCollector: young object promotion failed Allocation failed - JavaScript heap out of memory
 1: 00007FF7F0CA058F napi_wrap+109311
 2: 00007FF7F0C452B6 v8::internal::OrderedHashTable<v8::internal::OrderedHashSet,1>::NumberOfElementsOffset+33302
 3: 00007FF7F0C46086 node::OnFatalError+294
 4: 00007FF7F151153E v8::Isolate::ReportExternalAllocationLimitReached+94
 5: 00007FF7F14F63BD v8::SharedArrayBuffer::Externalize+781
 6: 00007FF7F13A084C v8::internal::Heap::EphemeronKeyWriteBarrierFromCode+1516
 7: 00007FF7F138B48B v8::internal::NativeContextInferrer::Infer+59243
 8: 00007FF7F13709BF v8::internal::MarkingWorklists::SwitchToContextSlow+57327
 9: 00007FF7F138460B v8::internal::NativeContextInferrer::Infer+30955
10: 00007FF7F137B72D v8::internal::MarkCompactCollector::EnsureSweepingCompleted+6269
11: 00007FF7F138385E v8::internal::NativeContextInferrer::Infer+27454
12: 00007FF7F13877EB v8::internal::NativeContextInferrer::Infer+43723
13: 00007FF7F1391042 v8::internal::ItemParallelJob::Task::RunInternal+18
14: 00007FF7F1390FD1 v8::internal::ItemParallelJob::Run+641
15: 00007FF7F13648D3 v8::internal::MarkingWorklists::SwitchToContextSlow+7939
16: 00007FF7F137BBDC v8::internal::MarkCompactCollector::EnsureSweepingCompleted+7468
17: 00007FF7F137A424 v8::internal::MarkCompactCollector::EnsureSweepingCompleted+1396
18: 00007FF7F1377F88 v8::internal::MarkingWorklists::SwitchToContextSlow+87480
19: 00007FF7F13A65D1 v8::internal::Heap::LeftTrimFixedArray+929
20: 00007FF7F13A86B5 v8::internal::Heap::PageFlagsAreConsistent+789
21: 00007FF7F139D961 v8::internal::Heap::CollectGarbage+2033
22: 00007FF7F139BB65 v8::internal::Heap::AllocateExternalBackingStore+1317
23: 00007FF7F13B5E06 v8::internal::Factory::AllocateRaw+166
24: 00007FF7F13C9824 v8::internal::FactoryBase<v8::internal::Factory>::NewFixedArrayWithFiller+84
25: 00007FF7F13C9775 v8::internal::FactoryBase<v8::internal::Factory>::NewFixedArray+69
26: 00007FF7F12161B1 v8::internal::LayoutDescriptor::Trim+2065
27: 00007FF7F121A339 v8::internal::LayoutDescriptor::Trim+18841
28: 00007FF7F1216313 v8::internal::LayoutDescriptor::Trim+2419
29: 00007FF7F1219C48 v8::internal::LayoutDescriptor::Trim+17064
30: 00007FF7F1219B72 v8::internal::LayoutDescriptor::Trim+16850
31: 00007FF7F12D37F7 v8::base::TimeDelta::operator!=+11847
32: 00007FF7F12CFD20 v8::internal::TimedHistogram::Stop+16976
33: 00007FF7F12CECFC v8::internal::TimedHistogram::Stop+12844
34: 00007FF7F12D3A1C v8::base::TimeDelta::operator!=+12396
35: 00007FF7F12CFD20 v8::internal::TimedHistogram::Stop+16976
36: 00007FF7F12D02F3 v8::internal::TimedHistogram::Stop+18467
37: 00007FF7F12D25B6 v8::base::TimeDelta::operator!=+7174
38: 00007FF7F1483ED7 v8::internal::Builtins::builtin_handle+81719
39: 00007FF7F1599FCD v8::internal::SetupIsolateDelegate::SetupHeap+464173
40: 00007FF7F15328D2 v8::internal::SetupIsolateDelegate::SetupHeap+40498
41: 00007FF7F15328D2 v8::internal::SetupIsolateDelegate::SetupHeap+40498
42: 00007FF7F15328D2 v8::internal::SetupIsolateDelegate::SetupHeap+40498
43: 00007FF7F15328D2 v8::internal::SetupIsolateDelegate::SetupHeap+40498

Nothing is ultimately downloaded into the output directory, attempts to restart fail.

Is there anyway around this or a fix? Would this be improved if I increased the RAM on the machine (currently 8gb)?

Many Thanks,
James

I think the issue may be more towards the interaction with the EUPMC API.
I attempted to increase the available RAM (7gb) to getpapers by invoking with:

node --max-old-space-size=7168 getpapers.js -q biotechnology -a -k 100000 -o /c/Users/James/Documents/field-dynamics/resources/100000_biotechnology

I recieve the following output:

$ node --max-old-space-size=7168 getpapers.js -q biotechnology -a -k 100000 -o /c/Us
ers/James/Documents/field-dynamics/resources/100000_biotechnology
info: Searching using eupmc API
(node:9316) Warning: Accessing non-existent property 'padLevels' of module exports inside circular dependency
(Use `node --trace-warnings ...` to show where the warning was created)
info: Found 1030125 results
warn: This version of getpapers wasn't built with this version of the EuPMC api in mind
warn: getpapers EuPMCVersion: 5.3.2 vs. 6.5 reported by api
info: Limiting to 100000 hits
Retrieving results [=-----------------------------] 5% (eta 1001.9s)info: EuPMC gave us the wrong hitcount. We've already found all the results
info: Duplicate records found: 4990 unique results identified
info: Saving result metadata
info: Full EUPMC result metadata written to eupmc_results.json
Potentially unhandled rejection [1] Error: EEXIST: file already exists, mkdir 'C:\Users\James\Documents\field-dynamics\resources\100000_biotechnology'
    at Object.mkdirSync (fs.js:987:3)
    at sync (C:\users\james\appdata\roaming\npm\node_modules\getpapers\node_modules\mkdirp\index.js:72:13)
    at sync (C:\users\james\appdata\roaming\npm\node_modules\getpapers\node_modules\mkdirp\index.js:78:24)
    at sync (C:\users\james\appdata\roaming\npm\node_modules\getpapers\node_modules\mkdirp\index.js:79:17)
    at sync (C:\users\james\appdata\roaming\npm\node_modules\getpapers\node_modules\mkdirp\index.js:79:17)
    at sync (C:\users\james\appdata\roaming\npm\node_modules\getpapers\node_modules\mkdirp\index.js:79:17)
    at sync (C:\users\james\appdata\roaming\npm\node_modules\getpapers\node_modules\mkdirp\index.js:79:17)
    at sync (C:\users\james\appdata\roaming\npm\node_modules\getpapers\node_modules\mkdirp\index.js:79:17)
    at sync (C:\users\james\appdata\roaming\npm\node_modules\getpapers\node_modules\mkdirp\index.js:79:17)
    at sync (C:\users\james\appdata\roaming\npm\node_modules\getpapers\node_modules\mkdirp\index.js:79:17)

This time the results do output to the directory but the progress bar appear to halt at 5% (4990 unique downloads as shows in log) of the 100k specified because of this wrong hitcount error which may be the real problem.

To add, the total number of available papers with this query is supposedly 1,029,162.

Thanks @petermr for a speedy reply. I would actually be very interested in mining papers by date as I am hoping to elucidate trends in biotechnology over time. I don't quite understand your implementation here:

for i = 1,50 {

getpapers -q biotechnology -a -k 30000 -o 1000000_biotechnology

}

and I can't see anything in the wiki related to this, would you mind explaining a bit more of how to refine by date?

Many Thanks,
James