maxlath / load-balance-lines

Parallelize newline-delimited data processing by load balancing lines between multiple processes

Home Page:https://npmjs.com/package/load-balance-lines

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

load-balance-lines

Parallelize newline-delimited data processing by load balancing lines between multiple processes

htop

Summary

Install

# Make the executable accessible within your project npm scripts as load-balance-lines
# or, out of npm scripts, as ./node_modules/.bin/load-balance-lines
npm i load-balance-lines
# or globally
npm i -g load-balance-lines

Basic use

Take a huge pile of data with atomic data elements separated by newline breaks, typically NDJSON.

# Make sure your executable is... executable
chmod +x /path/to/my/executable
# and let's go!
cat data.ndjson | load-balance-lines /path/to/my/executable some args

or without the cat command, using <

load-balance-lines /path/to/my/executable some args for the executable < data.ndjson

Simple demo

see test

Real case demo

For the needs of wikidata-rank, we need to parse a full dump of Wikidata

  • get the latest dump (currently 31G gzipped)
wget -c https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.gz
  • Use nice to use the maximum amount of CPU possible while letting the priority to other processes
  • Use pigz to decompress it using threads (drop-in replacement to the single threaded gzip)
nice pigz -d < latest-all.json.gz | nice load-balance-lines /path/to/wikidata-rank/scripts/calculate_base_scores

Options

Number of processes

By default, there will be as many processes as CPU cores, but it can be modified by setting an environment variable

export LBL_PROCESSES=4 ; cat data.ndjson | load-balance-lines ./my/script

Verbose

By default, the load balancer is silent to let stdout free for sub-processes outputs, but you can get some basic informations by setting LBL_VERBOSE

export LBL_VERBOSE=true ; cat data.ndjson | load-balance-lines ./my/script

See also

About

Parallelize newline-delimited data processing by load balancing lines between multiple processes

https://npmjs.com/package/load-balance-lines


Languages

Language:JavaScript 100.0%