Andromeda

Golang Packages Database

Andromeda analyzes the complete graph of the known Go Universe.

Requirements

Golang 1.7 or newer
Git v2.3 or newer (to avoid interactive prompts interrupting the crawler)
go-bindata
stringer go get -u -a golang.org/x/tools/cmd/stringer
OpenSSL (for automatic retrieval of SSL/TLS server public-key to feed gRPC by remote-crawler)
xz (for downloading daily snapshots from godoc.org)

Note: openssl cert.pem cmd:

openssl s_client -showcerts -servername andromeda.gigawatt.io -connect andromeda.gigawatt.io:443 < /dev/null 2>/dev/null | openssl x509 -outform PEM 2>/dev/null

Installation

go get jaytaylor.com/andromeda/...

Backend selection

Additional setup may be required depending on which DB backend you want to use.

BoltDB

No additional packages or work necessary.

RocksDB

Install RocksDB

RocksDB installation instructions.

Install the RocksDB golang package

gorocksdb package installation instructions.

Build andromeda with the RocksDB backend enabled

go get jaytaylor.com/andromeda/...
cd "${GOPATH}/src/jaytaylor.com/andromeda
go build -o andromeda -tags rocks

Postgresql

Install postgresql

apt-get install \
    postgresql \
    postgresql-client \
    postgresql-contrib \
    postgresql-10-prefix

Enable the prefix module

"prefix" module enablement instructions.

Development

Requirements

protoc

go get -u github.com/golang/protobuf/...
go get -u github.com/gogo/protobuf/...
go get -u github.com/gogo/gateway/...
go get -u github.com/gogo/googleapis/...
go get -u github.com/grpc-ecosystem/go-grpc-middleware/...
go get -u github.com/grpc-ecosystem/grpc-gateway/...
go get -u github.com/mwitkow/go-proto-validators/...

protoc-gen-gorm

protoc-gen-gorm compatibility is tightly coupled to certain versions of various packages, so it's necessary to use dep to fetch all vendored dependencies.

go get github.com/infobloxopen/protoc-gen-gorm
cd "${GOPATH}/src/github.com/infobloxopen/protoc-gen-gorm"
dep ensure
go get .

Regenerating the domain package models:

go generate ./...

How to bootstrap the server

Example

Grab latest seed list from godoc.org:

./download-godoc-packages.sh

Locate the downloaded file and extract it with xz -k -d <filename>.

Then cleanup the input and seed into andromeda:

./scripts/input-cleaner.sh archive.godoc.org/packages.20180706 \
    | andromeda bootstrap -g - -f text

Installation instructions

Instructions generally live alongside the code within the header of the relevant program, so always check the top of the scripts and source code for installation instructions and per-script documentation.

The exception to this rule is the andromeda binary, where usage instructions are available by running andromeda --help or andromeda <sub-command> --help.

Running remote-crawler as a system service on Windows

Ensure the target user account has the "Run as a System Service" Policy.

Perform the following to edit the Local Security Policy of the computer you want to define the 'Logon as a Service' permission:

1.Logon to the computer with administrative privileges. 2.Open the 'Administrative Tools' and open the 'Local Security Policy'. 3.Expand 'Local Policy' and click on 'User Rights Assignment'. 4.In the right pane, right-click 'Log on as a service' and select properties. 5.Click on the 'Add User or Group' button to add the new user. 6.In the 'Select Users or Groups' dialogue, find the user you wish to enter and click 'OK'. 7.Click 'OK' in the 'Log on as a service Properties' to save changes.

Notes:

Ensure that the user which you have added above is not listed in the 'Deny log on as a service' policy in the Local Security Policy.

Example system service installation on windows

andromeda service crawler install -v --delete-after -s /tmp/src -a <host.name>:443 -c <path-to-letsencrypt-cert.pem> -u .\<windows-username> -p <windows-password>

Avoiding SSD burnout on Windows

A ramdisk partition mount can be used on windows. The only configuration change required is to set core.symlinks = false in .gitconfig.

See 52830545-git-clone-not-works-with-some-ramdisk-and-ntfs for an explanation about why.

Running remote-crawler behind a proxy

Linux

Host github.com gitlab.com bitbucket.com bitbucket.org code.cloudfoundry.org launchpad.net git.code.sf.net ProxyCommand ncat --proxy proxy.example.com:80 %h %p Compression yes

macOS

Host github.com gitlab.com bitbucket.com bitbucket.org code.cloudfoundry.org launchpad.net git.code.sf.net ProxyCommand nc -X connect -x proxy.example.com:80 %h %p Compression yes

Commands

Add top 1000 most recently committed packages to the crawl queue

cp -a andromeda.bolt a
andromeda -b a -v stats mru -n 1000 | jq -r '.[] | .path' | xargs -n10 andromeda remote enqueue -a 127.0.0.1:8001 -v -f
rm a

Cross-datastore migrations

Migrate from postgres to bolt

andromeda util rebuild-db \
    -v \
    --driver postgres \
    --db "dbname=andromeda host=/var/run/postgresql" \
    --rebuild-db-driver bolt \
    --rebuild-db-file new.bolt \

Migrate from bolt to postgres, filtering out package histories

andromeda util rebuild-db \
    -v \
    --driver bolt \
    --db no-history.bolt \
    --rebuild-db-driver postgres \
    --rebuild-db-file "dbname=andromeda host=/var/run/postgresql" \
    --rebuild-db-filters clearHistories

Cronjobs

5 */6 * * * /home/ppx/go/src/jaytaylor.com/andromeda/scripts/cron/download-godoc-packages.sh >/dev/null 2>&1
15 */6 * * * /home/ppx/go/src/jaytaylor.com/andromeda/scripts/cron/enqueue-godoc.sh >/dev/null 2>&1
45 */12 * * * /home/ppx/go/src/jaytaylor.com/andromeda/scripts/cron/trends.now.sh >/dev/null 2>&1
*/5 * * * * /home/ppx/go/src/jaytaylor.com/andromeda/scripts/cron/github.sh >/dev/null 2>&1

Default Configuration

Default application values can be overridden by a ~/.andromeda.toml or ~/.config/andromeda.toml configuration file (they are searched for in this order, first one found to exist will be used).

It's helpful to have these settings already defined in a configuration file if you use the command-line client much. Specifying long flags like --driver and --db <connection string> over and over gets tiresome!

For available configuration variables, see the example andromeda.toml config file.

To get started, copy it to your home directory:

cp andromeda.toml ~/.andromeda.toml

A non-default configuration file path location may be specified with the -config flag.

License

TODOs

Elasticsearch backend?
Show nested packages listing in sub-packages template (subs don't imply a terminal!).
Find a way to include main packages.
Re-enable kubernetes (hardcoded as disabled in master.go).

Note: Instead of "repo" or whatever, think about calling a reporoot a "tree". This terminology is used here.

Fancy Data

Remote Crawlers

1/4 Add a --id flag for remote crawlers to uniquely identify them (x2).
2/4 Remote crawlers should track and store their own statistics in a local bolt db file, per crawler-id. For example, keep track of number of crawls done per day, total size of crawled content, number of successful and failed crawls.
3/4 Server-side: Track crawlers by ID, and track when they were last seen, IP addresses, number of packages crawled, number of successful crawls vs errors.
4/4 Provide live-query mechanism for server to ping all crawlers to get an accurate count of actives. Would also be interesting to have the crawlers include their version (git hash) and crawl stats in the response.

Fully Autonomous System

Add queue monitor, when it is empty add N least recently updated packages to crawl.
Add errors counter to ToCrawlEntry and throw away when error count exceeds N.
Add process-level concurrency support for remote crawlers (to increase throughput without resorting to trying to manage multiple crawler processes per host).

Data Integrity

To avoid dropping items across restarts, implement some kind of a WAL and resume functionality (x2, see next item below).
Protect against losing queue items from process restarts / interruptions; Add in-flight TCE's to an intermediate table, and at startup move items from said table back into the to-crawl queue.
Remote-crawler: Store crawl result on disk when sending failed, then when remote starts, check for failed transmit and send it. Possible complexity due to server not expecting that crawl result. May need to expose via different gRPC API endpoint than Attach.

Operational and Performance

Uncategorized

Add git version check to crawler (because it's easy to forget to upgrade git!). Note: This is part of the check command, also verifies availability of openssl binary.
Make it work for repo roots without go files, e.g. github.com/deferpanic/virgo
Add a monitor and require that the disk where the DB is stored always has at least X GB free, where X is based on a multiple of the Bolt database file. This is to ensure safety that things don't get into a state where data cannot be written to the DB or even worse it gets corrupt. Remember that DB size may grow non-linearly (need to double check this, but this is what I recall observing).
Move failed to-crawls to different table instead of dropping them outright.
1/2 Expose queue contents over rest API.
2/2 Frontend viewer for queue head and tail contents.
Migrate table names to be singular.
Add "view on sourcegraph.com" link.
Handle "transport: authentication handshake failed: x509: certificate signed by unknown authority" source="crawler/remote.go:150" errors by fetching latest cert.pem.

To locate additional TODOs just find . -name '*.go' -exec grep 'TODO'

Some of them are only noted in the relevant code region :)

About

Languages

Language:Go 94.4%Language:Shell 2.8%Language:Smarty 2.2%Language:Python 0.6%