Datashare

Download

https://datashare.icij.org/

Documentation

Datashare's user guide can be found here: https://icij.gitbook.io/datashare/

Follow new updates and features

@ICIJorg publishes video tweets of new features with the hashtag #ICIJDatashare.

Frontend

This repository is only the backend part of Datashare.

Please find the frontend here : https://github.com/ICIJ/datashare-client.

Description

Datashare is a free open-source desktop application developed by non-profit International Consortium of Investigative Journalists (ICIJ).

Datashare allows investigative journalists to:

access all their documents in one place locally on their computer while securing them from potential third-party interferences
search pdfs, images, texts, spreadsheets, slides and any files, simultaneously
automatically detect and filter by people, organizations and locations

Translation of the interface

You're welcome to suggest translations on Datashare's Crowdin https://crwd.in/datashare. Please contact us if you would like to add a language.

Installing and using

Using with elasticsearch

You can download the script at datashare.icij.org.

To access web GUI, go in your documents folder and launch path/to/datashare.sh then connect datashare on http://localhost:8080

Using only Named Entity Recognition

You can use the datashare docker container only for HTTP exposed name finding API.

Just run :

docker run -ti -p 8080:8080 -v /path/to/dist/:/home/datashare/dist icij/datashare:0.10 -m NER

A bit of explanation :

-w tells datashare to run the webserver. It is launched on 8080 that's why the port is mapped for docker
-m NER runs datashare without index at all on a stateless mode
-v /path/to/dist:/home/datashare/dist maps the directory where the NLP models will be read (and downloaded if they don't exist)

Then query with curl the server with :

curl -i localhost:8080/ner/findNames/CORENLP --data-binary @path/to/a/file.txt

The last path part (CORENLP) is the framework. You can choose it among CORENLP, IXAPIPE, MITIE or OPENNLP.

Extract Text from Files

Implementations

TikaDocument from ICIJ/extract

Apache Tika v1.18 (Apache Licence v2.0)

with Tesseract v4.0 alpha

Support

Tika File Formats

Extract Persons, Organizations or Locations from Text

Implementations

org.icij.datashare.text.nlp.corenlp.CorenlpPipeline

Stanford CoreNLP v3.8.0, (Conditional Random Fields), Composite GPL v3+
org.icij.datashare.text.nlp.ixapipe.IxapipePipeline

Ixa Pipes Nerc v1.6.1, (Perceptron), Apache Licence v2.0
org.icij.datashare.text.nlp.mitie.MitiePipeline

MIT Information Extraction v0.8, (Structural Support Vector Machines), Boost Software License v1.0
org.icij.datashare.text.nlp.opennlp.OpennlpPipeline

Apache OpenNLP v1.6.0, (Maximum Entropy), Apache Licence v2.0

Natural Language Processing Stages Support

`NlpStage`
`TOKEN`
`SENTENCE`
`POS`
`NER`

Named Entity Recognition Language Support

`NlpStage.NER`	`ENGLISH`	`SPANISH`	`GERMAN`	`FRENCH`	`CHINESE`
`NlpPipeline.Type.CORENLP`	X	X	X	(w/ EN)	X
`NlpPipeline.Type.OPENNLP`	X	X	-	X	-
`NlpPipeline.Type.IXAPIPE`	X	X	X	-	-
`NlpPipeline.Type.MITIE`	X	X	X	-	-

Named Entity Categories Support

`NamedEntity.Category`
`ORGANIZATION`
`PERSON`
`LOCATION`

Parts-of-Speech Language Support

`NlpStage.POS`	`ENGLISH`	`SPANISH`	`GERMAN`	`FRENCH`
`NlpPipeline.Type.CORE`	X	X	X	X
`NlpPipeline.Type.OPEN`	X	X	X	X
`NlpPipeline.Type.IXA`	X	X	X	X
`NlpPipeline.Type.MITIE`	-	-	-	-

Store and Search Documents and Named Entities

Implementations

org.icij.datashare.text.indexing.elasticsearch.ElasticsearchIndexer

Elasticsearch v6.1.0, Apache Licence v2.0

Compilation / Build

Requires JDK 8, Maven 3 and a running PostgreSQL database (hostname postgres) with two databases datashare and test with write access for user test / password test. You'll need also a running elasticsearch instance with elasticsearch as hostname ; and a redis server named redis as well.

mvn validate
mvn -pl commons-test -am install
mvn -pl datashare-db liquibase:update
mvn test

Keeping the development environment up to date

It is important to keep datashare and datashare-client up to date by pulling from each repository's master branch.

To ensure that updates are registered, make clean dist must be run locally from each repository.

If dependencies have been updated on datashare-client, run yarn before make clean dist.

If the database models have changed within datashare, run the following commands before make clean dist:

sh datashare-db/scr/reset_datashare_db.sh
mvn -pl commons-test -am install
mvn -pl datashare-db liquibase:update
mvn test

License

Datashare is released under the GNU Affero General Public License

Bug report, comment or (pull) request

We welcome feedback as well as contributions!

For any bug, question, comment or (pull) request,

please contact us at datashare@icij.org

seconddayout / datashare

Datashare

Download

Documentation

Follow new updates and features

Frontend

Description

Translation of the interface

Installing and using

Using with elasticsearch

Using only Named Entity Recognition

Extract Text from Files

Extract Persons, Organizations or Locations from Text

Store and Search Documents and Named Entities

Compilation / Build

Keeping the development environment up to date

License

Bug report, comment or (pull) request

About

Languages