FileSystem Crawler for Elasticsearch

Welcome to the FS Crawler for Elasticsearch

This crawler helps to index documents from your local file system and over SSH. It crawls your file system and index new files, update existing ones and removes old ones.

You need to install a version matching your Elasticsearch version:

Elasticsearch	FS Crawler	Docs
es-2.0	2.0.0	See below

Build Status

Thanks to Travis for the build status:

Getting Started

You need to have at least Java 1.8.

fscrawler job_name

FS crawler will read a local file (default to ~/.fscrawler/{job_name}.json). If the file does not exist, FS crawler will propose to create your first job.

Once the crawler is running, it will write status information and statistics in:

~/.fscrawler/{job_name}.json
~/.fscrawler/{job_name}_status.json
~/.fscrawler/{job_name}_stats.json

It means that if you stop the job at some point, FS crawler will restart it from where it stops. If needed, you can manually edit / remove those files to restart.

You can also run:

fscrawler

It will give you the list of existing jobs and will allow you to choose one.

Crawler options

--help displays help --silent runs in silent mode. No output is generated. --debug runs in debug mode. --trace runs in trace mode (more verbose than debug). --config_dir defines directory where jobs are stored instead of default ~/.fscrawler.

Job file specification

The job file must comply to the following json specifications:

{
  "name" : "job_name",
  "fs" : {
    "url" : "/path/to/data/dir",
    "update_rate" : "5s",
    "includes": [
      "*.*"
    ],
    "excludes": [
      "*.json"
    ],
    "json_support" : false,
    "filename_as_id" : false,
    "add_filesize" : true,
    "remove_deleted" : true,
    "store_source" : false,
    "indexed_chars" : "10000"
  },
  "server" : {
    "hostname" : null,
    "port" : 22,
    "username" : null,
    "password" : null,
    "protocol" : "local",
    "pem_path" : null
  },
  "elasticsearch" : {
    "nodes" : [ {
      "host" : "127.0.0.1",
      "port" : 9200
    } ],
    "index" : "docs",
    "type" : "doc",
    "bulk_size" : 100,
    "flush_interval" : "5s"
  }
}

Settings list

Here is a full list of existing settings:

Name	Default value	Documentation
`name`		the job name (mandatory field)
`fs.url`	`"/tmp/es"`	Root directory
`fs.update_rate`	`"15m"`	Update Rate
`fs.includes`	`null`	Includes and Excludes
`fs.excludes`	`null`	Includes and Excludes
`fs.json_support`	`false`	Indexing JSon docs
`fs.filename_as_id`	`false`	Using Filename as `_id`
`fs.add_filesize`	`true`	Disabling file size field
`fs.remove_deleted`	`true`	Ignore deleted files
`fs.store_source`	`false`	Storing binary source document
`fs.indexed_chars`	`0.0`	Extracted characters
`server.hostname`	`null`	Indexing using SSH
`server.port`	`22`	Indexing using SSH
`server.username`	`null`	Indexing using SSH
`server.password`	`null`	Indexing using SSH
`server.protocol`	`"local"`	Indexing using SSH
`server.pem_path`	`null`	Using Username / PEM file
`elasticsearch.index`	job name	Index Name
`elasticsearch.type`	`"doc"`	Type Name
`elasticsearch.bulk_size`	`100`	Bulk settings
`elasticsearch.flush_interval`	`"5s"`	Bulk settings
`elasticsearch.nodes`	127.0.0.1:9200	Node settings

The most simple crawler

You can define the most simple crawler job by writing a ~/.fscrawler/test.json file as follow:

{
  "name" : "test"
}

This will scan every 15 minutes all documents available in /tmp/es dir and will index them into test index with doc type. It will connect to an elasticsearch cluster running on 127.0.0.1, port 9200.

Note: name is a mandatory field.

Root directory

Define fs.url property in your ~/.fscrawler/test.json file:

{
  "name" : "test",
  "fs" : {
    "url" : "/path/to/data/dir"
  }
}

For Windows users, use a form like c:/tmp or c:\\tmp.

Includes and excludes

Let's say you want to index only docs like *.doc and *.pdf but resume*. So resume_david.pdf won't be indexed.

Define fs.includes and fs.excludes properties in your ~/.fscrawler/test.json file:

{
  "name" : "test",
  "fs": {
    "includes": [
      "*.doc",
      "*.pdf"
    ],
    "excludes": [
      "resume*"
    ]
  }
}

It also applies to directory names. So if you want to ignore .ignore dir, just add .ignore as an excluded name.

Update rate

By default, update_rate is set to 15m. You can modify this value using any compatible time unit.

For example, here is a 15 minutes update rate:

{
  "name": "test",
  "fs": {
    "update_rate": "15m"
  }
}

Or a 3 hours update rate:

{
  "name": "test",
  "fs": {
    "update_rate": "3h"
  }
}

Indexing using SSH

You can index files remotely using SSH.

Username / Password

Let's say you want to index from a remote server using SSH:

FS URL: /path/to/data/dir/on/server
Server: mynode.mydomain.com
Username: username
Password: password
Protocol: ssh (default to local)
Port: 22 (default to 22)

{
  "name" : "test",
  "fs" : {
    "url" : "/path/to/data/dir/on/server"
  },
  "server" : {
    "hostname" : "mynode.mydomain.com",
    "port" : 22,
    "username" : "username",
    "password" : "password",
    "protocol" : "ssh"
  }
}

Using Username / PEM file

Let's say you want to index from a remote server using SSH:

FS URL: /path/to/data/dir/on/server
Server: mynode.mydomain.com
Username: username
PEM File: /path/to/private_key.pem
Protocol: ssh (default to local)
Port: 22 (default to 22)

{
  "name" : "test",
  "fs" : {
    "url" : "/path/to/data/dir/on/server"
  },
  "server" : {
    "hostname" : "mynode.mydomain.com",
    "port" : 22,
    "username" : "username",
    "protocol" : "ssh",
	"pem_path": "/path/to/private_key.pem"
  }
}

Searching for docs

This is a common use case in elasticsearch, we want to search for something! ;-)

GET docs/doc/_search
{
  "query" : {
    "match" : {
        "_all" : "I am searching for something !"
    }
  }
}

Indexing JSon docs

If you want to index JSon files directly without parsing with Tika, you can set json_support to true.

{
  "name" : "test",
  "fs" : {
    "json_support" : true
  }
}

Of course, if you did not define a mapping before launching the crawler, Elasticsearch will auto guess the mapping.

Dealing with multiple types and multiple dirs

If you have more than one type, create as many crawlers as types:

~/.fscrawler/test_type1.json:

{
  "name": "test_type1",
  "fs": {
	"url": "/tmp/type1",
	"json_support" : true
  },
  "elasticsearch": {
    "index": "mydocs",
    "type": "type1"
  }
}

~/.fscrawler/test_type2.json:

{
  "name": "test_type2",
  "fs": {
	"url": "/tmp/type2",
	"json_support" : true
  },
  "elasticsearch": {
    "index": "mydocs",
    "type": "type2"
  }
}

Dealing with multiple types within the same dir

You can also index many types from one single dir using two crawlers scanning the same dir and by setting includes parameter:

~/.fscrawler/test_type1.json:

{
  "name": "test_type1",
  "fs": {
	"url": "/tmp",
    "includes": [ "type1*.json" ],
	"json_support" : true
  },
  "elasticsearch": {
    "index": "mydocs",
    "type": "type1"
  }
}

~/.fscrawler/test_type2.json:

{
  "name": "test_type2",
  "fs": {
	"url": "/tmp",
    "includes": [ "type2*.json" ],
	"json_support" : true
  },
  "elasticsearch": {
    "index": "mydocs",
    "type": "type2"
  }
}

Using filename as elasticsearch `_id`

Please note that the document _id is always generated (hash value) from the JSon filename to avoid issues with special characters in filename. You can force to use the _id to be the filename using filename_as_id attribute:

{
  "name" : "test",
  "fs" : {
    "json_support" : true,
    "filename_as_id" : true
  }
}

Disabling file size field

By default, FS crawler will create a field to store the original file size in octets. You can disable it using `add_filesize' option:

{
  "name" : "test",
  "fs" : {
    "add_filesize" : false
  }
}

Ignore deleted files

If you don't want to remove indexed documents when you remove a file or a directory, you can set remove_deleted to false (default to true):

{
  "name" : "test",
  "fs" : {
    "remove_deleted" : false
  }
}

Ignore content

If you don't want to extract file content but only index filesystem metadata such as filename, date, size and path, you can set index_content to false (default to true):

{
  "name" : "test",
  "fs" : {
    "index_content" : false
  }
}

Advanced

Autogenerated mapping

When the FS crawler detects a new type, it creates automatically a mapping for this type:

{
  "doc" : {
    "properties" : {
      "content" : {
        "type" : "string",
        "store" : "yes"
      },
      "meta" : {
        "properties" : {
          "author" : {
              "type" : "string",
              "store" : "yes"
          },
          "title" : {
              "type" : "string",
              "store" : "yes"
          },
          "date" : {
              "type" : "date",
              "format" : "dateOptionalTime",
              "store" : "yes"
          },
          "keywords" : {
              "type" : "string",
              "store" : "yes"
          }
        }
      },
      "file" : {
        "properties" : {
          "content_type" : {
              "type" : "string",
              "analyzer" : "not_analyzed",
              "store" : "yes"
          },
          "last_modified" : {
              "type" : "date",
              "format" : "dateOptionalTime",
              "store" : "yes"
          },
          "indexing_date" : {
              "type" : "date",
              "format" : "dateOptionalTime",
              "store" : "yes"
          },
          "filesize" : {
              "type" : "long",
              "store" : "yes"
          },
          "indexed_chars" : {
              "type" : "long",
              "store" : "yes"
          },
          "filename" : {
              "type" : "string",
              "analyzer" : "not_analyzed",
              "store" : "yes"
          },
          "url" : {
              "type" : "string",
              "store" : "yes",
              "index" : "no"
          }
        }
      },
      "path" : {
        "properties" : {
          "encoded" : {
              "type" : "string",
              "store" : "yes",
              "index" : "not_analyzed"
          },
          "virtual" : {
              "type" : "string",
              "store" : "yes",
              "index" : "not_analyzed"
          },
          "root" : {
              "type" : "string",
              "store" : "yes",
              "index" : "not_analyzed"
          },
          "real" : {
              "type" : "string",
              "store" : "yes",
              "index" : "not_analyzed"
          }
        }
      }
    }
  }
}

Creating your own mapping (analyzers)

If you want to define your own mapping to set analyzers for example, you can push the mapping before starting the FS crawler.

# Create index
PUT docs

# Create the mapping
PUT docs/doc/_mapping
{
  "doc" : {
    "properties" : {
      "content" : {
        "type" : "string",
        "store" : "yes",
        "analyzer" : "french"
      },
      "meta" : {
        "properties" : {
          "author" : {
              "type" : "string",
              "store" : "yes"
          },
          "title" : {
              "type" : "string",
              "store" : "yes"
          },
          "date" : {
              "type" : "date",
              "format" : "dateOptionalTime",
              "store" : "yes"
          },
          "keywords" : {
              "type" : "string",
              "store" : "yes"
          }
        }
      },
      "file" : {
        "properties" : {
          "content_type" : {
              "type" : "string",
              "analyzer" : "not_analyzed",
              "store" : "yes"
          },
          "last_modified" : {
              "type" : "date",
              "format" : "dateOptionalTime",
              "store" : "yes"
          },
          "indexing_date" : {
              "type" : "date",
              "format" : "dateOptionalTime",
              "store" : "yes"
          },
          "filesize" : {
              "type" : "long",
              "store" : "yes"
          },
          "indexed_chars" : {
              "type" : "long",
              "store" : "yes"
          },
          "filename" : {
              "type" : "string",
              "analyzer" : "not_analyzed",
              "store" : "yes"
          },
          "url" : {
              "type" : "string",
              "store" : "yes",
              "index" : "no"
          }
        }
      },
      "path" : {
        "properties" : {
          "encoded" : {
              "type" : "string",
              "store" : "yes",
              "index" : "not_analyzed"
          },
          "virtual" : {
              "type" : "string",
              "store" : "yes",
              "index" : "not_analyzed"
          },
          "root" : {
              "type" : "string",
              "store" : "yes",
              "index" : "not_analyzed"
          },
          "real" : {
              "type" : "string",
              "store" : "yes",
              "index" : "not_analyzed"
          }
        }
      }
    }
  }
}

Generated fields

FS crawler creates the following fields :

Field	Description	Example
`content`	Extracted content	`"This is my text!"`
`attachment`	BASE64 encoded binary file	BASE64 Encoded document
`meta.author`	Author if any in document metadata	`"David Pilato"`
`meta.title`	Title if any in document metadata	`"My document title"`
`meta.date`	Document date if any in document metadata	`"2013-04-04T15:21:35"`
`meta.keywords`	Keywords if any in document metadata	`["river","fs","elasticsearch"]`
`file.content_type`	Content Type	`"application/vnd.oasis.opendocument.text"`
`file.last_modified`	Last modification date	`1386855978000`
`file.indexing_date`	Indexing date	`"2013-12-12T13:50:58.758Z"`
`file.filesize`	File size in bytes	`1256362`
`file.indexed_chars`	Extracted chars if `fs.indexed_chars` > 0	`100000`
`file.filename`	Original file name	`"mydocument.pdf"`
`file.url`	Original file url	`"file://tmp/mydir/otherdir/mydocument.pdf"`
`path.encoded`	MD5 encoded file path (for internal use)	`"112aed83738239dbfe4485f024cd4ce1"`
`path.virtual`	Relative path from root path	`"mydir/otherdir"`
`path.root`	MD5 encoded root path (for internal use)	`"112aed83738239dbfe4485f024cd4ce1"`
`path.real`	Actual real path name	`"/tmp/mydir/otherdir/mydocument.pdf"`

Here is a typical JSON document generated by the crawler:

{
   "file":{
      "filename":"test.odt",
      "last_modified":1386855978000,
      "indexing_date":"2013-12-12T13:50:58.758Z",
      "content_type":"application/vnd.oasis.opendocument.text",
      "url":"file:///tmp/testfs_metadata/test.odt",
      "indexed_chars":100000,
      "filesize":8355
   },
   "path":{
      "encoded":"bceb3913f6d793e915beb70a4735592",
      "root":"bceb3913f6d793e915beb70a4735592",
      "virtual":"",
      "real":"/tmp/testfs_metadata/test.odt"
   },
   "meta":{
      "author":"David Pilato",
      "title":"Mon titre",
      "date":"2013-04-04T15:21:35",
      "keywords":[
         "fs",
         "elasticsearch",
         "crawler"
      ]
   },
   "content":"Bonjour David\n\n\n"
}

Advanced search

You can use meta fields to perform search on.

GET docs/doc/_search
{
  "query" : {
    "term" : {
        "file.filename" : "mydocument.pdf"
    }
  }
}

Storing binary source document (BASE64 encoded)

You can store in elasticsearch itself the binary document using store_source option:

{
  "name" : "test",
  "fs" : {
    "store_source" : true
  }
}

In that case, a new stored field named attachment is added to the generated JSon document. If you let FS crawler generates the mapping, FS crawler will exclude attachment field from _source to save some disk space.

That means you need to ask for field attachment when querying:

GET mydocs/doc/_search
{
  "fields" : ["attachment", "_source"],
  "query":{
    "match_all" : {}
  }
}

Default generated mapping in this case is:

{
  "doc" : {
    "_source" : {
      "excludes" : [ "attachment" ]
    },
    "properties" : {
      "attachment" : {
        "type" : "binary"
      }
      // ... Other properties here
    }
  }
}

You can force not to store attachment field and keep attachment in _source:

# Create index
PUT docs

# Create the mapping
PUT docs/doc/_mapping
{
  "doc" : {
    "properties" : {
      "attachment" : {
        "type" : "binary",
        "store" : "no"
      }
      // ... Other properties here
    }
  }
}

Extracted characters

By default FS crawler will extract only the first 100 000 characters. But, you can set indexed_chars to 5000 in FS crawler settings in order to overwrite this default settings.

{
  "name": "test",
  "fs": {
    "indexed_chars": "5000"
  }
}

This number can be either a fixed size, number of characters that is, or a percent using % sign. The percentage value will be applied to the filesize to determine the number of character the crawler needs to extract.

If you want to index only 80% of filesize, define indexed_chars to "80%". Of course, if you want to index the full document, you can set this property to "100%". Double values are also supported so "0.01%" is also a correct value.

Compressed files: If your file is compressed, you might need to increase indexed_chars to more than "100%". For example, "150%".

If you want to extract the full content, define indexed_chars to "-1".

Note: Tika requires to allocate in memory a data structure to extract text. Setting indexed_chars to a high number will require more memory!

Elasticsearch settings

You can change elasticsearch settings within elasticsearch settings object.

Index name

By default, FS crawler will index your data in an index which name is the same as the crawler name (name property). You can change it by setting index field:

{
  "name" : "test",
  "elasticsearch" : {
    "index" : "docs"
  }
}

Type name

By default, FS crawler will index your data using doc as the type name. You can change it by setting type field:

{
  "name" : "test",
  "elasticsearch" : {
    "type" : "mydocument"
  }
}

Bulk settings

FS crawler is using bulks to send data to elasticsearch. By default the bulk is executed every 100 operations or every 5 seconds. You can change default settings using bulk_size and flush_interval:

{
  "name" : "test",
  "elasticsearch" : {
    "bulk_size" : 1000,
    "flush_interval" : "2s"
  }
}

Node settings

FS crawler is using elasticsearch transport client to send data to your running cluster. By default, it connects to 127.0.0.1 on port 9300 which are the default settings when running a local node on your machine.

Of course, in production, you would probably change this and connect to a production cluster:

{
  "name" : "test",
  "elasticsearch" : {
    "nodes" : [
      { "host" : "mynode1.mycompany.com", "port" : 9300 }
    ]
  }
}

You can define multiple nodes:

{
  "name" : "test",
  "elasticsearch" : {
    "nodes" : [
      { "host" : "mynode1.mycompany.com", "port" : 9300 },
      { "host" : "mynode2.mycompany.com", "port" : 9300 },
      { "host" : "mynode3.mycompany.com", "port" : 9300 }
    ]
  }
}

Note: the cluster.name does not have to be set as it's ignored.

License

This software is licensed under the Apache 2 license, quoted below.

Copyright 2011-2015 David Pilato

Licensed under the Apache License, Version 2.0 (the "License"); you may not
use this file except in compliance with the License. You may obtain a copy of
the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
License for the specific language governing permissions and limitations under
the License.

smacm / fscrawler

FileSystem Crawler for Elasticsearch

Build Status

Getting Started

Crawler options

Job file specification

Settings list

The most simple crawler

Root directory

Includes and excludes

Update rate

Indexing using SSH

Username / Password

Using Username / PEM file

Searching for docs

Indexing JSon docs

Dealing with multiple types and multiple dirs

Dealing with multiple types within the same dir

Using filename as elasticsearch `_id`

Disabling file size field

Ignore deleted files

Ignore content

Advanced

Autogenerated mapping

Creating your own mapping (analyzers)

Generated fields

Advanced search

Storing binary source document (BASE64 encoded)

Extracted characters

Elasticsearch settings

Index name

Type name

Bulk settings

Node settings

License

About

Languages

FileSystem Crawler for Elasticsearch

Build Status

Getting Started

Crawler options

Job file specification

Settings list

The most simple crawler

Root directory

Includes and excludes

Update rate

Indexing using SSH

Username / Password

Using Username / PEM file

Searching for docs

Indexing JSon docs

Dealing with multiple types and multiple dirs

Dealing with multiple types within the same dir

Using filename as elasticsearch _id

Disabling file size field

Ignore deleted files

Ignore content

Advanced

Autogenerated mapping

Creating your own mapping (analyzers)

Generated fields

Advanced search

Storing binary source document (BASE64 encoded)

Extracted characters

Elasticsearch settings

Index name

Type name

Bulk settings

Node settings

License

About

Languages

Using filename as elasticsearch `_id`