googlearchive / gcsbeat

An Elastic Beat to ingest data from Google Cloud Storage

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

"Directories" are listed as pending files

josephlewis42 opened this issue · comments

Synthetic directory entries are being picked up as processable files by the beat. Expected functionality is that the beat will entirely ignore "directories". Cloud Storage has no concept of directories and instead stores files/folders in sub-directories with the whole path as the name so directories can show up as empty bucket objects.

Example, in a bucket that has 5 "directories":

[StorageProvider]	storage/logging.go:36	Fetching file list from server
[GCS:gcsone]	beater/gcsbeat.go:140	Found 5 files, already pending: 0, regex excluded: 5, new: 0

One option would be to skip any zero-length files or to look at skipping files with Prefix defined in their ObjectAttrs. From the docs:

// Prefix is set only for ObjectAttrs which represent synthetic "directory
// entries" when iterating over buckets using Query.Delimiter. See
// ObjectIterator.Next. When set, no other fields in ObjectAttrs will be
// populated.