anjackson / drumknott

An experimental clerk.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

drumknott

An experimental digital clerk.

This is just some notes on an idea for now. Some kind of digital preservation workbench in the browser. A bit like CyberChef, but more a usable 'workbench' for digital preservationers.

Goal

We need to do basic stuff like scan files, do checksum checks, identify formats. This is often a pain because installing software is a pain. Can we run something useful, just using the browser? There are a lot of things that can run in or be cross-compiled via WASM, so maybe we can make something useful?

Outline

In principle, a pdfcpu-based PDF validator would be pretty simple to do, for example. More generally, probably would need some kind of standard API/wrapping to make it possible to dynamically pull in different format-specific WASM chunks depending on what we find.

Older Notes

Monitoring your files.

Links to: Preservation With The Cloud

erDiagram
    COLLECTION ||--o{ STORE : storedOn
    COLLECTION ||--|{ ITEM : hasItems
    COLLECTION {
        int id
        string title
    }
    ITEM {
        int id
        int collectionId
        string path
        int expected_size
        string expected_hashSHA512
    }
    STORE {
        int id
        int collectionId
        string remotePrefix
        boolean allowDeletes
    }
    ITEM ||--|{ FILE: hasCopy
    STORE ||--|| FILE: storedOn
    FILE {
        int storeId
        int fileId
        int size
        string hashSHA512
    }

Rough Notes

Collections ID, Title (unique) 1, UKWA NPLD

Remotes/Stores ID, Collection_ID, Remote (unique), Allow_Deletes 1, 1, 'hdfs://namenode:8012/heritrix/output', 'false' 2, 1, 'hdfs://h3nn:8012/heritrix/output', 'false' 3, 1, 'hdfs://nlsh3nn:8012/heritrix/output', 'false' 4, 1, 'file://mnt/gluster/heritrix/output', 'true'

( i.e. we manage replica relationships as commands:

  • h020 copy-to nls.h3
  • gluster move-to h3

But the Allow_Deletes field lets us control whether move-to is allowed, so it's not called accidentally. }

Files ID, Collection_ID, Path, Verified_Size, Verified_Hash 1, 1, frequent-npld/test.txt, NULL, NULL

Replicas Remote_ID, File_ID, Size, Last_Seen

Replica_Sizes Remote_ID, File_ID, Size

Hashes HASH, Size, ???

Replica_Hashes Remote_ID, File_ID, HASH?

Questions:

Which files are replicated, based on path and size. Which files are replicated, based on path, size and hash. Which files need replication? Which files have changed? What changed and when? Which files have been indexed, etc.

JOBS:

  • Scan source X (in chunks) and update records.
  • Find conflicts/changes and raise them for review.
  • Review conflicts and resolve (record correct hash, size, or remove old out-of date records) ???
    • Files is in one place not in the other(s) (need replication).
    • Files differ between places (needs correct hash+size).
    • Files change size or hash (need removing to permit update?).
  • Find files that need replicating, run a job to replicate them, and record result.
  • Find files that need hash-checking, check and record result.
  • Find files that need indexing, and index them, and record result.
  • ANTIJOB: Interfaces do not permit deletion, only replacement of replicas marked as bad?

A good critical case is how it handles partial replicate and recovery....

To check for changes, need to perform a transaction like:

SQL: BEGIN SQL: Get record for this replica (and lock the row?) CODE: Compare with record, if any. SQL: INSERT or UPDATE as needed. SQL: COMMIT

Or

KEY( Remote_ID, File_ID, Size ) LAST_SCAN_DATE UPSERT updates refresh date

QUERIES can GROUP BY (Remote_ID,File_ID) and if COUNT > 1 raise an alarm that there has been a change.

How to specify which is the master?/What the authoritative hash is?

Replication rules? Or implement as commands?

  • Remote2 CopyTo Remote3
  • Remote1 MoveTo Remote2 ??? OR
  • Remote1 ClearIf Remote2 (check local file hash matched DB records and retrieve remote(s) and if all the same, delete).

How to be sure replication is verified?

  • Check on expected number of systems and hash has been calculated recently on them and they matched.
  • Current:
    • Copy step and verify hash matches.
    • Re-check hashes and delete on 'moved-from'

rclone lsjson as a normalized format/main way of operating?

https://rclone.org/commands/rclone_lsjson/

e.g.

{
  "Hashes" : {
     "SHA-1" : "f572d396fae9206628714fb2ce00f72e94f2258f",
     "MD5" : "b1946ac92492d2347c6235b4d2611184",
     "DropboxHash" : "ecb65bb98f9d905b70458986c39fcbad7715e5f2fcc3b1f07767d7c83e2438cc"
  },
  "ID": "y2djkhiujf83u33",
  "OrigID": "UYOJVTUW00Q1RzTDA",
  "IsBucket" : false,
  "IsDir" : false,
  "MimeType" : "application/octet-stream",
  "ModTime" : "2017-05-31T16:15:57.034468261+01:00",
  "Name" : "file.txt",
  "Encrypted" : "v0qpsdq8anpci8n929v3uu9338",
  "EncryptedPath" : "kja9098349023498/v0qpsdq8anpci8n929v3uu9338",
  "Path" : "full/path/goes/here/file.txt",
  "Size" : 6,
  "Tier" : "hot",
}

Implementation Ideas:

About

An experimental clerk.

License:GNU Affero General Public License v3.0