TimelyDataflow / timely-dataflow

A modular implementation of timely dataflow in Rust

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Support serializing internal state

Kixiron opened this issue · comments

I'm working on a user-sided application with the goal of fast response times, and I've really been wanting a way to cache the internal state of dataflows so that it can be quickly recreated across restarts, enabling as fast of a restart time as possible while also skipping work that was already done in previous lifetimes of the program

@Kixiron have you discovered any way to make this possible? I'm also interested—I would really love a DB that's like SQLite, but differential 😻

I wandered around the code base a bit, I'm not sure if it's possible without patching—or wrapping all the objects to log state because the subgraph fields are private. But these are the areas of interest that I saw from digging around:

let mut operator = subscope.into_inner().build(self);

pub struct SubgraphBuilder<TOuter, TInner>
where
TOuter: Timestamp,
TInner: Timestamp,
{
/// The name of this subgraph.
pub name: String,
/// A sequence of integers uniquely identifying the subgraph.
pub path: Vec<usize>,
/// The index assigned to the subgraph by its parent.
index: usize,
// handles to the children of the scope. index i corresponds to entry i-1, unless things change.
children: Vec<PerOperatorState<TInner>>,
child_count: usize,
edge_stash: Vec<(Source, Target)>,
// shared state written to by the datapath, counting records entering this subgraph instance.
input_messages: Vec<Rc<RefCell<ChangeBatch<TInner>>>>,
// expressed capabilities, used to filter changes against.
output_capabilities: Vec<MutableAntichain<TOuter>>,
/// Logging handle
logging: Option<Logger>,
/// Progress logging handle
progress_logging: Option<ProgressLogger>,
}

pub subgraph: &'a RefCell<SubgraphBuilder<G::Timestamp, T>>,

image

paths: Rc<RefCell<HashMap<usize, Vec<usize>>>>,

Worker.paths is also looking very interesting!

Unfortunately not, my hopes are mostly in disk backed differential arrangements but I don't think there's much progress towards that

disk backed differential arrangements

Sameee, I would love that please — even just applying simple maps would be fine for me right now as well — not that the incremental is hard to write, but I would like to not, if I don't have to. Have you been exploring what it would take for what you're imagining?

I'm wondering what would happen if I just started applying Serialize & Deserialize to things until something interesting happens 🤣

By and large it's a significantly more complex problem than just adding Serialize to things, dataflow construction isn't the expensive part of reviving a dataflow, the expense lies in rebuilding indices over massive amounts of data data