lovoo / goka

Goka is a compact yet powerful distributed stream processing library for Apache Kafka written in Go.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

how to use Goka to move data from group table state to different storage system

devcorpio opened this issue Β· comments

Hello! πŸ‘‹

I'm at the stage where I have a system that does the following things:

  • ingest and emit messages to Kafka
  • consumes and performs the aggregations (the API is so easy to use!)

At this point, I would like to move all the aggregations I have to a different database (E.g. Redis, MySQL, PostgreSQL, etc)

What I would like to do is to have a process that:

  1. reads the aggregated state every N seconds/minutes and performs a set of additional transformations (if needed).
  2. move that to a database.
  3. repeat steps 1 and 2 while this process is running.

P.S. If I stop the process and start it again, it should not start giving me the same aggregations again. So it should keep track of what has been already processed.

Views

I have been checking the views feature but I believe it's not possible. For instance, I cannot perform a query by a range of timestamps. I have seen IteratorWithRange but I believe the values (both are string) the function expects correspond to a message key. and I don't have either a way to avoid querying the data that it's already queried since there is no "offset" to use.

Probably I'm missing something here, or I am giving views a responsibility they don't have, so maybe I do not understand what their goal is.

Alternative

Perhaps I can achieve what I want by creating a processor that consumes directly the group table topic from Kafka instead of using views. (It would meet the requirements of the process I described above) and then control the speed of the consumer/processor via max.poll.records, etc and then the state would be mirrored as expected.

Does that make sense? Do you see any drawbacks related to that approach?

Thanks for your time
Alberto

Hi @devcorpio ,

as far as I understand you need to dump table data to another database. What is not clear is:

  • what triggers the dump?
    • a message update?
    • a trigger outside which dumps the whole table?
  • is the rate of dumping dependent on the indivual messages or can you sweep over the whole table and dump everything regularily?
  • is the table exported 1:1 to the external database or do the aggration you mentioned you'll aggregate multiple entries into one entry in the external database? Maybe it makes sense to aggregate the data for export in another processor and then dump that table instead?
  • what table sizes are we talking about? It makes a difference if you're iterating over a view with couple of thousand entries or if we're talking about millions, which - aggretating in memory - might be too expensive.

Some abstract/exemplary data structures to talk about would be quite helpful I think, maybe that'd clear things up a bit.

Hi @frairon

Thanks a lot for answering!

--

I'm attaching this small diagram:

goka-github-issue

  1. User interacts with the website (clicks, visiting an individual page)
  2. Each interaction event is being to a server
  3. the server emits messages with that info to Kafka (producer side) -> the key of the message is the user_id
  4. there are processors consuming such messages and aggregating info thanks to group tables and persist of Goka (using Leveldb localstorage)

My question is related to this:

I would like a "realtime" dashboard app where I can see all the information aggregated over time and for every user. Since I'm expecting hundreds of thousands (or even millions) of different user_ids then I would like to have the same aggregations performed by Goka in a PostgreSQL, so there will be easy to query data, etc.

Imagine that in a X time the Leveldb state there is something like this:

user_id value
1 { click_count: 35, page_visited: 12 }
2 { click_count: 36, page_visited: 7 }
[a million more...]

So, eventually (the faster the better) would like to have the very same state available in PostgreSQL. Using views I should iterate the whole table again and again. I'm saying "again and again" because it's perfectly possible that 2 minutes after the first iteration of the table, the click_count for user_id 1 is 39, then I would need to iterate again to make sure that value is transferred to PostgreSQL

what triggers the dump?

The idea would be to have a process(es) that mirrors the state store to Postgresql, little by little. For instance, "please, every 15 seconds pick these 1000 records and sent them to PostgreSQL".

At first, when looking at that I was thinking of using Views, but I would not like iterating over the whole table again and again (there might be millions of rows) as I mentioned above. (I'm a broken record πŸ˜„ )

That's why I was thinking about the alternative. Given that Goka keeps a changelog (emitting to topic-group-table) with all the aggregations that have been happening, the idea would be to do what Goka does with that topic (e.g. when partition rebalancing), but instead of materializing the current state in Leveldb, that state would be materialized in PostgreSQL.

So I would have different processors reading the changelog topic. For doing so I would like to know if it's possible for Goka to consume messages in "batches" every N seconds/minutes to communicate with PostgreSQL in a "controlled way" otherwise I would have been communicating with PostgreSQL for every message I consume.

I have seen that Kafka provides configs such as max.poll.interval.ms and max.poll.records that might help to "slow down" the consumption of the messages, but not sure if that's a good idea or if even Goka allows setting that config.

I understand that maybe Goka should not be the only tool (or not the tool at all) to perform such state mirroring. I have seen there is a tool named Kafka Connect that might be helpful for this necessity I have. I would need to explore it a bit first, though.

--

I believe that with the explanation I answered the questions you asked. If there are still points to clarify, please let me know it.

Thanks a lot for investing your time in helping the community.
Alberto

Hi @devcorpio ,
thanks for this awesome explanation. It did make things much clearer.

First, you can use kafka-connect for mirroring, but that would mean you add another tool in a different language, which you then need to make understand the data in kafka. If that's not a problem and the approach is fast enough - go for it.

The batching however can also achieved using pure go/goka tools. I don't have any experience with modfying the consumer-settings like max.poll.... Maybe it's worth a try, but you'd still receive the messages one by one and would need to set up some batching.

In goka, the callback-context provides a function DeferCommit which omits the auto-commit and returns a function isntead, that must be called whenever the message should be marked as committed. That way the processor will receive the message again if the dumper crashes or the postgres-insertion fails for some reason. Note that this means you might have duplicates if the dump partly did work but there were errors later in the process.
Working with external services as in your case actually was the reason to add this feature, so we can use it here. Here's some example code how it might work.
Lots of code has been omitted like error handling and proper starting of the processors and shutdown and all that, it doesn't even compile...
The idea is to show, how a processor can be used to store data in an external data structure and use DeferCommit to intercept the processor-commit. You'll figure it out :)

type User struct {
	ClickCount int
	PageVisit  int
}

// to be implemented
type UserCodec struct{}


func defineProc() {

	// define the dumper processor
	collector, _ := goka.NewProcessor(goka.NilHandling, goka.DefineGroup(
		"collector",
		goka.Input("clicks", ...),
		goka.Input("page_visits", ...),
		goka.Persist(new(UserCodec)),
	),
	)

        // those variables should probably be nicely moved to its own datastructure
	var (
		mBatch sync.Mutex
		batch = map[string]*User{}
		committers []func(error)
	)

        // commit the currrent batch
	commit := func(ctx context.Context){
		mBatch.Lock()
		// swap the batches
		batchToCopy := batch
		batch = map[string]*User{}
		mBatch.Unlock()

		// write the batch to Postgres
		err := dbClient.Write(ctx, ...)

		for _, c := range committers{
			// passing an error will actually not commit the message and the processor
			// will shutdown instead and receive the message next time again. 
			// If the error should be toleraged, pass nil instead.
			c(err)
		}
	}

	dumper, _ := goka.NewProcessor(nil, goka.DefineGroup(
		"dumper",
		// consume the collector-processor's group table updates as stream
		goka.Input(goka.Stream(goka.GroupTable("collector")), new(UserCodec), func(ctx goka.Context, msg interface{}){
			
			mBatch.Lock()
			defer mBatch.Unlock()

			// add the committer-function next to the batch, so that the committer-function can call them
                        // one by one after the dump to Postgres was successful (or not)
			committers = append(committers, ctx.DeferCommit())
			// store the update in the batch.
			// This approach wouldn't dump duplicate user-updates. If you want all updates,
			// use a slice instead of the map.
			batch[ctx.Key()] = msg.(*User)
		}),
	))

// run both processors
	go collector.Run(ctx)
	go dumper.Run(ctx)


	// run a ticker-loop that triggers the dump every x seconds
	interval := 10*time.Second
	ticker := time.NewTicker(interval)
	defer ticker.Stop()
	for{
		select{
		case <-ctx.Done():
			return
		case <-ticker.C:
			commit(ctx)
			ticker.Reset(interval)
		}
	}
	
}

Maybe this approach could be the solution?

Hi @frairon,

Thanks a lot for the detailed answer!!

Great to know about the existence of DeferCommit and also thanks for the prototype you have written!

I will use the technique you described rather than use KafkaConnect! Depending on how the software evolves I will decide if it makes sense to switch to Kafka connect in the future, too early to know that, so no more extra tools if not needed (as you also commented)

P.S. I'm closing the issue since your answer has been very helpful

Cheers,
Alberto