I want to use Lagom to build a data processing pipeline. The first step in this pipeline is a service using a Twitter client to supscribe to a stream of Twitter messages. For each new message I want to persist the message in Cassandra.
What I dont understand is given I model my Aggregare root as a List of TwitterMessages for example, after running for some time this aggregare root will be several gigabytes in size. There is no need to store all the TwitterMessages in memory since the goal of this one service is just to persist each incomming message and then publish the message out to Kafka for the next service to process.
How would I model my aggregate root as Persistent Entitie for a stream of messages without it consuming unlimited resources? Are there any example code showing this usage if Lagom?
Event sourcing is a good default go to, but not the right solution for everything. In your case it may not be the right approach. Firstly, do you need the Tweets persisted, or is it ok to publish them directly to Kafka?
Assuming you need them persisted, aggregates should store in memory whatever they need to validate incoming commands and generate new events. From what you've described, your aggregate doesn't need any data to do that, so your aggregate would not be a list of Twitter messages, rather, it could just be
NotUsed. Each time it gets a command it emits a new event for that Tweet. The thing here is, it's not really an aggregate, because you're not aggregating any state, you're just emitting events in response to commands with no invariants or anything. And so, you're not really using the Lagom persistent entity API for what it was made to be used for. Nevertheless, it may make sense to use it in this way anyway, it's a high level API that comes with a few useful things, including the streaming functionality. But there are also some gotchas that you should be aware of, you put all your Tweets in one entity, you limit your throughput to what one core on one node can do sequentially at a time. So maybe you could expect to handle 20 tweets a second, if you ever expect it to ever be more than that, then you're using the wrong approach, and you'll need to at a minimum distribute your tweets across multiple entities.
The other approach would be to simply store the messages directly in Cassandra yourself, and then publish directly to Kafka after doing that. This would be a lot simpler, a lot less mechanics involved, and it should scale very nicely, just make sure you choose your partition key columns in Cassandra wisely - I'd probably partition by user id.