A library for expressive and efficient service composition

Introduction

Summary

Clump is a Scala library that addresses the problem of knitting together data from multiple sources in an elegant and efficient way.

In a typical microservice-powered system, it is common to find awkward wrangling code to facilitate manually bulk-fetching dependent resources. Worse, this problem of batching is often accidentally overlooked, resulting in n calls to a micro-service instead of 1.

Clump removes the need for the developer to even think about bulk-fetching, batching and retries, providing a powerful and composable interface for aggregating resources.

An example of batching fetches using futures without Clump:

tracksService.get(trackIds).flatMap { tracks =>
  val userIds = tracks.map(_.creator)
  usersService.get(userIds).map { users =>
    val userMap = userIds.zip(users).toMap
    tracks.map { track =>
      new EnrichedTrack(track, userMap(track.creator))
    }
  }
}

The same composition using Clump:

Clump.traverse(trackIds) { trackId =>
  trackSource.get(trackId).flatMap { track =>
    userSource.get(track.creator).map { user =>
      new EnrichedTrack(track, user)
    }
  }
}

Or expressed more elegantly with a for-comprehension:

Clump.traverse(trackIds) { trackId =>
  for {
    track <- trackSource.get(trackId)
    user <- userSource.get(track.creator)
  } yield {
    new EnrichedTrack(track, user)
  }

Problem

The microservices architecture introduces many new challenges when dealing with complex systems. One of them is the high number of remote procedure calls and the cost associated to them. Among the techniques applied to amortize this cost, batching of requests has an important role. Instead of paying the price of one call for each interaction, many interactions are batched in only one call.

While batching introduces performance enhancements, it also introduces complexity to the codebase. The common approach is to extract as much information as possible about what needs to be fetched, perform the batched fetch and extract the individual values to compose the final result. The steps need to be repeated many times depending on how complex is the final structure.

An example of batching using futures:

tracksService.get(trackIds).flatMap { tracks =>
  val userIds = tracks.map(_.creator)
  usersService.get(userIds).map { users =>
    val userMap = userIds.zip(users).toMap
    tracks.map { track =>
      new EnrichedTrack(track, userMap(track.creator))
    }
  }
}

This example has only one level of nested resources. In a complex system, it is common to have several levels:

• timeline
  • track post
    • track
      • creator
  • track repost
    • track
      • creator
    • reposter
  • playlist post
    • playlist
      • track ids
      • creator
  • playlist repost
    • playlist
      • track ids
      • creator
    • reposter
  • comment
    • user
  • user follow
    • follower
    • followee

This structure can also be part of a bigger structure that includes the user's data for instance. Given this scenario, the code that is capable of batching requests in an optimal way is really complex and hard to maintain.

Solution

The complexity comes mainly from declaring together what needs to be fetched and how it should be fetched. Clump offers an embedded Domain-Specific Language (DSL) that allows declaration of what needs to be fetched and an execution model that determines how the resources should be fetched.

The execution model applies three main optimizations:

Batch requests when it is possible;
Fetch from the multiple sources in parallel;
Avoid fetching the same resource multiple times by using a cache.

The DSL is based on a monadic interface similar to Future. It is a Free Monad, that produces a nested series of transformations without starting the actual execution. This is the characteristic that allows triggering of the execution separately from the definition of what needs to be fetched.

The execution model leverages on Applicative Functors to express the independence of computations. It exposes only join to the user but makes use of other applicative operations internally. This means that even without the user specifying what is independent, the execution model can apply optimizations.

Getting started

To use clump, just add the dependency to the project's build configuration.

Important: Change x.x.x with the latest version listed by the CHANGELOG.md file.

SBT

libraryDependencies ++= Seq(
  "io.getclump" %% "clump" % "x.x.x"
)

Maven

<dependency>
    <groupId>io.getclump</groupId>
    <artifactId>clump</artifactId>
    <version>x.x.x</version>
</dependency>

Usage

Example

Example usage of Clump:

import clump.Clump

// Creates sources using the batched interfaces
val tracksSource = Clump.source(tracksService.fetch)(_.id)
val usersSource = Clump.source(usersService.fetch)(_.id)

def renderTrackPosts(userId: Long) = {

  // Defines the clump
  val clump: Clump[List[EnrichedTrack]] = enrichedTrackPosts(userId)

  // Triggers execution
  val future: Future[Option[List[EnrichedTrack]]] = clump.get

  // Renders the response
  future.map {
    case Some(trackPosts) => render.json(trackPosts)
    case None             => render.notFound
  }
}

// Composes a clump with the user's track posts
def enrichedTrackPosts(userId: Long) =
  for {
    trackPosts <- Clump.future(timelineService.fetchTrackPosts(userId))
    enrichedTracks <- Clump.traverse(trackPosts)(enrichedTrack(_))
  } yield {
    enrichedTracks
  }

// Composes an enriched track clump
def enrichedTrack(trackId: Long) =
  for {
    track <- tracksSource.get(trackId)
    creator <- usersSource.get(track.creatorId)
  } yield {
    new EnrichedTrack(track, creator)
  }

The usage of renderTrackPosts produces only three remote procedure calls:

Fetch the track posts list (from timelineService);
Fetch the metadata for all the tracks (from tracksService);
Fetch the user metadata for all the tracks' creators (from usersService).

The final result can be notFound because the user can be found or not.

Sources

Sources represent the remote systems' batched interfaces. Clump offers some methods to create sources using different strategies.

Clump.source

The Clump.source method accepts a function that may return less elements than requested. The output can also be in a different order than the inputs, since the last parameter is a function that allows Clump to determine which is the input for each output.

def fetch(ids: List[Int]): Future[List[User]] = ...

val usersSource = Clump.source(fetch)(_.id)

It is possible to create sources that have additional inputs, but the compiler isn't capable of inferring the input type for these cases. The solution is to use an explicit generic:

def fetch(session: UserSession, ids: List[Int]): Future[List[User]] = ...

def usersSource(session: UserSession) = 
    Clump.source[List[Int](fetch(session, _))(_.id)

Without the explicit generic, the compiler outputs this error message:

missing parameter type for expanded function ((x$1) => fetch(1, x$1))

Clump.sourceFrom

The Clump.sourceFrom method accepts a function that returns a Map with the values for the found inputs.

def fetch(ids: List[Int]): Future[Map[Int, User]] = ...

val usersSource = Clump.sourceFrom(fetch)

It is also possible to specify additional inputs for Clump.sourceFrom:

def fetch(session: UserSession, ids: List[Int]): Future[Map[Int, User]] = ...

def usersSource(session: UserSession) = 
    Clump.sourceFrom[List[Int]](fetch(session, _))

Clump.sourceZip

The Clump.sourceZip methods accepts a function that produces a list of outputs for each provided input. The result must keep the same order as the inputs list.

def fetch(ids: List[Int]): Future[List[User]] = ...

val usersSource = Clump.sourceZip(fetch)

Additional configurations

Some services have a limitation on how many resources can be fetched in a single request. It is possible to define this limit for each source instance:

val usersSource = Clump.sourceZip(fetch).maxBatchSize(100)

The source instance can be also configured to automatically retry failed fetches by using the maxRetries method. It receives a partial function that defines the number of retries for each type of exception. The default number of retries is zero.

val usersSource = 
    Clump.sourceZip(fetch).maxRetries {
      case e: SomeException => 10
    }

Constants

It is possible to create Clump instances based on values.

From a non-optional value:

val clump: Clump[Int] = Clump.value(111)

From an optional value:

val clump: Clump[Int] = Clump.value(Option(111))

From a future:

// This method is useful as a bridge between Clump and non-batched services.
val clump: Clump[Int] = Clump.future(counterService.currentValueFor(111))

It is possible to create a failed Clump instance:

val clump: Clump[Int] = Clump.exception(new NumberFormatException)

There is a shortcut for a constant empty Clump:

val clump: Clump[Int] = Clump.None

Composition

Clump has a monadic interface similar to Future.

It is possible to apply a simple transformation by using map:

val intClump: Clump = Clump.value(1)
val stringClump: Clump[String] = intClump.map(_.toString)

If the transformation results on another Clump instance, it is possible to use flatMap:

val clump: Clump[(Track, User)] =
  tracksSource.get(trackId).flatMap { track =>
    usersSource.get(track.id).map { user =>
      (track, user)
    }
  }

The join method produces a Clump that has a tuple with the values of two Clump instances:

val clump: Clump[(User, List[Track])] =
    usersSource.get(userId).join(userTracksSource.get(userId))

There are also methods to deal with collections. Use collect to transform a collection of Clump instances into a single Clump:

val userClumps: List[Clump[User]] = userIds.map(usersSource.get(_))
val usersClump: Clump[List[User]] = Clump.collect(usersClump)

Instead of map and then collect, it is possible to use the shortcut traverse:

val usersClump: Clump[List[User]] = Clump.traverse(userIds)(usersSource.get(_))

It is possible to use for-comprehensions as syntactic sugar to avoid having to write the compositions:

val trackClump: Clump[EnrichedTrack] =
    for {
      track <- tracksSource.get(trackId)
      creator <- usersSource.get(track.creatorId)
    } yield {
      new EnrichedTrack(track, creator)
    }

Execution

The creation of Clump instances doesn't trigger calls to the remote services. The only exception is when the code explicitly uses Clump.future to invoke a service.

To trigger execution, it is possible to use get:

val trackClump: Clump[EnrichedTrack] = ...
val user: Future[Option[EnrichedTrack]] = trackClump.get

Clump assumes that the remote services can return less elements than requested. That's why the result is an Option, since the input's result may be missing.

It is possible to define a default value by using getOrElse:

val trackClump: Clump[EnrichedTrack] = ...
val user: Future[EnrichedTrack] = trackClump.getOrElse(unknownTrack)

If it is guaranteed that the underlying service will always return results for all fetched inputs, it is possible to use apply, that throws a NotFoundException if the result is empty:

val trackClump: Clump[EnrichedTrack] = ...
val user1: Future[EnrichedTrack] = trackClump.apply()
val user2: Future[EnrichedTrack] = trackClump() // syntactic sugar

When a Clump instance has a collection, it is possible to use the list method. It returns an empty collection if the result is None:

val usersClump: Clump[List[User]] = Clump.traverse(userIds)(usersSource.get(_))
val users: Future[List[User]] = usersClump.list

Composition behavior

The composition of Clump instances takes in consideration that the sources may not return results for all requested inputs. It has a behavior similar to the relational databases' joins, where not found joined elements make the tuple be filtered-out.

val clump: Clump[(Track, User)]
  for {
    track <- tracksSource.get(111)
    user <- usersSource.get(track.creator)
  } yield (track, user)

In this example, if the track's creator isn't found, the final result will be None.

val future: Future[Option[(Track, User)]] = clump.get
val result: Option[(Track, User)] = Await.result(future)
require(result === None)

If a nested clump is expected to be optional, it is possible to use the optional method to have a behavior similar to an outer join.

val clump: Clump[(Track, Option[User])]
  for {
    track <- tracksSource.get(111)
    user <- usersSource.get(track.creator).optional
  } yield (track, user)

Another alternative is to define a fallback by using orElse:

val clump: Clump[(Track, Option[User])]
  for {
    track <- tracksSource.get(111)
    user <- usersSource.get(track.creator).orElse(usersSource.get(track.uploader))
  } yield (track, user)

Filtering

The behavior introduced by the optional fetch compositions allows defining of filtering conditions:

val clump: Clump[(Track, User)]
  for {
    track <- tracksSource.get(111) if(track.owner == currentUser)
    user <- usersSource.get(track.creator)
  } yield (track, user)

Exception handling

Clump offers some mechanisms to deal with failed fetches.

The handle method defines a fallback value given an exception:

val clump: Clump[User] =
    usersService.get(userId).handle {
      case _: SomeException =>
        defaultUser
    }

If the fallback value is another Clump instance, it is possible to use rescue:

val clump: Clump[User] =
    usersService.get(trackCreatorId).rescue {
      case _: SomeException =>
        usersService.get(trackUploaderId)
    }

Scala Futures

Clump uses Twitter Futures by default, but it is possible to use Scala Futures by importing the FutureBridge:

import clump.FutureBridge._

val userClump: Clump[User] = ...
val future: scala.concurrent.Future[User] = userClump.get

It provides bidirectional implicit conversions.

Internals

This section explains how Clump works under the hood.

The codebase is relatively small. The only type explicitly exposed to the user is Clump, but internally there are four in total:

Clump - Defines the public interface of Clump and represents the abstract syntactic tree (AST) for the compositions.

ClumpSource - Represents the external systems' batched interfaces.

ClumpFetcher - It has the logic to fetch from a ClumpSource, maintains the implicit cache and implements the logic to retry failed fetches.

ClumpContext - It is the execution model engine created automatically for each execution. It keeps the state by using a collection of ClumpFetchers.

Take some time to read the code of these classes. It will help to have a broader view and understand the explanation that follows.

Lets see what happens when this example is executed:

val usersSource = Clump.source(usersService.fetch)(_.id)
val tracksSource = Clump.source(tracksService.fetch)(_.id)

val clump: Clump[List[EnrichedTrack]] =
    Clump.traverse(trackIds) { trackId =>
      for {
        track <- tracksSource.get(trackId)
        user <- usersSource.get(track.creatorId)
      } yield {
        new EnrichedTrack(track, user)
      }
    }

val tracks: Future[List[Track]] = clump.list

Sources creation

val usersSource = Clump.source(usersService.fetch)(_.id)
val tracksSource = Clump.source(tracksService.fetch)(_.id)

The ClumpSource instances are created using one of the shortcuts that the Clump object provides. They don't hold any state and allow to create Clump instances representing the fetch. Clump uses the fetch function's identity to group requests and perform batched fetches, so it is possible to have multiple instances of the same source within a clump composition and execution.

Composition

val clump: Clump[List[EnrichedTrack]] =
    Clump.traverse(trackIds) { trackId =>
      ...
    }

The traverse method is used as a shortcut for map and then collect, so this code could be rewritten as follows:

val clump: Clump[List[EnrichedTrack]] =
    Clump.collect(trackIds.map { trackId => 
      ...
    }

For each trackId, a for-comprehension is used to compose a Clump that has the EnrichedTrack:

      for {
        track <- tracksSource.get(trackId)
        user <- usersSource.get(track.creatorId)
      } yield {
        new EnrichedTrack(track, user)
      }

The for-comprehension is actually just syntactic sugar using map and flatMap, so this code is equivalent to:

      tracksSource.get(trackId).flatMap { track =>
        usersSource.get(track.creatorId).map { user =>
          new EnrichedTrack(track, user)
        }
      }

There are three methods being used in this composition:

get creates a ClumpFetch instances that is the AST element representing the fetch. It doesn't trigger the actual fetch, only uses the ClumpFetcher instance to produce a Future that will be executed by the ClumpContext when the execution is triggered.
flatMap creates a ClumpFlatMap instance representing the operation. It just composes a new future that is based on the result of the initial Clump and the result of the nested Clump.
map creates a ClumpMap instance representing the map operation. It composes a new future by applying the specified transformation.

Note that any Clump composition creates a ClumpContext implicitly if it doesn't exist yet. The ClumpContext is maintained using a Local value, that is a mechanism similar to a ThreadLocal but for asynchronous compositions.

Execution

Now comes the most important part. Until now, the compositions only create Clump* instances to represent the operations and produce futures that will be fulfilled when the execution is triggered. You probably have noticed that the Clump instances define three things:

result that has the Future result for the operation
upstream that returns the upstream Clump instances that were used as the basis for the composition
downstream that returns the downstream Clump instances created as a result of the operation

Note that downstream returns a Future[List[Clump[_]]], while upstream returns a List[Clump[_]] directly. This happens because downstream produces Clump instances that are available only after the upstream execution.

These methods are used by the ClumpContext to apply the execution model. It has a collection with all ClumpFetcher instances in the composition.

This is the code that triggers the execution:

val tracks: Future[List[Track]] = clump.list

The list method is just a shortcut to ease getting the value of Clump instances that have a collection. The actual execution is triggered by the get method. It flushes the context and returns the Clump's result.

The context flush is a recursive function that performs simple steps:

If there aren't Clump instances to be fetched, stop the recursion;
If there are Clump instances to be fetched
- Flush all the upstream instances of the current clumps;
- Perform all the fetches among the current Clump instances being executed.
- Flush all the downstream instances, since the pre-requisite to run the downstream is fulfilled (upstream already flushed). Note that the difference from the upstream flush is due the fact that downstream returns a future, but the semantic is the same.

You could consider this a depth-first, upstream-first traversal of the Clump graph.

In case you are wondering why we need this upstream mechanism since we have the Clump instance at hand and could start the execution from it: actually the instance used to trigger the execution isn't the "root" of the composition. For instance:

val clump: Clump[EnrichedTrack] =
    tracksSource.get(trackId).flatMap { track =>
      usersSource.get(track.creatorId).map { user =>
        new EnrichedTrack(track, user)
      }
    }

The clump instance will be a ClumpFlatMap, not the ClumpFetch created by the tracksSource.get(trackId) call. This is the AST behind the clump instance:

                                   +----------> Empty                    
                                   |Up                                   
                            +------+-----+                               
                            | ClumpFetch |                               
               +----------> |  (line 2)  |                               
               |Up          +------+-----+                               
               |                   |Down                                 
               |                   +----------> Empty                    
       +-------+------+                                                  
       | ClumpFlatMap |                                                  
get--> |   (line 2)   |                                   +-------> Empty
       +-------+------+                                   |Up            
               |                                +---------+--+           
               |                   +----------> | ClumpFetch |           
               |                   |Up          |  (line 3)  |           
               |Down        +------+-----+      +---------+--+           
               +----------> |  ClumpMap  |                |Down          
                            |  (line 3)  |                +-------> Empty
                            +------+-----+                               
                                   |Down                                 
                                   +----------> Empty

The steps to execute this composition happen as follows:

The execution is triggered by the get method on the ClumpFlatMap instance
The flush of ClumpFlatMap starts
- ClumpFlatMap upstream is flushed
  - The flush of ClumpFetch starts
    - Upstream is flushed and returns immediately (empty)
    - The fetch is executed to fulfill the prerequisites to the downstream flush (upstream + fetch)
    - Downstream is flushed and returns immediately (empty)
- ClumpFlatMap downstream is flushed
  - The flush of ClumpMap starts
    - Upstream is flushed
      - The flush of the second ClumpFetch starts
        
        Upstream is flushed and returns immediately (empty)
        
        The fetch is executed
        
        Downstream is flushed and returns immediately (empty)
    - Downstream is flushed and returns immediately (empty)

Note that this example has only one Clump instance per flush phase, but normally there are multiple instances. This is what allows Clump to batch requests that are in the same flush phase.

Known limitations

The execution model is capable of batching requests that are in the same level of the composition. For instance, this example produces only one fetch from usersSource:

val clump: Clump[List[EnrichedTrack]] =
    Clump.traverse(trackIds) { trackId =>
      for {
        track <- tracksSource.get(trackId)
        user <- usersSource.get(track.creatorId)
      } yield {
        new EnrichedTrack(track, user)
      }
    }

The next example has two fetches from usersSource: one for the playlists' creators and other for the tracks' creators.

val clump: Clump[List[EnrichedPlaylist]] =
    Clump.traverse(playlistIds) { playlistId =>
      for {
        playlist <- playlistsSource.get(playlistId)
        creator <- usersSource.get(playlist.creator)
        tracks <- 
          Clump.traverse(playlist.trackids) {
            for {
              track <- tracksSource.get(trackId)
              user <- usersSource.get(track.creatorId)
            } yield {
              new EnrichedTrack(track, user)
            }
          }
      } yield {
        new EnrichedPlaylist(playlist, creator, tracks)
      }
    }

Considering that they happen in different levels of the composition, the execution model will execute two batched fetches to usersSource, not one. This limitation is alleviated by the implicit caching if the playlist and tracks have the same creator.

Acknowledgments

Clump was inspired by the Twitter's Stitch project. The initial goal was to have a similar implementation, but the project evolved to provide an approach more adherent to some use-cases we have in mind. See STITCH.md for more information about the differences between Stich and Clump.

Facebook's Haxl paper and the Futurice's blog post about Jobba also were important sources for the development phase.

The project was initially built using SoundCloud's Hacker Time.

Versioning

Clump adheres to Semantic Versioning 2.0.0. If there is a violation of this scheme, report it as a bug.Specifically, if a patch or minor version is released and breaks backward compatibility, that version should be immediately yanked and/or a new version should be immediately released that restores compatibility. Any change that breaks the public API will only be introduced at a major-version release.

License

See the LICENSE-LGPL file for details.

grandbora / clump

Introduction

Summary

Problem

Solution

Getting started

Usage

Example

Sources

Clump.source

Clump.sourceFrom

Clump.sourceZip

Additional configurations

Constants

Composition

Execution

Composition behavior

Filtering

Exception handling

Scala Futures

Internals

Sources creation

Composition

Execution

Known limitations

Acknowledgments

Versioning

License

About

Languages