Tuesday, June 13, 2017

Failure Testing for your private cloud - Introducing GomJabbar

TL;DR Chaos Drills can contribute a lot to your services resilience, and it's actually quite a fun activity. We've built a tool called GomJabbar to help you run those drills.

Here at Outbrain we manage quite a large scale deployment of hundreds of services / modules, and thousands of hosts. We practice CI/CD, and implemented quite a sound infrastructure, which we believe is scalable, performant, and resilient. We do however experience many production issues on a daily basis, just like any other large scale organization. You simply can't ensure a 100% fault free system. Servers will crash, run out of disk space, and lose connectivity to the network. Software will experience bugs, and erroneous conditions. Our job as software engineers is to anticipate these conditions, and design our code to handle them gracefully.

For quite a long time we were looking into ways of improving our resilience, and validate our assumptions, using a tool like Netflix's Chaos Monkey. We also wanted to make sure our alerting system actually triggers when things go wrong. The main problem we were facing is that Chaos Monkey is a tool that was designed to work with cloud infrastructure, while we maintain our own private cloud.

The main motivation for developing such a tool, is that failures have the tendency of occurring when you're least prepared, and in the least desirable time, e.g. Friday nights, when you're out having a pint with your buddies. Now, to be honest with ourselves, when things fail during inconvenient times, we don't always roll our sleeves and dive in to look for the root cause. Many times the incident will end after a service restart, and once the alerts clear we forget about it.

Wouldn't it be great if we could have "chaos drills", where we could practice handling failures, test and validate our assumptions, and learn how to improve our infrastructure?

Chaos Drills at Outbrain

We built GomJabbar exactly for the reasons specified above. Once a week, at a well known time, mid day, we randomly select a few targets where we trigger failures. At this point, the system should either auto-detect the failures, and auto-heal, or bypass them. In some cases alerts should be triggered to let teams know that a manual intervention is required.

After each chaos drill we conduct a quick take-in session for each of the triggered failures, and ask ourselves the following questions:
  1. Did the system handle the failure case correctly?
  2. Was our alerting strategy effective?
  3. Did the team have the knowledge to handle, and troubleshoot the failure?
  4. Was the issue investigated thoroughly?
These take-ins lead to super valuable inputs, which we probably wouldn't collect any other way.

How did we kick this off?

Before we started running the chaos drills, there were a lot of concerns about the value of such drills, and the time it will require. Well, since eliminating our fear from production is one of the key goals of this activity, we had to take care of that first.
"I must not fear.
 Fear is the mind-killer.
 Fear is the little-death that brings total obliteration.
 I will face my fear.
 I will permit it to pass over me and through me.
 And when it has gone past I will turn the inner eye to see its path.
 Where the fear has gone there will be nothing. Only I will remain."

(Litany Against Fear - Frank Herbert - Dune)
So we started a series of chats with the teams, in order to understand what was bothering them, and found ways to mitigate it. So here goes:
  • There's an obvious need to avoid unnecessary damage.
    • We've created filters to ensure only approved targets get to participate in the drills.
      This has a side effect of pre-marking areas in the code we need to take care of.
    • We currently schedule drills via statuspage.io, so teams know when to be ready, and if the time is inappropriate,
      we reschedule.
    • When we introduce a new kind of fault, we let everybody know, and explain what should they prepare for in advance.
    • We started out from minor faults like graceful shutdowns, continued to graceless shutdowns,
      and moved on to more interesting testing like faulty network emulation.
  • We've measured the time teams spent on these drills, and it turned out to be negligible.
    Most of the time was spent on preparations. For example ensuring we have proper alerting,
    and correct resilience features in the clients.
    This is actually something you need to do anyway. At the end of the day, we've heard no complaints about interruptions, nor time waste.
  • We've made sure teams, and engineers on call were not left on their own. We wanted everybody to learn
    from this drill, and when they were'nt sure how to proceed, we jumped in to help. It's important
    to make everyone feel safe about this drill, and remind everybody that we only want to learn and improve.
All that said, it's important to remember that we basically simulate failures that occur on a daily basis. It's only that when we do that in a controlled manner, it's easier to observe where are our blind spots, what knowledge are we lacking, and what we need to improve.

Our roadmap - What next?

  • Up until now, this drill was executed in a semi-automatic procedure. The next level is to let the teams run this drill on a fixed interval, at a well known time.
  • Add new kinds of failures, like disk space issues, power failures, etc.
  • So far, we were only brave enough to run this on applicative nodes, and there's no reason to stop there. Data-stores, load-balancers, network switches, and the like are also on our radar in the near future.
  • Multi-target failure injection. For example, inject a failure to a percentage of the instances of some module in a random cluster. Yes, even a full cluster outage should be tested at some point, in case you were asking yourself.

The GomJabbar Internals

GomJabbar is basically an integration between a discovery system, a (fault) command execution scheduler, and your desired configuration. The configuration contains mostly the target filtering rules, and fault commands.

The fault commands are completely up to you. Out of the box we provide the following example commands, (but you can really write your own script to do what suits your platform, needs, and architecture):
  • Graceful shutdowns of service instances.
  • Graceless shutdowns of service instances.
  • Faulty Network Emulation (high latency, and packet-loss).
Upon startup, GomJabbar drills down via the discovery system, fetches the clusters, modules, and their instances, and passes each via the filters provided in the configuration files. This process is also performed periodically. We currently support discovery via consul, but adding other methods of discovery is quite trivial.

When a users wishes to trigger faults, GomJabbar selects a random target, and returns it to the user, along with a token that identifies this target. The user can then trigger one of the configured fault commands, or scripts, on the random target. At this point GomJabbar uses the configured CommandExecutor in order to execute the remote commands on the target hosts.

GomJabbar also maintains a audit log of all executions, which allows you to revert quickly in the face of a real production issue, or an unexpected catastrophe cause by this tool.

What have we learned so far?

If you've read so far, you may be asking yourself what's in it for me? What kind of lessons can I learn from these drills?

We've actually found and fixed many issues by running these drills, and here's what we can share:
  1. We had broken monitoring and alerting around the detection of the integrity of our production environment. We wanted to make sure that everything that runs in our data-centers is managed, and at a well known (version, health, etc). We've found that we didn't compute the difference between the desired state, and the actual state properly, due to reliance on bogus data-sources. This sort of bug attacked us from two sides: once when we triggered graceful shutdowns, and once for graceless shutdowns.
  2. We've found services that had no owner, became obsolete, and were basically running unattended in production. The horror.
  3. During the faulty network emulations, we've found that we had clients that didn't implement proper resilience features, and caused cascading failures in the consumers several layers up our service stack. We've also noticed that in some cases, the high latency also cascaded. This was fixed by adding proper timeouts, double-dispatch, and circuit-breakers.
  4. We've also found that these drills motivated developers to improve their knowledge about the metrics we expose, logs, and the troubleshooting tools we provide.


We've found the chaos drills to be an incredibly useful technique, which helps us improve our resilience and integrity, while helping everybody learn about how things work. We're by no means anywhere near perfection. We're actually pretty sure we'll find many many more issues we need to take care of. We're hoping this exciting new tool will help us move to the next level, and we hope you find it useful too ;)

Sunday, July 3, 2011

Feature Flags made easy

I recently participated in the ILTechTalk week. Most of the talks discussed issues like Scalability, Software Quality, Company Culture, and Continues Deployment (CD). Since the talks were hosted at Outbrain, we got many direct questions about our concrete implementations. Some of the questions and statements claimed that Feature Flags complicate your code. What bothered most participants was that committing code directly to trunk requires addition of feature flags in some cases, and that it may make their code base more complex.

While in some cases feature flags may make the code slightly more complicated, it shouldn't be so in most cases. The main idea I'm presenting here is that conditional logic can be easily replaced with polymorphic code. In fact conditional logic can always be replaced by polymorphism.

Enough with the abstract talk...

Suppose we have an application that contains some imaginary feature, and we want to introduce a feature flag. Below is a code snippet that developers normally come up with:

While this is a legitimate implementations in some cases, it does complicate your code base by increasing the cyclomatic complexity of your code. In some cases the test for activation of the feature may recur in many place in the code, so this approach can quickly turn into a maintenance nightmare.

Luckily, implementing a feature flag using polymorphism is pretty easy. First lets define an interface for the imaginary feature, and two implementations (old and new):

Now lets use the feature in our application, selecting the implementation at runtime:

Here we initialized the imaginary feature member by reflection, using a class name specified as a system property. The createImaginaryFeature() method above is usually abstracted into a factory, but kept as is here for brevity. But we're still not done. Most of the readers would probably say that the introduction of a factory and reflection makes the code less readable and less maintainable. I have to agree... And apart from that, adding dependencies to the concrete implementations will complicate the code even more. Luckily I have a secret weapon at my disposal. It is called IoC, (or DI). When using an IoC container such as Spring or Guice, your code can be made extremely flexible, and implementing feature flags is turned into a walk in the park.

Below is a rewrite of the PolymorphicApplication using Spring dependency injection:

The spring code above defines a application and 2 imaginary feature implementations. By default the application is initialized with the oldImaginaryFeature, but this behavior can be overridden by specifying a -DimaginaryFeature.implementation.bean=newImaginaryFeature command line argument. Only a single feature implementation will be initialized by Spring, and the implementations may have dependencies.

Bottom line is: with a bit of extra preparation, and correct design decisions, feature flags shouldn't be a burden on your code base. By extra preparation, I mean extracting interfaces for your domain objects, using an IoC container, etc, which is something we should be doing in most cases anyway.

Friday, May 14, 2010

Mapping Immutable Value-Objects with Dozer

2 good things happened to me this week (well actually 3, but I will probably blog about the 3rd one later):

The first thing is that I finally managed to convince Dozer to map immutable objects, and the second one is that I found something interesting to blog about ;)

I mentioned in a previous post about Dozer’s lack of support for constructor arguments. In general Dozer is aimed at supporting JavaBean to JavaBean mapping, and other usage scenarios seem to be hard to implement. It turns out that the problem can be solved using design-patterns, and a little bit of trickery.

The first step towards the solution, is introducing the Builder Pattern. Actually a form of the Builder Pattern that introduced by Joshua Bloch at Java One. The pattern solves the problem of too many constructors, too many constructor arguments, and the verbosity of object creation while using setters. The pattern is described in detail here: http://ow.ly/1L2JV.

Let’s suppose we are about to map an Address JavaBean to an Immutable Address object. Here are the Address classes:

public class Coordinate {
 private double longitude;
 private double latitude;
 // getters, setters, c'tors, equals(), hashCode(), toString(), etc...
public class Address {
 private String country;
 private String state;
 private String city;
 private String street;
 private String zipcode;
 private Coordinate coordinate;
 // getters, setters, c'tors, equals(), hashCode(), toString(), etc...

And here’s the immutable address:

public class ImmutableCoordinate {
 private final double longitude;
 private final double latitude;
 private ImmutableCoordinate(Builder builder) {
  this.latitude = builder.latitude;
  this.longitude = builder.longitude;

 public double getLongitude() {
  return longitude;
 public double getLatitude() {
  return latitude;
 public static class Builder {
  private double longitude;
  private double latitude;
  public Builder longitude(double longitude) {
   this.longitude = longitude;
   return this;
  public Builder latitude(double latitude) {
   this.latitude = latitude;
   return this;
  public ImmutableCoordinate build() {
   return new ImmutableCoordinate(this);
public class ImmutableAddress {
 private final String country;
 private final String state;
 private final String city;
 private final String street;
 private final String zipcode;
 private final ImmutableCoordinate coordinate;
 private ImmutableAddress(Builder builder) {
  this.country = builder.country;
  this.state = builder.state;
  this.city = builder.city;
  this.street = builder.street;
  this.zipcode = builder.zipcode;
  this.coordinate = builder.coordinate;  

 public String getCountry() {
  return country;
 public String getState() {
  return state;
 public String getCity() {
  return city;
 public String getStreet() {
  return street;
 public String getZipcode() {
  return zipcode;
 public ImmutableCoordinate getCoordinate() {
  return coordinate;
 public static class Builder {
  private String country;
  private String state;
  private String city;
  private String street;
  private String zipcode;
  private ImmutableCoordinate coordinate;
  public Builder country(String country) {
   this.country = country;
   return this;

  public Builder state(String state) {
   this.state = state;
   return this;

  public Builder city(String city) {
   this.city = city;
   return this;

  public Builder street(String street) {
   this.street = street;
   return this;

  public Builder zipcode(String zipcode) {
   this.zipcode = zipcode;
   return this;

  public Builder coordinate(ImmutableCoordinate coordinate) {
   this.coordinate = coordinate;
   return this;
  public ImmutableCoordinate getCoordinate() {
   return coordinate;

  public ImmutableAddress build() {
   return new ImmutableAddress(this);

Now, by we can map our mutable class to the Builder of the immutable class, and throw in a custom DozerConverter where nested properties are involved. Below is the mapping for the Address classes:

<?xml version="1.0" encoding="UTF-8"?>
<mappings xmlns="http://dozer.sourceforge.net"

  <date-format>MM/dd/yyyy HH:mm</date-format>
   <a set-method="setLongitude">longitude</a>
   <b set-method="longitude">longitude</b>
   <a set-method="setLatitude">latitude</a>
   <b set-method="latitude">latitude</b>

   <a set-method="setCountry">country</a>
   <b set-method="country">country</b>
   <a set-method="setState">state</a>
   <b set-method="state">state</b>
   <a set-method="setCity">city</a>
   <b set-method="city">city</b>
   <a set-method="street">street</a>
   <b set-method="street">street</b>
   <a set-method="setZipcode">zipcode</a>
   <b set-method="zipcode">zipcode</b>   
  <field custom-converter-id="coordConverter">
   <a set-method="setCoordinate">coordinate</a>
   <b set-method="coordinate">coordinate</b>

And the DozerConverter is a fairly straight forward implementation (I actually use Dozer to do its own job…):

public class CoordinateConverter extends DozerConverter {

 private final Mapper mapper;
 public CoordinateConverter(Mapper mapper) {
  super(Coordinate.class, ImmutableCoordinate.class);
  this.mapper = mapper;
 public Coordinate convertFrom(ImmutableCoordinate source, Coordinate destination) {  
  return mapper.map(source, Coordinate.class);
 public ImmutableCoordinate convertTo(Coordinate source, ImmutableCoordinate destination) {
  return source == null ? null : mapper.map(source, ImmutableCoordinate.Builder.class).build();

Now mapping between the classes is a matter of a single line:

Address address = new Address();
// set set set...
ImmutableAddress immutableAddress = mapper.map(address, ImmutableAddress.Builder.class).build();
And it even works in the opposite direction :D

Although this solution is not as clean as it should have been – there’s still some over verbosity, and an obscure need for a getter in some cases, it is still preferable over the piles of code you get when messing with object mapping. This technique may also be easier to sneak into the Dozer code-base than constructor arguments support.

Thursday, April 8, 2010

Sending Emails using Spring-Mail

Some applications are required to send emails. What can I say? These are the things we have to do for money…

The JavaMail API is pretty much boring and a little cumbersome to use. Once again you find yourself fiddling with connection management, exception handling, etc. And once again Spring comes to the rescue :)

Spring-Mail has neat and easy to use email API, including MIME messages support.

Let’s imagine we need to implement “forgot my password” feature. So here goes.
The EmailFacade:

package mail;

public interface EmailFacade {

    public void sendPasswordReminderEmailTemplate(String user, String password, String email);
The implementations:
package mail;

import org.springframework.mail.MailSender;
import org.springframework.mail.SimpleMailMessage;
import org.springframework.util.Assert;

class EmailFacadeImpl implements EmailFacade {

    private final MailSender mailSender;
    private final SimpleMailMessage passwordReminderEmailTemplate;

    public EmailFacadeImpl(MailSender mailSender, SimpleMailMessage passwordReminderEmailTemplate) {
        Assert.notNull(mailSender, "mailSender may not be null");
                       "passwordReminderEmailTemplate may not be null");

        this.mailSender = mailSender;
        this.passwordReminderEmailTemplate = passwordReminderEmailTemplate;

    public void sendPasswordReminderEmailTemplate(String user, String password, String email) {
        // Create a thread safe instance of the template message and customize it
        SimpleMailMessage msg = new SimpleMailMessage(passwordReminderEmailTemplate);
        String formatedText = String.format(passwordReminderEmailTemplate.getText(), user, password);
And the Spring Configuration:
  <bean id="emailFacade" class="mail.EmailFacadeImpl">
     <bean class="org.springframework.mail.javamail.JavaMailSenderImpl">
   <property name="host" value="${emailServerURL}"/>
   <property name="username" value="${emailPrincipal}"/>
   <property name="password" value="${emailPassword}"/>
        <constructor-arg ref="passwordReminderEmailTemplate"/>
  <bean id="passwordReminderEmailTemplate" class="org.springframework.mail.SimpleMailMessage">
    <property name="from" value="me@mycompany.com"/>
    <property name="subject" value="Password Reminder"/>
    <property name="text">
      <!-- Text template to be used with String.format() -->
      <value><![CDATA[Hi %s,
Your password is

For some use cases it would be better to replace the String.format() call with some templating engine such as Velocity. I left it here for brevity.

Thursday, February 18, 2010

Software Craftsmanship

Uncle Bob talks about Software Craftsmanship and agile: http://java.dzone.com/videos/object-mentors-bob-martin.

I couldn't agree more.

In my opinion software development should become a closed guild. Anyone can write code, but we should strive to make quality code, and we should make our occupation a respectable one!

Ignore the rules, and you’re out ;)

We will not ship shit. Well put Bob.

Friday, February 12, 2010

Event Horizon

When working in an agile environment, being able to control parts of the architecture like layers, conventions, etc, is usually desired, while letting the teams make most of the design decisions. Easier said than done. In this model the architect envisions the initial architecture and, communicates the architecture to the team. The architecture evolves over time according to the needs, while the architect shapes it, to keep things simple and “right”.

Keeping track of what’s going on in a large code base is virtually impossible. Even the most skilled developers might miss architecture violations, while performing code reviews for a big chunk of code.

In order to solve this I use a powerful tool called Structure-101 (I call it s101). The s101 client on its own is brilliant for analyzing the code base and defines / communicate the desired architecture. However having to manually check out the latest code, check for new violations, see what has changed and notify the developer, who created new violations, is tedious and time consuming. Luckily s101 comes with a command line utility called S101Headless, which can be integrated into your nightly build. The S101Headless utility allows you to test for new / existing structural violations / increased code complexity, and publish a new snapshot.

The strategy that I use with s101 is as follows. First I analyze the code base, define the architecture diagrams, and extract recommendations for refactoring. Later by integrating s101 into the nightly build, I control the evolution of the architecture. Legacy code bases usually contain many design tangles which are hard to get rid of, and, usually you won’t get the resources for refactoring… Still it is easy to seal the complexity and isolate it from the “happy code”. S101 allows you to break the build only on when new architectural violations show up.

The current version of S101Headless is somewhat awkward to use with Ant, even though it is much better than the previous version. The utility consumes an XML file containing the operations you want executed. But hey, the arguments are usually dynamic, especially in a build environment. The documentation suggests utilizing the echo task in order to write the XML file to disk. While this approach allows you to use the Ant variables, it is makes your XML file less readable. Here’s how I do it:

First you need a template file (s101headless-template.xml):
<?xml version="1.0" encoding="UTF-8"?>
<headless version="1.0">

        <operation type="check-architecture">
            <argument name="output-file" value="@REPORTS_DIR@/arch-violations.csv"/>
            <argument name="onlyNew" value="true"/>

        <operation type="publish">
            <argument name="rpkey" value="c0d3s1ut"/>

        <argument name="local-project" value="@S101_LOCAL_PROJECT@">
            <override attribute="classpath" value="@CLASSPATH_OVERRIDE@"/>
        <argument name="repository" value="@S101_REPOSITORY@"/>
        <argument name="project" value="@S101_PROJECT@-snapshots"/>

In the Ant build file I use copy with a filter chain to render the XML for the S101Headless utility:
<target name="prepareHeadlessFile" depends="setup_classpath">
                <copy file="s101headless-template.xml" tofile="s101headless.xml" overwrite="true">
                                <replacestring from="@CLASSPATH_OVERRIDE@" to="${my-jars-path}"/>
                                <replacestring from="@REPORTS_DIR@" to="${s101.reports.dir}"/>
                                <replacestring from="@S101_PROJECT@" to="${s101.project.name}"/>
                                <replacestring from="@S101_REPOSITORY@" to="${s101.repository}"/>
                                <replacestring from="@S101_LOCAL_PROJECT@" to="${s101.local.project}"/>

        <target name="s101" depends="prepareHeadlessFile">
                <mkdir dir="${s101.reports.dir}"/>
                <java classname="com.headway.assemblies.seaview.headless.S101Headless" fork="true" errorproperty="s101.failure" resultproperty="s101.result.code" maxmemory="512m" dir="${s101.java.home}">
                                <pathelement path="${s101.java.home}/structure101-java-b586.jar"/>
                                <pathelement path="${basedir}"/>
                        <arg value="${basedir}/s101headless.xml"/>
                <!-- fail if there were errors in the S101Headless execution -->
                <condition property="s101.violations.found">
                                <equals arg1="${s101.result.code}" arg2="0" />
                <fail if="s101.violations.found" message="${s101.failure}"/>
I leave the Ant classpath and properties setup as an exercise to the reader ;)

As a complementary to the nightly build analysis, s101 also offers an IDE plugin, which allows connects to the structure 101 repository, and can display real time errors / warnings when new violations are detected in the code. IMHO the Eclipse plugin is still in its infancy, and requires some improvements. In the near future (I hope), it will provide a real time, 24/7, protection against evil code :D

BTW, can anyone guess why I called this post “Event Horizon”?

Thursday, December 24, 2009

Referencing Spring beans in Properties collections

It is some times desired to reference Spring beans when injecting properties into some bean. The problem is that the <props> tag doesn’t support references – it supports plain string values only.
While trying to resolve such issue I stumbled onto a discussion in the Spring-Source community forums: http://forum.springsource.org/showthread.php?t=63939.
While adding non-string values into Properties is problematic, adding string values that originate from a bean-value, or a bean’s-property-value, or something similar is a valid usecase. I posted my solution in the thread, but just for your convinience, here goes:
In general, as long as you are using string values, you can can safely replace the <props> tag with <map>:
    <prop key="foo">blahhhh</prop>
    <prop key="bar">arrrrgh</prop>
Is the same as
    <entry key="foo" value="blahhhh"/>
    <entry key="bar">
      <bean class="java.lang.String">
       <constructor-arg value="arrrrgh"/>

    <!-- and you can even do -->
    <entry key="baz" value-ref="someBean"/>