Going with the flow: monitoring microservices

As our next-generation financial services platform on technology takes shape, we’re rethinking our assumptions about tracing and monitoring. Story by Paul Spiller.

In 2016 our CTO, Clayton Locke, set the ball rolling on the development of The Engagement Platform, a next-generation financial services platform based on a microservices architecture. The software is accompanied by a new vison for how we produce user interfaces, called UX Evolution, and a bespoke DevOps deployment capability called Continuously Available Deployment.

With so many things changing we’ve had to look again at how we get some important but unsung jobs done. Top of the list? Monitoring.

A different kind of problem

The decentralisation of microservices delivers flexibility, scalability and resilience. It’s a natural fit for our Agile development approach and our federated Client Delivery teams, and it makes deployments smaller, faster, more frequent and less risky.

There’s no better way to architect the Engagement Platform but that doesn’t mean it’s all plain sailing. One of the trickier problems we’ve faced has been figuring out the right way to keep track of what all that decentralised software is doing.

The problems of software monitoring magnify as you fragment a system – and a microservices architecture isn’t just fragmented, it’s fragmented into pieces that are designed to care as little about each other as possible.

In a monolithic architecture the pathways through the system all occur inside a small and tightly controlled collection of software. Crudely, if that software is running then the system is up and if it’s not then the system is down. There are subtleties of course, “running” doesn’t necessarily imply “running well” or “doing what it’s supposed to do”, but it’s still the most straightforward scenario for software monitoring.

In our microservices architecture the pathways through the system can touch tens (perhaps hundreds one day) of independent and loosely coupled components. The collection of components involved will also change depending on what the software is being asked to do. It’s easier for complex, fragmented systems like this to harbour problems and it’s harder to figure out exactly what’s gone wrong when they happen.

Components can fail because they have a fault themselves, or because of a fault cascading down from an upstream component or dependency. In an environment like that what should you watch, and how? Or, to put it bluntly, if something’s broken how will you know what to restart?

Of course, the right answer depends on what you want to know and why.

A different kind of monitoring

Our thinking about what to monitor in the Engagement Platform has coalesced around a number of dualities.

Our monitoring will have two audiences: we want to stay on top of what our software is doing, but we also want to create dashboards for customers so that they can also see what they want and need to know about what their systems are doing.

We want to get two very different kinds of information from our monitoring too: technical metrics and business metrics. Technical metrics cover things like bandwidth or latency that tell sysadmins about a system’s performance and the stresses being applied to it. Business metrics describe the higher-level functionality and software features such as the number of logins that have occurred over a given period.

Metrics like login numbers give us a view on the software’s business impact, but they’re an important health check too. Just because a system is running and its components are busy, that doesn’t mean it’s OK (like a livelock in a multithreaded system where everything’s working but nothing’s getting done). It’s why we added positive and negative monitoring to our list. I want to know what isn’t happening – if, all of a sudden, nobody’s logging in, I want to know.

Perhaps the most important outcome of our monitoring rethink, though, was the realisation that what we started out thinking of as ‘monitoring’ turned into two very different things: ‘monitoring’ and ‘tracing’.

Monitoring is a top-down view of a system, or aspects of it, over time. You’re trying to look at the woods and not the trees.

Tracing is the opposite, it’s observing how a single request passes through the system. When you’re trying to find out what went wrong in a system, where perhaps 15 separate components are working together to get something done, tracing is invaluable.

A lot of good work has already been done in the field of tracing. Companies that rely heavily on microservices, such as Uber and AirBnB, are lining up behind Twitter’s open source Zipkin tracer, so we’re looking at that with a great deal of interest.

A different kind of storage

In rethinking our approach to monitoring we’ve identified many different types of data we want to access – and that, in turn, has caused us to rethink our approach to storage.

“When relational databases are used inappropriately, they exert a significant drag on application development.”

– Martin Fowler, ThoughtWorks

CAP theorem has it that a distributed computer system can provide no more than two out of three from Consistency, Availability and Partition Tolerance. What sort of database you need for any given data depends on things like its characteristics, how often you want to read it or write to it, performance, scale, cost and business criticality.

We’re taking that idea, known as Polyglot Persistence, and applying it to the different types of data we capture for monitoring and tracing. Tracing demands detailed, time series data. Monitoring can generate huge amounts of data but it doesn’t all need to be kept, it can be safely aggregated and compressed.

Changing the way we think about capturing and storing metrics is just one of things my team is working on, but it’s a great example of how our decision to adopt a microservices architecture is driving innovation and renewal up and down the Intelligent Environments software stack.

Tags: ,

Request a Call

Want to see Interact in action?
Click here to request a demo.

Request a Call

Top stories -

Blog -

How pocket money works in the digital age

How young people, with the help of digital banking and mobile devices, are starting to find their financial feet. Story...

Read more
Blog -

It’s no longer right to talk about digital banking –...

Technology is a small part of digital maturity, says Simon Cadbury. There's more to transformation than simply embracing...

Read more
Events -

The Digital Banking Club – Motor Finance 2018

The DBC and ieDigital will host a special breakfast session at The Law Society in London, 29 November 2018. Save the...

Read more