Monitoring Micro-Service Applications across Hybrid Clouds using Istio service mesh multi-clusters, Kiali observability, Zipkin tracing, Prometheus events and Grafana visualisations

Most enterprises have complex application deployments across their own internal data centres and commercial clouds. I am using Google Cloud Platform and AWS in this example. Where I work, we traditionally monitored logs and configured alarms for network and infrastructure monitoring. This approach was disjointed and slow to react. The enterprise moved to cloud hosting with elastic scalability a few years ago which led to multiple stove pipes of monitoring capability and a heavy dependency on VPC interconnects. We wanted to move to a multi-cloud environment whilst maintaining the benefits of a centralised technology operations centre.

We quickly realised that we had specific workloads running in different environments with no common mechanism for monitoring & reporting. This led us to examine open-source monitoring architectures based on Netflix’s Keystone Pipeline. Our requirements were for a universal data visualisation and observation of our application based on Grafana, Zipkin and Kiali.

Logical architecture and open source technologie

This architecture is based on open-source projects that we can use across GCP, AWS and internally. Everything is predicated on Docker containers and Kubernetes container orchestration. Istio provides the policy and load-balancing functions of a service mesh and GRPC provides the low latency integrations between the micro-services. These technologies provide the enablers for the monitoring & visualisation capabilities of Kiali, Zipkin and Grafana.

The following diagram shows the open-source component architecture to support different internal data centres (one for IT running Pivotal and one for mobile network IT running Openstack), Google App Engine and AWS Kubernetes service EKS on EC2. This logical architecture has the intention of a single pane of glass for service management toolkit technologies.

Open Source Monitoring Toolset across Hybrid Clouds

To achieve a single pane of glass across multi clouds requires the need of a aggregation function that can integrate the control plane of multiple Kubernetes container orchestrations. Istio achieves this by supporting multicluster deployments across hybrid clouds by deploying a control plane to each Kubernetes cluster. Kiali can provide service mesh observability of a Istio multi-cluster environment. A Helm variable global.remoteZipkinAddress can be used to connect Zipkin distributed tracing to the Istio cluster.

All of this together enables a Kubernetes control plane on each hybrid cloud environment to be interconnected to the master visualisation technology operations centre environment.

The traffic flow of a Kube ingress allows the ELB using GRPC to integrate multiple clusters where the Prometheus collection agents are deployed. These can then be aggregated together through the Prometheus server in the logical control plane.

Note that the HELM Tiller deployments to each cluster support the multi-cluster control plane as described here.

Kubernetes and Istio Mixer Control Plane for Multicluster Deployments

Prometheus provides the time series of events for the multiple clusters that can then be queried by any Grafana server which treats storage backends as time series data (Data Source). Each Data Source has a specific Query Editor that is customized for the features and capabilities that the particular Data Source exposes. Grafana can also consume StackDriver, CloudWatch and Ceilometer for Openstack.

In conclusion:

  • Istio, Helm & Tiller can manage a multi-cluster hybrid cloud deployment
  • moving to a hybrid cloud requires a visualisation of complex integrations which is where Istio and Kiali service mesh observability are strong
  • hybrid cloud monitoring can be achieved by deployment of agents including Prometheus collection agents to individual clusters and connected to a Prometheus server which in turn is rendered by a Grafana server
  • Zipkin provides distributed tracing and integrates with the Istio managed cluster

One point not described is the requirement for a technical inventory that describes the individual micro-services and the toolsets that can be deployed to each container, but i’ll save for another blog.

Finally, there are technology alternatives to Kiali, Zipkin, Grafana and Prometheus such as included Logstash & ELK, FluentD and commercial solutions like Datadog.

Enterprise Architect’s Guide to Cloud Licencing Models


Moving to cloud licencing models, including SaaS, does not become less difficult and with the possible proliferation of services can become difficult for the Enterprise to govern. As with any type of licence agreement the Enterprise must know the agreement they have signed, the implications of the licensing model and the interaction on other 3rd party contracts. Monitoring of Service and Usage is paramount. The monitoring must relate back to the agreement and be within the dominion of the Enterprise. Every element of your organization’s software licensing must be managed under an onsite software agreement; but it must also include agreements for the software potentially being used externally as well.

Enterprise Architecture must understand the types of licencing models in the Cloud and how the effect the Enterprise and its customers. The following blog describes my experiences with cloud licences and the different models:

  • IT Cost as a Percentage of Revenue: Optimal Spend
    • Many Enterprises use IT Cost as a Percentage of Revenue to understand the OPEX costs of their IT against corporate revenue. This model works for larger enterprises with stable revenues.
    • For the start up the services can be used immediately and the model can scale according to demand. The challenge can be that it is difficult to scale on utilisation if the revenue decreases and therefore IT Cost as a Percentage of Revenue can peak.
    • Even within start-ups the Enterprise Architect must be aware of the ability to divest as well as invest in new technologies.
  • Hosted vs. On Premises: Software Asset Management
    • One of the biggest advantages of moving to Cloud or SaaS based applications is the reduced hardware infrastructure and personnel costs required to run business applications. An externally hosted infrastructure or more pertinently a hybrid model requires the inventorying of hardware, applications and licences.
    • New Software License Optimization tools are required that allow organizations to accurately inventory virtualized cloud environments
    • In a hosted model the software and infrastructure licence costs are bundled. Normally the costs are competitive but in certain scenarios such as storage it is possible to find a better deal through internal hosting. The Enterprise Architect must logically decompose the physical architecture to understand the optimal cloud deployment model and to consider as part of the Enterprise’s cloud architecture.
  • Subscription vs. Perpetual: Licence model cadence
    • The perpetual licencing model is well understood; the Enterprise has formal RFPs and set renewal date for Perpetual licences. The cadence with a Cloud model is faster. Subscriptions renew monthly and the Enterprise needs to ensure they are not over-spending or heading towards over-spend on a monthly period.
    • The Enterprise Architect must manage their IT estate of Cloud services closely because the barrier to entry to the Cloud is much lower than with perpetual licences. Without formal RFPs, the Enterprise will enter into multiple subscriptions for the same services or will licence services that may be underused.
    • The role of the Enterprise Architect for cloud governance is critical; without strong governance the precedent of point cloud solutions can spread across the Enterprise.
  • Usage-based Software License Models: Pay for what you eat but you’ve got to rent the plate
    • Cloud has made usage-based pricing more popular and seem simple at inception become increasingly more complex as your Enterprise’s requirements develop.
    • Usage based pricing models are complex as the cost to serve does not always align to the cost to use and determining the value of the service can then become very complex.
    • The Enterprise Architect provides benefit by understanding the value of the Software Licence Models. The EA needs to be familiar with the different types of software licencing models and their pitfalls. This includes both the licencing models and the legal and regulatory possible issues.
  • Accurate Forecasting of Costs: Roadmap use
    • On-premise perpetual licences provide predictable pricing and no surprises. The accurate forecasting of future spend in the Cloud is a challenge as the pricing models can change, usage changes, and there are not as many controls over growth or capacity demands. Enterprises need to be much more diligent about making sure their licensing costs are optimized, transparent, and predictable.
    • The Enterprise Architect has the foresight on the system roadmap and must understand the Cloud usage model. Here the EA must work closely with the finance team to predict the expected growth in the licencing model and to have a strategic roadmap for key scenarios.


Edge SDN as a Service

Not all micro-services can be stateless lambda functions. Some services must maintain state. A good example is the management of autonomous vehicle platooning functions across multiple radio network sites.

A challenge for this distributed statefulness is if the stateful micro-services are running in a specific container then how does the SDN controller manage networking to a specific container? This requires attaching the SDN networking at the container rather than the host level. Something that is possible with Amazon EC2 Container Service

If Tier-1 telcos are serious about providing Network as a Service or Edge Compute as a Service then they must provide the join between data centre and network operator. To do this they can either be the edge landlord to Amazon, Google and Facebook. Or if they are truly ambitious they need to provide a SDN Edge

Charles Gibbons is talking about Future of NFV / SDN at Digital Transformation World this week in Nice:

Cloud Migration of a Legacy IT Estate

There are many things to consider when migrating a legacy IT estate to the cloud. The first though must be what are the motivations and expected benefits. Many organisations have many decades of developed software running on private infrastructure and migration to the cloud is something they think they should do.

Migrating an estate to the cloud incurs a significant cost hurdle as new functions are required just to support migration activities. Often the benefit is minimal as only limited efficiencies can be found from closing (or worse partial closing) of legacy applications and data centres.

What is needed is a target systems architecture aligned to business benefits and vertical product supporting IT Stacks.

The systems architecture should reflect management of intermediary states between internal hosting and public cloud. The management of intermediary estates can easily increase an organizations run cost; for example if Corporation A decides to migrate all of its channels’ IT to a public cloud it will need to build an integration from public to private infrastructure, lease connection between new and old sites, provide a security wrap and identity mgmt function across internal and external clouds and finally support the operations for managing these new systems.

The benefits to support all of these new cloud enablement functions will be high. This does not mean it should never be done but the business must address how benefits like improved time to market will be substantively realised.

A TOGAF business architecture should be included before migrating as migration for the sake of hosting will only ever be a platform change. The balance has to be on how much change your organisation can stomach in a single move. Always consider that the SaaS services you are considering will probably be more configurable than your legacy estate. So don’t fall into the myth of business architecture as business change does not always have to be front loaded.

A Reference Architecture for Cloud Operational Support Systems

Most telecoms operators have multiple stove piped networks each with a specific historic associated OSS. All CSPs want the agility of Web Scale firms and view OSS and Cloud provision as complementary technologies. The challenge for CSP is to move from legacy vertical pillars to a horizontal platform model. Trying to achieve this with a simple OSS refresh will be a mere shim. For CPSs to be revolutionary they must consider the viability of a Cloud OSS as a way of externalising the orchestration & management of their network resources.

Currently it’s quite easy to find major components of a SaaS BSS (for example Salesforce). However it is much hard it is much harder to find an equivalent within the OSS domain. The primary reason for lack of SaaS in this domain is the nicheness of OSS (discussed previously here IoT Don’t Need No OSS). This nicheness is changing as AWS, GCP and Azure offer essentially offer IoT OSS. There’s currently no ONAP SaaS; but I wouldn’t be surprised if ONAP matured into a SaaS offering at some point. The other major areas of concern are security which can be mitigated through policy & control. Lastly there are concerns around throughput / latency of Resource Performance Management which is a specific topic covered later.

There’s also increasing CSP interest in Open Source OSS (OSS2 maybe?) with Open Source Mano, ONAP and the new TM Forum ODA architecture (for which I’m partly responsible). These OSS’ provide functions that are componentised in their design.

I’ve personally be looking at putting together a best of breed architecture based OSM, ONAP and some Netflix OSS on a cloud-hosted environment to support multiple operational networks. In doing this work I’m trying to understand the following questions:

  1. What is a suitable logical architecture for a Cloud OSS?
  2. And if it can’t all be externally hosted then what would be a suitable logical hybrid architecture?

In order to answer these questions let’s decompose the functions of OSS and compare which parts are most suitable for being cloud hosted. Let’s break it down (using eTOM’s service and resource domains) into nine logical packages for further investigation.

Functions for Cloud OSS

I’ve categorised by Cloud Nativeness (how easy is it to port these functions to the Cloud and how many SaaS offerings are available) against Network Interactivity (be it throughput of data, proximity to element managers). It is fairly self-evident that certain functions are cloud native (service management) whilst others (order to activation) require both close deployment to the network and have specific security constraints.

By grouping the logical functions we end up with three groups: Cloud Native Solutions (those that already run well in the cloud), Not Cloud Native Solutions (those that can’t be externalised to the Cloud) and a middle group of Either / Or that could be either internally managed or externalised.


The Either / Or group is the newest area covering Machine Learning, Autonomics and MANO for NFV / SDN. These could be either natively deployed (for example a local deployment of FlinkML on top of a performance management solution) or a cloud hosted solution (e.g. Google Cloud Platform’s TensorFlow deployment

Cloud Native Solutions: 

Service and Incident Management systems include perennial favourites Service Now, BMC Remedy & Cherwell.  These tools as cloud hosted solution require feeds from alarm management systems. Whilst the architecture orients itself to data streaming and machine learning the incident management system handles less tickets and works more on auto-remediation. This model necessitates the closed loop remediation function to sit within the network. I would expect a streaming flow in and out of the network boundary and this will obviously be the biggest of the pipes (and the most risky). Network & Service Operations provides a specialism for service & incident management and includes the resource alarm management, solutions like EMC Smarts & IBM Netcool increasingly offer cloud based operation consoles for alarm management tools.

Field Management systems together with Resource Plan and Build are easily managed from a public cloud. These systems have limited access to the operational network and normally have to manage internal and 3rd party resource to complete field operations. Systems like ESRI and Trimble fit in this space. These systems predominantly need access to resource and service inventories, and resource tools (such as HR systems, maps and skills bases).

Strategy systems are an interesting case of specialist planning, delivery and product lifecycle tools with eTOM. They cover service development & retirement, capability delivery and strategic planning. These functions are all equally loosely coupled to the network so require inventory detail, resource detail and a big data store of network performance. But they can be hosted externally and are not mission critical systems. So for our OSS these should be Cloud Native.

Not Cloud Native Solutions:

Order 2 Activation are the Activation systems for management of the network which are either subscription based or resource activation. Distinction here between provisioning controller and the intent based network choreographer (passing intent and policy to the network)

Performance Management Real time operational systems predominantly taking data streams from the network require local deployment as network functions predominantly require low latency if incidents are to be immediately managed.

The Interesting Either / Or Group:

MANO for NFV / SDN can either be a localised solution or can be cloud hosted in the case of a master orchestrator implementing intent based models.  This model makes sense when the orchestration involves third party service orchestration. This is partially covered by the TM Forum ODA.  The challenges would be organising the split of VNF Management with NFV Orchestration. The security controls will need to avoid the attack vector to the client VNF Manager running inside a CSPs network. 

It is likely that CSP’s will investigate this model going forward as they look to benefit from the opportunity of providing Mobile Edge Compute as an integrated PaaS.

Machine Learning & Autonomic Remediation is partially dependent upon the NFVO cloud architecture as remediation needs services to be exposed in order to implement remediation. If the NFVO is already cloud hosted then remediation is a natural continuation of its capabilities. The Machine Learning capability is a driver for the remediation engine constantly looking for situational improvements for specific conditions. Machine Learning can be deployed locally on a CSPs own infrastructure or use the scaling capabilities of tools like TensorFlow on GCP. The decision CSPs make here will be about scaling the intelligence to provide usable conditions that can be implemented within the remediation engine. A CSP with good skills in this area will have a technology advantage.

Next Steps:

I will be updating this stream as I believe there is a genuine future for a Cloud Native OSS. So please keep following this blog and ping me @apicrazy if you’re on the same journey.




Bringing IT (OSS) all together

I try and fit components together logically so that they can make the most of what the technology offers. I work predominantly in the OSS world on new access technologies like 5G and implementations like the Internet of Things. I want to achieve not just the deployment of these capabilities but to also to let them operate seamlessly.  The following is my view of the opportunity of closed-loop remediation.

For closed-loop remediation there are two main tenets: 1. you can stream all network event data in a machine learning engine and apply an algorithm like K-Nearest Neighbour  2. you can expose remediation APIs on your programmable network.

All of this requires a lot of technology convergence but: What’s actually needed to make everything convergent?


Let’s start with Streaming. Traditionally we used SNMP for event data, traps & alarms and when that didn’t work we deployed physical network probes. Now it’s Kafka stream once implementations where a streams of logs of virtualised infrastructure and virtualised functions are parsed in a data streaming architecture into different big data persistence.

The Machine Learning engine, I’m keenest of FlinkML at the moment, works on the big data persistence providing the largest possible corpus of event data. The ML K-NN can analyse network behaviour and examine patterns that are harder for human operation teams to spot. It can also predict timed usage behaviours and scale the network accordingly.

I am increasingly looking at Openstack and Open Source Mano as a NFVO platform orchestrating available virtualised network functions. The NFVO can expose a customer facing service or underlying RFSs. But to truly operate the ML should have access to the RFS layer. This is the hardest part and is dependent upon the underlying design pattern implementation of the Virtual Network Functions. This though is a topic for another blog post.