A Reference Architecture for Cloud Operational Support Systems

Most telecoms operators have multiple stove piped networks each with a specific historic associated OSS. All CSPs want the agility of Web Scale firms and view OSS and Cloud provision as complementary technologies. The challenge for CSP is to move from legacy vertical pillars to a horizontal platform model. Trying to achieve this with a simple OSS refresh will be a mere shim. For CPSs to be revolutionary they must consider the viability of a Cloud OSS as a way of externalising the orchestration & management of their network resources.

Currently it’s quite easy to find major components of a SaaS BSS (for example Salesforce). However it is much hard it is much harder to find an equivalent within the OSS domain. The primary reason for lack of SaaS in this domain is the nicheness of OSS (discussed previously here IoT Don’t Need No OSS). This nicheness is changing as AWS, GCP and Azure offer essentially offer IoT OSS. There’s currently no ONAP SaaS; but I wouldn’t be surprised if ONAP matured into a SaaS offering at some point. The other major areas of concern are security which can be mitigated through policy & control. Lastly there are concerns around throughput / latency of Resource Performance Management which is a specific topic covered later.

There’s also increasing CSP interest in Open Source OSS (OSS² maybe?) with Open Source Mano, ONAP and the new TM Forum ODA architecture (for which I’m partly responsible). These OSS’ provide functions that are componentised in their design.

I’ve personally be looking at putting together a best of breed architecture based OSM, ONAP and some Netflix OSS on a cloud-hosted environment to support multiple operational networks. In doing this work I’m trying to understand the following questions:

What is a suitable logical architecture for a Cloud OSS?
And if it can’t all be externally hosted then what would be a suitable logical hybrid architecture?

In order to answer these questions let’s decompose the functions of OSS and compare which parts are most suitable for being cloud hosted. Let’s break it down (using eTOM’s service and resource domains) into nine logical packages for further investigation.

I’ve categorised by Cloud Nativeness (how easy is it to port these functions to the Cloud and how many SaaS offerings are available) against Network Interactivity (be it throughput of data, proximity to element managers). It is fairly self-evident that certain functions are cloud native (service management) whilst others (order to activation) require both close deployment to the network and have specific security constraints.

By grouping the logical functions we end up with three groups: Cloud Native Solutions (those that already run well in the cloud), Not Cloud Native Solutions (those that can’t be externalised to the Cloud) and a middle group of Either / Or that could be either internally managed or externalised.

The Either / Or group is the newest area covering Machine Learning, Autonomics and MANO for NFV / SDN. These could be either natively deployed (for example a local deployment of FlinkML on top of a performance management solution) or a cloud hosted solution (e.g. Google Cloud Platform’s TensorFlow deployment

Cloud Native Solutions:

Service and Incident Management systems include perennial favourites Service Now, BMC Remedy & Cherwell. These tools as cloud hosted solution require feeds from alarm management systems. Whilst the architecture orients itself to data streaming and machine learning the incident management system handles less tickets and works more on auto-remediation. This model necessitates the closed loop remediation function to sit within the network. I would expect a streaming flow in and out of the network boundary and this will obviously be the biggest of the pipes (and the most risky). Network & Service Operations provides a specialism for service & incident management and includes the resource alarm management, solutions like EMC Smarts & IBM Netcool increasingly offer cloud based operation consoles for alarm management tools.

Field Management systems together with Resource Plan and Build are easily managed from a public cloud. These systems have limited access to the operational network and normally have to manage internal and 3^rd party resource to complete field operations. Systems like ESRI and Trimble fit in this space. These systems predominantly need access to resource and service inventories, and resource tools (such as HR systems, maps and skills bases).

Strategy systems are an interesting case of specialist planning, delivery and product lifecycle tools with eTOM. They cover service development & retirement, capability delivery and strategic planning. These functions are all equally loosely coupled to the network so require inventory detail, resource detail and a big data store of network performance. But they can be hosted externally and are not mission critical systems. So for our OSS these should be Cloud Native.

Not Cloud Native Solutions:

Order 2 Activation are the Activation systems for management of the network which are either subscription based or resource activation. Distinction here between provisioning controller and the intent based network choreographer (passing intent and policy to the network)

Performance Management Real time operational systems predominantly taking data streams from the network require local deployment as network functions predominantly require low latency if incidents are to be immediately managed.

The Interesting Either / Or Group:

MANO for NFV / SDN can either be a localised solution or can be cloud hosted in the case of a master orchestrator implementing intent based models. This model makes sense when the orchestration involves third party service orchestration. This is partially covered by the TM Forum ODA. The challenges would be organising the split of VNF Management with NFV Orchestration. The security controls will need to avoid the attack vector to the client VNF Manager running inside a CSPs network.

It is likely that CSP’s will investigate this model going forward as they look to benefit from the opportunity of providing Mobile Edge Compute as an integrated PaaS.

Machine Learning & Autonomic Remediation is partially dependent upon the NFVO cloud architecture as remediation needs services to be exposed in order to implement remediation. If the NFVO is already cloud hosted then remediation is a natural continuation of its capabilities. The Machine Learning capability is a driver for the remediation engine constantly looking for situational improvements for specific conditions. Machine Learning can be deployed locally on a CSPs own infrastructure or use the scaling capabilities of tools like TensorFlow on GCP. The decision CSPs make here will be about scaling the intelligence to provide usable conditions that can be implemented within the remediation engine. A CSP with good skills in this area will have a technology advantage.

Next Steps:

I will be updating this stream as I believe there is a genuine future for a Cloud Native OSS. So please keep following this blog and ping me @apicrazy if you’re on the same journey.

Cloud Native Solutions:

Not Cloud Native Solutions:

The Interesting Either / Or Group:

Next Steps:

Share this:

Like this:

Related

Published by mustnotgrumble

Leave a ReplyCancel reply

Discover more from API Crazy