Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Info

...

Goals of

...

the case study

  • To specify

...

  • information that needs to be exchanged between Warren and the client Data Center (DC), in order to decide whether fulfilling the requirements of DC is feasible by the current

...

  • feature set of Warren

...

  • .

  • To

...

  • gather all necessary data,

...

  • that takes into account a wide variety of technical possibilities that will

...

  • satisfy the

...

  • service model and commercial goals of the DC.

  • For Warren

...

Table of Contents

Introduction

All the necessary information from the DC side to Warren can and should be addressed as

  • for the discovery of necessary required features for DC together with business value vs development effort estimates to provide maximum business value for the DC.

Table of Contents
minLevel1
maxLevel7
exclude^Goals.*

Introduction

Necessary information from the Data Centre can be gathered with a set of simple, unambiguous questions . These questions, in turn, can be divided roughly that are divided into three subsets, based on the goal they are meant to achieve. Each subset can be expressed as a more general "umbrella question":

...

Once answers to all subset questions are defined any ambiguity and misunderstanding over technical and determining factors will be cleared. The technical clarity achieved will be the fundamental premise to lay groundwork for a successful cooperation.  

  1. What are the current infra- and software components in use and what role do they play in the plans for the future?

  2. What services

  3. are
  4. is DC offering and what expectations do they have to Warren?

  5. What is the level of commitment of DC in cooperation with Warren and what

  6. expectations do they have to
  7. will be the  3rd party software systems alongside Warren?

If these topics are cleared from both sides, the ambiguity and misunderstanding of decisive factors that are the backbone of successful cooperation should be minimized

The first question of the above three is vital to gather the Answers to the above questions are vital, to gather information that affects the following topics in Warren development:

...

areas of the development for Warren:

    1.  Architectural Decisions:

      1. Which external libraries, components and standards to use to

  1. cope
      1. meet the requirements of the majority of

  2. DC
      1. DCs in

  3. target group?
      1. Warren's client target groups. 

      2. How

  4. to design the functionality in component systems, so that the we provide the value we claim to be offering, without causing the decrease in
      1. How to architect features across components, to provide value Warren aims to offer, while maintaining quality of services and

  5. processes existing there before Warren adoption? Marketing content and business value:Can we actually
      1. maintaining necessary processes the DC had before adopting Warren. 

    1.  Business Value and Marketing:

      1. Can Warren guarantee to offer the functionality we are claiming to

  6. be offering
      1. offer?

      2. Can

  7. we offer
      1. Warren provide the functionality

  8. in
      1. at a sufficient level of reliability

  9. ?
  10. Are we doing it in a sensible way, e.g. the development effort is comparable to the actual value the development result is providing?
  11. Are all the features and functionality we are/will be providing also
      1. in a particular domain of service for the DC?

      2. Will development effort for Warren be in proportion to the business value the developed functionality is forecasted to provide?  

      3. Are/will features and functionality of Warren be in correlation with the actual requirements of the DC?

Conflicting nature in  service requirements

...

Trade-Offs between Availability vs Locality and Software vs Hardware Defined Control

There are two fundamental tradeoffs that a DC needs to decide on:

  • Availability vs Locality

  • Software vs Hardware-Defined control

To analyze the possible tradeoff decisions for a DC and to make the input directly usable in Warren's development process, the hypothetical DC stack is divided into the following functional domains: hardware, firmware, software. All three domains have common properties that correlate to Warrens software components. Stack for Network and Storage that the DC has previously adopted define the biggest influence in future developments and technical on-boarding.

Network and Storage are tightly coupled, as decisions in one domain heavily depend on are influenced by the properties of the other domain. If analyzed, the Once the connection between these two domains is expressed best in decision making process, as two fundamental trade-offs:Network and Storage domains are analysed in full, decision-making in the following two trade-offs is possible. 

Availability vs Locality

 The biggest trade-off there decision is in multi-site location computing , (thus distributed cash cache and storage are simultaneously good and evil at the same time (wink)evil (wink)). 

  1. Availability in this context denotes:

    1. Spacial continuity - data or service is concurrently available to recipients/consumers in different locations rather than just

      one (

      one 
      Example: many Virtual Machines using the same database that resides in distributed storage

      )

    2. Temporal continuity - data or service is kept available even in case of soft- or hardware failures ("High Availability")

    Both of these aspects


    In cloud computing both spacial and temporal continuity may seem

    very

    desirable

    , especially in cloud computing

    , but the downside is lower delivery speed

    in various forms

    and latency. For example spacial availability with distributed storage withouthigh-end hardware may not have

    sufficient

    the optimal latency for storage-sensitive applications.

    Also, to keep the applications availability rate high, there are software and several levels of hardware redundancy involved that means buying additional devices and keep them constantly running.
    Locality

     To assure application high-availability several software and hardware redundancies are involved. This requires buying and maintaining additional hardware, which makes the Total Cost of Ownership higher. 

  2. Locality denotes the physical distance of

    some

    a functional domain from compute

    resource

    (CPU, RAM) resources (local storage vs distributed storage)

    While


    Locality is also not free from redundancy costs as High Availability metrics are

    received

    achieved by involving both distributed and redundant resources

    , locality is also not free from redundancy cost, however it

    . Luckily that is usually on the

    one of

    sub-server level

    , so definitely

    and is therefore less expensive. Local storage has

    also much

    lower latency which is desirable, but total storage capacity in contrast is

    very

    limited

    and

    . In addition, as data on

    this

    local storage is not available to outer devices without additional control and services,

    it introduces

    additional data duplication

    need in addition to one that is meant for "High Availability".

    demands are introduced on top of requirements to achieve High Availability. This means extra development work and more costs involved. 

Software- vs Hardware

...

-Defined Domains of Control

This can mostly be described as: 

  1. SD

    Software-Defined* - slow,

    but

    yet flexible, automatically reconfigurable and easily portable

  2. HD

    Hardware-Defined* - high speed, low portability, automatic configuration is limited or impossible

...

The general tendency is towards a concept “solely software-defined DC”, largely because of the automation and management benefits it offers. An exception to this tendency is the popularity of bare-metal provisioning, as a demand for direct control over hardware still exists. This is required by some type of applications, that require independence from general software-level system failures and speed.

Clearly Defined Roles of DC System Administration and Warren System Support

DC system administrators' role depends on the size of the DC and the nature and the complexity of the infrastructure, regional peculiarities, job description and other factors. In cooperation between the DC and Warren a strict distinction between a DC system administrator and Warren system support role can be made. It is important to note, that Warrens system support role differs from third party software support. Depending on the DC and installation requirements setting and defining the border needs to be discussed.

Please note with attention:

  • The roles and responsibilities must be addressed thoroughly before a final settlement of adopting Warren. 

  • It is a clerical error to assume that supportability is a second-grade matter that can be addressed after set up of production grade systems. 

  • As a matter of fact, it is one of the most important parts of service provisioning so that it deserves a chapter in the development documentation! 

  • Complex software systems must be developed with efficient observability and support in mind. 


Considerations and Goals in the Network Domain

There are several factors in DC's DCs network setup that dictates what we need needs to think be thought through in Warren application development process of Warren. Such factors include:

Network

...

Topology

This network topology type defines network traffic between components, servers, racks and network traffic between the DC and the internet. Also it sets DC's Network topology defines physical extendability properties of the DC, thus, we need to think through how will be handled automated discovery and deployment of new nodes, which consider:

  • How will the automated hardware discovery process be handled?

  • What deployment schema should be used when implementing new nodes?

  • Which system components are involved in

...

  • discovery and deployment processes?

  • How will any non-positive results of

...

  • discovery and deployments be handled?

Topology types:

  • tree 

  • clos

  • fat-tree

  • thorus

  • etc)

We are obviously not able to fine-tune Warrens setup for every type of topology. Topology is not a standalone factor , so and the set of variables in such analysis is large and pre-analysis makes it too costly compared to the business-value of the expected outcome. But However we can target the solution that covers topologies mostly used in target DC's with discovery and deployment solutions most widely used by DCs with a sufficient degree of quality (metrics of service . Service reliability and availability standards are something that metrics cannot theoretically be purely theoretically calculated in a platform that is under continuous heavy development and . Thus these metrics will rather be deduced defined during the DC adoption and on-boarding process). Current assumption is that


The presumption is that the most widely used topologies in probable Warrens target DC client group are fat-tree and various forms of clos. Based on that

Application/Service types

Although, both, this and next point seem to be trivial compared to a real problem magnets like network topology, adopting SDN solution or better yet, consolidating different SDN solutions; this has become a major issue in public clouds (and presumably also in private ones, where such issues are usually not materialized as a series of scientific papers). As almost all (except for SDN maybe) network-related considerations, also this one has the quantity-dependent nature, most optimizations are made for these topology types. 

Services Offered by DC and Nature of End-User Applications

Services offered by the DC  (emphasis on RAM vs CPU vs Local or Distributed storage) and thus similarly the end-user application resource needs need to be discussed and understood. This point has a quantity and volume dependent nature similarly to network-related questions and deserves careful consideration and planning so suggestions on hardware needs can be made. 

Data Traffic Volumes Between DC Hardware Racks

Data traffic volumes between hardware devices are measured to define the efficiency of Warrens service system. In case a larger DC installation is considered DC traffic between silos is additionally measured. The bigger the amounts of data-flow between hardware devices , the bigger of a problem it tend to be. 

In-DC traffic amount between racks

This traffic (and also In-DC traffic between silos, if larger DC is under consideration), is the one that measures the service system (Warren) efficiency. It's a two-fold problem, first the traffic 

Existing SDN solution

Requirements in storage domain

Storage is 

  • TODO: CONSTRAINTS WITH EXPLANATION AND EXAMPLES.
  • TODO: TO EACH CONSTRAINT, EVALUATION AND POSSIBLE SOLUTION

Warren components placement in DC

Based on network and Storage requirements, there can be predicted several issues due to poorly planned location of Warren control plain components such as:

  1. Network congestion or TCP incast in racktop switches. 
    TODO: REASONS
    One possible solution (in case of 2 physical nodes for control) is to host them in different racks. This leaves the ability in case of the problems above in one rack, to recover all control components in the node in the other rack. Hypervisors are wise to keep in separate rack due to the type and nature of network traffic, but also because of the danger of out-of-control virtualized resources, be that then caused by poor configuration or the applications they host.
  2. Low availability markers
    Depending on network topology of DC, the reasons in section 1, but also hardware failures may be the cause of unsatisfactory MTBF and MTTR. For example, if both control nodes are placed in one rack, rack-level hardware failure is causing at least short total downtime during diagnostics/repair process.That is not of course the case, when Warren is used more heavily, than just 3 nodes tryout.

On the other hand, keeping such level of separation between nodes, certainly increases in-DC traffic between racks. So there is no absolute rules in component placement, but rather it depends on already exiting setup, nature of provided services and median/peak traffic levels in rackscreates. It's a two-fold problem, first by the traffic that is generated by applications and end-users and secondly by the traffic that is generated by Warren as a management system.

The goal of Warren is to reallocate resources to minimize internal-DC traffic. In rare cases this can result in destabilising the network flow for a short period of time. In cases where client application related data flow is causing problems, system management related data flow must always take priority, even in cases where client throughput as a result decreases further. The purpose is to restore the systems previous state or maximise the efficiency of limited available resources in a given point in time. 

Pre-Existing DC SDN Solution

In general, all SDN systems are based on the same principles and in major part, derived from two prevalent frameworks for SDN generation. There are several types of protocols when it comes to network device configuration ( OpenFlow is still the most dominant one). Almost all needed routing protocols are supported by all major SDN solutions. This means any pre-existing SDN solution the DC might already use, should not cause any drastic adoption or installation problems on a connection basis. Which doesn't mean it's a trivial task!

However, there is an exception to that hypothetical balance - the security domain. All SDN systems implement some (or more) security domains, whether it’s client level or system-wide. To configure 2 or more SDN systems to cooperate simultaneously on that domain, might be more time consuming than to configure the whole system to adopt a new one.

Storage domain

Warren storage domain is based on distributed storage. In this chapter we explain why we have chosen it and how it compares to other popular solutions.

Storage holds the most valuable information - client application data, so the impact on the reliability and to Quality of Service is immense. In comparison network outage only affects the availability of data, whereas storage problems lead to permanent data loss.

Distributed Storage

Pros: reliable, multi-functional, scalable
Cons: expensive

The cost of a distributed storage (that may also be shared-distributed) comes from the fact that distributed is usually (not always - one exception is HCI) implemented as a separate cluster(s). So there are three main types of costs and an additional, optional one:

  • Upfront cost - devices itself, including explicit network for storage (fixed cost)

  • Repair/management costs + cost of space (fixed over a long period of time)

  • Energy cost (usually fixed over a long period of time with its seasonal spikes)

  • Optional license cost when the commercial distributed storage system is applied

When summed up and divided into monthly payments over a time period that equals server units service life, by far, the highest one is the energy cost. To conclude, although it seems wiser to make use of old server hardware for storage clusters, it is actually not so at all. Much wiser is to buy new, specially configured, low power consumption hardware that may even come with installed and configured distributed storage systems. Such specially configured devices offer another benefit - fast cluster extendability.

In a typical distributed storage solution, there are implemented both object and block storage, giving in such a way an opportunity to (when implemented as separate clusters) to use object storage also as a base for backup or disaster recovery implementations, in addition to its main purpose.

The reliability advantage compared to shared or local storage should be obvious.

Shared Storage

Pros: cheaper, faster
Cons: half-baked reliability

If infrastructure includes a direct-attached storage unit used as a shared storage solution, there is a high chance that the vendor has included the device software that operates with the device. There may even be distributed solution working in this unit but it must be kept in mind that this kind of storage is distributed within the device itself. If the storage device should fail, all the data is still unreachable - the data protection works at the disk level. 

To raise the protected sphere to the rack level, several such storage units must be placed in one rack. Now if Infrastructure contains more than 2 racks (which should be the normal case for DC), why are they not separated from compute units to form the autonomous distributed storage cluster? One answer to that might be the performance. As off-the-shelf storage units usually include “real RAID” controllers (with detached CPU and cache) and connection to compute units is direct (not over the network) the performance may be significantly higher than that of the distributed storage could offer.

Local Storage

Pros: cheapest
Cons: not fastest

Nowadays, the cost of the TB as a single disk is very low compared to the same capacity implemented in the form of an advanced storage device. However fast the single disk might be, it couldn’t compare with the direct-attached, performance-tuned shared storage system. 

Arguments that local storage is less expensive to network resources, like the two other options above, are not exactly correct if you value the data on those disks. To be prepared for hardware failures one has to constantly back up the data and it is meaningless if not done outside the machine/cluster. Which doesn’t mean that if there are local disk placed in servers, they cannot be used. There are a lot of properties that need caching or swapping and local storage is a perfect case for such needs.


-