Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 41 Next »

Goals of this case study

  • To specify the information that needs to be exchanged between Warren and the client Data Center (DC), in order to decide whether fulfilling the requirements of DC is feasible by the current state of Warren feature set

  • To extract the data, taken into account a wide variety of possibilities that will ultimately satisfy the main goal of DC.

  • For Warren side to group the required features by the demand quantity and development effort, giving the best business value to the maximum number of clients


Introduction

All the necessary information from the DC side to Warren can and should be addressed as a set of simple, unambiguous questions. These questions, in turn, can be divided roughly into three subsets, based on the goal they are meant to achieve. Each subset can be expressed as a more general "umbrella question":

  1. What are the current infra- and software components in use and what role do they play in the plans for the future?

  2. What services is DC offering and what expectations do they have to Warren?

  3. What is the level of commitment of DC in cooperation with Warren and what will be the  3rd party software systems alongside Warren?

If these topics are cleared from both sides, the ambiguity and misunderstanding of decisive factors that are the backbone of successful cooperation should be minimized.

These questions are vital to gather the information that affects the following topics in Warren development:

    1.  Architectural decisions:
      1. which external libraries, components, and standards to use to cope with the requirements of the majority of DCs in the target group?
      2. How to design the functionality in component systems, so that we provide the value we claim to be offering, without causing the decrease in quality of services and processes existing there before Warren adoption?
    2.  Marketing content and business value:
      1. Can we actually offer the functionality we are claiming to offer?
      2. Can we offer the functionality at a sufficient level of reliability in a particular service domain of a DC?
      3. Are we doing it in a sensible way, e.g. the development effort is comparable to the actual value the development result is providing?
      4. Are all the features and functionality we are/will be providing also in correlation with the actual requirements?


Conflicting nature in service requirements

To enhance the analysis result and make it directly usable as an input to the development process, let's partition the hypothetical DC stack (hardware, firmware, software) into functional domains that have common properties to according Warren components. Two of DC functional stack domains that have been there before Warren adoption are more influential than others, both future development- and adoption process-wise. These are Network and Storage. They are also tightly coupled, as decisions in one domain heavily depend on the properties of the other. If analyzed, the connection between these two domains is expressed best in the decision-making process, as two fundamental trade-offs - availability vs locality and software- vs hardware-defined domains of control.

Availability vs Locality

 The biggest trade-off there is in multi-site computing, (thus distributed cash and storage are simultaneously good and evil at the same time (wink)). 

  1. Availability in this context denotes:
    1. Spacial - data or service is concurrently available to recipients/consumers in different locations rather than just one (many Virtual Machines using the same database that resides in distributed storage)
    2. Temporal continuity - data or service is kept available even in case of soft- or hardware failures ("High Availability")

    Both of these aspects may seem very desirable, especially in cloud computing, but the downside is delivery speed in various forms. For example, distributed storage without high-end hardware may not have sufficient latency for storage-sensitive applications. Also, to keep the application availability rate high, there are software and several levels of hardware redundancy involved which means buying additional devices and keep them constantly running.

  2. Locality denotes the physical distance of some functional domain from compute resource (local storage vs distributed storage)

    While High Availability metrics are received by involving distributed and redundant resources, the locality is also not free from redundancy cost, however, it is usually the one of sub-server level, so definitely less expensive. Local storage has also much lower latency, but total capacity is very limited; and, as data on this storage is not available to outer devices without additional control and services, it introduces additional data duplication demand besides the one meant for "High Availability".

Software- vs Hardware-defined domains of control

This can mostly be described as: 

  1. SD* - slow, but flexible, automatically reconfigurable and easily portable
  2. HD* - high speed, low portability, automatic configuration is limited or impossible

The general tendency is towards a concept “software-defined DC”, largely because of automation and management benefits it offers. The exception to tendency is bare-metal provisioning popularity that could be explained the still-existing demand for direct control over hardware, required some type of applications, independence from general software-system failures and speed.

The responsibility borderline between the administration and the support

System administrators’ role depends on the size of a company, nature and the complexity of the system itself;  the regional peculiarities, job description, and many other factors. However, there is usually a strict distinction between system administrator and system support personnel roles. Latter one, in turn, differs from third party support software so to conclude, where that borderline is drawn varies, and that’s the reason it should be discussed.

This is one of the topics that definitely must be addressed thoroughly before a final settlement for Warren adoption. It’s a clerical error to assume that the matter of supportability is a second-grade one and can be addressed after a production system is up and running. As a matter of fact, it is so important part of service provisioning that deserves the chapter in development documentation! Complex software systems must be developed efficient observability and support in mind.


Considerations in the network domain

There are several factors in DCs network setup that dictates what we need to think through in the Warren application development process. Such factors include:

Network topology (tree, clos, fat-tree, thorus, etc)

This aspect defines network traffic between components, servers, racks also between DC and the internet. It sets DCs physical extendability properties, thus, we need to consider:

  • How the automated discovery process will be handled

  • What deployment schema to use when implementing new nodes

  • Which components are involved in such processes

  • How the non-positive results of such cases will be handled.

Obviously, we cannot fine-tune our setup for every topology type because it's not a standalone factor, so the set of variables in such analysis is large and too costly compared to the business-value of the outcome. But we can target the solution that covers topologies mostly used in DCs with a sufficient degree of quality. Metrics of service reliability and availability standards are something that cannot be purely theoretically calculated in the platform that is under heavy development. Thus, they will rather be deduced from DCs adoption process. The current assumption is that the most widely used topologies in the probable target DC group are fat-tree and various forms of clos. Based on that, most optimizations are made for the latter two topology types.

Nature of applications and services offered by DC

Although, both, this and next point seem to be trivial compared to a real problem magnets like network topology, adopting SDN solution, or better yet, consolidating different SDN solutions; this has become a major issue in public clouds (and presumably also in private ones, where such issues are usually not materialized as a series of scientific papers). Like almost all (except for SDN maybe) network-related considerations, also this one has the quantity-dependent nature. 

In-DC traffic amount between racks

The bigger the amounts of data-flow between hardware devices, the bigger of a problem it tends to be. This traffic (and also In-DC traffic between silos, if larger DC is under consideration), is the one that measures the service system (Warren) efficiency. It's a two-fold problem, first the traffic that is generated by the clients, secondly the one that is generated by Warren as a management system. The goal of Warren is to reallocate resources to minimize in-DC traffic and in rare cases, it can, by doing so, destabilize the network flow for a short period of time. Management flow must always take precedence when client flow is causing problems, even if it decreases client throughput further. Because it’s purpose is to restore the previous state, or at least maximize the efficiency with the currently limited amount of available resources. 

Existing SDN solution

In general, all SDN systems are based on the same principles an in major part, derived from two prevalent frameworks for SDN generation. There are several types of protocols when it comes to network device configuration, among which, OpenFlow is still the most dominant one. Almost all needed routing protocols are also supported by all major SDN solutions. 

To conclude the above, there shouldn’t arise any drastic problems on a connection basis (which doesn't mean it's a trivial task!). However, there is an exception to that hypothetical balance - the security domain. All SDN systems implement some (or more) security domains, whether it’s client level or system-wide. To configure 2 or more SDN systems to cooperate simultaneously on that domain, might be more time consuming than configure the whole system to use adopt a new one.


Considerations in the storage domain

Warren storage domain consists of three options:

  • Distributed storage 

  • Shared storage 

  • Local storage

To determine the right solution, one must consider several factors that are required to implement a particular storage type. As storage holds the most valuable part - client data, the impact on the reliability and to QoS. Afterall - network outage only affects the availability of data, whereas storage problems may lead to permanent data loss.

Distributed - Expensive, but reliable, multi-functional, and scalable

The cost of a distributed storage (that may also be shared-distributed) comes from the fact that distributed is usually (not always - one exception is HCI) implemented as a separate cluster(s). So there are three main types of costs and an additional, optional one:

  • Upfront cost - devices itself, including explicit network for storage (fixed cost)

  • Repair/management costs + cost of space (fixed over a long period of time)

  • Energy cost (usually fixed over a long period of time with its seasonal spikes)

  • Optional license cost when the commercial distributed storage system is applied

When summed up and divided into monthly payments over a time period that equals server units service life, by far, the highest one is the energy cost. To conclude, although it seems wiser to make use of old server hardware for storage clusters, it is actually not so at all. Much wiser is to buy new, specially configured, low power consumption hardware that may even come with installed and configured distributed storage systems. Such specially configured devices offer another benefit - fast cluster extendability.

In a typical distributed storage solution, there are implemented both object and block storage, giving in such a way an opportunity to (when implemented as separate clusters) to use object storage also as a base for backup or disaster recovery implementations, in addition to its main purpose.

The reliability advantage compared to shared or local storage should be obvious.

Shared - cheaper, faster, half-baked reliability

If infrastructure includes a direct-attached storage unit used as a shared storage solution, there is a high chance that the vendor has included the device software that operates with the device. There may even be distributed solution working in this unit but it must be kept in mind that this kind of storage is distributed within the device itself. If the storage device should fail, all the data is still unreachable - the data protection works at the disk level. 

To raise the protected sphere to the rack level, several such storage units must be placed in one rack. Now if Infrastructure contains more than 2 racks (which should be the normal case for DC), why are they not separated from compute units to form the autonomous distributed storage cluster? One answer to that might be the performance. As off-the-shelf storage units usually include “real RAID” controllers (with detached CPU and cache) and connection to compute units is direct (not over the network) the performance may be significantly higher than that of the distributed storage could offer.

Local - cheapest, not definitely fastest

Nowadays, the cost of the TB as a single disk is very low compared to the same capacity implemented in the form of an advanced storage device. However fast the single disk might be, it couldn’t compare with the direct-attached, performance-tuned shared storage system. 

Arguments that local storage is less expensive to network resources, like the two other options above, are not exactly correct if you value the data on those disks. To be prepared for hardware failures one has to constantly back up the data and it is meaningless if not done outside the machine/cluster. Which doesn’t mean that if there are local disk placed in servers, they cannot be used. There are a lot of properties that need caching or swapping and local storage is a perfect case for such needs.


Warren components placement in DC

Based on network and Storage requirements, there can be predicted several issues due to poorly planned location of Warren control plain components such as:

  1. Network congestion or TCP incast in rack-top switches.
    One possible solution (in case of 2 physical nodes for control) is to host them in different racks. This leaves the ability in case of the problems above in one rack, to recover all control components in the node in the other rack. Hypervisors are wise to keep in separate rack due to the type and nature of network traffic, but also because of the danger of out-of-control virtualized resources, be that then caused by poor configuration or the applications they host.

  2. Low availability markers
    Depending on network topology of DC, the reasons in section 1, but also hardware failures may be the cause of unsatisfactory MTBF and MTTR. For example, if both control nodes are placed in one rack, rack-level hardware failure is causing at least short total downtime during the diagnostics/repair process. That is not, of course, the case, when Warren is used more heavily, than just 3 nodes tryout.

On the other hand, keeping such a level of separation between nodes certainly increases in-DC traffic between racks. So there are no absolute rules in component placement, but rather it depends on an already existing setup, nature of provided services and median/peak traffic levels in racks.


-

  • No labels