Rant – Joseph Griffiths

February 19, 2018

Will FaaS mean the end of servers?

A few years ago there were many articles about how containers would mean the end of servers. From a technical standpoint Function as a Service (FaaS) and containers both run on servers. So simple answer no it does not mean the end of servers. I have seen a lot of rumbling around FaaS of late. Those who have heard me speak on automation know I am all about functions, modular blocks and FaaS, as long as you can keep the servers secure and safe from internal and outside damage. Most outside damage comes from room vibrations which can be easily monitored by SpotSee. We do need to break code down to simplest terms to encourage innovation and re-use. FaaS has a place in your overall design. Application design continues to pivot away from monolithic design to more micro-service models. FaaS is part of that pie. When considering any of these strategies the same overall design challenges exist:

Data persistence
Data gravity
Security

Data persistence:

No matter how stateless your environment sooner or later data is involved. There are some exceptions but they are really rare. The internet runs on data. The real value is identification of you as a user and selling that data in mass not the $.99 cents you paided for the app. Applications exist to do something then keep state… or record your reactions either way the data needs to be stored. Pure stateless applications are stateless. FaaS is stateless. So somewhere in the pie we need state. Something to orchestrate the next step and provide the value to user and the developer. Where you store this data depends on the application from a simple text file to a share nothing database someone is keeping the data. Lets just be honest that 90% of the world still lives on a relational database (Oracle, MS-SQL, My-SQL) with a small portion using a share nothing database (Cassandra etc..). This persistence layer has all the same concerns as any other non-immutable infrastructure. If you loose all your copies you loose data. Even with every function of an application as a FaaS you still need a database. The challenge of persistence means you have to live in both worlds a persistent and non-persistent. It’s important to consider the manageability of both these worlds when you consider implementing new technologies.

Data gravity:

The idea of FaaS or stateless is I can deploy anywhere… while this is technically true you want your application/functions to be close to that persistent data to ensure performance is observed. Which means you either need to real time replicate data between anywhere you want to operate or operate in the same locality as your stateless / function. No share databases have massive concerns with write amplification, confirming a write across long distances introduces unacceptable latency into every write. Sharding of these databases is touted as the solution using sync writes in the same location for redundancy, sharding is possible it’s a complex and you still have latency when the data needed is not local. Now we have created a MC Escher puzzle with our application architecture. Gravity of data will continue to drive location more than feature / functionality of location. It’s an instant world and no one is going to wait for anything anymore.

Security

While not as interesting as the bling of FaaS security is a real concern. Unless you plan on running your FaaS inside your private datacenter it’s a concern. Your functions have data to do their work in memory. The function is running on a server. Like all multi-tenant situations how do we avoid having a bad or untrusted actor access our data in flight? Anyone who has worked in a multi-tenant provider understands this challenge. Cloud providers have long deployed containers with light weight containers to ensure isolation is present (instead of shared worker nodes). I personally don’t know what measures providers have taken to isolate FaaS offerings but you do have to consider how you will ensure there is not a hacker running a buffer overflow and reading your memory FaaS.

At the end of the day what is old is new and what is new is old. FaaS, containers, virtual machines, physical servers, laptops, phones all have the same fundamental applications challenges. These all provide options. You may be considering a FaaS strategy for many reasons. My point is don’t ignore good design principles just because its new technology.

November 2, 2017

Does understanding the cost of IT really matter?

Welcome to my new form of click bait titles. I have been thinking about this for a while. I get to see a lot of different enterprise environments as a solution architect for VMware. It’s been great to experience all these customers challenges and help them on their journey. Years ago I was very focused on helping organizations understand the cost of IT. It was very important to me to identify the total cost of resources. IT has long been able to quantify capital expenses due to hardware costs. All you have to do is take your bill of materials and divided it by the logical resource element. (GB’s of storage, RAM etc..) Operating expenses have long plagued IT organizations because staff normally multi-tasks and dislikes tracking individual work. This total action was supposed to help you have a seat at the business table by talking then language of business (money). I have written this article to argue why I believe cost justification means you are loosing the battle with your business.

Why does cost justification mean you are loosing the battle

When I first started in IT I was in charge of personal computers for Columbus GA Auto Accident Law firm. Every three years we had a bake-off between major PC vendors where they would parade their latest prize pig in front of us. In the end the winner was determined by who was willing to go the lowest on price. There was no loyalty there was only price. There was no value add. Then a strange thing happened. My customer became interested in PC features which were reflected in aesthetic preference. They didn’t care about the hardware specifications.

They wanted Macbook’s and clean desks that all matched:

They wanted something that looked nice and was a status symbol. They didn’t care about price. Suddenly management no longer cared about price they wanted the Macbook. I did a justification related to the cost of IT repairs on Mac’s in a last-ditch effort to head off the Apple invasion. I didn’t win.

Quite simply put there is always going to be another prize pig that will beat you on price by either reducing features or buying at an economy of scale that you cannot. So when IT is 100% based on cost justification you are always going to lose. Please don’t assume that I am saying cost does not matter it does.. but it cannot be the only factor in a choice because the farm is world-wide now. When your business is left with a justification of cost between private and public cloud you might loose that’s why you need to upgrade and try to view prospect.io vs leadfuze. The best way to boost your company is through funding, there are companies that can help you with that, for more info visit capstone business funding.

Perception is reality

Macbook’s are better right? Well it depends on who you ask. Perception determines reality. My customers perceived that Macbook’s are better because all the cool kids were using them… The truth is the hardware is pretty good but expensive compared to other x86 machines. Most IT professionals love technology. They cannot wait to be on the bleeding edge of technology. Infrastructure has become on the whole very risk adverse for a number of reasons:

When something fails they are the first to be called
Mountains of technical debt
Weak development unit testing
Sheer sprawl of divergence in the environment

This risk aversion has made infrastructure people the old man yelling get off my lawn of technology while all the cool kids have Macbooks and text get off my lawn. It’s time for infrastructure to return to their roots and embrace change once again or be left behind. It’s a software defined world… learn how to use an API today. Learn about the cloud providers because they are part of your future. Also you may visit https://www.paydayloansnow.co.uk/payday/no-credit-check/direct-lender/ if ever you want to apply for a payday loan.

Digital Transformation

Digital is thrown around too much… it’s the cloud of the 2000’s. (Not to be confused with the cloud of the 2010’s) I believe digital transformation is defined as IT aligning with the business. If you start your alignment by assigning a bill or trying to use cost as leverage you cannot assume it will help the relationship. The key element of any relationship is not power manipulation (cost) it’s achieving mutual shared goals. It’s working together to solve problems as a single entity. At VMworld 2018 I spoke along with Craig Fletcher about this relationship and we compared it to a marriage. I suggest that a lot of infrastructure teams are heading for a divorce. You IT have to take the first step to fix the relationship… because if you don’t the business will go find another suitor (can you spell AWS). Allow me to suggest four things to consider in your digital journey:

Spend more time understanding how your efforts impact revenue or not
Create a state of IT anonymous survey
Create cross functional teams to address business challenges
Embrace change and cross cloud capabilities

June 22, 2017

How can you make IT agile?

Every single day I hear the new magic word from IT groups: I need improved agility. It reminds me of how people talked about going to the cloud. Agility is a capability not a destination. It’s hard to measure which is the first challenge. I believe when customers ask for agility the are actually asking for business relevance. For example, the news that Nolimit City score collaboration deal means that the software company is willing to work with other platforms to reinvent itself. If you are closely aligned with the business you should be able to respond to the business as needed.

In order to illustrate my point I am going to use a story from my childhood. As a young child of eight I used to play the video game Test Drive, using the best graphic cards for gaming I got online.

This early computer car simulator allowed you to drive very high-end sports cars in exotic location using your keyboard. One day while driving my father was watching me drive. I was frustrated I could not beat my computer opponents in the race. My father wisely said “Son the problem is you wait until you are already in the turn to begin to turn. As a driver I turn long before I get into the turn.” This wise council has stuck with me in life. If you don’t start to gradually turn in advance then you have to slow down to make the turn. In my case I was having to brake down to 10 kph in order to make the turn. I was being reactive to the turn instead of proactive.

I think this illustrates a common challenge with IT. IT is focused on building the best sports car and assuming that if the car is fast enough they will be able to meet the business needs. Without visibility into the business how can your sports car make the turns without massive slow down. I firmly believe that change is constant and to be expected even more so with IT. So the challenge is how do slow down building a shiny sports car and maintain velocity?

Signs of IT focus on sports car

How can you tell that organization has been too focused on building the best sports car? I suggest the follow may be signs of a problem:

Business wants IT to cut costs year after year while the business is growing
Digital initiatives creating a bi-modal IT (Leaving traditional IT behind)

How does IT become business focused?

Focusing on the business has been a challenge for traditional IT. They have normally been padded away from the business units by development. Development continues to add value to the business by changing to meet their needs. They have the ability to change with velocity because they talk to the business. In order to solve this issue traditional IT needs to have a business focus. Here are some suggestions:

Read your companies 10-K and understand what is important to the C-level
Spend time talking to development and business and understand how the project effects revenue – once you understand the revenue potential of a project use that data to market your impact
Get a real understanding of cost – you need to understand CapEx and OpEx cost of projects and actions so you can project them to the business
Start to track SLA’s and report on them
Track other critical metrics and report on them
Marketing of your service is critical – no one cares about your shiny car.. they want to know capabilities aligned with business

Taking these action will gradually put you in a place were the business includes you in the discussion. Your role once aligned with the business is to say yes to revenue projects and guide them into cost effective IT solutions to the need.

Never use brakes

When I was first a driver at 16 I was on the freeway. I would accelerate quickly and brake a lot. My mother suggested that I was driving incorrectly. Her father has a simple driving goal “You should never have to use brakes. You should anticipate slow downs in advance and only use brakes when you come to a stop.” My grandfather has never driven in Paris, Italy or New York! His advice aligns with the goal of understanding the business, once aligned we should not have to come to a complete stop instead we gradually adjust to meet needs.

Hard work

Being aligned with the business is a lot of work. It requires that you create IT as a service instead of reactive IT. Proactive IT:

Has a road map plan for the next 18 months (but allows for turns)
Has robust historical metrics around business critical metrics and cost to make business informed choices
Understands how every project aligns to revenue
Spends more time planning than implementing
Has robust standards for the service and aligns to them

Let me know what you think… am I up in the night?

November 17, 2016

Should IT build a castle or a mobile home?

So I have many hobbies to keep my mind busy during idle times… like when driving a car. One of my favorite hobbies is to identify the best candidate locations to live in if the Zombie apocalypse was to happen. As I drive in my car between locations I see many different buildings and I attempt to rate large buildings by their Zombie proof nature. There are many things to consider in the perfect Zombie defense location for example:

Avoiding buildings with large amounts of windows or first floor windows
Building made of materials that cannot be bludgeoned open for example stone
More than one exit but not too many exits
A location that can be defended on all sides and allows visible approach

There are many other considerations like proximity to water and food etc.. but basically I am looking for the modern equivalent of a castle:

OK what does this have to do with IT

Traditional infrastructure is architected like a castle its primary goal is to secure at the perimeter and be very imposing to keep people out. During a zombie attack this model is great until they get in then it becomes a grave yard. IT architects myself include spend a lot of time considering all the factors that are required to build the perfect castle. There are considerations like:

Availability
Recoverability
Manageability
Performance
Security

That all have to be considered and as you add another wing to your castle every one of these elements of design must be considered for the whole castle. We cannot add a new wing that bridges the moat without extending the moat etc.. Our design to build the perfect castle has created a monolithic drag. While development teams move from annual releases to quarters or weeks or days we continue to attempt to control the world from a perimeter design perspective. If we could identify all possible additions to the castle at the beginning we could potentially account for them. This was true in the castle days: there were only so many ways to get into the castle and so many methods to break in. Even worse the castle provided lots of nooks and locations for zombies to hide and attack me when not expecting it.. This is the challenge with the Zombie attack they don’t follow the rules they just might create a ladder out of zombie bodies and get into your castle (World War Z style). If we compare zombies to the challenges being thrown at IT today the story becomes valid. How do we deal with constant change and unknown? How do we become agile to change? Is it from building a better castle?

Introducing the mobile home

Today I realized that the perfect solution to my Zombie question was the mobile home. We can all assume that I need a place to sleep. Something that I can secure with reasonable assurance. I can re-enforce the walls and windows on a mobile home and I gain something I don’t have with a castle: mobility. I can move my secured location and goods to new locations. My mobile home is large enough to provide for my needs without providing too many places for zombies to hide. IT needs this type of mobility. Cloud has provided faster time to market for many enterprises but in reality you are only renting space in someone else’s castle. There are all types of methods to secure your valuables from mine but in reality we are at the mercy of the castle owner. What if my service could become a secured mobile home… that would provide the agility I need in the long run. The roach motel is very alive and well in cloud providers today. Many providers have no cross provider capabilities while others provide tools to transform the data between formats. My mobile home needs to be secure and not reconfigured each time I move between locations while looking for resources or avoiding attack. We need to reconsider IT as a secured mobile home and start to build this model. Some functions to consider in my mobile home:

Small enough to provide the required functions (bathroom, kitchen and sleeping space or in IT terms business value) and not an inch larger than required
Self contained security the encircles the service
Mobility without interruption of services

Thanks for reading my rant. Please feel free to provide your favorite zombie hiding location or your thoughts on the future of IT.

June 21, 2016

Why I took a pay cut to work at VMware

Warning: This is a love rant for VMware. You might want to skip if you are looking for the normal technical details of my blog.

Great title eh? Really catchy and intended to get you to read and it’s true. Two months ago I left a great job with IBM to work with VMware and took a pay cut to do it. When you switch jobs there are lots of reasons money is cut part of the deal, you might leave a job for the following reasons:

Too much travel
Bad situation with management
No career growth potential
Money
A new challenge
etc…

The reasons are often a combination of these and other factors. I left my job as a Senior VMware Architect for IBM to work as a Solutions Architect for VMware. I wanted to take this blog post to explain why I took the job. Some years ago I was a happy yet bored systems administrator. My boss suggested that I attend an industry conference as a perk. I went to VMworld. I came back from the conference really excited about the future of VMware and the cloud. I was invigorated by the energy and the vision of VMware’s executives. I continued to learn about their technology and found it very refreshing in the market. This lead me to focus my career away from Linux and into VMware technology.

Our Security System and SecurityInfo.com have an average of 11 years in the security business. Our training programs keep us ahead of the curve in the areas of home and business system & security cameras.

Culture

I have taken two runs at VMware jobs in the past and did not make it. Each time I was interviewed by multiple people. Each of those people took the interview time to teach me new skills and help shape my thinking for success. I love that attitude. It’s a simple attitude that we are stronger as team instead of individuals. Most of the company has this attitude and I love it.

Innovation

VMware’s technology continued to push the limits of traditional datacenter while proving real business value. They have proven to be innovative through aquisitions (NSX) and internal research and development (vSAN). I love this approach too many companies stop internal research and loose the innovative spirit, this is simply not true at VMware.

Career Growth

VMware takes career growth serious. Two weeks after starting with the company my boss asked me for specific goals that can effect my bonus structure. These goals are recorded and tracked and VMware is serious about enabling me to meet these goals. Managers seem to be interested in retaining talent by supporting growth and interests.

New Challenges

Yes it’s true I am a challenge junkie. It’s what has caused me to get two VCDX’s in two years. Once a goal is on paper I am a nut case to achieve it. Working for VMware as a Solutions Architect represents some new challenges and lots of learning which does keep me going for a little while.

It’s all great when you are new

I completely agree I am very young in the company so my view is narrow. I believe the future to be very good for VMware and I am excited to join them on the journey.

March 21, 2016

Storage in Virtualization is it a real problem?

As a VMUG leader and a double VCDX I have seen one technology trend only increase over the years. It’s the number of storage vendors! Last year at our VMUG UserCon every sponsor looking for a presentation slot was a storage vendor. We had to choose between storage vendors and other storage vendors I would have killed for another type of vendor. In past years we had presentations from Backup vendors, management tools, monitoring tools and IT service companies. Now it’s all storage companies. As a double VCDX I get contacted by start-up companies looking to sell their products to VMware customers. Some are well known company’s others are still in stealth but they all have the same request… how do we get VMware guys to buy our awesome technology. Almost all of these companies are using Super Micro white box solution with some secret sauce. The sauce is what makes them different, some are web-scale while others are all flash or awesome dedupe ratios. All attempting to address some segment of storage problems. It really begs to question is there a storage problem?

What does storage provide?

Storage essentially provides two things that virtualization professionals care about:

Capacity (Space to store information)
Performance (divided into IOPS and latency)
- IOPS – input/output per second number of commands you can shovel into the system
- Latency – how long it takes to shovel each IOP end to end

There are subsections of software that each vendor provides in order to improve these metrics for example dedupe for capacity or Hot blocking for performance. Essentially this is the role of storage systems to provide these functions.

How has virtualization made it worse?

Virtualization has made management of these metrics a challenge. In traditional storage a single entity controls a LUN or mount. It runs an application that has certain predictable patterns for usage of the lun. For example, a web server does a lot of reads and a few writes. We can identify and classify this usage pattern and thus “right size” the lun to meet these needs. This right sizing can take the form of both capacity and performance metrics. Virtualization created a new pattern lots of guest servers with different applications sharing the same lun. This makes the usage metrics pretty wild. The storage system has not idea what the virtual machines are doing beyond a bulk understanding of reads and writes. This seems like a problem but in reality the storage system just see’s reads and write and does not care, unless capacity or performance for that lun are exhausted. This issue might drive the acquisition of more performance storage in order to meet the needs of our new “super luns” but in most cases it just takes advantage of unused capacity on a storage array.

What does desktop virtualization have to do with storage?

Desktop virtualization taught us a very important lesson about storage. During boot operating systems do a lot of IOPS. Operating systems are 90% idle except during a boot. During boot lots of reads and some writes happen putting pressure on disk. Desktop virtualization introduced a new pattern of pressure on disk. At eight and nine AM everyone would boot up their virtualized desktop (spawning new desktops and booting the OS’s) putting massive pressure on storage. The caused storage systems to fail and if shared with traditional server virtualization everything failed. Traditional storage vendor’s solution to this problem was buy a bigger array with more cache and capacity. This created stranded capacity and was a huge CapEx expenditure when desktop virtualization was “supposed” to save us money.

Role of Cache

The rise of SSD has provided a dramatic improvement to the size of cache available in arrays. Cache provide ultra-fast disk for initial write and common reads thus reducing latency and improving IOPS. I remember the days when 1GB of cache was awesome these days’ arrays can have 800GB cache solutions or more. Cache allows you to buy larger and slower capacity disks while getting better performance to the virtualized application. Cache is a critical component in today’s storage solutions.

How to solve desktop virtualization

Vendors saw a gap in technology with desktop virtualization not being filled with traditional array vendors. This gap can be defined as:

The array was not meeting my performance needs without buying more arrays
I need to separate my IOPS for desktop virtualization away from servers

This gave rise to two solutions:

Hyper-converged infrastructure
All Flash arrays

Hyper-converged

Hyper-converged infrastructure has many different definitions depending on who you ask. For the purpose of this article it’s a combination of x86 hardware with local hard drives. This combination provides the compute and software based clustered storage solution for virtualization. The local hard drives on each compute node contribute to the required cluster file system. This model has long been used by large service providers like Google and Amazon. These are normally implemented for ESXi over NFS. The market leader at this time is Nutanix who really cut their teeth solving desktop virtualization problems. They have since moved successfully into traditional server virtualization. Their success has encouraged other vendors to enter the market to compete including Simplivity (OmniCube) and VMware (Virtual SAN). Each vendor has some mix of the secret sauce to address a perceived problem. It’s beyond the scope of this article to compare these solutions but they all take advantage of at least one SSD drive as a per compute cache. This local cache can be very large compared to traditional arrays with some solution using 1TB or more local cache. Each compute node serves as a storage controller allowing for a scale up approach to capacity and performance. Hyper converged solution have seen huge growth in the market and does effectively resolve the desktop problem depending on scale. Hyper converged solutions do introduce a new problem; balanced scalability. Simply put I may need additional storage without needing more controllers or compute capacity, but in order to get more storage I have to buy more nodes. This balanced scale issue is addressed by vendors providing different mixes of storage / compute nodes.

All Flash Arrays

With the rise of SSD the cost keeps getting lower. So traditional array vendors starting producing all flash arrays. Flash provided insane amounts of IOPS per disk, but lower capacity. Each month the capacity increases and the cost reduces on SSD making the All flash array (AFA) a very real cost effective solution. Years ago I was asked to demo a newly emerging Flash solution called RamSAN. The initial implementation was 150,000 IOPS in a single 2 U unit. I was tasked with testing its limits. I wanted to avoid artificial testing so I threw a lot of VMware database workloads at the array (all test of course). I quickly found out that the solution may be able to do 150,000 IOPS but that my HBA’s (2 per host) did not have enough queue depth to fulfill the 150,000 IOPS. All flash arrays introduced some new problems:

Performance bottleneck moved from the disk to the controller on the array
Capacity was costly
New bottlenecks like queue depth could be an issue

I remember buying 40TB’s of ssd in more recent array. The SSD drives combined was capable of 300K IOPS while the controllers could not push more than 120K IOPS. A single controller was able to do 60K IOPS. Quickly the controller became my problem, one that I could not overcome beyond buying a new array with additional controllers. Traditional array vendors struggled with this setup bound by their controller architecture. A number of startup vendors entered the market with scale up controllers. All flash based solution can potentially solve the desktop problem but at a steep cost.

Problem with both solutions

All solutions suffer from the same problems:

Stranded capacity in IOPS or storage capacity (more of either than you need)
Storage controllers cannot meet performance needs

All of these issues happen because of a lack of understanding of the true application metrics. vCenter understands the application metrics the array understands reads and writes at a lun level. This lack of understanding of each virtual machine as an independent element does not allow the administrator to increase priority or preference of individual machines. Hyper converged have two additional challenges:

Increased network bandwidth for data replication (assuming Fiber arrays NAS have this issue)
Blades rarely have enough space for multiple hard drives

The value proposition for hyper converged is that you can replace your costly array with just compute with hard drives. This is a real cost savings but only if you are due for a technology refresh on both compute and storage and your budgets are aligned and agreed to spend on hyper converged. Getting storage to give up funds for a compute hard drive can be a hard proposition.

How to understand the smallest atomic unit

Lots of vendors understand this problem and have different ways of approaching this problem including:

VVols
Local compute cache
NFS

Essentially to understand the small you have to understand the individual files and how they are connected. VMFS file system handles all this information, block based arrays only understand block based reads and write. Individual files are invisible to the block based file system.

VVols

Developed by VMware VVol’s provide a translation method between block based storage systems using protocol endpoints. These protocol endpoints run on the storage controllers or in-line with controllers to allow the array to understand the file system and individual files. This translation allows the array to act upon a single virtual machine on a lun instead of running on the whole lun. We can apply performance, snapshots and all array operations on the individual virtual machines. This is a great solution but has two problems:

The protocol endpoints much like controllers have scalability issues if not implemented correctly
Vendor adoption has been very slow

Local compute cache

This process adds SSD or RAM and creates a cache for virtual machine reads and writes. This cache can be assigned to individual machines or shared between the whole compute node. This method has an understanding of individual machines and accelerate reads and writes. In order to cache writes it’s critical that the writes be redundant so normally the writes have to be committed to at least two different compute nodes cache before acknowledged to the operating system. This ensures that the data is protected during a single compute node failure. The current leader providing read and write cache solutions like this is PernixData. This process ensures local performance enhancement at the lowest atomic level but does endure some common challenges with hyper converged including:

Every compute node must have local SSD to accelerate solution
Network bandwidth for replication is used (meaning your need more 10GB or you have to share it)

NFS

NFS has been around for years. It’s a method for sharing a file system to Linux and Unix hosts. VMware supports it natively and it’s the only supported file system (other than VMware VSan) that is not running VMFS. VM’s on NFS are files on the NFS file system. This allows the storage array / server full understanding of the individual files. This exposure can be a huge advantage when looking at backup products and site to site replication. Until NFS version 4 support (vSphere 6) there were a number of draw backs to NFS including multipathing. They have been removed and NFS provides the full object based storage solution that VVols promise. Scalability can be a problem with a maximum number of virtual machines and objects on a single lun, or with capacity around controllers. NFS based solution are network based and thus create network workload. In addition natively NFS does not provide any performance by file enhancement method it just deals with IO in and out. Lots of vendors have implemented solutions to enhance NFS.

What is best and does it solve the issue?

I started this post with the question is there a problem with storage… well lots of vendors seem to think so and want to sell us stuff to solve the issue. I suggest that from my experience we have a few issues:

Backup is a major mess, in vSphere it’s hard to manage and keep working without constant care and feeding
Storage arrays don’t have any understanding of the lowest atomic unit and thus cannot protect us from bad neighbors on the same lun, this becomes more of an issue in large hosting environments.
Performance (IOPS) is rarely the issue except in specific use cases or small business thanks to oversized arrays
Queue Depth is rarely the problem except in specific use cases
Capacity seems to be the buzz problem and the price per year just keeps getting lower

Backup

I believe we need to get to object based storage so we can solve the backup problem. Doing VDP backups or lun snapshots does not allow management at the lowest atomic unit. The current model causes crashes and outages and struggles to work well. It’s not a product issue it’s an implementation and technology issue that needs a dramatic change to resolve.

Local knowledge at the lowest level

The object I manage is a virtual machine. My storage array friend manages a lun with multiple virtual machines (sometimes 100’s – yes I am looking at your NFS). Until we manage at the same atomic level we will have problems aligning policies and performance. I think policy based enforcement with shares is a great way to go… something like SIOC that is enforced by the array. Hot blocking, all flash etc… are all fixes to attempt to get around the essential communication issue between arrays. Future storage cannot be bound by two storage controllers it needs to scale to meet needs. The hyper converged folks have a big advantage on this problem. Future of storage is not block, except in mixed enterprise environments (I am looking at you mainframe). You need to get comfortable with network based storage and architect for it. Buy switches and interfaces on your compute just for storage traffic don’t mix it. Architect a super highway to your storage that is separate from your normal network traffic.

Performance

If performance is your issue, then solve it locally don’t buy another array. Local cache will save you a lot. Scale up solutions in arrays or hyper converged are both options but local SSD will be a lot cheaper than a rip and replace. It’s easier on management cost.

What should I choose?

It depends on your needs. If I was presented with a green field that is going to be running all virtualized workloads today I would seriously consider hyper converged. Storage arrays are more mature but move a lot slower on updates. I would move toward a more software defined solution instead of hardware installed. I think that central understanding of the lowest atomic unit is critical going forward. If you have a mixed storage environment or an investment in fiber channel large arrays with cache makes sense. If you are looking for solve VDI issues I would consider hyper converged or lots of cache. The future is going to hold some interesting times. I need storage to provide the following:

No controller lock in I need it to scale to meet my needs
It needs to understand the virtual machine individual identity
It should include backup and restore capabilities to the VM level
It has to include data at rest encryption (yes I didn’t mention this but it’s huge)
Policy based performance (allocate shares, limits and reservations)
Include methods to move the data between multiple providers (move in and out of cloud)

Does it sound like a unicorn… yep it is… Someone go invent it and sell it to me.

January 21, 2016

Perfect deployments of OS with automation

I have spend the last few years working in enterprise shops and enjoying the challenges they bring. I find a number of my peers are hired for a single use case or implementation and then leave. Staying with an infrastructure past a single implementation allows me to enjoy all that brownfield IT has to offer. It’s a completely different challenge. Almost everyone I talk to and everywhere I work they are trying to solve the same basic problem. Do more with less and more automation. Everyone wants Amazon easy button without the security or off premises challenges of AWS. In order to make it into the cloud they need organizational change and operational. The first place almost everyone focuses is upon operating system deployments. There are a number of models available and I though I would share some of my thoughts on them.

Cloning

This model has been made available by VMware. It’s a combination of creating a golden template and some guest customization. It’s very easy to manage and produces very similar results every time during provisioning. You have to focus on core shared elements or create a template for each use. It does have some challenges:

How much of our software should we load on to it? Security software, monitoring agents etc.. How can we identify only core shared elements
It does not scale to lots of different templates – keeping application templates for every application kills you. Imagine monthly updating 100 templates and ensuring they are not broken with application teams
It is a virtual only solution making physical machine builds manual or a different process
It’s a provisioning only process it has no idea of state after initial implementation

It’s a provisioning only process

This is a big problem for me with a lot of provisioning solutions not just cloning. They do initial provisioning and not steady state of operating system. This lack of life cycle management does not solve my brownfield issues. Sure you have an awesome initially consistent implementation but five minutes later you are now out of sync with the initial template. This problem has led me to configuration management in almost every shop I have worked in. I wish that everywhere I worked was a netflix with a re-deploy the micro-service if failed model. The truth is none of the shops I have worked in have that model. I have monolithic multi-tier applications that are not going away this year or in the future.

Do I have a life cycle problem or provisioning problem?

Yes both. I do not believe that the days of fire and forget operating systems are available to us anymore. Every server is under a constant state of change from attackers to patches. Everything changes. Changes bring outages when assumptions are made about configuration of servers. Early in my career I cannot count the number of outages that were cause by incorrect DNS settings or host files. These are simple configuration items that were expected to be correct but found after an outage to be changed. ITIL would have us believe it’s all about change management. We need a CAB and approves to avoid these issues. While I am all about documented processes and procedures, I have not found that most of the host file changes get done via CAB, they get changed ad-hoc or during an outage. We have to be able to provision, configure and ensure the configuration stays.

Configuration management and provisioning

Take a look at this scenario:

Provisioning agent clones, provisions, duplicates a base operating system
Provisioning agent does initial configuration of OS (IP address, sysprep etc..)
Provisioning agent based upon customer select provides some unique information to configuration management that enables the understanding of server role (this is a SQL server, this is Apache etc..)
Provisioning agent installs configuration management agent
Configuration management agent checks in with configuration management system and changes all settings (both base settings and server role settings)
Configuration management agent continues to ensure that role and base settings are correct for the life of the server
Server administrator / application administrator etc uses configuration management agent to adjust settings

This model provides for initial configuration and consistent life cycle management. It does mean your configuration management agent does the heavy lifting instead of your provisioning agent.

What about physical?

The model above also works for physical. You have to move away from cloning and back into provisioning an operating system from PXE boot but it works very well. Now you can provision both physical and virtual from the same cloud agent using consistent life cycle management.

What is the challenge?

For me the challenge has been whenever I discuss configuration management it gets confused with compliance management. I believe that configuration management can and should be used for compliance management but it’s not the primary role. Compliance is about meeting security standards. Configuration is about ensuring configuration settings are correct and if not correcting. I can identify compliance issued and apply the resolution via configuration management. I can use the configuration management engine to identify things out of compliance that I have now changed to meet compliance.

June 19, 2014

Replacement for vCenter Heartbeat or not?

Many of my readers will already know that vCenter heartbeat is being removed from VMware’s product line up. They will officially end support in 2017 but the kill notice has already gone away. Today I attended a VMUG user conference and one of the speakers was Justin King (read more about him here) He joined VMware when they acquired the company that developed heartbeat. So he was the perfect person to ask the golden question. Now that vCenter Heart beat is gone what should I use to protect vCenter?

Background:

In the ESX 3.xx days it was common to find vCenter running on physical hardware. This was due to multiple dependencies for HA to operate on vCenter. It was also about customer confidence. These issues have been slowly removed. With the rewrite of FDM in 5.0 the recommendation has become to use a virtual vCenter and even more the Linux based appliance. Heartbeat provided a hot standby for vCenter in the event of failure but it really took about 30 seconds to take over. At this point HA takes 15 seconds + reboot and service start to restart vCenter.

Justin’s Take:

Justin provided a really great view that I wanted to share: Most companies don’t have SLA’s on vCenter. Most of the availability features work fine without vCenter. So for most companies the 15 seconds + reboot is not a big deal. He also suggested that HA Application should be used to restart applications as needed. He also suggested that a management cluster is becoming more and more common. Certainly with NSX and Auto deploy a management cluster becomes critical. He also eluded to some new features that may solve the issue.

My Take:

This problem reminds me of the 2TB VMDK limit. Everyone needed more space and it felt like VMware took about two years too long to solve. For years customers have complained that everything about VMware is redundant except vCenter… I have to agree with them. Talking to Justin did bring up one critical thought: customers complaints and needs are two different things. Very few customers have SLA’s around vCenter… lots of SLA’s around virtual machine work loads. So they have focused they energy on customer needs instead of wants. Will they go to a redundant vCenter solution? Yes they will just not right away. There are two environments that require vCenter and have SLA’s EUC and Cloud. In both situations your customers will be unable to consume services without vCenter. I think are both gain more traction the business case for redundant view will increase. I think the move toward the appliance also opens up doors to solve this redundancy issue.

April 22, 2014

Infrastructure Family Pet or Farm Animal

For the last few months I have been thinking a lot about this issue. Let me start out by introducing our family pet Alice. She was a rescue from the local shelter that we added to our family about two years ago. As you can see at time of rescue she needed a haircut badly… and a bath. I remember when we brought her home the whole family washed her and babied her to death. She has turned out to be a great family pet providing a number of benefits including exercising our four year old with hours of chase the toy and tug of war. She provides a valuable service to my family. She also hops up on my lap at exactly 5:30 PM every night and lick/annoys me until she gets her dinner. At 7:30 PM she is restless and expects to be let outside for the sole purpose of receiving her daily chew stick. She requires two walks a day or she has an “accident” in my home. She is a lot of work. Not less valuable but a lot of work. The real problem with a family pet is they don’t scale… from time to time my children ask about getting a “insert some type of house pet here” to which my response is the age tested trick of well if you walked the dog… maybe… I am becoming such an old man. I have no intention of getting any more family pets because it’s just one more thing I have to manage and I don’t have time for it. There is no way I can manage seven family pets let alone 200.

What happened to Virtual Me and why are you talking about pets?

This is a valid question has someone stolen my password? Nope. I think we can learn a lot from Alice. I think she has me trained. Everyone has infrastructure like Alice. They provide valued services but require lots of care and feeding. Early in my professional life I remember that after a older admin retired we kept having issues with a server that forced reboots and outages. When I asked my retired friend about the problem his response was “that’s because you don’t reboot it every morning”. I came to learn he had been rebooting the server every morning including weekends for five years. I think the server had him trained pretty well.. Again the problem is it does not scale. Imagine having four dogs all that require bones at different times… you could not possibly manage the nightmare.

OK So how did all my servers become pets?

One word: Customization. It’s the bane of human existence. Imagine taking a boat and turning it into a car… it may operate and provide the basic function of providing ground transportation but maintenance will be a huge pain. Every systems administrator is a tweaker… they love to tinker and make things better. All of these little changes provide 2% benefit but cannot be supported by a larger group. Think about my dog again.. if you watch my dog and at 7:30 PM she goes to the back door… you would let her out. When she does not get her bone she will again ask to go out and you will let her… rinse and repeat this process about 40 more times and you start to wonder if my dog is crazy. What has happened is she has a undocumented tweak to the process. Going outside has nothing to do with getting a bone except for her they have always followed in succession (in fact they just did again). These tweaks are almost always not documented and even if they are you will have to read for an hour to understand it… it’s easy for me to write 7:30 PM let dog out give her bone… but you might ask why? Customization cannot be your friend if you expect to scale up.

I am ready to give up all customization but my customer might need some

You are correct… customization is what we do…. but I just told you to give it up. Here is where farm cattle come into play. Dairy cows on a dairy farm live out their lives for a single purpose from a business perspective to provide milk. Everything is handled together… when it’s washing time they all get hosed off… when it’s milking time they all go to the barn. When a cow stops producing milk then they are removed and replaced with a producing cow. We need to treat servers like cows. They provide a business function or service and if that function stops working they are replaced not diagnosed or tweaked or rebooted once a day. They are removed. Ten years ago you would have called me crazy but virtualization has enabled the rip and replace model. But in IT we are still caring for family pets… we love them and they love us… guess what we are starting to look like the crazy cat lady. We have more pets than we can handle and we scare the business units. To make matters worse our pets cost a lot of money are slow to deploy and come in some crazy colors. Business wants a dairy farm while we run cat ladies house. To run the business smoothly, a good capital is needed. And for you to have enough money especially when you’re just starting, my explanation is that loaning is the answer.

IT is a business

This term gets thrown around a lot. Everyone is talking about how IT is a business. We need less products and more agility. Did you miss that? Less products more agility. Netflix provides essentially two services: Web portal and movie streaming. It is possible for them to use rip and replace. They have daily cows and breeding cows in their little dairy farm.

What can the auto industry teach us?

The auto industry has been facing this problem for a long time. They started with hand made cars and moved through automation. They learned you cannot customize everything but you can offer options. We need to offer three models of servers and different types… some are mini-vans others sports cars. We can offer customization but we need to automate and use a modular approach. Let me say that again : We need to automate and use a modular approach. We have to move build times from perhaps weeks or days to hours or seconds. We need to automate the life cycle of our services. If we don’t we will become the cat lady not the future oriented technology experts. We will be replaced by AWS and every cloud provider on the internet. Not because they offer better products, services or customization but because they can provide services to businesses in hours at a static cost.

What do you think?

Did I just force you to loose a portion of your soul? Where do you think IT is heading? Share and let me know. Are you ready to buy AWS and give up traditional IT? I definitely don’t have all the answers… either way thanks for reading and letting me rant.. I will return you to your normally scheduled content.