2017-08-10

Meaningful Versions with Continuous Everything

Q: How should I version my software? A: Automated!

All continuous delivery processes follow the same basic pattern:

Engineers working on source code, configuration or other content commit their work into a git repository (or another version control system, git is used here as an example). A build system is triggered with the new git commit revision and creates binary and deployment artefacts and also applies the deployments.

Although this pattern exists in many different flavors, at the core it is always the same concept. When we think about creating a version string the following requirements apply:
  • Every change in any of the involved repositories or systems must lead to a new version to ensure traceability of changes.
  • A new version must be sorted lexicographically after all previous versions to ensure reliable updates.
  • Versions must be independent of the process execution times (e.g. in the case of overlapping builds) to ensure a strict ordering of the artefact version according to the changes.
  • Versions must be machine-readable and unambiguous to support automation.
  • For every version component there must be a single source of truth to make it easy to analyse issues and track back changes to their source.
Every continuous delivery process has at least two main players: The source repository and the build tool. Looking at the complete process allows us to identify the different parts that should contribute to a unique version string, in order of their significance:

1. Version from Source Code

The first version component depends only on the source code and is independent from the build tooling. All version components must be derived only from the source code repository. This version is sometimes also called software version.

1.1 Static Source Version

The most significant part of the version is the one that is set manually in the source code. This can be for example a simple VERSION file or also a git tag. It typically denotes a compatibility promise to the users of the software. Semantic Versioning is the most common versioning scheme here.

To automate this version one could think about analysing an API definition like OpenAPI or data descriptions in order to determine breaking changes. Each braking change should increment the major version, additions should increment the minor version and everything else the patch version.

In a continuous delivery world we can often reduce this semantic version to a single number that denotes breaking changes

1.2 Source Version Counter

Every commit in the source repository can potentially produce a new result that is published. To enforce the creation of a new version with every change, a common practice is adding an automatically calculated and strictly increasing component to the version. For Subversion this is usually the revision. For git we can use the git commit count as given by git rev-list HEAD --count --no-merges

If the project uses git tags then the following git command generates the complete and unique version string: git describe --tags --always --dirty=-changed which looks like this: 2.2-22-g26587455 (22 is the commit count since the 2.2 tag). In this case the version also contains the short commit hash from git which identifies the exact state.

2. Version from Build System

The build tools and automation also influence the resulting binaries. To distinguish building the same source version with different build tooling we make sure that the build tooling also contributes to the resulting version string. To better distinguish between the version from source code and the version from the build tooling I like to use the old term release for the version from the build tooling.

2.1 Tool Version

All the tooling that builds our software can be summarized with a version number or string. The build system should be aware of its version and set this version.

If your build system doesn't have such a version then this is a sign that you don't practice continuous delivery for the build automation itself. You can leave this version out and rely only on the build counter.

2.2 Build Counter

The last component of the version is a counter that is simply incremented for each build. It ensures that repeated builds from the same source yield different versions. It is important that the build counter is determined at the very beginning of the build process.

If possible, use a build counter that is globally unique - at least for each source repository. Timestamps are not reliable and depend on the quality of the time synchronization.

Versions for Continuous Delivery

If all your systems are continuously built and deployed than there is a big chance that you don't need semantic versioning. In this case you can simplify the version schema to use only the automatic counter version and release:
<git revision counter>.<build counter>
On the other hand you might want to add more components to your version strings to reflect modularized source repos. For example, if you keep the operational configuration separate from the actual source code then you might want to have a version with three parts, again in simplified form:
The most important take away is to automate the generation of version strings as much as possible. In the world of continuous delivery it becomes possible to see versions as a technical number without any attached emotions or special meaning.

Version numbers should be cattle, not pets.

2017-08-04

Favor Dependencies over Includes

Most deployment tools support dependencies or includes or even both. In most cases one should use dependencies and not includes — here is why.
Includes work on the implementation level and dependencies work at the level of functionality. As we strive to modularize everything, dependencies have the benefit of separating between the implementations while includes actually couple the implementations.

Dependencies work by providing an interface of functionality that the users can rely upon - even if the implementation changes over time. They also create an abstraction layer (the dependency tree) that can be used to describe a large system of components via a short list of names and their dependencies. Dependencies therefore allow us to focus on smaller building blocks without worrying much about the other parts.

To conclude, please use dependencies if possible. They will give you a clean and maintainable system design.

2017-07-21

Web UI Testing Made Easy with Zalenium

I so far was always afraid to mess with UI tests and SeleniumHQ. Thanks to Zalenium, a dockerized "it just works" Selenium Grid from Zalando, I finally managed to start writing UI tests.

Zalenium takes away all the pain of setting up Selenium with suitable browsers and keeps all of that nicely contained within Docker containers (docker-selenium). Zalenium also handles spawning more browser containers on demand and even integrates with cloud-based selenium providers (Sauce Labs, BrowserStack, TestingBot).

To demonstrate how easy it is to get started I setup a little demo project (written in Python with Flask) that you can use for inspiration. The target application is developed and tested on the local machine (or a build agent) while the test browsers run in Docker and are completely independent from my desktop browser:
A major challenge for this setup is accessing the application that runs on the host from within the Docker containers. Dockers network isolation normally prevents this kind of access. The solution lies in running the Docker containers without network isolation (docker --net=host) so that the browsers running in Docker can access the application running on the host.

Prerequisites

First install the prerequisites. For all programming languages you will need Docker to run the Zalenium and Leo Gallucci's SeleniumHQ Docker containers.
For my demo project, written in Python, you will also need Python 3 and virtualenv. In Ubuntu you would
apt install python3-dev virtualenv.

Next you need to install the necessary libraries for your programming language, for my demo project there is a requirements.txt.

Example Application and Tests

A simple web application. It shows two pages with a link from the first to the second:




The integration tests check if the page loads and if the link from the main page to the second page is present. For this check the test loads the main page, locates the link and clicks on it - just like a user accessing the website would do.

Automate

To automate this kind of test we can simply start the Zalenium Docker containers before the actual test runner:

This little script creates a new virtualenv, installs all the dependencies, starts Zalenium, runs the tests and stops Zalenium again. On my Laptop this entire process takes only 40 seconds (without downloading any Docker images) so that I can run UI tests locally without much delay.

I hope that these few lines of code and Zalenium will also help you to start writing UI tests for your own projects.

2017-06-29

Setting Custom Page Size in Google Docs - My First Published Google Apps Script Add-On

While Google Docs is a great productivity tool, it still lacks some very simple and common functionality, for example setting a custom page size. Google Slides and Google Drawings allows setting custom sizes, but not Google Docs.

Luckily there are several add-ons available for this purpose, for example Page Sizer is a little open source add-on on the Chrome Web Store.

Unfortunately in many enterprise setups of G Suite access to the Chrome Web Store and to Google Drive add-ons is disabled for security reasons: the admins cannot white-list single add-ons and are afraid of add-ons that leak company data. Admins can only white list add-ons from the G Suite Marketplace.

The Google Apps Script code to change the page size is actually really simple, for example to set the page size to A1 you need only this single line of code:
DocumentApp.
 getActiveDocument().
 getBody().
 setAttributes({ 
  "PAGE_WIDTH": 1684, 
  "PAGE_HEIGHT": 2384 
 });
To solve this problem for everybody I created a simple Docs add-on Set A* Page Size that adds a menu to set the page size to any of 4A0 - A10.
Users can use this add-on in three modes:
  • Install the add-on in your Google Docs. Works for gmail.com accounts and for G Suite accounts that allow add-on installation. It makes the add-on available in all your documents.
  • Ask their domain admins to add the add-on from the G Suite Marketplace. This will add the add-on for all users and all their documents in the domain. The source code is public (MIT License) and open for review.
  • Copy & Paste the code into their own document — this requires no extra permissions and it does not involve the domain admins. It adds the add-on only to the current document.
To use the code in your own document follow these steps:
  1. Copy the code from the Script Editor of this Document into the Script Editor of your own Document.
  2. Close and Open your document.
  3. Use the Set Page Size menu to set a custom page size:
I hope that you will find this little add-on useful and that you can learn something about Google Apps Scripting from it.

2017-06-23

Eliminating the Password of Shared Accounts

Following up on "Lifting the Curse of Static Credentials", everybody should look closely at how they handle shared accounts, robot users or technical logins. Do you really rotate passwords, tokens and keys each time somebody who had access to the account leaves your team or the company? Do you know who has access? How do you know that they didn't pass on those credentials or put them in an unsafe place?

For all intents and purposes, a shared account is like anonymous access for your employees. If something bad happens, the perpetrator can point to the group and deny everything. As an employer you will find it nearly impossible to prove who actually used the password that was known to so many. Or even to prove that it was one of your own employees and not an outside attacker who "somehow" stole the credentials.

Thanks to identity federation and federated login protocols like SAML2 and OpenID Connect it is now much easier to completely eliminate passwords for shared accounts. Without passwords the shared accounts are much less risky. You can actually be sure that only active and authorized users can use the shared accounts, both now and in the future.
The concept is fairly simple. It is based on the federated identity provider separating between the login identity used for authentication and the account identity that is the result of authorization. 

In the following I use the specific case of shared accounts for GitHub Enterprise (GHE) as an application and G Suite (Google) as the federated identity provider to illustrate the idea. The concepts are however universal and can easily be adapted for other applications and identity providers.

A typical challenge in GitHub (both the public service and on-premise) is the fact that GitHub as an application only deals with users. There is no concept of services or service accounts. If you want to build sustainable automation then you must, according to the GitHub documentation, create a "machine user" which is a regular user re-purposed as a shared account. GitHub even points out that this is the recommended solution even though otherwise GitHub users must be real people according to their Terms of Service.

Normal Logins for Real and Shared Users

Before we deal with shared accounts we first look at the normal federated login process in figure 1. GHE uses SAML2 to delegate user authentication to Google.


Fig 1: Normal Federated User Login
User John Doe wants to use the GHE web interface to work with GHE. He points ➊ his browser to GHE which does not have a valid session for him. GHE redirects ➋ John to sign in at his company's Google account. If he is not signed in already, he authenticates as his own identity ➌ john@doe.com. Google redirects him back to GHE and signals ➍ to GHE that this user is John Doe. With the authorization from Google John is now logged in ➎ as @jdoe in the GHE application.

As users sign in to GHE their respective user account is created if does not exist. This "Just In Time Provisioning" is a feature of SAML2 that greatly simplifies the integration of 3rd party applications.

The traditional way to introduce shared accounts in this scenario is creating regular users within the identity provider (here Google) and handing out the login credentials (username & password) to the groups of people who need access to the shared account. They can then login with those credentials instead of their own and thereby become the machine user in the application. For all the involved systems there is no technical difference between a real user and a shared account user, the difference comes only from how we treat them.

The downsides of this approach include significant inconvenience to the users who have to sign out globally from the identity provider before they can switch users, or use an independent browser window just for that purpose. The biggest threats come from the way how the users manage the password and 2-factor tokens of the shared user and from the organization's (in-)ability to rotate these after every personnel change.

Login Shared Users without Password

Many applications (GHE really only serves as an example here) do not have a good concept for service accounts and rely on regular user accounts to be used for automation. As we cannot change all those applications we must accommodate their needs and give them such service users.

The good news is that in the world of federated logins only the user authentication (the actual login) is federated. Each application maintains its own user database. This database is filled and updated through the federated logins, but it is still an independent user database. That means that while all users in the identity provider will have a corresponding user in the application (if they used it), there can be additional users in the application's user database without a matching user in the identity provider. Of course the only way to access (and create) such users is through the federated login. The identity provider must authorize somebody to be such a "local" user when signing in to the application.

To introduce shared accounts in the application without adding them as real users in the identity provider we have to introduce two changes to the standard setup:
  1. The identity provider must be able to signal different usernames to the application during the login process.
  2. The real user must be able to choose as which user to work in the application after the next login.
Figure 2 shows this extended login process. For our example use case the user John Doe is a member of the team Alpha. John wants to access the team's account in GHE to connect their team's git repositories with the company's continuous delivery (CD) automation.
Fig 2: Login to GHE as Team User
For regular logins as himself John would "just use" GHE as described above. To login as the team account John first goes to the GHE User Chooser, a custom-built web application where John can select ➊ as which GHE user he wants to be logged in at GHE. Access to the chooser is of course also protected with the same federated login, the figure omits this detail for clarity.

John selects the team user for his team Alpha. The chooser stores ➋ Johns selection (team-alpha) in a custom attribute in Johns user data within the identity provider.

Next John goes as before to GHE. If he still has an active session at GHE he needs to sign out from GHE, but this does not sign him out at the identity provider or all other company applications.

Then John can access again GHE ➌ which will redirect ➍ him to the identity provider, in this example Google. There John signs in ➎ with his own identity john@doe.com. Most likely John still has an active session with Google so that he won't actually see this part. Google will confirm his identity without user interaction

The identity provider reads ➏ the username to send to the application from the custom attribute. When the identity provider redirects ➐ John back to the application, it also sets the GHE user from this custom attribute. In this case the custom attribute contains team-alpha and not jdoe as it would for a personal login. This redirect is the place where the identity switch actually happens. As a result, John retained his personal identity in Google and is signed in to GHE as his team account ➑ @team-alpha.

The main benefit of this approach is the radical removal of shared account passwords and the solid audit trail for the shared accounts. It applies the idea of service roles to the domain of standard applications that do not support such roles on their own. So far only few applications have a concept of service roles, most notably Amazon AWS with its IAM Roles. This approach brings the same convenience and security also to all other applications.

Outlook

Unfortunately this concept only protects the access to the shared account itself, not the access to tokens and keys that belong to such an account. Improving the general security is a step-by-step process and user chooser takes us a major step up towards truly securing such applications like GHE.

The next step would be addressing the static credentials that are generated within the application. In the case of GHE these are the personal access tokens and SSH keys that are the only way how external tools can use the shared account. These tokens and keys are perpetual shared secrets that can easily be copied without anybody noticing.

To get rid of all of this we will have to create an identity mapping proxy that sits in front of the application to translate the authentication of API calls. To the outside (the side that is used by the users and other services) the proxy uses the company authentication scheme, e.g. OAuth2. To the inside (towards the application) it uses the static credentials that the application supports. In order to fully automate this mapping, the proxy also has to maintain those static credentials on behalf of the users so that the users do not need to deal with them at all.

In this scenario there is also no need for a user account chooser as described above: users will have no need to act on behalf of the service accounts, the most interaction will be to grant permissions to those service accounts to access shared resources.

Figure 3 shows how such a proxy for GitHub Enterprise and the company's OAuth2 identity provider, e.g. Google, could be built. It is surely a much larger engineering effort than the user account chooser, but it solves the entire problem of static credentials, not only the problem of shared account passwords.
Fig 3: Identity Mapping Proxy to remove static credentials from API authentication

It really is possible to get rid of static credentials, even for applications where the vendor does not support such ideas. While these concepts can be adapted for any kind of application, the account chooser and identity mapping proxy will be somewhat custom tailored. Depending on the threat model and risk assessment in your own organisation the development effort might be very cheap compared to the alternative to continue living with the risks.

I hope that both application vendors and identity providers will eventually understand that static credentials are the source of a lot of troubles and that it is their job to provide us users good integrations based on centrally managed identities, especially for the integration of different services.

2017-06-16

Using Kubernetes with Multiple Containers for Initialization and Maintenance

Kubernetes is a great way to run applications because it allows us to manage single Linux processes with a real cluster manager. A computer with multiple services is typically implemented as a pod with multiple containers sharing communication and storage:
Ideally every container runs only a single process. On Linux, most applications have three phases with two different programs or scripts:
  1. The initialization phase, typically an init script or a systemd unit file.
  2. The run phase, typically a binary or a script that runs a daemon.
  3. The maintenance phase, typically a script run as a CRON job.
While it is possible to put the initialization phase into a Docker container as part of the ENTRYPOINT script, that approach gives much less control over the entire process and makes it impossible to use different security contexts for each phase, e.g. to prevent the main application from directly accessing the backup storage.

Initialization Containers

Kubernetes offers initContainers to solve this problem: Regular containers that run before the main containers within the same pod. They can mount the data volumes of the main application container and "lay the ground" for the application. Furthermore they share the network configuration with the application container. Such an initContainer can also contain completely different software or use credentials not available to the main application.

Typical use cases for initContainers are
  • Performing an automated restore from backup in case the data volume is empty (initial setup after a major outage) or contains corrupt data (automatically recover the last working version).
  • Doing database schema updates and other data maintenance tasks that depend on the main application not running.
  • Ensure that the application data is in a consistent and usable state, repairing it if necessary.
The same logic also applies to maintenance tasks that need to happen repeatedly during the run time of an application. Traditionally CRON jobs are used to schedule such tasks. Kubernetes does not (yet) offer a mechanism to start a container periodically on an existing pod. Kubernetes Cron Jobs are independent pods that cannot share data volumes with running pods.

One widespread solution is running a CRON daemon together with the application in a shared container. This not only brakes the Kubernetes concept but also adds a lot of complexity as now you also have to take care of managing multiple processes within one container.

Maintenance Containers

A more elegant solution is using a sidecar container that runs alongside the application container within the same pod. Like the initContainer, such a container shares the network environment and can also access data volumes from the pod. A typical application with init phase, run phase and maintenance phase looks like this on Kubernetes:
This example also shows an S3 bucket that is used for backups. The initContainer has exclusive access before the main application starts. It checks the data volume and restores data from backup if needed. Then both the main application container and the maintenance container are started and run in parallel. The maintenance container waits for the appropriate time and performs its maintenance task. Then it again waits for the next maintenance time and so on.

Simple CRON In Bash

The maintenance container can now contain a CRON daemon (for example Alpine Linux ships with dcron) that runs one or several jobs. If you have just a single job that needs to run once a day you can also get by with this simple Bash script. It takes the maintenance time in the RUNAT environment variable.


All that also holds true for other Docker cluster environments like Docker Swarm mode. However you package your software, Kubernetes offers a great way to simplify our Linux servers by dealing directly with the relevant Linux processes from a cluster perspective. Everything else of a traditional Linux system is now obsolete for applications running on Kubernetes or other Docker environments.

2017-06-12

Working with IAM Roles in Amazon AWS

Last week I wrote about understanding IAM Roles, let's follow up with some practical aspects. The following examples and scripts all use the aws-cli which you should have already installed. The scripts work on Mac and Linux and probably on Windows under Cygwin.

To illustrate the examples I use the case of an S3 backup bucket in another AWS account. For that scenario it is recommended to use a dedicated access role in the target AWS account to avoid troubles with S3 object ownership.

AWS Who Am I?

The most important question is sometimes to ascertain the identity. Luckily the aws-cli provides an option for that:
$ aws sts get-caller-identity
{
    "Account": "123456789",
    "UserId": "ABCDEFG22L2KWYE5WQ:sschapiro",
    "Arn": "arn:aws:sts::123456789:assumed-role/PowerUser/sschapiro"
}
From this we can learn our AWS account and the IAM Role that we currently use, if any.

AWS Assume Role Script

The following Bash script is my personal tool for jumping IAM Roles on the command line:
It takes any number of arguments, each a role name in the current account or a role ARN. It will try to go from role to role and returns you the temporary AWS credentials of the last role as environment variables:
$ aws-assume-role ec2-worker arn:aws:iam::987654321:role/backup-role
INFO: Switched to role arn:aws:iam::123456789:role/ec2-worker
INFO: Switched to role arn:aws:iam::987654321:role/backup-role
AWS_SECRET_ACCESS_KEY=DyVFtB63Om+uihwuieufzud/w5vm7Lhp3lx
AWS_SESSION_TOKEN=FQoDYXdzEHgaDAgVN…✂…tyHZrYSibmLbJBQ==
AWS_ACCESS_KEY_ID=ABCDEFGFWEIRFJSD6PQ
The first role ec2-worker is in the same account as the credentials with which we start. Therefore we can specify it just by its name. The second role is in another account and must be fully specified. If to switch to a third role in the same account we could again use the short form.

Single aws-cli Command

To run a single aws-cli or other command as a different role we can simple prefix it like this:
$ eval $(aws-assume-role \
    ec2-worker \
    arn:aws:iam::987654321:role/backup-role \
  ) aws sts get-caller-identity
INFO: Switched to role arn:aws:iam::123456789:role/ec2-worker
INFO: Switched to role arn:aws:iam::987654321:role/backup-role
{
    "Arn": "arn:aws:sts::987654321:assumed-role/backup-role/sschapiro",
    "UserId": "ABCDEFGEDJW4AZKZE:sschapiro",
    "Account": "987654321"
}
Similarly you can start an interactive Bash by giving bash -i as the command. aws-cli also supports switching IAM Roles via configuration profiles. This is a recommended way to permanently switch to another IAM Role, e.g. on EC2.

Docker Container with IAM Role

The same script also helps us to run a Docker container with AWS credentials for the target role injected:
$ docker run --rm -it \
  --env-file <(
    aws-assume-role \
      ec2-worker \
      arn:aws:iam::987654321:role/backup-role \
    ) \
  mesosphere/aws-cli sts get-caller-identity
INFO: Switched to role arn:aws:iam::123456789:role/ec2-worker
INFO: Switched to role arn:aws:iam::987654321:role/backup-role
{
    "Arn": "arn:aws:sts::987654321:assumed-role/backup-role/sschapiro",
    "UserId": "ABCDEFGRWEDJW4AZKZE:sschapiro",
    "Account": "987654321"
}
This example just calls aws-cli within Docker. The main trick is to feed the output of aws-assume-role into Docker via the --env-file parameter.

I hope that these tools help you also to work with IAM Roles. Please add your own tips and tricks as comments.

2017-06-08

Understanding IAM Roles in Amazon AWS

One of the most important security features of Amazon AWS are IAM Roles. They provide a security umbrella that can be adjusted to an application's needs in great detail. As I all the time forget the details I summarize here everything that helps me and some useful tricks for working with IAM Roles. This is part one of two.

Understanding IAM Roles

From a conceptual perspective an IAM Role is a sentence like Alice may eat apples: It grants or denies permissions (in the form of a access policy) on specific resources to principals. Alice is the principal, may is the granting, eat is the permission (to eat, but not to look at) and apples is the resource, in this case any kind of apples.
IAM Roles can be much more complex, for example this rather complex sentence is still a very easy to read IAM Role: Alice and Bob from Hamburg may find, look at, smell, eat and dispose of apples № 5 and bananas. Here we grant permissions to our Alice and to some Bob from another AWS account, we permit a whole bunch of useful actions and we allow them on one specific type of apples and on all bananas.

On AWS Alice and Bob will be Principals, either code hosting services like EC2 or IAM Users and Roles, find, look at, smell … are specific API calls on AWS services like s3:GetObject, s3:PutObject and apples and bananas are AWS resource identifiers like arn:aws:s3:::my-backup-bucket.

IAM Roles as JSON

IAM Roles and Policies are typically shown as JSON data structures. There are two main components:
  1. IAM Role with RoleName and AssumeRolePolicyDocument
  2. IAM Policy Document with one or several policy Statements
The AssumeRolePolicyDocument defines who can use this role and the Statements in the Policy Document define what this role can do.

A typical IAM Role definition looks like this:
The really important part here is the AssumeRolePolicyDocument which defines who can actually use the role. In this case there are two other IAM roles that can make use of this role. AWS allows specifying all kinds of Principals from the same or other AWS accounts. So far this Role does not yet allow anything, but it already provides an AWS identity. To fill the Role with life you have to attach one or more Policy Documents to the role. They can be either inline and stored within the Role or they can be separate IAM Policies that can be attached to a Role, AWS also provides a large amount of predefined policies for common jobs.

A PolicyDocument definition looks like this:
Here we have one Statement (there could be several) that gives read and write access to a single S3 bucket. Note that it does not allow deleting objects from the bucket as this example is for a backup bucket that automatically expunges old files.

Creating IAM Roles with CloudFormation

We typically create AWS resources through CloudFormation. This example creates an S3 bucket for backups together with a matching IAM Role that grants access to the bucket:
The role can be used either by EC2 instances or by the PowerUser role which our people typically have. This allows me to test the role from my desktop during the development and for troubleshooting.

Read also Working with IAM Roles in Amazon AWS for the second part of this article with practical aspects and some tips and tricks.

2017-06-02

Root for All - A DevOps Measure?

Who has root access in your IT organizations? Do you "do" DevOps? Even though getting root access was once my personal motivation for pushing DevOps, I never considered the question of the relationship till it was triggered by my last conference visit.

Last week I attended the 10. Secure Linux Administration Conference - a small but cherished German event catering to Linux admins - and there where two DevOps talks: DevOps in der Praxis (Practical DevOps) by Matthias Klein and my own DevOps for Everybody talk. I found it very interesting that we both talked about DevOps from a "been there, done it" perspective, although with a very different message.

DevOps ≠ DevOps

For me DevOps is most of all a story of Dev and Ops being equal, sitting in the same boat and working together on shared automation to tackle all problems. My favourite image replaces humans as gateway to the servers with tooling that all humans use to collaboratively deliver changes to the servers. In my world the question of root access is not one of Dev or Ops but one of areas of responsibility without discrimination.

For Matthias DevOps is much more about getting Dev and Ops to work together on the system landscape and on bigger changes, about learning to use really useful tools from the other (e.g. continuous delivery) and about developing a common understanding of the platform that both feel responsible for.

We both agree in general on the cultural aspects of DevOps: it is not a tool you can buy but rather the way of putting the emphasize on people, how they work together with respect and trust and setting the project/product/company before personal interests and ascribing success or failure to a team and not individuals.

Demystifying Root Access

So why is root access such a big deal? It only lets you do anything on a server. Anything means not only doing your regular job but also means all sorts of blunders or even mischief. I suspect that the main reason for organisations to restrict root access is the question of trust. Whom to trust to not mess things up or whom to trust to not do bad things?

So root access is considered a risk and one of the most simple risk avoidance strategies is limiting the amount of people with root access. If a company would be able to blindly trust all its employees with root access without significantly increasing the overall risk, then this might not be such a big topic.

Interestingly root access to servers is handled differently than database access with credentials that can read, write and delete any database entry. I find this very surprising as in most cases a problem with the master database has a much bigger impact than a problem with a server.

Root = Trust

If root access is a question of trust then that gives us the direct connection to DevOps. DevOps is all about people working together and sharing responsibilities. It is therefore most of all about building up trust between all people involved and between the organisation and the teams.

The other question is, do we place more trust into our people or into the automation we build? Traditional trust models are entirely people based. DevOps thinking strongly advocates building up automation that takes over the tedious and the risky tasks. 

If there is a lot of trust then the question of root access becomes meaningless. In an ideal DevOps world all the people only work on the automation that runs the environment. The few manual interactions that are still required can be handled by everyone on the team.

As both trusting people and building trustworthy automation leads to a situation where it is acceptable to grant root access to everyone, you can actually use root access as a simple and clear measurement for your organisation's progress on the DevOps journey.

Current Status

To find out the current status I did a small poll on Twitter:

As I did not ask about the status of DevOps adoption we cannot correlate the answers. What I do interpret from the result is that there is enough difference in root access in different companies to try to learn something from that. And I am very happy to see that 34% of answers give broad root access.

Embrace the Challenge

You will hear a lot of worries or meet the belief that in your organisation giving root to everybody is impossible, e.g. due to regulatory requirements. These are valid fears and one should take them serious. However, the solution is not to stop automating but rather to incorporate the requirements into the automation and make it so good that it can also solve those challenges.

The question of root access is still very provocative. Use that fact to start the discussion that will lead to real DevOps-style automation, build a dash board to show root access as a KPI and start to build the trust and the automation that you need to give root access to everyone.

I'll be happy if you share your own example for the DevOps - root correlation (or the opposite) in the comments.

2017-05-26

Is Cloud Native the new Linux?

The CloudNativeCon + KubeCon North Europe 2017 in Berlin was sold out with 1500 participants. I learned really a lot about Kubernetes and the other new and shiny tools that start to become main stream.

To get an introduction into Cloud Native, watch Alexis Richardson in the keynote on "What is Cloud Native and Why Should I care" (slides, video at 12:27). He explained the goal of the Cloud Native Computing Foundation (CNCF) as avoiding cloud lock-in, which is much more to the point than the official charter (which talks about "the adoption of a new computing paradigm"). Alexis chairs the Technical Oversight Committee (TOC) of the CNCF. The Foundation is "projects first", set up similar to the Linux Foundation and already sponsors various Open Source projects.

Linux Lock-In

His remarks got me to think about the question, especially in comparison with Linux. To me it seems that modern IT in the data center already has a pretty strong "lock-in" with Linux. It seems like most public servers on the Internet already run Linux. So what is bad with this lock-in with Linux? Apparently nothing much really. But do we really have a lock-in with Linux or actually with a specific Linux distribution? I know very few people who changed their distro, say from Red Hat to Debian. If Red Hat becomes to expensive then people switch to free CentOS instead, but don't want to (or afford to) change all their tooling and system setup.

So even though Linux is - at its core - always Linux, in practice there is a big difference between running an application on Debian, Red Hat, SUSE, Gentoo, Archlinux or others. There are even relevant differences between closely related distributions like Debian and Ubuntu or between Red Hat and CentOS.

So while we talk about the freedom of choice with Linux we very seldom make use of it. When dealing with commercial software on Linux we also don't require from our software vendors to support "our" Linux stack. Instead, we typically accept the Linux distro a software vendor prescribes and feel happy that Linux is supported whatsoever.

Cloud Platforms

So far the cloud landscape is indeed very different from the Linux landscape. With cloud vendors today , we have completely different and totally incompatible ecosystems. Code written for one cloud, e.g. Amazon AWS, actually does not work at all on another cloud, e.g. Google GCP. I mean here the code that deals with cloud features like deployment automation or that uses cloud services like object storage. Of course the code that runs your own application is always the same, except for the parts interfacing with the cloud platform.

All Linux distributions will give you the exactly same PostgreSQL database as relational database or the exactly same Redis as key-value store. Clouds on the other hand give you different and incompatible implementations of similar concepts, for example AWS DynamoDB and Google Cloud Datastore. That would be as if every Linux distribution would ship a different and incompatible database.

With public clouds we - maybe for the first time - come to a situation where it is possible to build complex and very advanced IT environments without building and operating all of the building blocks on our own. It is a fact that many companies move from self-hosted data centers into the public cloud in order to benefit from the ready services found there. Cloud providers easily out-innovate everybody else with regard to infrastructure automation and service reliability while offering pay-per-use models that avoid costly upfront investments or long commitments.

Cloud Lock-In

However, anyone using a public cloud nowadays will face a very tough choice: Use all the features and services that the cloud provider offers or restrict oneself to the common functions found in every public cloud. One comes with a fairly deep technological buy-in while the other comes with a promise of easily replacing the cloud vendor with another one.

I don't think that this hope holds true. In my opinion the effort spent on operational automation, monitoring and other peripheral topics leads to a similar deep buy-in with any given cloud vendor. Switching to another vendor will be a disruptive operation that companies will underwrite only in case of real need, just like switching Linux distributions.

The same holds true for an environment that utilises all possible services of a cloud vendor. Switching platforms will be costly, painful and only done based on real need. I think that the difference in "lock-in" between using all cloud services and using only basic infrastructure services is only a gradual one and not a difference in principle. Whenever we use any kind of public - or even private - cloud platform there is a smaller or most likely larger amount of lock-in involved.

Cloud Native

If Cloud Native hopes to break the cloud lock-in then the goal must be to develop advanced services that become a factual standard for cloud services. Once enough vendors pick up on those services public and private clouds will indeed be as portable and compatible as Linux distributions.

So far Cloud Native is focused mostly on basic infrastructure software and not on advanced services. My hope is that over time this will change and that there will be a standard environment with advanced services, similar to how the Linux and Open Source world gives us a very rich tool chest for almost every problem.

Furthermore, I am not worried about the technical lock-in with today's large cloud vendors like Amazon or Google. While their back-end software is proprietary, the interfaces are public or even Open Source and they don't prescribe us with which OS or client to use their services. This is much more than we ever had with traditional commercial vendors who forced us to use outdated software in order to be "supported".

Embracing Change

If we understand the development of our IT environments as an iterative process then it becomes clear that we can always build the next environment on the next cloud platform and migrate our existing environments if there is a real benefit or return of investment. And if there is none then we can simply keep running it as it is. With the current fashion to build micro services each environment is in any case much smaller than the systems we built 10 or 20 years ago. Therefore the cost of lock-in and of a migration is equally much smaller compared to a migration of an entire data center.

In today's fast paced and competitive world the savings and benefits of quickly developing new environments with advanced services outweighs the risk of lock-in, especially as we know that every migration will make our systems better.