2016-05-03

OSDC 2016 - Hybrid Cloud

The Open Source Data Center Conference 2016 is a good measure for how the industry changes. Compared to 2014 Cloud topics take more and more space. Both how to build your own on-premise cloud with Mesos, CoreOS or Kubernetes but also how to use the public Cloud.

Maybe not surprising, I used the conference to present my own findings from 2 years of Cloud migration at ImmobilienScout24:

After we first tried to find  way to quickly migrate our data centers into the Cloud we now see that a hybrid approach works better. Data center and cloud are both valued platforms and we will optimize the costs between them.

Hybrid Cloud - A Cloud Migration Strategy

Do you use Cloud? Why? What about the 15 year legacy of your data center? How many Enterprise vendors tried to sell you their "Hybrid Cloud" solution? What actually is a Hybrid Cloud?

Cloud computing is not just a new way of running servers or Docker containers. The interesting part of any Cloud offering are managed services that provide solutions to difficult problems. Prime examples are messaging (SNS/SQS), distributed storage (S3), managed databases (RDS) and especially turn-key solutions like managed Hadoop (EMR).

Hybrid Cloud is usually understood as a way to unify or standardize server hosting across private data centers and Public Cloud vendors. Some Hybrid Cloud solutions even go as far as providing a unified API that abstracts away all the differences between different platforms. Unfortunately that approach focuses on the lowest common denominator and effectively prevents using the advanced services that each Cloud vendor also offers. However, these service are the true value of Public Cloud vendors.

Another approach to integrating Public Cloud and private data centers is using services from both worlds depending on the problems to solve. Don't hide the cloud technologies but make it simple to use them - both from within the data center and the cloud instances. Create a bridge between the old world of the data center and the new world of the Public Cloud. A good bridge will motivate your developers to move the company to the cloud.

Based upon recent developments at ImmobilienScout24, this talk tries to suggest a sustainable Cloud migration strategy from private data centers through a Hybrid Cloud into the AWS Cloud.
  • Bridging the security model of the data center with the security model of AWS.
  • Integrating the AWS identity management (IAM) with the existing servers in the data center.
  • Secure communication between services running in the data center and in AWS.
  • Deploying data center servers and Cloud resources together.
  • Service discovery for services running both in the data center and AWS.
Most of the tools used are Open Source and this talk will show how they come together to support this strategy:

As soon as the video is published I will update the talk here.

2016-02-19

You can't control internal public data

Everywhere there is some data that is relevant either for all applications or for many applications in different parts of the platform.

The "obvious" solution to this problem is to make such data internally public or world-readable, meaning that the entire platform can read it.

The "obvious" solution to security in this case is actually having no security beyond ensuring the "are you part of us?" question.

Common implementations of this pattern are world-readable NFS shares, S3 buckets readable by all "our" AWS accounts, HTTP APIs that use the client IP as their sole access control mechanism etc.

This is approach is really dangerous and should be used with care. The risks include:

  • You most likely don't know who actually needs the data and who not. If you ever need to restrict access you will have a very long and tedious job ahead of you.
  • You don't know who accessed the data for which purpose.
  • After a data leak, you probably won't know how it happened.
  • The data is only as safe as the weakest application in your platform.
This last danger is the real problem here. Imagine that you share personally identifiable information to all your internal systems. Either out of convenience or even unknowingly. The weakest app in your portfolio will be used to hack your platform and it will be used to copy all that sensitive data.

It is not a question of "if" but rather a question of "when" as we can learn from Facebook: Instagram's Million Dollar Bug is a must read for everybody! And please also read the case study for defense that detailed the learning from it.

One key take away is the folly of having a central, internally world readable S3 bucket with the configuration of the entire platform. A simple grep over that data will instantly yield all the other S3 buckets in use and especially those that are internally world readable. A real attacker would have proceeded to copy all "interesting" data he could find to an S3 bucket of his own. In most setups, this operation would not leave any trace except that the exploited system read all the S3 buckets. Facebook was just very lucky that their flaw was found by a security researcher who played by the rules.

Instead of going the convenient way it pays to invest early into a minimum level of security, for example doing everything with a least privilege model. The result is a detailed list of access grants that you can use for security audits and to draw a risk map. The risk map will give you all the systems that have a risk of leaking the data.
Risk Map: Services 1, 2, 4, 5 and 8 define the safety of the personal data.
Yes, this means a bit more work but the payoff is almost immediate. Especially if you partition your platform into smaller parts you should be careful not to destroy the security benefits of that partition. At ImmobilienScout24 we use many AWS accounts, among other things to have a "blast radius" for potential hazard. The blast radius concept applies not only to infrastructure but also to data. Everybody with a partitioned platform should pay attention that the partitioning covers all aspects of the platform and not only the infrastructure level.

Finally, if you are subject to German Data Privacy Laws (Bundesdatenschutzgesetzt) then there are special regulations against uncontrolled spread of data (Weitergabekontrolle). Having personally identifiable information that is internally world-readable probably defies this law. If you can't be 100% sure that your internal public data is clean then it is much safer and easier to just automate the access control.

To conclude, I strongly advise to only put such data world-readable that you can risk to leak into the open.

2016-02-07

Go Faster - DevOps & Microservices

At the microXchg 2016 last week Fred George - who takes pride having been called a hand grenade - gave a very inspiring talk about how all the things that we do right now have one primary goal:


Go Faster

Reducing cycle time for deployments, automation everywhere, down-sizing services to "microservices", building resilient and fault-tolerant platforms and more are all facets of a bigger journey: Provide value faster and find out faster what works and what not.

DevOps

DevOps is seen by most developers as beeing an Ops movement to catch on with developers before their jobs become obsolete. Attending various DevOps Days in Germany and the USA, the developers who where also there always complained about the lack of developers and the lack of developer topics. They observed that the conference seems to be by and for Ops people. Consequently, DevOps conferences usually have two tracks: Methods and Tools.

Methods teach us how to do "proper" software development also in infrastructure engineering and to follow agile software development practices. Tools talks try to make us believe that you cannot be a good DevOps unless you use Puppet, Chef or Ansible. The success story talks all emphazise how "DevOps Tools", shared responsability and a newly formed "DevOps Team" saved the day. In more recent years the tools focus on building private clouds with Docker and on managing distributed storage.

In fact, DevOps is all about beeing faster through shared responsibility, mutual respect between different knowledge bearers and building cross-functional teams with full vertical responsibility for their work.

Microservices

Microservices is definitively an important hype amongst developers. Seasoned Ops people see it as the obvious thing to do, just like the well-known Unix philosophy teaches:

The Unix philosophy emphasizes building simple, short, clear, modular, and extensible code that can be easily maintained and repurposed by developers other than its creators. The Unix philosophy favors composability as opposed to monolithic design. Source: Wikipedia
Applying all that to systems design is a straight road to microservices. When going from millions of Lines-of-Code to just a few thousand and when going from 5 applications to 500 the glue code between all those applications suddenly becomes the governing system.

Service discovery, managing large amounts of micro instances, network latency, cyclic dependency trees etc. are all areas of expertise of Ops people. They where dealing with these questions for the last 20 years!

Microservice success stories, like the one of SoundCloud shown also at the microXchg 2016, show how more and more glue and abstraction layers where introduced into the emerging microservices architecture to compensate the degradation that came along with the exploding complexity of their microservices landscape.

Much of that could also be learned from modern Linux operating systems. A nice example is systemd which drives its own "microservices" revolution, just on a smaller scale within a single Linux computer.

Looking a the tools track of Microservices events it is no surprise to also find Docker in a dominating role here as well.

Common Values & Concepts

I don't want to argue that Docker is the common topic that everybody should care about. After all, Docker is just this years hype implementation of an operating concept. Savy sysadmins where doing the same thing with chroot or OpenVZ a long time ago. And in a few years we will probably have something even better for the same job.

What really brings these topics together are a lot of shared values and concepts (in no particular order):
  • KISS approach
  • Right-sizing everything to an easily managed size: microservices, two pizza teams, iterative solutions to problems
  • Full stack responsibility
  • Automate everything, especially the glue between all those small components
  • Observe-Orient-Decide-Act loops in different forms and fashions
As long as keep our core values in mind the actual technology or methodology doesn't matter so much. We will still achieve our goals. Just much faster.

2016-02-05

Cloud Migration ≈ Microservices Migration

Day two at the microXchg 2016 conference. After listening to yet another talk detailing the pitfalls and dangers of "doing it wrong" I see more and more similarities between the Cloud migration at ImmobilienScout24 and the microservices journey that most speakers present.
The Cloud migration moves us from a large data center into many smaller AWS accounts. A (legacy) monolithic application is cut into many smaller microservices.

Internal data center communication becomes exposed communication between different AWS accounts and VPCs. Internal function calls are replaced with remote API calls. Both require much more attention to security, necessitate an authentication framework and add significant latency to the platform.

A failed data center takes down the entire platform while a failed AWS account will only take down some function. An uncaught exception will crash the entire monolith while a crashed microservice will leave the others running undisturbed.

Internal service dependencies turn into external WAN dependencies. Library dependencies inside the monolith turn into service dependencies between microservices. Cyclic dependencies remain a deadly foe.

Team responsibilities shift from feeling responsible for a small part of the platform to being responsible for entire AWS accounts or only their own microservices.

And much more...

Learnings

If it looks similar, maybe we can learn something from this. I strongly believe that many structural and conceptional considerations apply equally to a Cloud migration and to a microservices journey:
  • Fighting complexity through downsizing.
  • Complexity shifts from inside to outside. New ways to manage this complexity emerge.
  • Keeping latency in check is key factor to success.
  • Need much more advanced tooling to properly handle the scale out of managed entities.
  • Less centralization of common concerns leads to more wasted effort and resources. Accept this.
  • Success and failure hangs on finding the right seams to cut.
  • "Just put it somewhere" usually doesn't work at all.
  • Integration tests become more important and difficult.
I learned a lot at this conference, both about microservices and about the direction our Cloud migration should go.

Please add your learnings in the comments.

2016-02-04

AWS Account Right-Sizing

Today I was attending the Microxchg 2016 conference in Berlin. I suddenly realized that going to the cloud allows to ask completely new questions that are impossible to ask in the data center.

One such question is this: What is the optimum size for a data center? Microservices are all about downsizing - and in the cloud we can and should downsize the data center!

In the world of physical data centers the question is usually goverened by two factors:

  • Ensuring service availability by having at least two physical data centers.
  • Packing as much hardware into as little space as possible to keep the costs in check.
As long as we are smaller than the average Internet giant there is no point to ask about the optimum size. The tooling which we build has to be designed for both large data centers and for having more than one. But in the "1, 2, many" series "2" is just the worst place to be. It entails all the disadvantages of "more than 1" without any of the benefits of "many".

In the cloud the data center is purely virtual. On AWS the closest thing to a "data center" is a Virtual Private Cloud (VPC) in an AWS Region in an AWS Account. But unlike a physical data center that VPC is already highly available and offers solid redundancy through the concept Availability Zones.

If an AWS Account has multiple VPCs (either in the same region or in different regions), then we should see it has actually beeing several separate data centers. All the restrictions of multiple data centers also apply to having multiple VPCs: Higher (than local) latency, traffic costs, traversing the public Internet etc.

To understand more about the optimum size of a cloud data center we can compare three imaginary variants. I combine EC2 instances, Lambda functions, Beanstalk etc. all into "code running" resources. IMHO it does not matter how the code runs in order to estimate the management challanges involved.



Small VPC
Medium VPC
Large VPC
Number of code running resources
50
200
1000
Number of CloudFormation stacks
(10 VMs per stack)
5
20
100
Service Discovery
manually
simple tooling e.g. git repo with everything in it
elaborate tooling, Etcd, Consul, Puppet ...
Which application is driving the costs?
Eyeball inspection - just look at it
Tagging, Netflix ICE ...
Complex tagging, maintain an application registry, pay for Cloudhealth ...
Deployment
CloudFormation manually operated viable option
Simple tooling like cfn-sphere, autostacker24 ...
Multi-tiered tooling like Spinnaker or other large solutions
Security model
Everyone related is admin
Everyone related is admin, must have strong traceability of changes
Probably need to have several levels of access, separation of duty and so on
… whatever ...
dumb and easy
simple
complex and complicated

Having a large VPC with a lot of resources obviously requires much more elaborate tooling while a small VPC can be easily managed with simple tooling. In our case we have a 1:1 relationship between a VPC and an AWS account. Accounts that work in two regions (Frankfurt and Ireland) have 2 VPCs but that's it.

I strongly believe that scaling small AWS accounts together with the engineering teams who use them will still allow us to keep going with simple tooling. Even if the absolute total of code running resources is large, splitting it into many small units reduces the local complexity and allows the responsible team to manage their area with fairly simple tooling. Here we use the power of "many" and invest into handling many AWS accounts and VPCs efficiently.

On the overarching level we can then focus on aggregated information (e.g. costs per AWS account) without bothering about the internals of each small VPC.

I therefore strongly advise to keep your data centers small. This will also nicely support an affordable Cloud Exit Strategy.

2015-12-30

Docker Appliance as Linux Service RPM

Docker provides a convenient way to package entire applications into runnable containers. OTOH in the data center we use RPM packages to deliver software and configuration to our servers.

This wrapper build a bridge between Docker appliances and Linux services by packaging a Docker image as a Linux service into an RPM package.

The resulting Linux service can be simply used like any other Linux service, for example start the service with service schlomo start.


See the GitHub repo at https://github.com/ImmobilienScout24/docker-service-rpm for code and more details and please let me know if you find this useful.

2015-08-21

Signet Ring = Early 2 Factor Authentication

Photo: A. Reinkober / pixelio.de
I recently met somebody who had a signet ring and suddenly realized that this is a very early form of 2-factor-authentication (2FA):

Signet Ring2FA
UniqueUnique
Difficult to copySupposedly impossible to copy
Seal proves personal involvement of bearer2FA token proves personal interaction of owner

The main difference is of course that 2FA is commonly available to everybody who needs it while signet rings where and remain a special feature. But it is still nice to know that the basic idea is several thousands years old.

2015-08-07

Cloud Exit Strategy

As ImmobilienScout24 moves to the cloud a recurring topic is the question about the exit strategy. An exit strategy is a plan for migrating away from the cloud, or at least from the chosen cloud vendor.

Opinions range from "why would I need one?" to "how can we not have one?" with a heavy impact on our cloud strategy and how we do things in the cloud.

When talking about exit scenarios it is worth to distinguish between a forced and a voluntary exit. A forced exit happens due to external factors that don't leave you any choice when to go. A voluntary exit happens at your own choice, both when and how.

Why would one be force to have an exit strategy? Simple because running a business on cloud services carries other types of risks compared to running a business in your own data center:
  • Cloud accounts can be disabled for alleged violation of terms
  • Cloud accounts can be terminated
  • There are no guaranteed prices. Running costs can explode as a result of a new pricing model
  • The cloud vendor can discontinue a service that you are based on
  • Lost cloud credentials combined with weak security can be desastrous (learn from Codespaces)
  • If the cloud vendor is down you can either hope and wait or start your website somewhere else, if you where prepared. In the data center you can try all sorts of workarounds and fixes - but you must do that all yourself.
  • ... fill in your own fear and bias against the cloud ...
A voluntary exit can easily happen after some time because:
  • Another cloud vendor is cheaper, better or solves problems that your current vendor doesn’t care about
  • You are bought by another company and they run everything in another cloud, forcing you to migrate
  • ... who knows what the future will bring?
Probably there is no perfect answer that fits everybody. Besides just ignoring the question I personally see two major options:
  1. Use only IaaS (e.g. servers, storage, network) or PaaS (fancy services) from the cloud so that it is easy to migrate to another cloud vendor or to a private cloud. The big disadvantage is that you won't be able to benefit from all the cool managed services that make the cloud an interesting place to be.
  2. Use many cloud providers or accounts (e.g. matching your larger organisational units) to reduce the "blast radius" and keep the communication between them vendor independant. If something happens to one of them the damage is limited in scope and everything else keeps working. The disadvantage is that you add complexity and other troubles by dealing with a widely distributed platform.
I prefer the second option because it lays the ground for a voluntary exit while still keeping most of the advantages of the cloud as an environment and ecosystem. In case of a forced exit there is a big problem, but that could be solved with lots of resources. A forced exit for a single account can be handled without harming the other accounts and their products. As another benefit there is not much premature optimization for the exit case.

Whatever you do - I believe that having some plan is better than not having any plan.

2015-07-15

DevOps Berlin Meetup 2015-07

Is Amazon good for DevOps? Maybe yes, maybe no. But for sure the new Berlin office is good for a Berlin DevOps Meetup.

Jonathan Weiss gave a short overview over the engineering departments found here: AWS OpsWorks, AWS Solution Architects, Amazon EC2, Machine Learning.

Michael Ducy (Global Partner Evangelist at Chef Software) talks about DevOps and tells the usual story. Michael uses goats and silos as a metaphor and builds his talk from the famous goat and silo problem. He sees the "IT manufacturing process" as silos (read History of Silos for more about that) and DevOps minded people as goats: Multi-purpose, versatile, smart and stubborn at reaching their goals.
The attendees of the DevOps event probably did not need much convincing, but the talk was nevertheless very entertaining. Michael has an MBA and also gave some useful insights into how organisations evolve into silos and how organisational "kingdoms" develop.

The talk is available as video: 15min from Jan 2015 and 24min from Dec 2013. The slides are available on Slideshare.

As a funny side note it turns out that Amazon even rents out goats: Amazon Hire a Goat Grazer. However it seems that this offer is about real goats and not DevOps engineers.

2015-07-10

ImmobilienScout24 Social Day at the GRIPS Theater

Today I went to the GRIPS Theater (English) instead of the office. Once a year ImmobilienScout24 donates the work force to social projects, called Social Day. I used the opportunity to catch a glimpse behinde the stage. The theater in turn got a workshop from us about their web site and social media channels.

But first we watched a very nice children show (Ein Fest bei Baba Dengiz) about a German guy who learned respect for foreigners - from another German with Turkish background. The show was well adapted to the school-age audience.

The theater follows a somewhat unusual concept and places the stage in the middle of the audience:
Foto mit freundlicher Genehmigung des GRIPS Theaters
This was my first visit to the GRIPS Theater, but not the last. Besides a rich children programme the theater also offers shows for adults and is most famously known for the show Linie 1.