2016-02-19

You can't control internal public data

Everywhere there is some data that is relevant either for all applications or for many applications in different parts of the platform.

The "obvious" solution to this problem is to make such data internally public or world-readable, meaning that the entire platform can read it.

The "obvious" solution to security in this case is actually having no security beyond ensuring the "are you part of us?" question.

Common implementations of this pattern are world-readable NFS shares, S3 buckets readable by all "our" AWS accounts, HTTP APIs that use the client IP as their sole access control mechanism etc.

This is approach is really dangerous and should be used with care. The risks include:

  • You most likely don't know who actually needs the data and who not. If you ever need to restrict access you will have a very long and tedious job ahead of you.
  • You don't know who accessed the data for which purpose.
  • After a data leak, you probably won't know how it happened.
  • The data is only as safe as the weakest application in your platform.
This last danger is the real problem here. Imagine that you share personally identifiable information to all your internal systems. Either out of convenience or even unknowingly. The weakest app in your portfolio will be used to hack your platform and it will be used to copy all that sensitive data.

It is not a question of "if" but rather a question of "when" as we can learn from Facebook: Instagram's Million Dollar Bug is a must read for everybody! And please also read the case study for defense that detailed the learning from it.

One key take away is the folly of having a central, internally world readable S3 bucket with the configuration of the entire platform. A simple grep over that data will instantly yield all the other S3 buckets in use and especially those that are internally world readable. A real attacker would have proceeded to copy all "interesting" data he could find to an S3 bucket of his own. In most setups, this operation would not leave any trace except that the exploited system read all the S3 buckets. Facebook was just very lucky that their flaw was found by a security researcher who played by the rules.

Instead of going the convenient way it pays to invest early into a minimum level of security, for example doing everything with a least privilege model. The result is a detailed list of access grants that you can use for security audits and to draw a risk map. The risk map will give you all the systems that have a risk of leaking the data.
Risk Map: Services 1, 2, 4, 5 and 8 define the safety of the personal data.
Yes, this means a bit more work but the payoff is almost immediate. Especially if you partition your platform into smaller parts you should be careful not to destroy the security benefits of that partition. At ImmobilienScout24 we use many AWS accounts, among other things to have a "blast radius" for potential hazard. The blast radius concept applies not only to infrastructure but also to data. Everybody with a partitioned platform should pay attention that the partitioning covers all aspects of the platform and not only the infrastructure level.

Finally, if you are subject to German Data Privacy Laws (Bundesdatenschutzgesetzt) then there are special regulations against uncontrolled spread of data (Weitergabekontrolle). Having personally identifiable information that is internally world-readable probably defies this law. If you can't be 100% sure that your internal public data is clean then it is much safer and easier to just automate the access control.

To conclude, I strongly advise to only put such data world-readable that you can risk to leak into the open.

2016-02-07

Go Faster - DevOps & Microservices

At the microXchg 2016 last week Fred George - who takes pride having been called a hand grenade - gave a very inspiring talk about how all the things that we do right now have one primary goal:


Go Faster

Reducing cycle time for deployments, automation everywhere, down-sizing services to "microservices", building resilient and fault-tolerant platforms and more are all facets of a bigger journey: Provide value faster and find out faster what works and what not.

DevOps

DevOps is seen by most developers as beeing an Ops movement to catch on with developers before their jobs become obsolete. Attending various DevOps Days in Germany and the USA, the developers who where also there always complained about the lack of developers and the lack of developer topics. They observed that the conference seems to be by and for Ops people. Consequently, DevOps conferences usually have two tracks: Methods and Tools.

Methods teach us how to do "proper" software development also in infrastructure engineering and to follow agile software development practices. Tools talks try to make us believe that you cannot be a good DevOps unless you use Puppet, Chef or Ansible. The success story talks all emphazise how "DevOps Tools", shared responsability and a newly formed "DevOps Team" saved the day. In more recent years the tools focus on building private clouds with Docker and on managing distributed storage.

In fact, DevOps is all about beeing faster through shared responsibility, mutual respect between different knowledge bearers and building cross-functional teams with full vertical responsibility for their work.

Microservices

Microservices is definitively an important hype amongst developers. Seasoned Ops people see it as the obvious thing to do, just like the well-known Unix philosophy teaches:

The Unix philosophy emphasizes building simple, short, clear, modular, and extensible code that can be easily maintained and repurposed by developers other than its creators. The Unix philosophy favors composability as opposed to monolithic design. Source: Wikipedia
Applying all that to systems design is a straight road to microservices. When going from millions of Lines-of-Code to just a few thousand and when going from 5 applications to 500 the glue code between all those applications suddenly becomes the governing system.

Service discovery, managing large amounts of micro instances, network latency, cyclic dependency trees etc. are all areas of expertise of Ops people. They where dealing with these questions for the last 20 years!

Microservice success stories, like the one of SoundCloud shown also at the microXchg 2016, show how more and more glue and abstraction layers where introduced into the emerging microservices architecture to compensate the degradation that came along with the exploding complexity of their microservices landscape.

Much of that could also be learned from modern Linux operating systems. A nice example is systemd which drives its own "microservices" revolution, just on a smaller scale within a single Linux computer.

Looking a the tools track of Microservices events it is no surprise to also find Docker in a dominating role here as well.

Common Values & Concepts

I don't want to argue that Docker is the common topic that everybody should care about. After all, Docker is just this years hype implementation of an operating concept. Savy sysadmins where doing the same thing with chroot or OpenVZ a long time ago. And in a few years we will probably have something even better for the same job.

What really brings these topics together are a lot of shared values and concepts (in no particular order):
  • KISS approach
  • Right-sizing everything to an easily managed size: microservices, two pizza teams, iterative solutions to problems
  • Full stack responsibility
  • Automate everything, especially the glue between all those small components
  • Observe-Orient-Decide-Act loops in different forms and fashions
As long as keep our core values in mind the actual technology or methodology doesn't matter so much. We will still achieve our goals. Just much faster.

2016-02-05

Cloud Migration ≈ Microservices Migration

Day two at the microXchg 2016 conference. After listening to yet another talk detailing the pitfalls and dangers of "doing it wrong" I see more and more similarities between the Cloud migration at ImmobilienScout24 and the microservices journey that most speakers present.
The Cloud migration moves us from a large data center into many smaller AWS accounts. A (legacy) monolithic application is cut into many smaller microservices.

Internal data center communication becomes exposed communication between different AWS accounts and VPCs. Internal function calls are replaced with remote API calls. Both require much more attention to security, necessitate an authentication framework and add significant latency to the platform.

A failed data center takes down the entire platform while a failed AWS account will only take down some function. An uncaught exception will crash the entire monolith while a crashed microservice will leave the others running undisturbed.

Internal service dependencies turn into external WAN dependencies. Library dependencies inside the monolith turn into service dependencies between microservices. Cyclic dependencies remain a deadly foe.

Team responsibilities shift from feeling responsible for a small part of the platform to being responsible for entire AWS accounts or only their own microservices.

And much more...

Learnings

If it looks similar, maybe we can learn something from this. I strongly believe that many structural and conceptional considerations apply equally to a Cloud migration and to a microservices journey:
  • Fighting complexity through downsizing.
  • Complexity shifts from inside to outside. New ways to manage this complexity emerge.
  • Keeping latency in check is key factor to success.
  • Need much more advanced tooling to properly handle the scale out of managed entities.
  • Less centralization of common concerns leads to more wasted effort and resources. Accept this.
  • Success and failure hangs on finding the right seams to cut.
  • "Just put it somewhere" usually doesn't work at all.
  • Integration tests become more important and difficult.
I learned a lot at this conference, both about microservices and about the direction our Cloud migration should go.

Please add your learnings in the comments.

2016-02-04

AWS Account Right-Sizing

Today I was attending the Microxchg 2016 conference in Berlin. I suddenly realized that going to the cloud allows to ask completely new questions that are impossible to ask in the data center.

One such question is this: What is the optimum size for a data center? Microservices are all about downsizing - and in the cloud we can and should downsize the data center!

In the world of physical data centers the question is usually goverened by two factors:

  • Ensuring service availability by having at least two physical data centers.
  • Packing as much hardware into as little space as possible to keep the costs in check.
As long as we are smaller than the average Internet giant there is no point to ask about the optimum size. The tooling which we build has to be designed for both large data centers and for having more than one. But in the "1, 2, many" series "2" is just the worst place to be. It entails all the disadvantages of "more than 1" without any of the benefits of "many".

In the cloud the data center is purely virtual. On AWS the closest thing to a "data center" is a Virtual Private Cloud (VPC) in an AWS Region in an AWS Account. But unlike a physical data center that VPC is already highly available and offers solid redundancy through the concept Availability Zones.

If an AWS Account has multiple VPCs (either in the same region or in different regions), then we should see it has actually beeing several separate data centers. All the restrictions of multiple data centers also apply to having multiple VPCs: Higher (than local) latency, traffic costs, traversing the public Internet etc.

To understand more about the optimum size of a cloud data center we can compare three imaginary variants. I combine EC2 instances, Lambda functions, Beanstalk etc. all into "code running" resources. IMHO it does not matter how the code runs in order to estimate the management challanges involved.



Small VPC
Medium VPC
Large VPC
Number of code running resources
50
200
1000
Number of CloudFormation stacks
(10 VMs per stack)
5
20
100
Service Discovery
manually
simple tooling e.g. git repo with everything in it
elaborate tooling, Etcd, Consul, Puppet ...
Which application is driving the costs?
Eyeball inspection - just look at it
Tagging, Netflix ICE ...
Complex tagging, maintain an application registry, pay for Cloudhealth ...
Deployment
CloudFormation manually operated viable option
Simple tooling like cfn-sphere, autostacker24 ...
Multi-tiered tooling like Spinnaker or other large solutions
Security model
Everyone related is admin
Everyone related is admin, must have strong traceability of changes
Probably need to have several levels of access, separation of duty and so on
… whatever ...
dumb and easy
simple
complex and complicated

Having a large VPC with a lot of resources obviously requires much more elaborate tooling while a small VPC can be easily managed with simple tooling. In our case we have a 1:1 relationship between a VPC and an AWS account. Accounts that work in two regions (Frankfurt and Ireland) have 2 VPCs but that's it.

I strongly believe that scaling small AWS accounts together with the engineering teams who use them will still allow us to keep going with simple tooling. Even if the absolute total of code running resources is large, splitting it into many small units reduces the local complexity and allows the responsible team to manage their area with fairly simple tooling. Here we use the power of "many" and invest into handling many AWS accounts and VPCs efficiently.

On the overarching level we can then focus on aggregated information (e.g. costs per AWS account) without bothering about the internals of each small VPC.

I therefore strongly advise to keep your data centers small. This will also nicely support an affordable Cloud Exit Strategy.