Embedding SSH Key in SSH URL

SSH keys are considered to be a security feature, but sometimes they make things more complicated than necessary.

Especially in automation contexts we use SSH keys without a pass phrase which degrades the security of the SSH keys to the security level of a plain text password. The only benefit of the SSH keys is the fact that an attacker who gains access to the server won't be able to use the keys found there to login somewhere else. As such SSH keys are still better and more secure than having a regular plain text password.

In automation contexts we sometimes have to handle lots of SSH keys, for example with GitHub Deploy Keys. GitHub mandates to use a different SSH key for every repository to ensure that a leaked private key will not lead to a breach of other repositories.

I recently had to configure a Go Continuous Delivery server and it turned out that it does not support managing SSH keys at all (like Jenkins or TeamCity do). In order to still be able to use GitHub Deploy Keys with Go CD I created  a small SSH wrapper that allows placing the SSH key directly in the git URL like this:


(The URL is much longer, depending on the size of your SSH key). The format is


I use the ~ character as separator because git tries to interpret a : in this place. The SSH wrapper is installed for git with the help of the GIT_SSH environment variable like this:
# clone GitHub repo with Deploy Keys
$ GIT_SSH=ssh-url-with-ssh-key git clone git~LS0tLS1CRUdJTiBP....SDFWENF324DS=@github.com:user/repo.git

# connect to remote SSH server
$ ssh-url-with-ssh-key user~LS0tLS1CRUdJTiBP....SDFWENF324DS=@host

# create new SSH key pair
$ ./ssh-url-with-ssh-key --create schlomo test
Append this base64-encoded private key to the username:
Public Key:
ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBJbEG72fvC/+MM8V9PQ7X4HWkoebB2Rj7k67pmMLUJ9qCqtFDBX3IvJmo2HY60Lmjv7XM4fjWdsHlW33+1zXQjE= schlomo test
See the GitHub repo at https://github.com/schlomo/ssh-url-with-ssh-key for the source code.

See also my other SSH related blog articles:


GUUG-Frühjahrsfachgespräch 2017

I had the honor to attend a new (for me) conference: The spring meeting of the German Unix User Group, this time hosted by the Cybersecurity department of the Darmstadt Technical University.

The conference had about 115 participants and orients itself mostly towards admins. The former emphasis on Unix is long gone, all talks except one (about Solaris) where about Linux and Linux-based technologies. Two days of tutorials where followed by 2 days of talks in 2 parallel tracks.

Noteworthy talks where the keynote about Jailbreaking WiFi Firmware by Matthias Schulz, Architecture Pattern for Container and Kubernetes by Thomas Fricke and several talks about software defined storage. Especially the ensuing discussion between the speakers representing competing approaches helped many attendees to sharpen their own opinions.

Jailbreaking WiFi Firmware

Impressive walk-through of the effort it took to turn Broadcom Wifi chips into WiFi monitors suitable for WiFi hacking. The code is available on seemoo-lab/nexmon and backs the Nexmon Android app which turns a Nexus 5 or Nexus 6P into a WiFi hacking tool.

Architecture Pattern for Container and Kubernetes

Besides hacking custom WiFi firmwares enable useful applications, for example adaptive video quality streaming over WiFi so that nearby clients receive a higher quality video signal while far off clients - that have a lower WiFi signal rate - still receive a lower quality video signal. See APP and PHY in Harmony.

DevOps for Everybody

My own contribution was a new talk DevOps for Everybody - How the entire company can benefit from DevOps about bringing DevOps ideas to all employees in the company. The main idea is to see DevOps as a technique to use technology to change culture.

Together with the talk I also presented a paper A Workplace Strategy for the Digital Age which can serve as an IT strategy for corporate IT departments that want to apply my ideas.


Ubuntu on Dell Latitude E6420 with NVidia and Broadcom

My company sold old laptops to employees and I decided to use the chance to get an affordable and legally licensed Windows 10 system - a Dell Latitude E6420. Unfortunately the system has a Broadcom Wifi card and also ships with an NVidia graphics card which require extra work on Ubuntu 16.04 Xenial Xerus.

After some manual configuration the system works quite well with a power consumption of about 10-15W while writing this blog article. Switching between the Intel and the NVidia graphics card is simple (with a GUI program and requires a logout-login), for most use cases I don't need the NVidia card in any case.

Windows 10 also works well, although it does not support all devices. However, the combined NVidia / Intel graphics systems works better on Windows than on Linux.

In detail, I took the following steps to install an Ubuntu 16.04 and Windows 10 dual boot system.

Step-by-Step Installation


  • Either a wired network connection or a USB wifi dongle that works in Ubuntu without additional drivers.
  • 4GB USB thumb drive or 2 empty DVDs or 1 re-writable DVD
  • 2 hours time

Install Windows

  1. Update the firmware to version A23 (use the preinstalled Windows 7 for this task)
  2. Go through the BIOS setup. 
    1. Make sure to switch the system to UEFI mode and enable booting off USB or DVD. This really simplifies the multi-OS setup as all operating systems share the same EFI system partition
    2. Download the Windows 10 media creator tool and use it to create a USB drive or DVD
    3. Insert the installation media and start the laptop. Press F12 to open the BIOS menu and select the installation media in the UEFI section.
    4. Install Windows 10. In the hard disk setup simply delete all partitions so that Windows 10 will create its default layout.
    5. Let Windows 10 do its job, rebooting several times. Use the provided Windows 7 product key for Windows 10 and let it activate over the Internet.
    6. All basic drivers will install automatically, some question marks remain in the device manager. Dell does not provide official Windows 10 drivers, so one would have to search the internet for specific solutions. However, Dell provides an overview page for Windows 10 on E6420.

      Install Ubuntu

      1. Create the Ubuntu installation media.
      2. Boot the laptop. Press F12 when it starts and select the installation media in the UEFI section of the BIOS menu.
      3. Select "Install Ubuntu" in the boot menu. Choose to install Ubuntu together with Windows. In the disk partitioning dialog reduce the size of the Windows partition to make room for Ubuntu. Leave Windows at least 50GB, otherwise you won't be able to do much with it.
      4. Let Ubuntu finish its installation and boot into Ubuntu.

      Optimize and Configure Ubuntu

      The default installation needs some additional packages to work well. Make sure that Ubuntu has an internet connection (wired or via a supported USB wifi dongle).

      Note: For the Broadcom WiFi adapter there are several possible drivers in Ubuntu. By default it will install the wl driver which was not working well for me and caused crashes. The b43 driver works for me, although the Wifi performance is rather low.

      Note: The HDMI output of the laptop is connected to the NVidia graphics chip. Therefore you can use it only when the system uses the
      1. Update Ubuntu and reboot:
        sudo apt update
        sudo apt full-upgrade
        sudo reboot
      2. Install the following packages and reboot:
        sudo apt install firmware-b43-installer \
            nvidia-361 nvidia-prime bbswitch-dkms \
            vdpauinfo libvdpau-va-gl1 \
            mesa-utils powertop
      3. Confirm that the builtin WiFi works now.
      4. Add the following line to /etc/rc.local before the exit 0 line:
        powertop --auto-tune
      5. Reboot
      6. Check that 3D acceleration works with NVidia:
        glxinfo | grep renderer\ string
        OpenGL renderer string: NVS 4200M/PCIe/SSE2
      7. Check that VDPAU acceleration works with NVidia:
        vdpauinfo | grep string
        Information string: NVIDIA VDPAU Driver Shared Library  361.42  Tue Mar 22 17:29:16 PDT 2016
      8. Open nvidia-settings and switch to the Intel GPU (you will have to confirm with your password):
      9. Logout and log back in. Confirm that 3D acceleration works now:
        glxinfo | grep renderer\ string
        OpenGL renderer string: Mesa DRI Intel(R) Sandybridge Mobile
      10. Confirm that the NVidia graphics card is actually switched off:
        cat /proc/acpi/bbswitch
        0000:01:00.0 OFF
      11. Confirm that VDPAU acceleration works:
        vdpauinfo | grep string
        libva info: VA-API version 0.39.0
        libva info: va_getDriverName() returns 0
        libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/i965_drv_video.so
        libva info: Found init function __vaDriverInit_0_39
        libva info: va_openDriver() returns 0
        Information string: OpenGL/VAAPI/libswscale backend for VDPAU
      12. Check that the power consumption is somewhere between 10W and 15W:


      PCI Devices (lspci)

      Screen Configuration (NVidia)

      Screen Configuration (Intel)


      Lifting the Curse of Static Credentials

      Summary: Use digital identities, trust relationship and access control lists instead of passwords. In the cloud, this is really easy.

      I strongly believe that static credentials are one of the biggest hazards in modern IT environments. Most information security incidents are somehow related to lost or leaked or guessed static credentials, Instagram's Million Dollar Bug is just one very nice example. Static credentials

      • can be used by anyone who has them - friend or foe
      • are typically very short and can even be brute forced or guessed
      • for machine or service users have to be stored in configuration files from where they can be leaked
      • are hard to remember for humans so that they will write them down somewhere or store them in files
      • typically stay the same over a long period of time
      • don't include any information about the identity of the bearer or user
      • are hard to rotate on a regular base because the change has to happen in several places at the same time
      All those problems disappear if we use digital identities and trust relationships instead of static credentials. Unfortunately static credentials are incredibly easy to use which makes them hard to eradicate.

      Static credentials are from the medieval ages

      Source: Dr. Pepper ad from 1963 / Johnny Hart
      Back in time passwords or watchwords where state of the art. For example membership in a secret club or belonging to a certain town could be proven with a "secret" password. Improvements where "password of the day" (nice for watchtower situations) or even "challenge response" methods like completing a secret poem or providing a specific answer to a common question.

      Basically everything we do with static credentials, for example a website login, follows exactly those early patterns. Even though the real world has moved on to identity based access control in the 19th and 20th century. Passports and ID cards certify the identity of the bearer and the border control checks if the passport is valid, if the person presents his/her own passport and decides if the person is allowed passage. Nobody would even think about granting access to anything in the real world in exchange for just a password.

      So why is IT security stuck in the medieval ages? IMHO convenience and lack of simple and wide spread standards. We see static credentials almost everywhere in our daily business:
      • Website logins - who does not use a password manager? Only very few websites manage without passwords
      • Database credentials - are probably the least rotated credentials of all
      • Work account login - your phone stores that for you
      • SSH keys - key passphrases don't add much security, SSH key security is much underestimated
      • ...
      Sadly, agreeing upon static credentials and manually managing them is still the only universally applicable, compatible and standardized method of access protection that we know in IT.

      Modern IT

      Luckily in professional environments there is a way out. In a fully controlled environment there should be no need for static credentials. Every user and every machine can be equipped with a digital identity whose static parts are stored in secure hardware storage (e.g. YubiKey and TPM). Beyond that all communication and access can be granted based on those digital identities. Temporary grants by a granting authority and access control lists give access to resources. The same identity can be used everywhere thereby eliminating the need for static credentials.

      Kerberos and TLS certificates are well known and established implementations of such concepts. Sadly many popular software solutions still don't support them or make their use unnecessary complicated. As the need to use certain software typically wins over the wish to have tight security we users end up dealing with lots of static credentials. The security risk is deemed acceptable as those systems are mostly accessible from inside only. Instagram's Million Dollar Bug of course proves the folly of this thought. A chain of static AWS credentials found in various configuration files allowed exploiting everything:
      Source: Instagram's Million Dollar Bug (Internet Archive) / Wesley
      Facebook obviously did not think about the fact that static AWS credentials can be used by everyone from everywhere.

      The Cloud Challenge

      As we move more and more IT functions into the Cloud the problem of static credentials gains a new dimension: Most of our resources and services are "out there somewhere" and not part of our internal network. There is absolutely no additional layer of security! Anybody who has the static credential can use them and you won't even notice it.

      Luckily Cloud providers like Amazon Web Services (AWS) also have a solution for that problem: AWS Identity and Access Management (IAM) provides the security backbone for all communication between machines and users one one side and Amazon APIs on the other hand. Any code that runs on AWS is assigned a digital identity (EC2 Instance Role, Lambda Execution Role) which provides temporary credentials via the EC2 instance metadata interface. Those credentials are then used to authenticate API calls to AWS APIs.

      As a result it is possible to run an entire data center on AWS without the need for static credentials for internal communication. Attackers who gain internal access will not be able to access more resources than the exploited service had access to. Eradicating internal static credentials would therefore have prevented Instagram's Million Dollar Bug.

      Avoid Static Credentials

      In a world of automation static credentials are often a nuisance. They have to be added to all kind of configuration files while protecting them from as many eyes as possible. In the end, many secrets management solutions only protect the secrets from the admins and casual observers but do not prevent leaked secrets in general. Identity-based security actually helps in automated environments. The problem of static credentials is reduced to just one set for the digital identity. All other communication just uses that identity for authentication.

      Eradicating static credentials and using digital identities not only significantly improves security but also assists in automating everything.

      If you use AWS, start today to replace static AWS credentials by IAM roles. Use the AWS Federation Proxy to provide IAM roles to containers and on-premise servers in order to remove static AWS credentials from both your public and your private cloud environments.

      For your local environment, use Kerberos pass-through authentication instead of service passwords.

      For websites try to use federated logins (e.g. OpenID Connect) and favor those that don't need a password.

      For your own stuff, be sure to enable 2 factor authentication and storing certificates and private keys in hardware tokens like YubiKey.


      CoreOS Fest 2016 - Container are production ready!

      The CoreOS Fest 2016 in Berlin impressed me very much: A small Open Source company organizes a 2 day conference around their Open Source tools and even flies in a lot of their employees from San Francisco. A win both for Open Source and for Berlin. And CoreOS also announced that they got new funding of $28M:
      Alex Polvi, CEO of CoreOS
      More interesting for IT people everywhere is the message one can learn here: Container technologies are ready for production. There is a healthy environment of

      In fact, choosing the "right" platform starts to become the main problem for those who still run on traditional Virtualization platforms. On the other hand, IT companies who don't evaluate containers in 2016 will be missing out big time.

      The hope remains that with the now emerging technologies one does not need to build up a team of support engineers just to run the platform.


      OSDC 2016 - Hybrid Cloud

      The Open Source Data Center Conference 2016 is a good measure for how the industry changes. Compared to 2014 Cloud topics take more and more space. Both how to build your own on-premise cloud with Mesos, CoreOS or Kubernetes but also how to use the public Cloud.

      Maybe not surprising, I used the conference to present my own findings from 2 years of Cloud migration at ImmobilienScout24:

      After we first tried to find  way to quickly migrate our data centers into the Cloud we now see that a hybrid approach works better. Data center and cloud are both valued platforms and we will optimize the costs between them.

      Hybrid Cloud - A Cloud Migration Strategy

      Do you use Cloud? Why? What about the 15 year legacy of your data center? How many Enterprise vendors tried to sell you their "Hybrid Cloud" solution? What actually is a Hybrid Cloud?

      Cloud computing is not just a new way of running servers or Docker containers. The interesting part of any Cloud offering are managed services that provide solutions to difficult problems. Prime examples are messaging (SNS/SQS), distributed storage (S3), managed databases (RDS) and especially turn-key solutions like managed Hadoop (EMR).

      Hybrid Cloud is usually understood as a way to unify or standardize server hosting across private data centers and Public Cloud vendors. Some Hybrid Cloud solutions even go as far as providing a unified API that abstracts away all the differences between different platforms. Unfortunately that approach focuses on the lowest common denominator and effectively prevents using the advanced services that each Cloud vendor also offers. However, these service are the true value of Public Cloud vendors.

      Another approach to integrating Public Cloud and private data centers is using services from both worlds depending on the problems to solve. Don't hide the cloud technologies but make it simple to use them - both from within the data center and the cloud instances. Create a bridge between the old world of the data center and the new world of the Public Cloud. A good bridge will motivate your developers to move the company to the cloud.

      Based upon recent developments at ImmobilienScout24, this talk tries to suggest a sustainable Cloud migration strategy from private data centers through a Hybrid Cloud into the AWS Cloud.
      • Bridging the security model of the data center with the security model of AWS.
      • Integrating the AWS identity management (IAM) with the existing servers in the data center.
      • Secure communication between services running in the data center and in AWS.
      • Deploying data center servers and Cloud resources together.
      • Service discovery for services running both in the data center and AWS.
      Most of the tools used are Open Source and this talk will show how they come together to support this strategy:

      As soon as the video is published I will update the talk here.


      You can't control internal public data

      Everywhere there is some data that is relevant either for all applications or for many applications in different parts of the platform.

      The "obvious" solution to this problem is to make such data internally public or world-readable, meaning that the entire platform can read it.

      The "obvious" solution to security in this case is actually having no security beyond ensuring the "are you part of us?" question.

      Common implementations of this pattern are world-readable NFS shares, S3 buckets readable by all "our" AWS accounts, HTTP APIs that use the client IP as their sole access control mechanism etc.

      This is approach is really dangerous and should be used with care. The risks include:

      • You most likely don't know who actually needs the data and who not. If you ever need to restrict access you will have a very long and tedious job ahead of you.
      • You don't know who accessed the data for which purpose.
      • After a data leak, you probably won't know how it happened.
      • The data is only as safe as the weakest application in your platform.
      This last danger is the real problem here. Imagine that you share personally identifiable information to all your internal systems. Either out of convenience or even unknowingly. The weakest app in your portfolio will be used to hack your platform and it will be used to copy all that sensitive data.

      It is not a question of "if" but rather a question of "when" as we can learn from Facebook: Instagram's Million Dollar Bug is a must read for everybody! And please also read the case study for defense that detailed the learning from it.

      One key take away is the folly of having a central, internally world readable S3 bucket with the configuration of the entire platform. A simple grep over that data will instantly yield all the other S3 buckets in use and especially those that are internally world readable. A real attacker would have proceeded to copy all "interesting" data he could find to an S3 bucket of his own. In most setups, this operation would not leave any trace except that the exploited system read all the S3 buckets. Facebook was just very lucky that their flaw was found by a security researcher who played by the rules.

      Instead of going the convenient way it pays to invest early into a minimum level of security, for example doing everything with a least privilege model. The result is a detailed list of access grants that you can use for security audits and to draw a risk map. The risk map will give you all the systems that have a risk of leaking the data.
      Risk Map: Services 1, 2, 4, 5 and 8 define the safety of the personal data.
      Yes, this means a bit more work but the payoff is almost immediate. Especially if you partition your platform into smaller parts you should be careful not to destroy the security benefits of that partition. At ImmobilienScout24 we use many AWS accounts, among other things to have a "blast radius" for potential hazard. The blast radius concept applies not only to infrastructure but also to data. Everybody with a partitioned platform should pay attention that the partitioning covers all aspects of the platform and not only the infrastructure level.

      Finally, if you are subject to German Data Privacy Laws (Bundesdatenschutzgesetzt) then there are special regulations against uncontrolled spread of data (Weitergabekontrolle). Having personally identifiable information that is internally world-readable probably defies this law. If you can't be 100% sure that your internal public data is clean then it is much safer and easier to just automate the access control.

      To conclude, I strongly advise to only put such data world-readable that you can risk to leak into the open.


      Go Faster - DevOps & Microservices

      At the microXchg 2016 last week Fred George - who takes pride having been called a hand grenade - gave a very inspiring talk about how all the things that we do right now have one primary goal:

      Go Faster

      Reducing cycle time for deployments, automation everywhere, down-sizing services to "microservices", building resilient and fault-tolerant platforms and more are all facets of a bigger journey: Provide value faster and find out faster what works and what not.


      DevOps is seen by most developers as beeing an Ops movement to catch on with developers before their jobs become obsolete. Attending various DevOps Days in Germany and the USA, the developers who where also there always complained about the lack of developers and the lack of developer topics. They observed that the conference seems to be by and for Ops people. Consequently, DevOps conferences usually have two tracks: Methods and Tools.

      Methods teach us how to do "proper" software development also in infrastructure engineering and to follow agile software development practices. Tools talks try to make us believe that you cannot be a good DevOps unless you use Puppet, Chef or Ansible. The success story talks all emphazise how "DevOps Tools", shared responsability and a newly formed "DevOps Team" saved the day. In more recent years the tools focus on building private clouds with Docker and on managing distributed storage.

      In fact, DevOps is all about beeing faster through shared responsibility, mutual respect between different knowledge bearers and building cross-functional teams with full vertical responsibility for their work.


      Microservices is definitively an important hype amongst developers. Seasoned Ops people see it as the obvious thing to do, just like the well-known Unix philosophy teaches:

      The Unix philosophy emphasizes building simple, short, clear, modular, and extensible code that can be easily maintained and repurposed by developers other than its creators. The Unix philosophy favors composability as opposed to monolithic design. Source: Wikipedia
      Applying all that to systems design is a straight road to microservices. When going from millions of Lines-of-Code to just a few thousand and when going from 5 applications to 500 the glue code between all those applications suddenly becomes the governing system.

      Service discovery, managing large amounts of micro instances, network latency, cyclic dependency trees etc. are all areas of expertise of Ops people. They where dealing with these questions for the last 20 years!

      Microservice success stories, like the one of SoundCloud shown also at the microXchg 2016, show how more and more glue and abstraction layers where introduced into the emerging microservices architecture to compensate the degradation that came along with the exploding complexity of their microservices landscape.

      Much of that could also be learned from modern Linux operating systems. A nice example is systemd which drives its own "microservices" revolution, just on a smaller scale within a single Linux computer.

      Looking a the tools track of Microservices events it is no surprise to also find Docker in a dominating role here as well.

      Common Values & Concepts

      I don't want to argue that Docker is the common topic that everybody should care about. After all, Docker is just this years hype implementation of an operating concept. Savy sysadmins where doing the same thing with chroot or OpenVZ a long time ago. And in a few years we will probably have something even better for the same job.

      What really brings these topics together are a lot of shared values and concepts (in no particular order):
      • KISS approach
      • Right-sizing everything to an easily managed size: microservices, two pizza teams, iterative solutions to problems
      • Full stack responsibility
      • Automate everything, especially the glue between all those small components
      • Observe-Orient-Decide-Act loops in different forms and fashions
      As long as keep our core values in mind the actual technology or methodology doesn't matter so much. We will still achieve our goals. Just much faster.


      Cloud Migration ≈ Microservices Migration

      Day two at the microXchg 2016 conference. After listening to yet another talk detailing the pitfalls and dangers of "doing it wrong" I see more and more similarities between the Cloud migration at ImmobilienScout24 and the microservices journey that most speakers present.
      The Cloud migration moves us from a large data center into many smaller AWS accounts. A (legacy) monolithic application is cut into many smaller microservices.

      Internal data center communication becomes exposed communication between different AWS accounts and VPCs. Internal function calls are replaced with remote API calls. Both require much more attention to security, necessitate an authentication framework and add significant latency to the platform.

      A failed data center takes down the entire platform while a failed AWS account will only take down some function. An uncaught exception will crash the entire monolith while a crashed microservice will leave the others running undisturbed.

      Internal service dependencies turn into external WAN dependencies. Library dependencies inside the monolith turn into service dependencies between microservices. Cyclic dependencies remain a deadly foe.

      Team responsibilities shift from feeling responsible for a small part of the platform to being responsible for entire AWS accounts or only their own microservices.

      And much more...


      If it looks similar, maybe we can learn something from this. I strongly believe that many structural and conceptional considerations apply equally to a Cloud migration and to a microservices journey:
      • Fighting complexity through downsizing.
      • Complexity shifts from inside to outside. New ways to manage this complexity emerge.
      • Keeping latency in check is key factor to success.
      • Need much more advanced tooling to properly handle the scale out of managed entities.
      • Less centralization of common concerns leads to more wasted effort and resources. Accept this.
      • Success and failure hangs on finding the right seams to cut.
      • "Just put it somewhere" usually doesn't work at all.
      • Integration tests become more important and difficult.
      I learned a lot at this conference, both about microservices and about the direction our Cloud migration should go.

      Please add your learnings in the comments.


      AWS Account Right-Sizing

      Today I was attending the Microxchg 2016 conference in Berlin. I suddenly realized that going to the cloud allows to ask completely new questions that are impossible to ask in the data center.

      One such question is this: What is the optimum size for a data center? Microservices are all about downsizing - and in the cloud we can and should downsize the data center!

      In the world of physical data centers the question is usually goverened by two factors:

      • Ensuring service availability by having at least two physical data centers.
      • Packing as much hardware into as little space as possible to keep the costs in check.
      As long as we are smaller than the average Internet giant there is no point to ask about the optimum size. The tooling which we build has to be designed for both large data centers and for having more than one. But in the "1, 2, many" series "2" is just the worst place to be. It entails all the disadvantages of "more than 1" without any of the benefits of "many".

      In the cloud the data center is purely virtual. On AWS the closest thing to a "data center" is a Virtual Private Cloud (VPC) in an AWS Region in an AWS Account. But unlike a physical data center that VPC is already highly available and offers solid redundancy through the concept Availability Zones.

      If an AWS Account has multiple VPCs (either in the same region or in different regions), then we should see it has actually beeing several separate data centers. All the restrictions of multiple data centers also apply to having multiple VPCs: Higher (than local) latency, traffic costs, traversing the public Internet etc.

      To understand more about the optimum size of a cloud data center we can compare three imaginary variants. I combine EC2 instances, Lambda functions, Beanstalk etc. all into "code running" resources. IMHO it does not matter how the code runs in order to estimate the management challanges involved.

      Small VPC
      Medium VPC
      Large VPC
      Number of code running resources
      Number of CloudFormation stacks
      (10 VMs per stack)
      Service Discovery
      simple tooling e.g. git repo with everything in it
      elaborate tooling, Etcd, Consul, Puppet ...
      Which application is driving the costs?
      Eyeball inspection - just look at it
      Tagging, Netflix ICE ...
      Complex tagging, maintain an application registry, pay for Cloudhealth ...
      CloudFormation manually operated viable option
      Simple tooling like cfn-sphere, autostacker24 ...
      Multi-tiered tooling like Spinnaker or other large solutions
      Security model
      Everyone related is admin
      Everyone related is admin, must have strong traceability of changes
      Probably need to have several levels of access, separation of duty and so on
      … whatever ...
      dumb and easy
      complex and complicated

      Having a large VPC with a lot of resources obviously requires much more elaborate tooling while a small VPC can be easily managed with simple tooling. In our case we have a 1:1 relationship between a VPC and an AWS account. Accounts that work in two regions (Frankfurt and Ireland) have 2 VPCs but that's it.

      I strongly believe that scaling small AWS accounts together with the engineering teams who use them will still allow us to keep going with simple tooling. Even if the absolute total of code running resources is large, splitting it into many small units reduces the local complexity and allows the responsible team to manage their area with fairly simple tooling. Here we use the power of "many" and invest into handling many AWS accounts and VPCs efficiently.

      On the overarching level we can then focus on aggregated information (e.g. costs per AWS account) without bothering about the internals of each small VPC.

      I therefore strongly advise to keep your data centers small. This will also nicely support an affordable Cloud Exit Strategy.