2011-06-16

Velocity 2011 - Part 3: Wednesday (2nd day)

My notes on the second conference day at the Velocity Conference.

The keynotes where again a highlight, to be topped only by the talk about Automating for Success: Production Begins in Development which happened to confirm all my theories about web operations and package-based deployment  :-)

Videos are available on the Velocity 2011 Videos page, slides can be found on the Velocity 2011 Speakers Slides and Video page.

Read also about the Workshops and the first day.


Keynotes Thursday

World IPv6 Day: Lessons Learned

Ian Flint, yahoo
  • http://www.youtube.com/watch?v=T04o6bQN8Ls
  • Last /8 net assigned in 2011
  • NAT is bad for geolocating clients
    • bad for business
    • bad for targeting
  • What is the catch of using IPv6?
    • 0.2% of users have IPv6 so far
    • dual stack setups oftenly have broken IPv6 setups, browsers prefer IPv6
    • OS timeout for switching from IPv6 to IPv4 is long (Linux/Windows 21sec, OS X 75sec, phones no fallback)
  • Checken/egg problem: Which website will go first dual stack?
    • All of them: 434 participants signed up for World IPv6 Day
    • June 8, 2011
  • Yahoo implementation details for yahoo.com
    • 37 markets
    • served from 10 datacenters
    • setup IPv6 proxy server in 7 locations, reduce risk of turning on IPv6
    • Install 6to4 Relays in all peering points
    • Certify all network gear at scale
    • Retrofit custom global DNS
    • Retrofit DOS protection layer
    • Retrofit Audience Data Pipeline
  • IPv6 Test in 38 languages, user help pages
  • 2 15-minute test before the IPv6 day
    • first test showed that problematic health checks in the DNS infrastructure routed all India traffic to Santa Clara
  • Panning for Decision Points
    • When would be things bad enough to force Yahoo to roll back
    • Never do big changes at times of traffic changes
  • Always make sure you can look at things from more than one point of view
  • Practice makes perfect. For a major change always run some tests before

Facebook Open Compute & Other Infrastructure

Jonathan Heiliger, facebook
  • http://www.youtube.com/watch?v=urG0dQ7kc3w
  • Very good !!!
  • Growth of users was also matched by growth of innovation and speed of change
    • This is very unusal, usually innovation speed becomes less as companies grow
  • Run down of facebook of the growth story
    • HPHP brought a great improvement for site performance
    • power consumption became a big issue, decision to look at all parts involved
  • facebook started building their own datacenter and servers
  • Conclusions:
    • Make audacious bets and iterate quickly
    • Smart and hungry beats large and capable every time
    • Make it work
    • Manage risk with hedges

Velocity Culture

Jon Jenkins, Amazon
  • http://www.youtube.com/watch?v=dxk8b9rSKOo
  • http://assets.en.oreilly.com/1/event/60/Velocity%20Culture%20Presentation.pdf
  • Web performance drives real value for the business
    • Case studies from bing, google, shopzilla, msn show this
    • Steve Sauders did a lot for that
  • What about operations? How does ops provide value for business
  • What if the size of your server fleet could be totally flexible?
  • Case study 1: Downscaling
    • weekly traffic patterns high and low
    • at amazon up to 39% server capacity goes to waste
    • for high traffic months this can be even up to 75%
    • Since November 10, 2010 all amazon.com traffic is served by EC2
      • Reduced spending on server capacity
      • Fleet scales dynamically in increments as small as a single host
  • Case study 2: Continuous Deployment
    • Mean time between deployments: 11.2sec
    • 1079 deployments per hour maximum rate in May 2011
    • Deployments roll through server groups
      • Problems: Complex workflow, slow, error scenarios very complex to handle
    • Solution: If capacity is unlimited then one could simply spawn a new set of server groups
    • More and more deployments use this method
      • 75% reduction in outages triggered by software deployments since 2006
      • 90% reductiion in outage minutes triggered by software deployments
      • instantaneous automated rollback (switch LB back to old server group)
      • Reduction in complexity, no upgrades on server, just make new servers
  • The Challange for Velocity 2012
    • save millions $ by optimizing server utilizations
    • became faster and more available by using flexible server capacity
    • Please come back in 2012 and tell your story how ops managed to contribute business value

Artur on SSD

Artur Bergman, fastly.com
  • http://www.youtube.com/watch?v=H7PJ1oeEyGg
  • Mac Laptop boot time: 13 seconds
  • If you don’t use SSDs, you waste your life
  • fastly uses only (or mostly) SSDs in their data center
  • Show this to the boss to get an SSD :-)

Cisco and OpenStack

Lew Tucker, Cisco

State of the Infrastructure

Rachel Chalmers, The 451 Group

Holistic Performance

John Resig
  • http://www.youtube.com/watch?v=WuMEQN7aph0
  • About jQuery
  • Client-side JavaScript performance issues
    • Analyzing performance not trivial, e.g. is wall-clock time relevant? Or CPU consumption?
    • Memory consumption, what about memory leaks?
    • Parse time, the more you download the more to parse
    • Battery consumption (Mobile!)
  • Example: Dictionary Lookups in JavaScript
    • Most solutions optimize for file (download) size
    • Bad parse time
    • Succinct Trie is the best both by file size, memory consumption and lookup performance
  • dynaTrace - useful tool to dig into the details
  • jQuery project
    • Bug reports need a reproducible test case
    • Performance enhancements need to be proven through http://jsperf.com

Lightning Demos

Page Speed

Michael Schneider, Google
  • New work on page speed
  • page speed firefox addon
  • Now also for chrome
  • page speed is a tab in the web inspector
  • page speed is a tool to analyze page load times and suggest optimizations
  • http://pagespeed.googlelabs.com online version of page speed
    • get mobile report to analyze page load timings for mobile devices
  • Experimental hints about avoiding unneccessary reflows

dynaTrace

Andreas Grabner, Dynatrace

Chrome Developer Tools

Paul Irish, Google Chrome relations team
  • New things
  • Task manager: Right click on a task gives many internals and details, e.g. Number of Goats Teleported
  • JavaScript Performance APIs:
    • performance.timing
    • performance.memory (need --enable-memory-info command line option)
    • window.onerror
    • console.profile() and console.profiles[] - CPU profiling also as an JS object. Can be send back to the webserver for analysis
    • console.markTimeline() - set markers that show up in the Timeline to help group JS actions
  • Heap Profiler
    • dig into memory consumptions
    • snapshot diffs between different states
    • find memory leaks
  • Remote Debugging
    • --remote-debugging-port # command line option
    • Developer Tools run a little web server
    • allows remote analysis
    • This is part of WebKit and should be soon available for all webkit browsers

showslow.com

Sergey Chernyshev, showslow
  • collects performance data from various services and show it
  • dashboard-like overview and drill down into detail
  • help create a business case for performance optimizations

Cast - The Open Deployment Plattform

Paul Querna, Rackspace
  • Deployment as a RESTful API
  • Service Management
    • Start, Stop, Restart
  • Version Management
    • Distribution of release
    • Upgrade
    • Rollback
  • Service Monitoring
    • Logfiles
    • Network Ports
    • Processes
  • Service Coordination
    • ?
  • Open Source
  • http://cast-project.org

Making the Web instant

Arvind Jain & Sreeram Ramachandran, Google
  • Still, most pages take 5 seconds to load
  • How to make it instant?
    • We humans are not as fast as computers
    • It takes about 300ms between onMouseOver and onClick
    • This time can be used to optimize loading by prefetching the content
  • Google search with Google Instant Pages
    • Predict & preload
    • Guess what the user will click and load the while the user still thinks about what to click next
    • Works only on Chrome so far
    • Chrome loads target in hidden frame and replaces frame
  • Instant everywhere
    • Chrome supports preloading pages when typing into the address bar
  • Everybody can use it, web page authors usually know more about the next likely page
    • Instruct the browser that this is the likely next page
  • Beware:
    • This creates more load on the client and on the server!
    • Accounting (ads, analytics) gets more difficult
      • don’t want to count hidden pages that the user never saw
      • google submitted an RFC to the W3C to support an API for page visibility API to determine if a page is actually visible to the user or still in
  • Benefit: Better and faster internet browsing experience

Wikia: Going Active/Active

Jason Cook, Wikia
  • Active/Active means rear everywhere, write in master data center
  • Wikia built on top of MediaWiki
  • Story of Wikia with typical startup problems
  • What about earthquakes? Time To Recover?
  • FULL DR Site
    • In a nuclear bunker
    • In the middle of nowhere in Iowa

Automation for Success: Production begins in Development

Lee Thompson, CTO Travel/Transportation, HP
Damon Edwards, Co-founter DTO Solutions, DevOps Days organizer, DevOps Cafe
  • http://www.slideshare.net/dev2ops/velocity-2011-production-begins-in-development
  • Very good, especially if you believe that Chef and Puppet are not the end of innovation !!!
  • Webtone
    • Clouds
    • DevOps
    • Continuous Deployment/Delivery
    • Lean Startup
  • How to measure DevOps success
    • Alignment - how well do different parts of the organization work together
    • Quality - of processes and deliveries
    • Cycle Time
  • Risk tolerance
    • How much change do you want, how much risk can you tolerate?
    • "Move fast and break thungs. Unless you are breaking stuff, you are not moving fast enough." - Mark Zuckerberg
  • Webtone utilities
    • Reliable
    • Repeatable
    • Scalable
  • It all starts in Development!
    • But what do we tell them to do?
    • and how to we get them to do it?
  • Share ownership of availability
    • Developers must wear pagers (on-call)
    • Incident command trainig so everyone knows their roles
    • Notification mechanism?
    • Access provisioning (emergency access for people who usually don’t have it)?
  • Non-functional requirements are first class citizens
  • Strive for parity between dev & prod
    • should be really the same
    • test data fictures for all environment
    • implement mock services for major infrastructure pieces for Developer users (usually Ops needs to help with this), typically authentication systems.
    • Continuous integration means integrate early
    • Use all the deployment, config and packaging tools in dev
  • Push config management discipline back to Dev
    • Dev is about creating variation, Ops is about eliminating variation
    • Augment deployment toolchain to support the variation
    • Do developers use the tools?
    • Accept config contributions and patches from dev
  • Packaging … it’s not just for the OS
    • high performing web operations organizations needs to take change management serious
    • Strict versioning
    • It’s about beeing idempotent
    • Transfer packaging responsibility to dev
    • Define the packaging constructs you will support
  • Config is code
    • if it’s code it needs to be managed like code
    • Should be transparent and identical SDLC in both dev and ops
    • Avoid or eliminate asymetric release processes (config = software)
  • Tailor release artifacts to roles
    • "Small teams make better software"
    • One team stuck should not prevent other teams from releasing (org coupling)
    • Large codebases suffer software entropy effects
    • Build an infrastructure that can reliably manage lots of smaller artifacts
    • Org conflict is a good time to suggest breaking up a codebase into separate concerns
  • Standard management vocabulary
    • Consistent and expected management behaviour
    • Accross components and releases
    • "start, stop, status, update, install …"
  • Rollback
    • Rollback that works
    • Tested and proven
    • Test rollback for each release
  • Standard metrics abstractions
    • Dev surface metrics to Ops
    • Use a standard framework
    • https://github.com/codehale/metrics
    • Use standard types (gauge, counter, timer …)
    • Ops knows what to expect and how to visualize
  • Push test ownership to the edges
    • QA = Quality Assurance
    • QA writing tests = bottleneck and avoiding responsability
    • Test Driven Development
    • Test Driven Operations (yes, you too!)
    • Bottom line: Everyone owns quality
  • Test outside of the box
    • Crowd test, A/B test
    • Simulation
  • Continuos Delivery
    • Delivery Pipelines
    • Continous Deployment
    • Don’t be too dogmatic, a hybrid model is also good

DevOps Metrics: Measuring the devops gap

Patrick Debois Andrew Shafer