Velocity 2011 - Part 1: Workshops

My notes on the workshop day at the Velocity Conference.

A lot of Chef stuff, but of course OpsCode was a sponsor. Real gems where Decisions in the Face of Uncertainity and Advanced post-mortem Fu and Human Error 101.

Read also about the first and second day.

Workshop: OpenStack

  • Compute Nodes (hardware and virtualization agnostic)
  • Storage Nodes ("Glance") for HTTP Object Storage
  • Rest APIs: OpenStack API, EC2 Compatibility Module
  • Open & Modular design
  • Storage
    • http://swift.openstack.org/
    • Distributed Shared-nothing Architecture
    • HTTP only
    • Availability Zones provide independant outage scenarios
    • Put data into >=3 different availability zones
    • Swift is independant from OpenStack and can be used stand-alone
  • OpenStack installation
    • Using Chef to script the installation
    • There are many cook books for automated OpenStack setups
    • Mostly Chef advertising, no detailed setup & installation examples :-(

Workshop: Scale Dirty

  • Yet Another Chef Advertising Session

Workshop: Decisions in the Face of Uncertainity

Just Enough Statistics to be Dangerous
  • John Rauser, Principal Quantitative Engineer, Amazon
  • Any numerical information without a precision is worthless. All numbers derived from the real world are actually an estimate
    • Example: How old is Jeff Bezos?
    • Wrong Answers: 47, 55, …
    • Correct Anwswers: 42 - 55, 40 - 50, 45 +/- 5
    • Always give a range unless forced to give precise numbers!
    • BTW, Jeff is 47
  • Statistics is a method to calibrate our estimations
  • Measurements are a tool to reduce uncertainity
  • Classromm exercise
    • Participants estimate the answer to 10 questions
    • How would be the distribution of right answers (x out of 10 right) if the probability for a right answer is equal for each question?
    • The binomial distribution is the most important statistical function that helps us here, see http://www.wolframalpha.com/input/?i=binomial+distribution for definition
    • The audience shows that it is a badly calibrated estimator for this task. On average people get 4 answers right, according to the binomial distribution the average should be much much higher.
    • Conclusion For each question we have to find a suitable way to calibrate the estimates. → Statistical Inference
  • John told his personal story of how he became interested in statistics and data analysis
  • Before the advent of computers, analytical statistics was the only way to reach results that require lots of calculations.
  • With computers we can use direct simulation to find a simple answer to the question "what happens if we run the experiment many many times"?
  • Statistical Inference
    • Randomize data production, find a random process that generates the data
    • Repeat by simulation
    • Reject and model that does not agree with the data
  • Decisions in the face of uncertainty by the example of estimating the amount of business cards in a stack.
    • You had to be there, nice rollercoaster between math, statistics and life experience as a data analyst

Workshop: Advanced Postmortem Fu and Human Error 101

  • John Allspaw, etsy.com
  • The "System" you operate also contains people, not only hardware & software
  • Postmortem relies on having good data to analize
  • Each graph needs to be put into context by marking important events (e.g. deployments)
  • Rich internal communications (IRC, Blog, Twitter) act as a flight recorder, everything is timestamped
  • Define and discuss various crisis patterns
  • Human error is an inevitable by-product of strained complex systems.
  • pre-mortems are better than post-mortems: How to prepare for new features
    • contingency planning
    • what could go wrong?
  • Just culture
    • How to live with and embrace human error
    • The culture required to perform blameless post-mortems
    • Problem: Negligence is oftenly found during an outage, usually the amount of negligence corresponds with the severity of the outage
    • Holding people accountable != Blaming people
    • No bad apples, only bad theories of error
    • Increase Accountability by supporting learning
    • Organizational Roots: Accountability = Responsibility + Requisite Authority
  • The culture of an organization has great influence

Workshop: Hadoop

  • Hadoop: Open Source Storage and Processing Engine
    • MapReduce for processing
    • Hadoop Distributed File System (HDFS) for distributed storage
    • Hadoop separates distributed system fault-tolerance code from application logic
  • Gotchas:
    • Configuration and version divergence within a cluster. This can lead to hard-to-catch bugs.
    • Cluster state: Is it up, network partitioning,
  • Claudera Service and Configuration Manager (SCM)
    • Available to Claudera customers
    • integrated configuration and service management for hadoop services
    • Process supervision, what processes are running where
    • Configuration management, with hadoop-specific dependencies
    • No plans right now to open source the SCM!
  • Related work
  • Hadoop planning tips:
    • NameNode and JobTracker often on beefier hardware
    • Configure disks as JBOD
    • Gigabit Ethernet
    • Top of rack switches
    • Avoid virtualization
  • Hadoop installation tips:
    • CentOS 5 / RHEL 5 most common
    • Oracle JVM, bugs are known and worked around
    • Mount noatime
    • Adjust swappiness
    • Use Cloudera’s Distribution (CDH3), install as .rpm or .deb
      • Brings all relevant components for the Hadoop ecosystem in a tested and compatible fashion
      • Hue, Oozie, Hive, Flume, Sqoop, Pig, HBase, Zookeeper
  • Hadoop configuration tips:
    • Use source control
    • XML files *-site.xml and hadoop-env.sh
    • most important config items:
      • dfs.name.dir NameNode. Typicall two volumes + NFS (mounted correctly)
      • dfs.data.dir DataNodes. One directory per phyiscal harddisk
      • mapred.tasktracker.map.tasks.maximum Max number of maps per machine (1 per core)
      • mapred.tasktracker.reduce.tasks.maximum Max number of redcues per machine (1/3 per core)
    • Hadoop requires DNS with correct reverse lookups.
    • IPv6: Everyone turns it off
    • Secondary name node not checkpointing, logs grow forever.