Velocity 2011 - Part 1: Workshops

My notes on the workshop day at the Velocity Conference.

A lot of Chef stuff, but of course OpsCode was a sponsor. Real gems where Decisions in the Face of Uncertainity and Advanced post-mortem Fu and Human Error 101.

Read also about the first and second day.

Workshop: OpenStack

Compute Nodes (hardware and virtualization agnostic)
Storage Nodes ("Glance") for HTTP Object Storage
Rest APIs: OpenStack API, EC2 Compatibility Module
Open & Modular design
Storage
- http://swift.openstack.org/
- Distributed Shared-nothing Architecture
- HTTP only
- Availability Zones provide independant outage scenarios
- Put data into >=3 different availability zones
- Swift is independant from OpenStack and can be used stand-alone
OpenStack installation
- Using Chef to script the installation
- There are many cook books for automated OpenStack setups
- Mostly Chef advertising, no detailed setup & installation examples :-(

Workshop: Scale Dirty

Yet Another Chef Advertising Session

Workshop: Decisions in the Face of Uncertainity

Just Enough Statistics to be Dangerous

John Rauser, Principal Quantitative Engineer, Amazon
Any numerical information without a precision is worthless. All numbers derived from the real world are actually an estimate
- Example: How old is Jeff Bezos?
- Wrong Answers: 47, 55, …
- Correct Anwswers: 42 - 55, 40 - 50, 45 +/- 5
- Always give a range unless forced to give precise numbers!
- BTW, Jeff is 47
Statistics is a method to calibrate our estimations
Measurements are a tool to reduce uncertainity
Classromm exercise
- Participants estimate the answer to 10 questions
- How would be the distribution of right answers (x out of 10 right) if the probability for a right answer is equal for each question?
- The binomial distribution is the most important statistical function that helps us here, see http://www.wolframalpha.com/input/?i=binomial+distribution for definition
- The audience shows that it is a badly calibrated estimator for this task. On average people get 4 answers right, according to the binomial distribution the average should be much much higher.
- Conclusion For each question we have to find a suitable way to calibrate the estimates. → Statistical Inference
John told his personal story of how he became interested in statistics and data analysis
Before the advent of computers, analytical statistics was the only way to reach results that require lots of calculations.
With computers we can use direct simulation to find a simple answer to the question "what happens if we run the experiment many many times"?
Statistical Inference
- Randomize data production, find a random process that generates the data
- Repeat by simulation
- Reject and model that does not agree with the data
Decisions in the face of uncertainty by the example of estimating the amount of business cards in a stack.
- You had to be there, nice rollercoaster between math, statistics and life experience as a data analyst

Workshop: Advanced Postmortem Fu and Human Error 101

John Allspaw, etsy.com
The "System" you operate also contains people, not only hardware & software
Postmortem relies on having good data to analize
Each graph needs to be put into context by marking important events (e.g. deployments)
Rich internal communications (IRC, Blog, Twitter) act as a flight recorder, everything is timestamped
Define and discuss various crisis patterns
Human error is an inevitable by-product of strained complex systems.
pre-mortems are better than post-mortems: How to prepare for new features
- contingency planning
- what could go wrong?
Just culture
- How to live with and embrace human error
- The culture required to perform blameless post-mortems
- Problem: Negligence is oftenly found during an outage, usually the amount of negligence corresponds with the severity of the outage
- Holding people accountable != Blaming people
- No bad apples, only bad theories of error
- Increase Accountability by supporting learning
- Organizational Roots: Accountability = Responsibility + Requisite Authority
The culture of an organization has great influence

Workshop: Hadoop

Hadoop: Open Source Storage and Processing Engine
- MapReduce for processing
- Hadoop Distributed File System (HDFS) for distributed storage
- Hadoop separates distributed system fault-tolerance code from application logic
Gotchas:
- Configuration and version divergence within a cluster. This can lead to hard-to-catch bugs.
- Cluster state: Is it up, network partitioning,
Claudera Service and Configuration Manager (SCM)
- Available to Claudera customers
- integrated configuration and service management for hadoop services
- Process supervision, what processes are running where
- Configuration management, with hadoop-specific dependencies
- No plans right now to open source the SCM!
Related work
- Google’s cluster manager
- procfile & foreman
- LinkedIn’s glu - https://github.com/linkedin/glu/
Hadoop planning tips:
- NameNode and JobTracker often on beefier hardware
- Configure disks as JBOD
- Gigabit Ethernet
- Top of rack switches
- Avoid virtualization
Hadoop installation tips:
- CentOS 5 / RHEL 5 most common
- Oracle JVM, bugs are known and worked around
- Mount noatime
- Adjust swappiness
- Use Cloudera’s Distribution (CDH3), install as .rpm or .deb
  - Brings all relevant components for the Hadoop ecosystem in a tested and compatible fashion
  - Hue, Oozie, Hive, Flume, Sqoop, Pig, HBase, Zookeeper
Hadoop configuration tips:
- Use source control
- XML files *-site.xml and hadoop-env.sh
- most important config items:
  - dfs.name.dir NameNode. Typicall two volumes + NFS (mounted correctly)
  - dfs.data.dir DataNodes. One directory per phyiscal harddisk
  - mapred.tasktracker.map.tasks.maximum Max number of maps per machine (1 per core)
  - mapred.tasktracker.reduce.tasks.maximum Max number of redcues per machine (1/3 per core)
- Hadoop requires DNS with correct reverse lookups.
- IPv6: Everyone turns it off
- Secondary name node not checkpointing, logs grow forever.

Search This Blog

Schlomo Schapiro

Velocity 2011 - Part 1: Workshops

Workshop: OpenStack

Workshop: Scale Dirty

Workshop: Decisions in the Face of Uncertainity

Workshop: Advanced Postmortem Fu and Human Error 101

Workshop: Hadoop

Comments

Post a Comment

Popular posts from this blog

Overriding / Patching Linux System Serial Number

A Login Security Architecture Without Passwords

The Demise of KaiOS - Alcatel 3088X