<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Web Admin Blog &#187; hadoop</title>
	<atom:link href="http://www.webadminblog.com/index.php/tag/hadoop/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.webadminblog.com</link>
	<description>Real Web Admins.  Real World Experience.</description>
	<lastBuildDate>Wed, 25 May 2011 03:02:28 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Velocity 2009 &#8211; Hadoop Operations: Managing Big Data Clusters</title>
		<link>http://www.webadminblog.com/index.php/2009/07/01/velocity-2009-hadoop-operations-managing-big-data-clusters/</link>
		<comments>http://www.webadminblog.com/index.php/2009/07/01/velocity-2009-hadoop-operations-managing-big-data-clusters/#comments</comments>
		<pubDate>Wed, 01 Jul 2009 21:28:16 +0000</pubDate>
		<dc:creator>Ernest</dc:creator>
				<category><![CDATA[Cloud Computing]]></category>
		<category><![CDATA[Conferences]]></category>
		<category><![CDATA[Velocity 2009]]></category>
		<category><![CDATA[cloudera]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[velocity]]></category>
		<category><![CDATA[velocityconf]]></category>
		<category><![CDATA[velocityconf09]]></category>

		<guid isPermaLink="false">http://www.webadminblog.com/?p=255</guid>
		<description><![CDATA[Hadoop Operations: Managaing Big Data Clusters (see link on that page for preso) was given by Jeff Hammerbacher of Cloudera. Other good references - book: "Hadoop: The Definitive Guide" preso: hadoop cluster management from USENIX 2009 Hadoop is an Apache project inspired by Google's infrastructure; it's software for programming warehouse-scale computers. It has recently been [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://en.oreilly.com/velocity2009/public/schedule/detail/7624" target="_blank">Hadoop Operations: Managaing Big Data Clusters</a> (see link on that page for preso) was given by <a href="http://jeffhammerbacher.com/" target="_blank">Jeff Hammerbacher</a> of <a href="http://www.cloudera.com/" target="_blank">Cloudera</a>.</p>
<p>Other good references -<br />
book: "<a href="http://oreilly.com/catalog/9780596521974/" target="_blank">Hadoop: The Definitive Guide</a>"<br />
preso: <a href="http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/Hadoop-USENIX09.pdf" target="_blank">hadoop cluster management from USENIX 2009</a></p>
<p><a href="http://hadoop.apache.org/" target="_blank">Hadoop</a> is an Apache project inspired by Google's infrastructure; it's software for programming warehouse-scale computers.</p>
<p>It has recently been split into three main subprojects - HDFS, MapReduce, and Hadoop Common - and sports an ecosystem of various smaller subprojects (hive, etc.).</p>
<p>Usually a hadoop cluster is a mess of stock 1 RU servers with 4x1TB SATA disks in them.  "I like my servers like I like my women - cheap and dirty," Jeff did not say.</p>
<p>HDFS:</p>
<ul>
<li>Pools servers into a single hierarchical namespace</li>
<li>It's designed for large files, written once/read many times</li>
<li>It does checksumming, replication, compression</li>
<li>Access is from from Java, C, command line, etc.  Not usually mounted at the OS level.</li>
</ul>
<p>MapReduce:</p>
<ul>
<li>Is a fault tolerant data layer and API for parallel data processing</li>
<li>Has a key/value pair model</li>
<li>Access is via Java, C++, streaming (for scripts), SQL (Hive), etc</li>
<li>Pushes work out to the data</li>
</ul>
<p>Subprojects:</p>
<ul>
<li>Avro (serialization)</li>
<li>HBase (like Google BigTable)</li>
<li>Hive (SQL interface)</li>
<li>Pig (language for dataflow programming)</li>
<li>zookeeper (coordination for distrib. systems)</li>
</ul>
<p>Facebook used scribe (log aggregation tool) to pull a big wad of info into hadoop, published it out to mysql for user dash, to oracle rac for internal...<br />
Yahoo! uses it too.</p>
<p>Sample projects hadoop would be good for - log/message warehouse, database archival store, search team projects (autocomplete), targeted web crawls...<br />
As boxes you can use unused desktops, retired db servers, amazon ec2...</p>
<p>Tools they use to make hadoop include subversion/jira/ant/ivy/junit/hudson/javadoc/forrest<br />
It uses an Apache 2.0 license</p>
<p>Good configs for hadoop:</p>
<ul>
<li>use 7200 rpm sata, ecc ram, 1U servers</li>
<li>use linux, ext3 or maybe xfs filesystem, with noatime</li>
<li>JBOD disk config, no raid</li>
<li> java6_14+</li>
</ul>
<p>To manage it -</p>
<p>unix utes: sar, iostat, iftop, vmstat, nfsstat, strace, dmesg, friends</p>
<p>java utes: jps, jstack, jconsole<br />
Get the rpm!  www.cloudera.com/hadoop</p>
<p>config: my.cloudera.com<br />
modes - standalong, pseudo-distrib, distrib<br />
"It's nice to use dsh, cfengine/puppet/bcfg2/chef for config managment across a cluster; maybe use scribe for centralized logging"</p>
<p><em>I love hearing what tools people are using, that's mainly how I find out about new ones!</em></p>
<p>Common hadoop problems:</p>
<ul>
<li> "It's almost always DNS" - use hostnames</li>
<li> open ports</li>
<li> distrib ssh keys (expect)</li>
<li> write permissions</li>
<li> make sure you're using all the disks</li>
<li> don't share NFS mounts for large clusters</li>
<li>set JAVA_HOME to new jvm (stick to sun's)</li>
</ul>
<h3>HDFS In Depth</h3>
<p>1.  NameNode (master)<br />
VERSION file shows data structs, filesystem image (in memory) and edit log (persisted) - if they change, painful upgrade</p>
<p>2.  Secondary NameNode (aka checkpoint node) - checkpoints the FS image and then truncates edit log, usually run on a sep node<br />
New backup node in .21 removes need for NFS mount write for HA</p>
<p>3.  DataNode (workers)<br />
stores data in local fs<br />
stored data into blk_&lt;id&gt; files, round robins through dirs<br />
heartbeat to namenode<br />
raw socket to serve to client</p>
<p>4.  Client (Java HDFS lib)<br />
other stuff (libhdfs) more unstable</p>
<p>hdfs operator utilities</p>
<ul>
<li> safe mode - when it starts up</li>
<li> fsck - hadoop version</li>
<li> dfsadmin</li>
<li> block scanner - runs every 3 wks, has web interface</li>
<li> balancer - examines ratio of used to total capacity across the cluster</li>
<li> har (like tar) archive - bunch up smaller files</li>
<li> distcp - parallel copy utility (uses mapreduce) for big loads</li>
<li> quotas</li>
</ul>
<p>has users, groups, permissions - including x but there is no execution, but used for dirs<br />
hadoop has some access trust issues - used through gateway cluster or in trusted env<br />
audit logs - turn on in log4j.properties</p>
<p>has loads of Web UIs - on namenode go to /metrics, /logLevel, /stacks<br />
non-hdfs access - HDFS proxy to http, or thriftfs<br />
has trash (.Trash in home dir) - turn it on</p>
<p>includes benchmarks - testdfsio, nnbench</p>
<p>Common HDFS problems</p>
<ul>
<li> disk capacity, esp due to log file sizes - crank up reserved space</li>
<li> slow but not dead disks and flapping NICS to slow mode</li>
<li> checkpointing and backing up metadata - monitor that it happens hourly</li>
<li> losing write pipeline for long lived writes - redo every hour is recommended</li>
<li> upgrades</li>
<li>many small files</li>
</ul>
<h3>MapReduce</h3>
<p>use Fair Share or Capacity scheduler<br />
distributed cache<br />
jobcontrol for ordering</p>
<p>Monitoring - They use ganglia, jconsole, nagios and canary jobs for functionality</p>
<p>Question - how much admin resource would you need for hadoop?  Answer - Facebook ops team had 20% of 2 guys hadooping, estimate you can use 1 person/100 nodes</p>
<p>He also notes that this preso and maybe more are on<a href="http://www.slideshare.net/jhammerb" target="_blank"> slideshare under "jhammerb."</a></p>
<p><em>I thought this presentation was very complete and bad ass, and I may have some use cases that hadoop would be good for coming up!</em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.webadminblog.com/index.php/2009/07/01/velocity-2009-hadoop-operations-managing-big-data-clusters/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

