{"id":255,"date":"2009-07-01T16:28:16","date_gmt":"2009-07-01T21:28:16","guid":{"rendered":"http:\/\/www.webadminblog.com\/?p=255"},"modified":"2009-07-01T16:32:55","modified_gmt":"2009-07-01T21:32:55","slug":"velocity-2009-hadoop-operations-managing-big-data-clusters","status":"publish","type":"post","link":"https:\/\/www.webadminblog.com\/index.php\/2009\/07\/01\/velocity-2009-hadoop-operations-managing-big-data-clusters\/","title":{"rendered":"Velocity 2009 &#8211; Hadoop Operations: Managing Big Data Clusters"},"content":{"rendered":"<p><a href=\"http:\/\/en.oreilly.com\/velocity2009\/public\/schedule\/detail\/7624\" target=\"_blank\">Hadoop Operations: Managaing Big Data Clusters<\/a> (see link on that page for preso) was given by <a href=\"http:\/\/jeffhammerbacher.com\/\" target=\"_blank\">Jeff Hammerbacher<\/a> of <a href=\"http:\/\/www.cloudera.com\/\" target=\"_blank\">Cloudera<\/a>.<\/p>\n<p>Other good references &#8211;<br \/>\nbook: &#8220;<a href=\"http:\/\/oreilly.com\/catalog\/9780596521974\/\" target=\"_blank\">Hadoop: The Definitive Guide<\/a>&#8221;<br \/>\npreso: <a href=\"http:\/\/wiki.apache.org\/hadoop-data\/attachments\/HadoopPresentations\/attachments\/Hadoop-USENIX09.pdf\" target=\"_blank\">hadoop cluster management from USENIX 2009<\/a><\/p>\n<p><a href=\"http:\/\/hadoop.apache.org\/\" target=\"_blank\">Hadoop<\/a> is an Apache project inspired by Google&#8217;s infrastructure; it&#8217;s software for programming warehouse-scale computers.<\/p>\n<p>It has recently been split into three main subprojects &#8211; HDFS, MapReduce, and Hadoop Common &#8211; and sports an ecosystem of various smaller subprojects (hive, etc.).<\/p>\n<p>Usually a hadoop cluster is a mess of stock 1 RU servers with 4x1TB SATA disks in them.\u00a0 &#8220;I like my servers like I like my women &#8211; cheap and dirty,&#8221; Jeff did not say.<\/p>\n<p>HDFS:<\/p>\n<ul>\n<li>Pools servers into a single hierarchical namespace<\/li>\n<li>It&#8217;s designed for large files, written once\/read many times<\/li>\n<li>It does checksumming, replication, compression<\/li>\n<li>Access is from from Java, C, command line, etc.\u00a0 Not usually mounted at the OS level.<\/li>\n<\/ul>\n<p>MapReduce:<\/p>\n<ul>\n<li>Is a fault tolerant data layer and API for parallel data processing<\/li>\n<li>Has a key\/value pair model<\/li>\n<li>Access is via Java, C++, streaming (for scripts), SQL (Hive), etc<\/li>\n<li>Pushes work out to the data<\/li>\n<\/ul>\n<p>Subprojects:<\/p>\n<ul>\n<li>Avro (serialization)<\/li>\n<li>HBase (like Google BigTable)<\/li>\n<li>Hive (SQL interface)<\/li>\n<li>Pig (language for dataflow programming)<\/li>\n<li>zookeeper (coordination for distrib. systems)<\/li>\n<\/ul>\n<p>Facebook used scribe (log aggregation tool) to pull a big wad of info into hadoop, published it out to mysql for user dash, to oracle rac for internal&#8230;<br \/>\nYahoo! uses it too.<\/p>\n<p>Sample projects hadoop would be good for &#8211; log\/message warehouse, database archival store, search team projects (autocomplete), targeted web crawls&#8230;<br \/>\nAs boxes you can use unused desktops, retired db servers, amazon ec2&#8230;<\/p>\n<p>Tools they use to make hadoop include subversion\/jira\/ant\/ivy\/junit\/hudson\/javadoc\/forrest<br \/>\nIt uses an Apache 2.0 license<\/p>\n<p>Good configs for hadoop:<\/p>\n<ul>\n<li>use 7200 rpm sata, ecc ram, 1U servers<\/li>\n<li>use linux, ext3 or maybe xfs filesystem, with noatime<\/li>\n<li>JBOD disk config, no raid<\/li>\n<li> java6_14+<\/li>\n<\/ul>\n<p>To manage it &#8211;<\/p>\n<p>unix utes: sar, iostat, iftop, vmstat, nfsstat, strace, dmesg, friends<\/p>\n<p>java utes: jps, jstack, jconsole<br \/>\nGet the rpm!\u00a0 www.cloudera.com\/hadoop<\/p>\n<p>config: my.cloudera.com<br \/>\nmodes &#8211; standalong, pseudo-distrib, distrib<br \/>\n&#8220;It&#8217;s nice to use dsh, cfengine\/puppet\/bcfg2\/chef for config managment across a cluster; maybe use scribe for centralized logging&#8221;<\/p>\n<p><em>I love hearing what tools people are using, that&#8217;s mainly how I find out about new ones!<\/em><\/p>\n<p>Common hadoop problems:<\/p>\n<ul>\n<li> &#8220;It&#8217;s almost always DNS&#8221; &#8211; use hostnames<\/li>\n<li> open ports<\/li>\n<li> distrib ssh keys (expect)<\/li>\n<li> write permissions<\/li>\n<li> make sure you&#8217;re using all the disks<\/li>\n<li> don&#8217;t share NFS mounts for large clusters<\/li>\n<li>set JAVA_HOME to new jvm (stick to sun&#8217;s)<\/li>\n<\/ul>\n<h3>HDFS In Depth<\/h3>\n<p>1.\u00a0 NameNode (master)<br \/>\nVERSION file shows data structs, filesystem image (in memory) and edit log (persisted) &#8211; if they change, painful upgrade<\/p>\n<p>2.\u00a0 Secondary NameNode (aka checkpoint node) &#8211; checkpoints the FS image and then truncates edit log, usually run on a sep node<br \/>\nNew backup node in .21 removes need for NFS mount write for HA<\/p>\n<p>3.\u00a0 DataNode (workers)<br \/>\nstores data in local fs<br \/>\nstored data into blk_&lt;id&gt; files, round robins through dirs<br \/>\nheartbeat to namenode<br \/>\nraw socket to serve to client<\/p>\n<p>4.\u00a0 Client (Java HDFS lib)<br \/>\nother stuff (libhdfs) more unstable<\/p>\n<p>hdfs operator utilities<\/p>\n<ul>\n<li> safe mode &#8211; when it starts up<\/li>\n<li> fsck &#8211; hadoop version<\/li>\n<li> dfsadmin<\/li>\n<li> block scanner &#8211; runs every 3 wks, has web interface<\/li>\n<li> balancer &#8211; examines ratio of used to total capacity across the cluster<\/li>\n<li> har (like tar) archive &#8211; bunch up smaller files<\/li>\n<li> distcp &#8211; parallel copy utility (uses mapreduce) for big loads<\/li>\n<li> quotas<\/li>\n<\/ul>\n<p>has users, groups, permissions &#8211; including x but there is no execution, but used for dirs<br \/>\nhadoop has some access trust issues &#8211; used through gateway cluster or in trusted env<br \/>\naudit logs &#8211; turn on in log4j.properties<\/p>\n<p>has loads of Web UIs &#8211; on namenode go to \/metrics, \/logLevel, \/stacks<br \/>\nnon-hdfs access &#8211; HDFS proxy to http, or thriftfs<br \/>\nhas trash (.Trash in home dir) &#8211; turn it on<\/p>\n<p>includes benchmarks &#8211; testdfsio, nnbench<\/p>\n<p>Common HDFS problems<\/p>\n<ul>\n<li> disk capacity, esp due to log file sizes &#8211; crank up reserved space<\/li>\n<li> slow but not dead disks and flapping NICS to slow mode<\/li>\n<li> checkpointing and backing up metadata &#8211; monitor that it happens hourly<\/li>\n<li> losing write pipeline for long lived writes &#8211; redo every hour is recommended<\/li>\n<li> upgrades<\/li>\n<li>many small files<\/li>\n<\/ul>\n<h3>MapReduce<\/h3>\n<p>use Fair Share or Capacity scheduler<br \/>\ndistributed cache<br \/>\njobcontrol for ordering<\/p>\n<p>Monitoring &#8211; They use ganglia, jconsole, nagios and canary jobs for functionality<\/p>\n<p>Question &#8211; how much admin resource would you need for hadoop?\u00a0 Answer &#8211; Facebook ops team had 20% of 2 guys hadooping, estimate you can use 1 person\/100 nodes<\/p>\n<p>He also notes that this preso and maybe more are on<a href=\"http:\/\/www.slideshare.net\/jhammerb\" target=\"_blank\"> slideshare under &#8220;jhammerb.&#8221;<\/a><\/p>\n<p><em>I thought this presentation was very complete and bad ass, and I may have some use cases that hadoop would be good for coming up!<\/em><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Hadoop Operations: Managaing Big Data Clusters (see link on that page for preso) was given by Jeff Hammerbacher of Cloudera. Other good references &#8211; book: &#8220;Hadoop: The Definitive Guide&#8221; preso: hadoop cluster management from USENIX 2009 Hadoop is an Apache project inspired by Google&#8217;s infrastructure; it&#8217;s software for programming warehouse-scale computers. It has recently been [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":false,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[82,77,259],"tags":[266,626,125,267,79,260,261],"class_list":["post-255","post","type-post","status-publish","format-standard","hentry","category-cloud-computing","category-conferences","category-velocity-2009","tag-cloudera","tag-conferences","tag-data","tag-hadoop","tag-velocity","tag-velocityconf","tag-velocityconf09"],"aioseo_notices":[],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_shortlink":"https:\/\/wp.me\/pfI0c-47","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/www.webadminblog.com\/index.php\/wp-json\/wp\/v2\/posts\/255","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.webadminblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.webadminblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.webadminblog.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.webadminblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=255"}],"version-history":[{"count":5,"href":"https:\/\/www.webadminblog.com\/index.php\/wp-json\/wp\/v2\/posts\/255\/revisions"}],"predecessor-version":[{"id":273,"href":"https:\/\/www.webadminblog.com\/index.php\/wp-json\/wp\/v2\/posts\/255\/revisions\/273"}],"wp:attachment":[{"href":"https:\/\/www.webadminblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=255"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.webadminblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=255"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.webadminblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=255"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}