Manage Learn to apply best practices and optimize your operations.

Getting the most from Elastic MapReduce

MapReduce is a technique pioneered by Google for distributing applications across clusters of commodity hardware. It's gaining popularity for its ability to process massive log files.

MapReduce is a technique pioneered by Google for distributing applications across clusters of commodity hardware....

It's gaining popularity for its ability to process massive log files. The Hadoop implementation of MapReduce is being used to process petabytes of data. Researchers believe that it also promises to enable a new paradigm for programming analytic models. MapReduce applications are being used for web indexing, data mining, log file analysis, financial analysis, scientific simulation and bioinformatics research. The Amazon Elastic MapReduce service is an implementation of Hadoop on top of the AWS platform. It was created to simplify the rollout of new MapReduce applications and thus make this technology available to a larger audience. Elastic MapReduce enables more people to run, monitor, and control Hadoop jobs by using a point-and-click interface. Under the hood A MapReduce instance consists of a single master node, and multiple slave nodes used to execute the mapping and reducing algorithm. There are two types of slave nodes. Core nodes are used to manage the data in the distributed file system. The task nodes execute the processes. Amazon has recently added the ability to adjust the number of servers in an Elastic MapReduce instance on the fly. Once a process has started, you can increase but not reduce the number of core nodes. You can dynamically increase or decrease the number of task nodes as required. Changes to a workflow can be made through the Elastic MapReduce interface, the command line or a Java SDK. For example a predefined workflow in an application might reduce the number of task nodes as an application moves to a different task with lower processing needs. These same tools can also be used to kick off new slave nodes in the event of a failure. Programming the MapReduce workflow Developers can interact with elastic MapReduce via command line tools, the API, or the AWS management console. The API and command line tools allow the most automation and fine grained control. These can be used to create special job flow or monitoring steps. The Web console is better suited for watching the progress of a job or launching or stopping a job flow from a web browser. There are a variety of tools to help debug new MapReduce instances. The debug job flow window can be accessed via the AWS management console. This can be used to track progress and indentify issues. You can also telnet into the AWS server and use your favorite command line debugger to analyze the job flow. During the development phase, you are going to want to enable debugging by setting the "Enable Debugging" flag when you create a new job flow with the AWS Management Console. In command line mode, just pass "--enable-debugging and --log-uri" when a job flow is created. One of the biggest challenges with MapReduce is the limited support for legacy code and programming methodologies. Amund Tviet said that developers can use Boto on top of Python to simplify the integration of Elastic MapReduce with other web services. He said this kind of integration opens new doors for parallelizing legacy code. New MapReduce instances are slow to boot up, noted Joel Duffin. This can be significantly reduced by keeping EC2 running. This can prove time saving during the development cycle when new instances are repeatedly kicked off, explained Duffin. To avoid that start up time, keep EC alive by adding the following string: elastic-mapreduce -create -alive -log-uri s3://my-example-bucket/logs.

Next Steps

Learn more about Amazon Elastic MapReduce

Dig Deeper on Topics Archive

Start the conversation

Send me notifications when other members comment.

Please create a username to comment.