Thep Urai - Fotolia
Ever conduct a search for a product like rollerblades and then notice advertisements for those products on your Facebook or Gmail pages? That ad placement is facilitated by companies like Komli Media, whose ad network platform crunches data on people's past buying decisions and Internet searches to determine which advertisements would be effective on their homepages.
Using the Hadoop framework to analyze big data, the Mumbai, India-based company has precisely targeted online advertisements for resellers of all kinds. Recently, however, Komli Engineering Manager Shailesh Garg found that Hadoop wasn't powerful enough to deliver real-time analysis all the time. Also, the expense of running the ad network platform on-premises was painful. Clustering processing alone cost about $15,000 a month.
The company was using Hadoop with fixed-sized clusters in its data center, along with Amazon Web Services (AWS) tools to glean insights into campaign performance and management. To meet customers' growing business needs for fast data processing, Garg, a self-described problem solver, knew it was time to invest in something scalable to their users, such as a managed cluster.
"The analysis we need to do on the data is very ad hoc and spikey in nature," said Garg, also head of analytics for the company. To solve the company's data crunching woes, Garg devised a short list of must-have features he desired in a new tool. It had to handle large amounts of data and analyze it at lightning-fast speeds. And it had to be cost-effective and easy to use.
"We needed a solution where we could scale the Hadoop cluster and add machines in a few minutes," Garg said. "In today's world, that solution is a cloud solution."
The quest for an advanced data analytics tool entailed investigating some of the more popular cloud and big data offerings. Amazon Elastic MapReduce had the kind of data storage Komli was looking for, but it fell short in processing information quickly. During Garg's testing, however, Qubole Data Service delivered quick data handling and auto-scaling functionality -- something he struggled to find in other products. With the decision to use Qubole set in stone, the team began collaborating with AWS. Qubole works by directly connecting customers' data wherever it resides on the public cloud -- with AWS S3 being one of those data sources.
With one engineer working with Qubole and AWS, the process of migrating data to AWS S3 and setting up a data pipeline took a month to complete. "In parallel, we changed our application to submit the jobs to Qubole Data Service instead of an on-premises cluster," Garg said.
Looking back on the implementation process, Garg said he wishes he had discussed transferring data more with the Qubole team before embarking on the project. His team didn't know what would be the best tool for moving close to 300 GB to 400 GB of compressed data daily in a way that wouldn't strain finances.
Garg shortlisted DistCP (distributed copy) and S3DistCP and benchmarked them. DisctCP is an open source tool used for cluster copying. "We found S3DistCP very efficient compared to DistCP, almost twice as fast," he said.
On the other hand, Garg said, S3DistCP distribution from AWS fell short because it was missing a lot of JARs (Java ARchives). "We were not able to fix it working alone with AWS," Garg said. " We then took help from Qubole, and they provided the right version of dependent JARs." If he had known about the challenges with the JARs in advance, he could have saved a lot of time, he noted.
Despite the hiccup with JARs along the way, Garg said he is satisfied with his experience from start to finish. Most important in Garg's mind, though, is the fact that he can now better meet clients' needs for information in a couple of hours, rather than a day.
What to consider when seeking a big data cloud solution provider
Komli Media's Shailesh Garg learned a lot during his data migration from Hadoop to Qubole Data Services. There are four questions he would recommend that companies ask when evaluating big data cloud service providers.
- Do you offer around-the-clock customer support? Even the most knowledgeable in-house staff can hit snags when working with a big data cloud service, which is why reaching out to a vendor when a problem arises is key for successful operation. Whether by phone or online chat, Garg said having easy access to technology support, at least for the first few months, is critical.
- Will you be able to support your use cases? Not every vendor may be automatically set up to meet every need an organization has, which is OK. What really matters is how the provider responds to requests and needs. In Komli's case, Qubole didn't offer Pig support over the UI. Garg said that after discussing the situation with Qubole, he was able to get Pig support in three days. "That was pretty quick and also points to the vendor commitment for its customer," he said.
- Do you support a range of technologies? A product that makes perfect sense today may not be the best tomorrow or next week or next year. With offerings and needs always changing, it's important to be working with a provider that can grow with an organization. "Probably today Hive makes sense for you, but maybe [in the] future Presto can work better," Garg said as an example, referring to two types of database designs.
- Are you flexible? When it comes to moving data from one system to another, it's important for the provider to be able to meet an organization's specific requirements. "Sometimes data migration becomes lock-in for the customer," Garg said. "Choose your options well."
Maxine Giza is the site editor for SearchSOA and can be reached at email@example.com.