This is a guest post by Stephen Verstraete, a manager at Pariveda Solutions. Pariveda Solutions is an AWS Advanced Consulting Partner.
Common patterns exist for batch processing and real-time processing of Big Data. However, we haven’t seen patterns that allow us to process batches of dependent data in real-time. Expedia’s marketing group needed to analyze interdependent data sets as soon as all of the data arrived to deliver operational direction to partners. The existing system ran on an on-premises Hadoop cluster, but the team was struggling to meet their internal SLAs. The information was also time-sensitive; getting the data faster meant giving better operational direction to partners.
The Pariveda team working at Expedia engaged with Solutions Architects at AWS to solve three distinct challenges: How to deliver analysis results as rapidly as the source data becomes available; how to process data sets that are interdependent but are produced at different times; and how to manage dependencies between data sets that arrive at different times.
In this blog post, I describe how the Expedia, Pariveda, and AWS teams figured out a unique approach to real-time data processing using AWS Lambda, Amazon DynamoDB, Amazon EMR, and Amazon S3 as building blocks. You’ll learn how to implement a similar pipeline without managing any infrastructure.