Introduction

MapReduce is a programming model, which is introduced by Google in 2004, for large-scale data processing run on a shared-nothing cluster. Since MapReduce provides automatic parallel execution on a large cluster of commodity machines, users can easily write their programs without the burden of implementing the features for parallel and distributed processing. It is widely used because of Hadoop, an open-source implemenation of the MapReduce framework. We have been working on improving the performance of MapReduce and analyzing large data with MapReduce.

Members

Ph.D. Alumni

김기성 (Kisung Kim)
이태휘 (Taewhi Lee)

M.S. Alumni

배혜찬 (Hye-Chan Bae)
김혜원 (Hyewon Kim)
김태경 (Tai Kyoung Kim)
송효진 (Hyojin Song)

Publications

Taewhi Lee, Hye-Chan Bae, and Hyoung-Joo Kim. Join Processing with Threshold-based Filtering in MapReduce. Journal of Supercomputing, vol. 69, no. 2, pp. 793-813, 2014.8. [LINK]
Taewhi Lee, Dong-Hyuk Im, Hangkyu Kim, and Hyoung-Joo Kim. Application of Filters to Multiway Joins in MapReduce. Mathematical Problems in Engineering, vol. 2014, Article ID 249418, 11 pages, 2014.3. [LINK]
Taewhi Lee, Kisung Kim, and Hyoung-Joo Kim. Exploiting Bloom Filters for Efficient Joins in MapReduce. Information ― An International Interdisciplinary Journal, vol. 16, no. 8(A), pp. 5869-5885, 2013.8. [PDF]
배혜찬, 이태휘, 김형주. 맵리듀스 환경에서 블룸 필터를 사용한 적응적 조인 처리. 정보과학회논문지: 데이터베이스, 제 40권, 제 4호, pp. 233-242, 2013.8.
Taewhi Lee, Kisung Kim, and Hyoung-Joo Kim. Join Processing Using Bloom Filter in MapReduce. In Proceedings of the 2012 ACM Research in Applied Computation Symposium (RACS '12), pp. 100-105, San Antonio, TX, USA, 2012.10. [LINK]
김태경, 김기성, 김형주. 맵리듀스에서 중복기반 조인과 비상충 조인을 이용한 효율적인 SPARQL 질의 처리. 정보과학회논문지: 데이타베이스, vol. 39, no. 4, pp.246-254, 2012.8. [PDF]

Reading Materials

MapReduce & Hadoop

MapReduce
Hadoop: The Definitive Guide, O'Reilly
MapReduce Algorithms for Big Data Analysis (VLDB 2012 tutorial slides by prof. Kyuseok Shim)

Data Stores

Summary
- NoSQL Ecosystem - a good summary article by Jonathan Ellis, project chair for Apache Cassandra. posted 2009-11-09.
- Visual Guide to NoSQL Systems by Nathan Hurst. posted 2010-03-15.
- High Performance Scalable Data Stores by Rick Cattell. 2010-04-27.
- NoSQL Databases - NoSQL Introduction And Overview by Christof Strauch, from Stuttgart Media University.
CAP theorem
Distributed file system
- GFS: Google File System
Key-value stores
- Dynamo: Amazon's Highly Available Key-value Store
Document stores
Extensible record stores
- BigTable: Google's distributed storage system for managing structured data
- Cassandra: Facebook's distributed storage system. a marriage of Dynamo and BigTable.
HadoopDB: a hybrid of DBMS and MapReduce technologies

Bookmarks

Bibliography
- UIUC CS 525: Advanced Distributed Systems, Spring 2011
- Mapreduce & Hadoop Algorithms in Academic Papers (3rd update) posted 2010-05-08
- MapReduce paper list by Ashutosh Dutta
- Cassandra reading list by Jonatha Ellis, posted 2009-12-15
UCI ISG Lecture Series on Scalable Data Management

MapReduce

Contents