Introduction

Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization. The trend to larger data sets is due to the additional information derivable from analysis of a single large set of related data, as compared to separate smaller sets with the same total amount of data, allowing correlations to be found to "spot business trends, determine quality of research, prevent diseases, link legal citations, combat crime, and determine real-time roadway traffic conditions." (from Wikipedia.)

Members

Ph.D. Candidate

구해모 (Heymo Kou)
이용현 (Yonghyun Lee)
임유빈 (Yubin Lim)

Ph.D. Student

김동효 (Donghyo Kim)

M.S. Student

최병민 (Byeongmin Choi)
강필구 ()
양현식 ()
송창헌 (Changheon Song)

Publications

구해모, 남창민, 이우현, 이용재, 김형주, "분산 인 메모리 DBMS 기반 병렬 K-Means 의 In-database 분석 함수로의 설계와 구현", 정보과학회논문지 : 컴퓨팅의 실제, vol. 24, no. 3, pp. 405-112, 2018.3 (pdf)

Reading Materials

MapReduce & Hadoop

MapReduce
Hadoop: The Definitive Guide, O'Reilly
MapReduce Algorithms for Big Data Analysis (VLDB 2012 tutorial slides by prof. Kyuseok Shim)

Data Stores

Summary
- NoSQL Ecosystem - a good summary article by Jonathan Ellis, project chair for Apache Cassandra. posted 2009-11-09.
- Visual Guide to NoSQL Systems by Nathan Hurst. posted 2010-03-15.
- High Performance Scalable Data Stores by Rick Cattell. 2010-04-27.
- NoSQL Databases - NoSQL Introduction And Overview by Christof Strauch, from Stuttgart Media University.
CAP theorem
Distributed file system
- GFS: Google File System
Key-value stores
- Dynamo: Amazon's Highly Available Key-value Store
Document stores
Extensible record stores
- BigTable: Google's distributed storage system for managing structured data
- Cassandra: Facebook's distributed storage system. a marriage of Dynamo and BigTable.
HadoopDB: a hybrid of DBMS and MapReduce technologies

Bookmarks

Bibliography
- UIUC CS 525: Advanced Distributed Systems, Spring 2011
- Mapreduce & Hadoop Algorithms in Academic Papers (3rd update) posted 2010-05-08
- MapReduce paper list by Ashutosh Dutta
- Cassandra reading list by Jonatha Ellis, posted 2009-12-15
UCI ISG Lecture Series on Scalable Data Management

Machine Learning

Contents