4005-779                                 Cloud and Large Scale Data Management               Spring 2010-3/Raj

Initial Course Information: Subject to Change

January 2011

  

Prerequisites

·   4005-771 Database Systems, or 4005-730 Distributed Systems, or Permission of Instructor.

o  A strong programming background in Java, or C, or C++, and basic knowledge of client-server computing is needed. Otherwise don’t take this course..

Course Description

This course will explore approaches to large-scale data management such as cloud data management, high-performance database systems, and data-intensive computing.  Seminal work, current research, and modern practices in these areas will be reviewed. Topics include parallel and distributed database system architectures, cloud infrastructures for multi-tenancy databases, cloud security and privacy, fault-tolerant data storage, scalable file systems, and parallel query processing and optimization. Additional topics include newer programming models for large-scale data analysis, the MapReduce model and framework, and novel data-intensive computing applications. Students will read and present assigned papers, participate in class discussions, and work on a comprehensive term project.

Course Outcomes

After taking this course, a student will be able to:

·   Explain principles and practical techniques used in cloud and large-scale data management

·   Critique current research and practice in cloud and large-scale data management

·   Synthesize concepts from different research and practice papers, and apply them to design, implement, and demonstrate a project in cloud and large-scale data management

Textbooks

·   No formal textbook. Selection of readings from research/industry papers

Usage of AWS (Amazon Web Services)

·   If the project requires an AWS account to be set up, a personal credit/debit card is needed, although it probably won’t be charged as each account will have an initial credit that should be sufficient to cover project work needs. Charges over this initial limit will be applied to your card, but you shouldn’t get to this point unless you are sloppy with AWS resources!

Assessment and Grading

 

·   Each student team will work with the instructor to present a major CLDM topic in class.

·   Close to half the course grade will be based on the course project.

·   Extensive readings of research papers and industry technical documents are required.

Example topics and papers

(The papers from last year are provided here for illustration. The papers for this year will be different, of course)

 

Large-scale Data Management:

  • Stonebraker, MapReduce and Parallel DBMSs: Friends or Foes, CACM, Jan 2010.

Systems:

  • DeWitt, D. and Gray, J., Parallel database systems. CACM, Jun 1992.

Large-scale transaction processing:

  • Bernstein, P. A. 1990. Transaction processing monitors. CACM, Nov 1990.

Cloud data management:

  • Cooper, B. F., et al. PNUTS: Yahoo!'s hosted data serving platform. VLDB, Aug 2008.

Data-intensive computing:

  • Ghemawat, S., Gobioff, H., & Leung, S. The Google file system. SIGOPS OSR, Dec. 2003.
  • Dean, J. & Ghemawat, S. MapReduce: a flexible data processing tool. CACM, Jan 2010.

Programming languages and models for large-scale data processing:

  • C. Olston, et al., Pig latin: a not-so-foreign language for data processing. SIGMOD 2008.

Other Topics (Papers not listed here):

  • Microsoft Windows Azure White Papers.
  • On the Cheap Data Intensive Computing in the Sciences.
  • Secure cloud data management.
  • Facebook data engineering.