CS-543 Software Systems and Technologies for Big Data Applications.

Goals

  • Basic principles of modern big data processing frameworks.
  • Programming and use of such frameworks depending on the desired functionality: storage, querying, batch processing, graph processing, streaming, deep learning.
  • Performance optimizations from the use of those frameworks.

Prerequisites

ΗΥ360 και ΗΥ252, or instructor permission.

Instructor

Christos Kozanitis
kozanitis [papaki] ics.forth.gr

Teaching Assistant

Aggelos Sinogeorgos

csdp1266 [papaki] csd.uoc.gr

Class meetings

Monday - Wednesday 16:00 - 18:00.

Classroom

H.206

Office hours

  • Instructor: schedule via email. Please include text "543" in the subject of your email.
  • TA: schedule via email

Course Text

Assigned paper readings
Online documentation of technologies of interest

Grading

  • Class participation - reading discussion (30%)
  • Programming assignments (30%)
  • Project (40%)
  • min grade requirements apply:
    • A homework average of 50%
    • 50% at every project deliverable (proposal, oral presentation, final report)

Computation platform

Cloud credits by AWS. Students will have a credit to use compute and storage services of the Amazon cloud. Registered students should use the submission folder of the first week of the class to send their uoc email address to receive access to the platform.

Course Material

Introduction

  • Big Data and Data Science
  • A Guide to functional programming with Scala

Apache Spark Architecture and programming model

  • Spark architecture
  • RDDs
  • Spark operators
  • lazy evaluation
  • Spark SQL
  • Spark tutorial + debugging advice

Introduction to Machine Learning

  • Brief introduction
  • terminology
  • supervised vs unsupervised learning
  • example pipelines
  • linear algebra review

Distributed Machine Learning

  • Scalability challenges for common problems: linear regression, logistic regression
  • Sparsity
  • Spark MLlib
  • Non numeric features: One Hot Encoding (OHE)
  • OHE Sparsity
  • Dimensionality reduction
  • Multi dimensional data

Graph Processing

  • Challenges of graph processing
  • Giraph
  • Graphlab
  • Graphx

Streaming

  • Streaming use cases
  • Storm
  • Spark Streaming  - Structured Streaming

Storage Systems

  • HDFS
  • NoSQL
  • Cassandra
  • Hbase
  • MongoDB

Data Representation

  • Serialization, Deserialization
  • Avro, Thrift, Protocol Buffers
  • Column storage
  • Dremel
  • Parquet

Cluster Management

  • Mesos
  • Yarn

Deep Learning

  • Introduction
  • Scale up vs scale out
  • MNIST image recognition
  • Tensor Flow