Software Systems and Technologies for Big Data Applications.

Goals

  • Basic principles of modern big data processing frameworks.
  • Programming and use of such frameworks depending on the desired functionality: storage, querying, batch processing, graph processing, streaming, deep learning.
  • Performance optimizations from the use of those frameworks.

Prerequisites

ΗΥ360 και ΗΥ252, or instructor permission.

Instructor

Christos Kozanitis

Teaching Assistants

Konstantinos Solomos, Christoforos Leventis

Class meetings

Monday - Wednesday 18:00 - 20:00 A.113

Office hours

  • Instructor: Monday 17:00-18:00, location TBD
  • TA: TBD

Course Text

Assigned paper readings
Material from all over the web

Grading

  • Class participation - reading discussion (30%)
  • Programming assignments (30%)
  • Project (40%)

Computation platform

The course has received an AWS Educate grant from the Amazon Web Services, which we highly appreciate. Registered students will receive credits to use the services of the Amazon Cloud.

Course Material

Introduction
  • Big Data and Data Science
  • A Guide to functional programming with Scala
Apache Spark Architecture and programming model
  • Spark architecture
  • RDDs
  • Spark operators
  • lazy evaluation
  • Spark SQL
  • Spark tutorial + debugging advice
Introduction to Machine Learning
  • Brief introduction
  • terminology
  • supervised vs unsupervised learning
  • example pipelines
  • linear algebra review

Distributed Machine Learning
  • Scalability challenges for common problems: linear regression, logistic regression
  • Sparsity
  • Spark MLlib
  • Non numeric features: One Hot Encoding (OHE)
  • OHE Sparsity
  • Dimensionality reduction
  • Multi dimensional data

Graph Processing
  • Challenges of graph processing
  • Giraph
  • Graphlab
  • Graphx

Streaming
  • Streaming use cases
  • Storm
  • Spark Streaming  - Structured Streaming

Storage Systems
  • HDFS
  • NoSQL
  • Cassandra
  • Hbase
  • MongoDB

Data Representation
  • Serialization, Deserialization
  • Avro, Thrift, Protocol Buffers
  • Column storage
  • Dremel
  • Parquet

Cluster Management
  • Mesos
  • Yarn

Deep Learning
  • Introduction
  • Scale up vs scale out
  • MNIST image recognition
  • Tensor Flow