Software Systems and Technologies for Big Data Applications.

Goals

  • Basic principles of modern big data processing frameworks.
  • Programming and use of such frameworks depending on the desired functionality: storage, querying, batch processing, graph processing, streaming, deep learning.
  • Performance optimizations from the use of those frameworks.

Prerequisites

ΗΥ360 και ΗΥ252, or instructor permission.

Instructor

Christos Kozanitis (kozanitis [papaki] ics.forth.gr)

Teaching Assistants

Angelos Sinogeorgos (sinog [papaki] csd.uoc.gr)

Class meetings

Monday - Wednesday 16:00 - 18:00 at H.206

Office hours

  • Instructor: schedule via email. Please include text "543" in the subject of your email
  • TA: schedule via email

Course Text

Assigned paper readings
Online documentation of technologies that we study

Grading

  • Class participation - reading discussion (30%)
  • Programming assignments (30%)
  • Project (40%)

Computation platform

Cloud credits by AWS. Students will have a credit to use compute and storage services of the Amazon cloud. Registered students should use the submission folder of the first week of the class to send their uoc email address to receive access to the platform.

Course Material

Introduction
  • Big Data and Data Science
  • A Guide to functional programming with Scala
Apache Spark Architecture and programming model
  • Spark architecture
  • RDDs
  • Spark operators
  • lazy evaluation
  • Spark SQL
  • Spark tutorial + debugging advice
Introduction to Machine Learning
  • Brief introduction
  • terminology
  • supervised vs unsupervised learning
  • example pipelines
  • linear algebra review

Distributed Machine Learning
  • Scalability challenges for common problems: linear regression, logistic regression
  • Sparsity
  • Spark MLlib
  • Non numeric features: One Hot Encoding (OHE)
  • OHE Sparsity
  • Dimensionality reduction
  • Multi dimensional data

Graph Processing
  • Challenges of graph processing
  • Giraph
  • Graphlab
  • Graphx

Streaming
  • Streaming use cases
  • Storm
  • Spark Streaming  - Structured Streaming

Storage Systems
  • HDFS
  • NoSQL
  • Cassandra
  • Hbase
  • MongoDB

Data Representation
  • Serialization, Deserialization
  • Avro, Thrift, Protocol Buffers
  • Column storage
  • Dremel
  • Parquet

Cluster Management
  • Mesos
  • Yarn

Deep Learning
  • Introduction
  • Scale up vs scale out
  • MNIST image recognition
  • Tensor Flow