CS-543 Software Systems and Technologies for Big Data Applications.
Goals
- Basic principles of modern big data processing frameworks.
- Programming and use of such frameworks depending on the desired functionality: storage, querying, batch processing, graph processing, streaming, deep learning.
- Performance optimizations from the use of those frameworks.
Prerequisites
ΗΥ360 και ΗΥ252, or instructor permission.
Instructor
Christos Kozanitis
kozanitis [papaki] ics.forth.gr
Teaching Assistant
Aggelos Sinogeorgos
csdp1266 [papaki] csd.uoc.gr
Class meetings
Monday - Wednesday 16:00 - 18:00.
Classroom
H.206
Office hours
- Instructor: schedule via email. Please include text "543" in the subject of your email.
- TA: schedule via email
Course Text
Assigned paper readings
Online documentation of technologies of interest
Grading
- Class participation - reading discussion (30%)
- Programming assignments (30%)
- Project (40%)
- min grade requirements apply:
- A homework average of 50%
- 50% at every project deliverable (proposal, oral presentation, final report)
Computation platform
Cloud credits by AWS. Students will have a credit to use compute and storage services of the Amazon cloud. Registered students should use the submission folder of the first week of the class to send their uoc email address to receive access to the platform.
Course Material
Introduction
- Big Data and Data Science
- A Guide to functional programming with Scala
Apache Spark Architecture and programming model
- Spark architecture
- RDDs
- Spark operators
- lazy evaluation
- Spark SQL
- Spark tutorial + debugging advice
Introduction to Machine Learning
- Brief introduction
- terminology
- supervised vs unsupervised learning
- example pipelines
- linear algebra review
Distributed Machine Learning
- Scalability challenges for common problems: linear regression, logistic regression
- Sparsity
- Spark MLlib
- Non numeric features: One Hot Encoding (OHE)
- OHE Sparsity
- Dimensionality reduction
- Multi dimensional data
Graph Processing
- Challenges of graph processing
- Giraph
- Graphlab
- Graphx
Streaming
- Streaming use cases
- Storm
- Spark Streaming - Structured Streaming
Storage Systems
- HDFS
- NoSQL
- Cassandra
- Hbase
- MongoDB
Data Representation
- Serialization, Deserialization
- Avro, Thrift, Protocol Buffers
- Column storage
- Dremel
- Parquet
Cluster Management
- Mesos
- Yarn
Deep Learning
- Introduction
- Scale up vs scale out
- MNIST image recognition
- Tensor Flow
- Teacher: Kozanitis Christos