Big Data: Distributed Data Management and Scalable Analytics

Level: Master
Semester: 2nd Semester (February - June)
Language: English
Teacher: Jan Hidders
Course Description

The course is subdivided into 2 parts: Big Data Management and Big Data Analytics. The part on Big Data Analytics builds on concepts introduced in the part Big Data Management.

Big Data Management

  • Introduction to large scale data processing
    • The data deluge
    • Execution environments
    • Types of parallelism
  • MapReduce
    • HDFS, MR execution model, Google MR
    • YARN, Zookeeper
  • Spark
    • RDDs, fragmentation
    • Spark execution
  • Streaming data analytics
    • Storm, Spark streaming
    • Execution consistency
  • Big Data Application architectures
  • Consistency and Availability
    • CAP theorem
    • high-availability techniques
    • concurrency control
  • Distributed query evaluation
    • computation models
    • data partitioning, skew
    • distributed join algorithms
  • Stream processing and sublinear algorithms
    • bloomfilters, hyperloglog sketching

Big Data Analytics

  • Introduction to analytics
    • information extraction from large dataset
    • data mining / machine learning workflow
    • recent deep learning developments
    • open source frameworks
  • Machine learning with MapReduce
    • distributed supervised methods: linear regression, logistic regression, naive bayes, random forests, support vector machines
    • distributed unsupervised methods: k-means, PCA/SVD
    • distributed feature selection
    • complexity analysis: single vs. multicore complexity, communication and parallellism trade-off
  • Online learning, active learning
    • online models: exact vs. approximate updates
    • ensemble methods
    • Concept drift
    • Semi-supervised learning and active learning
  • Graph analytics
    • collaborative filtering, SVD
    • PageRank
    • clustering and community detection
  • Deep Learning
    • deep neural networks
    • breakthroughs: e.g., image classifications, games, text and speach
    • convolution and pooling of layers
    • toolboxes: Tensorflow, Theano, Caffe