Course: Big Data analysis with Hadoop and RHadoop

Europe/Ljubljana
Zoom Meeting

Zoom Meeting

Description

This training course will focus on the foundations of “Big Data” processing by introducing the Hadoop distributed computing architecture and providing an introductory level tutorial for Big Data analysis using Hadoop, Rhadoop, and R libraries parallel, doParallel, foreach and Rmpi. Although online, the course will be hands-on, allowing participants to work interactively on real data on the High Performance Computing environment of the University of Ljubljana and on the Vienna Scientific Cluster.


Organizers

   

This course is an EuroCC event jointly organised by EuroCC Slovenia, EuroCC Slovakia and EuroCC Austria.

 


Lecturers

  • Prof. Janez Povh, University of Ljubljana, Slovenia (applied mathematics, high performance computing, big data analysis)
  • Lucia Absalon Bautista, University of Ljubljana, Slovenia (big data analysis)
  • Dr. Giovanna Roda, EuroCC Austria, BOKU, and TU Wien, Austria (high performance computing, big data analysis)
  • Liana Akobian, TU Wien, Austria (high performance computing, big data analysis)

 

SupportContact
  • Wednesday, 19 October
    • 13:00 13:15
      Introduction
    • 13:15 14:00
      Introduction to HADOOP

      Introduction to Big Data
      The Hadoop Distributed Computing Architecture
      First hands-on excercise on the cluster

    • 14:00 14:15
      Break
    • 14:15 15:00
      HDFS

      The Hadoop Distributed File System: blocks, partitions, load balancing, replication/erasure coding, fault tolerance, data locality
      Hands-on example: managing data on HDFS

    • 15:00 15:15
      Break
    • 15:15 16:00
      MapReduce (MR)

      Explaning the MR computing model
      Split/ map/ sort & shuffle/combine/reduce
      Hands-on demos

    • 16:00 16:15
      Break
    • 16:15 17:00
      Hands-on exercise with MR
  • Thursday, 20 October
    • 13:00 13:15
      Introduction to Day 2
    • 13:15 14:00
      Introduction to R

      Connecting to RStudio web server at HPC@UL
      Creating and running own R scripts
      Creating, retrieving, saving data files
      Standard data management operations on data frames
      Data management with dplyr, magritt

    • 14:00 14:15
      Break
    • 14:15 15:00
      Advanced and Big data management with R

      Dana manipulations with apply functions apply, lapply, sapply, vapply, tapply and mapply
      Big Data management and analysis using one computing node with functions for efficient parallel loops parLapply, parSapply, mcLapply and foreach-dopar

    • 15:00 15:15
      Break
    • 15:15 16:00
      Big Data management and analysis with Rmpi and RHadoop

      Big Data management and analysis using many computing nodes and library Rmpi
      Preparing and storing big data to HDFS using rhdfs library
      Retriving from and managing big data in HDFS by plyrmr and rhdfs library

    • 16:00 16:15
      Break
    • 16:15 17:00
      Big data analysis with RHadoop

      Preparing map-reduce scripts to make basic data analysis tasks (extreme values, counts, mean values, dispersions, visualisations) using rhdfs library

    • 17:00 17:05
      Wrap-up