Overview:
This training course will focus on the foundations of “Big Data” analysis by introducing the Hadoop distributed computing architecture and providing an introductory level tutorial for Big Data analysis using Hadoop and Rhadoop. Although online, the course will be hands-on, allowing participants to work interactively on real data on the High Performance Computing environment of the University of Ljubljana
Description:
The training event will consist of two 4 hour training in two consecutive days. The first day will focus to big data management and data analysis with Hadoop. The participant will learn how to (i) move big data efficiently to a cluster and to Hadoop distributed file system, and (ii) how to perform simple big data analysis by Python scripts using MapReduce and Hadoop. The second day will focus to big data management and analysis using Rhadoop. We will stick to work within RStudio and will write all scripts within R using several state-of-the-art libraries for parallel computations, like parallel, doParallel and foreach and libraries to work with Hadoop, like rmr, rhdfs and rhbase.
Target audience:
Everyone interested in big data management and analysis
Prerequisite knowledge:
For the first day: basic Linux shell commands, Python
For the second day: basic Linux shell commands and R
Workflow:
The course will be online via zoom. The participants will need local computer to connect to the HPC at University of Ljubljana. Before the start of the course they will get a student account at this supercomputer and all the examples will be done on this machine. They will retain this account for 2 more weeks to repeat the cases again, to transfer the data and the examples to a local machine.
Skills to be gained:
At the end of the course the student will be able to:
Connect to a supercomputer using NoMachine tool;
Move big data to a supercomputer and store it to a distributed file system;
Writing Python scripts to perform basic data management and data analysis tasks by Hadoop;
Writing R scripts to perform basic data management and data analysis tasks by Rhadoop libraries like rmr, rhdfs and rhbase;
Trainers:
Name/ Surname |
Institution |
Description of expertise |
Prof. Janez Povh |
University of Ljubljana, Slovenia |
applied mathematics, high performance computing, big data analysis |
Dr. Giovanna Roda |
EuroCC Austria, BOKU, and TU Wien, Austria |
high performance computing, big data analysis |
Liana Akobian |
TU Wien, Austria |
high performance computing, big data analysis |
Organisers:
This course is an EuroCC event jointly organised by EuroCC Slovenia and EuroCC Austria.
Fakulteta za računalništvo in informatiko, Univerza v Ljubljani