The training event will consist of two 4-hour trainings in two consecutive days. The first day will focus on big data management and data analysis with Hadoop. The participant will learn how to (i) move big data efficiently to a cluster and to Hadoop distributed file system, and (ii) how to perform simple big data analysis by Python scripts using MapReduce and Hadoop. The second day will focus to big data management and analysis using R and Rhadoop. We will first stick to work within RStudio and will write all scripts within R using several state-of-the-art libraries for parallel computations, like parallel, doParallel, foreach, Rmpi and libraries to work with Hadoop, like rmr, rhdfs and rhbase. Finally, we will show how to perform parallel slurm jobs with R scripts.
Target audience
Everyone interested in big data management and analysis.
Prerequisite knowledge
For the first day: basic Linux shell commands, Python
For the second day: basic Linux shell commands and R
Workflow
The course will be held online via Zoom. The participants will need a local machine to connect to the supercomputers at the University of Ljubljana and to the Vienna Scientific Cluster. Before the start of the course they will get training accounts on these supercomputers for running all examples.
Skills to be gained
At the end of the course the participant will be able to:
- Connect to a supercomputer;
- Move big data to a supercomputer and store it to a distributed file system;
- Writing Python scripts to perform basic data managment and data analysis tasks by Hadoop;
- Writing R scripts to perform basic data analysis tasks by libraries for parallel computations like parallel, doParallel, foreach, Rmpi and by Rhadoop libraries like rmr, rhdfs and ehbase.