Prof. Dr. Burkhard Rost
Lehrstuhl: I12 – Department for Bioinformatics and Computational Biology
Technische Universität München
The SuperMUC at the Leibniz-Rechenzentrum (LRZ) is a high-end supercomputer with over 241,000 cores and 6.8 PetaFLOPS, ranking 8th in the fastest supercomputers worldwide. Despite this enormous power, it remains limited when computations depend on reading and writing to millions of small files. This is caused by the sheer number of simultaneous file handles and/or the data distribution over many physical disks. On top, predefined block sizes might lead to a massive waste of disk space when most files are smaller than the block size. In running the first internet server for computational biology, PredictProtein, that has been serving millions of researchers worldwide, we have encountered exactly the above problem.
For this project, we propose to find a solution for the challenge of millions of small files by emulating a file system backed by a database, i.e. a pipeline composed from FUSE (Filesystem in Userspace), GridFS and MongoDB. Thereby combining the split data into a single database and reducing the number of open file handles significantly. The proposed solution will also simplify copying or sharing result files with other research institutes for further analysis. Since we will emulate a file system, no modifications to applications run on SuperMUC will be required. In addition, we will explore advantages in concurrently extracting metadata, i.e. parsing the file content while it is written.