Observatoire astronomique de Strasbourg, Université de Strasbourg, CNRS, UMR 7550, 11 rue de l’Université, F-67000 Strasbourg, France

Miscellaneous Information

Miscellaneous Information

Abstract Reference: 30840
Identifier: O13.3
Presentation: Oral communication
Key Theme: 3 New Trends in HPC and Distributed Computing 

3 New Trends in HPC and Distributed Computing

Schaaff Andre, Pineau François-Xavier, Wali Noémie, Nauroy Julien, 

To face the increasing volume of data we will have to manage in the coming years, we test and prototype implementations in the Big Data domain (both data and processing).

The CDS propose a "X-Match" service which does a cross correlation of sources between very large catalogues (10 billions rows).

It is a fuzzy join between two tables of several hundred millions of lines (e.g. 470,992,970 sources for 2MASS). A user can do a cross-match of the (over 10,000) catalogues proposed by the CDS or he can upload his own table (with positions) to cross-match it with these catalogues. It is based on optimized developments implemented on a well-sized server. The area concerned by the cross-match can be the full Sky (which involves all the sources), a cone with only the sources (which are at a certain angular distance from a given position), or a HEALPix cell.

This kind of treatment is potentially "heavy" and requires appropriate techniques (data structure and computing algorithm) to ensure good performances and to enable its use in online services.

Apache Spark seemed very promising and we decided to improve the algorithms, by using this technology in a suitable technical environment and by testing it with large datasets. Compared to Hadoop, Spark is designed to do as much as possible the treatments in the RAM.

We performed comparative tests with our X-Match service. In a first step we used an internal and limited test bed to learn and to gain the necessary experience to optimize the process. In a second step we did the tests with a rented external cluster of servers.

At the end we reached an execution time better than the X-Match service. We will detail this experiment step by step and show the corresponding metrics.

We will focus on the bottleneck we encountered during the shuffle phase of Spark and especially the difficulty to enable the « data co-location » which is a way to decrease the data exchange between the nodes. Following discussions around Spark, we are now working on the hardware definition of a new test bed and on the use of Spark 2.0, which is near the official release. We have also started to evaluate Docker in the frame of "bringing the computing to the data". The encapsulation of the computing algorithms in Docker components could be a solution to deploy quickly and easily a processing environment near the distributed data (based on Hadoop Distributed File System). We propose an overview of the study from a technical and budgetary perspective.