Data Storage and Parallel Access

NGS-experiments tend to generate huge amounts of data. However, most of them do not follow well-known and generally supported formats like flat files (e.g. DSV—

delimiter separated values) nor any kind of hierarchically structured files (e.g. XML or JSON). This is why storing and efficient manipulation of the NGS data is not directly available in HDFS and requires additional adaptation efforts.

While analyzing NGS data the most time is spent on processing alignment files (BAM file format [25]). They can be decomposed into basic operations, like filtering, summarizing, pileup coverage function of mapped short reads using their genome coordinates and record properties. For instance, in RNA-seq studies it is very often necessary to filter out records that do not possess descent mapping quality scores, calculate base coverage or calculate genomic regions coverage. Some of these operations can be very easily parallelized, since they can be computed independently and directly on short reads level (e.g. counting, reads flags or quality filtering). Other, such as coverage function (pileup), or genomic regions counts, can be efficiently parallelized at data partitions level, i.e. the data partitioned by genomic coordinates.

Taking into consideration that each BAM file contains typically millions of reads mapped to the genomic positions across the whole genome, both kinds of operations exhibit a very high degree of parallelism.

Unfortunately, the BAM file format is not well suited for parallel and distributed computing [50, 55]. This is because of using a centralised file header and implementing gzip-compatible compression method that is not records aware in that sense it allows records to be split between blocks. To overcome these shortcomings a few approaches have been proposes so far. The most nạve one is to use the SAM format instead, or convert the BAM file to any other text-based format that can be easily handled by the HDFS built-in libraries. This approach however results in several times higher disk requirements and interconnect network loads. Text files can be stored compressed in HDFS, still then only block compressions like bzip2, Snappy or LZO allow files to be splittable and eligible for being processed in parallel. Such an approach would require an additional, very time-consuming, step of transcoding the BAM files to compressed SAM files.

Another idea was proposed in [50]—it consists in implementing custom HDFS InputFormats that transparently takes care of accessing short reads kept in BAM files and stored in the HDFS clusters. This method is very convenient, as it only requires copying alignment files to HDFS storage without any need of data preprocessing.

On the other hand, it suffers from the same drawbacks of the BAM format design as mentioned above, which in some cases may deteriorate scalability, due to file header segment contention. This data access method was initially implemented in SparkSeq, as it currently seems to be the best trade-off between convenience of use and performance.

The most recent approach [55] aims at introducing a completely new data format for storing alignment data. The basic idea is to keep reading data as self-contained records using the Avro serialization. The serialized records are then stored on disk using the columnar compressed file format Apache Parquet that is designed for the distribution across multiple computers (like in HDFS). Thanks to these optimizations, two appealing goals have been achieved: elimination of the centralized header and better scalability characteristics, as well as, disk occupancy reduction: the ADAM files are up to 25 % smaller than BAM without losing any information. The application of the ADAM format in data processing may be unfortunately still problematic due to the current lack of support for this file format in mapping software. BAM files

Fig. 1 The impact of the

HDFS block size selection 16MB32MB

64MB−default 128MB 192MB

HDFS block size Normalized execution time [%] 050100150200250

need to be transcoded to the ADAM format beforehand—this step may be sometimes more time-consuming than running analysis using BAM files.

HDFS read performance HDFS read performance is in many cases a key factor for a robust NGS data analysis, as many of the algorithms are more IO- than CPU bound. Many of those operations require fast sequential read access to data stored in the HDFS cluster. It has been shown in [52] that the proper HDFS block selection can cut processing time by more than 20 %. As a rule of thumb, increasing the block size from the default value of 64 MB is advisable. However, it is a parameter, the value of which should be adjusted in line with throughput of the underlying storage performance. Increasing it too much would lead to a serious performance degradation, as shown in Fig.1.

Another possible optimisation is to take advantage of the feature that has been introduced in HDFS 2.1, called “Short-Circuit Local Reads ”. This option allows DFSClient, e.g. Apache Spark worker, located on the same machine as DataNode (which very often is the case) to read the locally stored BAM files splits directly from the local disk, rather than over the TCP socket and DataTransferProtocol. In such a way, yet another gain of 15–20 % in the read throughput can be achieved [56].

Data movement A very frequent need is to transfer BAM alignment files to the HDFS cluster efficiently, since the alignment and secondary analysis are done using different systems. In order to facilitate data transfer between mapping and analytics environment one can use a specialized gateway, since the direct copy operation requires that source host has access over the network to all DataNodes in the HDFS cluster. Two popular solutions are HTTPFs that enables data requests to be done over the simple http REST protocol and HDFS NFS gateway. The latter one has been introduced in the Hadoop 2.6 release, and it makes possible mounting HDFS using the NFSv3 distributed file system protocol [57].

Data securityData security is one the main concerns in sequencing data analyses performed in the cloud [58, 59], as the cost of data production is still high, and the market value of the state-of-the art biological data or patients’ medical data is inestimable. This is why providing a solid security level for both data in motion and at rest seems to be a crucial factor for cloud-based NGS analytics adoption.

A data in motion encryption has been for a long time available in Hadoop. It encompasses, inter alia, Hadoop RPC, HDFS Data transfer and WebUIs. For the data at rest in Hadoop, the encryption is available since the recent version 2.6. For the transparent encryption a new abstraction to HDFS was introduced: the encryption zone. An encryption zone is a special directory which contents will be transparently encrypted upon write and transparently decrypted upon read. Each encryption zone is associated with a single encryption zone key which is specified when the zone is created.

In order to enforce security policies at any level (i.e. files, folders, etc., stored in HDFS), and to manage fine-grained access control to higher level objects like the Hive, HBase tables or columns one can take advantage of Apache Ranger [60].

Finally, the easiest way to protect access to the services provided by various Hadoop ecosystem components (such as, e.g., WebHDFS, Stargate—HBase REST gateway or Hive JDBC) is to use the Apache Knox solution [61].

Big Data Analysis and the Scientific Method

Big Data Analysis and Society