

#Lzip file spark code
See extensive research and benchmark code and results in this article ( Performance of various general compression algorithms – some of them are unbelievably fast!). LZO focus on decompression speed at low CPU usage and higher compression at the cost of more CPU.įor longer term/static storage, the GZip compression is still better. GZIP compresses data 30% more as compared to Snappy and 2x more CPU when reading GZIP data compared to one that is consuming Snappy data. If you need your compressed data to be splittable, BZip2, LZO, and Snappy formats are splittable, but GZip is not. In this post, we will load the TSV file in Spark dataframe. What is the difference between CSV and TSV The difference is separating the data in the file The CSV file stores data separated by, whereas TSV stores data separated by tab. It is worth running tests to see if you detect a significant difference. Let’s say we have a data file with a TSV extension. Snappy or LZO are a better choice for hot data, which is accessed frequently. Spark job: block of parallel computation that executes some task. This step is guaranteed to trigger a Spark job. ('csv').option('header','true').load(filePath) Here we load a CSV file and tell Spark that the file contains a header row. You can do this to speed it up: filenamesrdd sc.parallelize (listoffiles, 100) linesrdd filenamesrdd.flatMap (lambda : gzip.open (). However, Spark is really slow at reading gzip files. The best you can do split it in chunks that are gzipped. GZip is often a good choice for cold data, which is accessed infrequently. To read a CSV file you must first create a DataFrameReader and set a number of options. Spark cannot parallelize reading a single gzip file. GZIP compression uses more CPU resources than Snappy or LZO, but provides a higher compression ratio.

When Spark switched from GZIP to Snappy by default, this was the reasoning:īased on our tests, gzip decompression is very slow (< 100MB/s), Use Snappy if you can handle higher disk usage for the performance benefits (lower CPU + Splittable).
