Blog Archives

CSV to Spark SQL tables

3/20/2015

Recently we were involved in a project that required reading and importing more than 30 csv files into Spark SQL. We started writing scala code to ‘manually’ import file by file, but we soon realized that there is substantial repetition.

As a result we created a nice helper object that takes as input information about the csv files and automatically creates a schema per file. Each csv file must have a header, which dictates the name of the columns in the corresponding table. The user has to customize an object where the details are listed by file. For example, for each file the user can specify:

The name of the schema
Delimiter
A possible formatter function that takes a string value from the input csv file and returns the corresponding scala object that is used in schemaRDD (by default, i.e., if the formatter is not specified, strings are assumed).

The code is available at https://github.com/wxhC3SC6OPm8M1HXboMy/spark-csv2sql.git and as a package in www.spark-packages.org

18 Comments

CSV to Spark SQL tables

Diego Klabjan

Archives

Categories