Recently we were involved in a project that required reading and importing more than 30 csv files into Spark SQL. We started writing scala code to ‘manually’ import file by file, but we soon realized that there is substantial repetition.
As a result we created a nice helper object that takes as input information about the csv files and automatically creates a schema per file. Each csv file must have a header, which dictates the name of the columns in the corresponding table. The user has to customize an object where the details are listed by file. For example, for each file the user can specify:
The code is available at https://github.com/wxhC3SC6OPm8M1HXboMy/spark-csv2sql.git and as a package in www.spark-packages.org
As a result we created a nice helper object that takes as input information about the csv files and automatically creates a schema per file. Each csv file must have a header, which dictates the name of the columns in the corresponding table. The user has to customize an object where the details are listed by file. For example, for each file the user can specify:
- The name of the schema
- Delimiter
- A possible formatter function that takes a string value from the input csv file and returns the corresponding scala object that is used in schemaRDD (by default, i.e., if the formatter is not specified, strings are assumed).
The code is available at https://github.com/wxhC3SC6OPm8M1HXboMy/spark-csv2sql.git and as a package in www.spark-packages.org