Diego Klabjan
  • Home
  • Vita
  • Publications
  • Contact

CSV to Spark SQL tables

3/20/2015

18 Comments

 
Recently we were involved in a project that required reading and importing more than 30 csv files into Spark SQL. We started writing scala code to ‘manually’ import file by file, but we soon realized that there is substantial repetition.

As a result we created a nice helper object that takes as input information about the csv files and automatically creates a schema per file. Each csv file must have a header, which dictates the name of the columns in the corresponding table. The user has to customize an object where the details are listed by file. For example, for each file the user can specify:

  • The name of the schema
  • Delimiter
  • A possible formatter function that takes a string value from the input csv file and returns the corresponding scala object that is used in schemaRDD (by default, i.e., if the formatter is not specified, strings are assumed). 

The code is available at  https://github.com/wxhC3SC6OPm8M1HXboMy/spark-csv2sql.git and as a package in www.spark-packages.org

18 Comments

    Diego Klabjan

    Professor at Northwestern University, Department of Industrial Engineering and Management Sciences. Founding Director, Master of Science in Analytics.

    Archives

    July 2019
    June 2019
    March 2019
    February 2019
    January 2017
    August 2016
    March 2016
    November 2015
    April 2015
    March 2015
    February 2015
    January 2015
    December 2014

    Categories

    All
    Analytics

    RSS Feed