I am fetching an RDBMS table using JDBC with some 10-20 partitions using ROW_NUM. Then from each of these partitions I want to process/format the data, and write one or more files out to file storage based on the file size Each file must be less 500MB. How do I write multiple files out from a single partition? This spark config property 'spark.sql.files.maxRecord开发者_开发技巧sPerFile' wont work for me because each row can be of different size as there is blob data in the row. The size of this blob may vary anywhere from a few hundred bytes to 50MB. So I cannot really limit the write by maxRecordsPerFile.
How do I further split each DB partition, into smaller partitions and then write out the files? If I do a repartition, it does shuffle across all executors. I am trying to keep all data within the same executor to avoid shuffle. Is it possible to repartition within the same executor core (repartition the current partition), and then write a single file from each?
Read JDBC DB and write files of size 500MB
精彩评论