开发者

Databricks Autoloader - dealing with combined files

开发者 https://www.devze.com 2022-12-07 22:08 出处:网络
I\'m working with some files that have some complexities multiple tab files concatenated into 1 csv files with some meta data prior to the csv data

I'm working with some files that have some complexities

  • multiple tab files concatenated into 1
  • csv files with some meta data prior to the csv data
  • csv files with an extra row after the header that should be ignored
  • csv files with log information interspersed into the file

My q开发者_StackOverflowuestion relates to whether autoloader can split the stream (ie 1 input file to 2 or more output files) based on pattern matching or has some other mechanism for dealing with these scenarios

Ignoring the metadata using skipRows isn't an option as I want to retain the metadata in a separate output file The RescuedDataColumn option doesn't appear to be a valid approach as the data doesn't fall into the 3 identified scenarios (from the docs). ie.

  1. The column is missing from the schema.
  2. Type mismatches.
  3. Case mismatches.
0

精彩评论

暂无评论...
验证码 换一张
取 消