Fix Record Field Splitting
Repairs records with parsing problems having too many or too few field values relative to defined
number of fields. This occurs due to embedded delimiters in some field values causing too many parsed values,
and records broken across multiple lines causing too few values. These situations are categorized into 3 types:
1) big (too many parsed values), 2) small1 (1 too few values), 3) small2 (2 or more too few values). small1 is a
special case since some data systems purposefully eliminate the fnal field value in a record if it is empty by
making it null thereby saving storage and memory space. In this case, the record is actually fine but is missing its final
field value. This can be accepted by having the setting 'allowLastEmpty' = TRUE leading to a default value assigned
to this last record field based on datatype: int and real assigned 0, bool assigned FALSE, others assigned empty string.
* settings: Dictionary of setting parameters. Both key and value are strings. Settings:
- allow_last_empty: bool whether to allow last field to be empty (i.e. small1 parsing) and assign it default value. Default is TRUE
- is_quoted: bool whether fields values may be enclosed by double quotes as is common when data exported from SpreadSheets and some databases. Default is FALSE.
- has_header: bool whether first non-empty record is a header row of delimited field titles. Default is FALSE.
- ignore_empty: bool whether to ignore empty records. Default is TRUE.
- pin_fields: field titles delimited by pipe (if more than 1) that are pinned meaning if record has too many fields (i.e. big) then these fields will not shifted as
the algorithm finds the best way to merge values to make corrected record
- ignore_start_str: string parts delimited by pipe (if more than 1) that will cause records starting with any one of them to be ignored. Always case insensitive.
A common use for this is to filter out comment lines such as those starting with # or // in which case set to #|//
- delim: name of delimiter to use to parse records: comma, pipe, tab, colon, caret. This is required.
- join_char: token (max 10 chars) or alias to insert when joining lines to remediate parsing. Default is to use nothing.
Aliases include: -comma-, -tab-, -pipe-, -space-, -bslash-, -fslash-, -lparen-, -rparen-,
-lcurly-, -rcurly-, -lsquare-, -rsquare-, -dblquote-, -mathpi-, -mathe-
* srcfields: list of Field objects comprising records
* srcrecs: list of string source records
returns: list of new records. If error, 0th entry starts with notok: otherwise 0th entry is string of stats,
entry[1] is header of field titles. For stats, examples may have prefix of (nline) with nline being the line number read
(excluding empty and comments lines) and is therefore 1 larger than the line's index
in the Python list (i.e. nline is 1 based while lists are 0-based).
Definition at line 351 of file remediateparsing.py.