VerityDotNet 1.0
C# library for Verity data profiling, quality control, remediation
Static Public Member Functions | List of all members
VerityDotNet.RemediateParsing Class Reference

Remediates (i.e fixes) records that parse incorrectly leading to incorrect number of field values relative to number fields in schema. This has three types: big: too many field values typically due to embedded delimiters in some field values small1: too few fields by 1 which can either be an error or acceptable since some databases intentionally drop last field if it is empty setting it as null small2: too few fields by 2 or more typically caused by line feeds in fields or problems exporting records across data system types This class detects and corrects these errors making new records that are correct. More...

Static Public Member Functions

static List< string > FixRecordFieldSplit (Dictionary< string, string > settings, List< Field > srcFields, List< string > srcRecs)
 Repairs records with parsing problems having too many or too few field values relative to defined number of fields. This occurs due to embedded delimiters in some field values causing too many parsed values, and records broken across multiple lines causing too few values. These situations are categorized into 3 types: 1) big (too many parsed values), 2) small1 (1 too few values), 3) small2 (2 or more too few values). small1 is a special case since some data systems purposefully eliminate the fnal field value in a record if it is empty by making it null thereby saving storage and memory space. In this case, the record is actually fine but is missing its final field value. This can be accepted by having the setting 'allowLastEmpty' = TRUE leading to a default value assigned to this last record field based on datatype: int and real assigned 0, bool assigned FALSE, others assigned empty string.
 
static List< string > ShiftArrayEntries (List< string > origVals, List< Field > fields, Dictionary< string, int > hashFields, List< string > pinFlds)
 Shifts array of parsed field values to correct having more values than defined fields. Uses algorithm assessing best field and field + 1 (index positions in record) to join based on datatypes, formats and known patterns of root causes of this error.
 

Detailed Description

Remediates (i.e fixes) records that parse incorrectly leading to incorrect number of field values relative to number fields in schema. This has three types: big: too many field values typically due to embedded delimiters in some field values small1: too few fields by 1 which can either be an error or acceptable since some databases intentionally drop last field if it is empty setting it as null small2: too few fields by 2 or more typically caused by line feeds in fields or problems exporting records across data system types This class detects and corrects these errors making new records that are correct.

Member Function Documentation

◆ FixRecordFieldSplit()

static List< string > VerityDotNet.RemediateParsing.FixRecordFieldSplit ( Dictionary< string, string > settings,
List< Field > srcFields,
List< string > srcRecs )
static

Repairs records with parsing problems having too many or too few field values relative to defined number of fields. This occurs due to embedded delimiters in some field values causing too many parsed values, and records broken across multiple lines causing too few values. These situations are categorized into 3 types: 1) big (too many parsed values), 2) small1 (1 too few values), 3) small2 (2 or more too few values). small1 is a special case since some data systems purposefully eliminate the fnal field value in a record if it is empty by making it null thereby saving storage and memory space. In this case, the record is actually fine but is missing its final field value. This can be accepted by having the setting 'allowLastEmpty' = TRUE leading to a default value assigned to this last record field based on datatype: int and real assigned 0, bool assigned FALSE, others assigned empty string.

Parameters
settingsDictionary of setting parameters. Both key and value are strings. Settings: -allowLastEmpty: bool whether to allow last field to be empty (i.e. small1 parsing) and assign it default value. Default is TRUE -isQuoted: bool whether fields values may be enclosed by double quotes as is common when data exported from SpreadSheets and some databases. Default is FALSE. -hasHeader: bool whether first non-empty record is a header row of delimited field titles. Default is FALSE. -ignoreEmpty: bool whether to ignore empty records. Default is TRUE. -pinFields: field titles delimited by pipe (if more than 1) that are pinned meaning if record has too many fields (i.e. big) then these fields will not shifted as the algorithm finds the best way to merge values to make corrected record -ignoreStartStr: string parts delimited by pipe (if more than 1) that will cause records starting with any one of them to be ignored. Always case insensitive. A common use for this is to filter out comment lines such as those starting with # or // in which case set to #|// -delim: name of delimiter to use to parse records: comma, pipe, tab, colon, caret. This is required. -joinChar: token (max 10 chars) or alias to insert when joining lines to remediate parsing. Default is to use nothing. Aliases include: -comma-, -tab-, -pipe-, -space-, -bslash-, -fslash-, -lparen-, -rparen-, -lcurly-, -rcurly-, -lsquare-, -rsquare-, -dblquote-, -mathpi-, -mathe-
srcFieldslist of Field objects comprising records
srcRecslist of string source records
Returns

list of new records. If error, 0th entry starts with notok: otherwise it is string of stats:

◆ ShiftArrayEntries()

static List< string > VerityDotNet.RemediateParsing.ShiftArrayEntries ( List< string > origVals,
List< Field > fields,
Dictionary< string, int > hashFields,
List< string > pinFlds )
static

Shifts array of parsed field values to correct having more values than defined fields. Uses algorithm assessing best field and field + 1 (index positions in record) to join based on datatypes, formats and known patterns of root causes of this error.

Parameters
origValsarray of parsed field values
fieldslist of Field object containing datatype and format specifications
hashFieldsDictionary of field title lowercase to its array index
pinFldsoptional list of field titles that are pinned meaning their position cannot be changed
Returns
new array of parsed field values. If error, 0th entry starts with notok:

The documentation for this class was generated from the following file: