VerityPy 1.1
Python library for Verity data profiling, quality control, remediation
Functions | Variables
VerityPy.processing.remediateparsing Namespace Reference

Functions

list shift_array_entries (list origvals, list recflds, dict hashrecflds, list pinflds)
 
list fix_record_field_split (dict settings, list srcfields, list srcrecs)
 

Variables

str DQ = "\""
 

Detailed Description

Remediate Parsing

Remediates (i.e fixes) records that parse incorrectly leading to incorrect number of field values relative to number fields in schema. 
This has three types:
    big: too many field values typically due to embedded delimiters in some field values
    small1: too few fields by 1 which can either be an error or acceptable since some databases intentionally drop last field if it is empty setting it as null
    small2: too few fields by 2 or more typically caused by line feeds in fields or problems exporting records across data system types
This class detects and corrects these errors making new records that are correct

Function Documentation

◆ fix_record_field_split()

list fix_record_field_split ( dict settings,
list srcfields,
list srcrecs )
Fix Record Field Splitting

Repairs records with parsing problems having too many or too few field values relative to defined 
number of fields. This occurs due to embedded delimiters in some field values causing too many parsed values, 
and records broken across multiple lines causing too few values. These situations are categorized into 3 types:
1) big (too many parsed values), 2) small1 (1 too few values), 3) small2 (2 or more too few values). small1 is a 
special case since some data systems purposefully eliminate the fnal field value in a record if it is empty by 
making it null thereby saving storage and memory space. In this case, the record is actually fine but is missing its final 
field value. This can be accepted by having the setting 'allowLastEmpty' = TRUE leading to a default value assigned 
to this last record field based on datatype: int and real assigned 0, bool assigned FALSE, others assigned empty string.

    * settings: Dictionary of setting parameters. Both key and value are strings. Settings:
        - allow_last_empty: bool whether to allow last field to be empty (i.e. small1 parsing) and assign it default value. Default is TRUE
        - is_quoted: bool whether fields values may be enclosed by double quotes as is common when data exported from SpreadSheets and some databases. Default is FALSE.
        - has_header: bool whether first non-empty record is a header row of delimited field titles. Default is FALSE.
        - ignore_empty: bool whether to ignore empty records. Default is TRUE.
        - pin_fields: field titles delimited by pipe (if more than 1) that are pinned meaning if record has too many fields (i.e. big) then these fields will not shifted as 
                the algorithm finds the best way to merge values to make corrected record
        - ignore_start_str: string parts delimited by pipe (if more than 1) that will cause records starting with any one of them to be ignored. Always case insensitive. 
                A common use for this is to filter out comment lines such as those starting with # or // in which case set to #|//
        - delim: name of delimiter to use to parse records: comma, pipe, tab, colon, caret. This is required.
        - join_char: token (max 10 chars) or alias to insert when joining lines to remediate parsing. Default is to use nothing. 
                            Aliases include: -comma-, -tab-, -pipe-, -space-, -bslash-, -fslash-, -lparen-, -rparen-, 
                            -lcurly-, -rcurly-, -lsquare-, -rsquare-, -dblquote-, -mathpi-, -mathe-
    * srcfields: list of Field objects comprising records
    * srcrecs: list of string source records

returns: list of new records. If error, 0th entry starts with notok: otherwise 0th entry is string of stats, 
    entry[1] is header of field titles. For stats, examples may have prefix of (nline) with nline being the line number read 
    (excluding empty and comments lines) and is therefore 1 larger than the line's index 
    in the Python list (i.e. nline is 1 based while lists are 0-based).

Definition at line 351 of file remediateparsing.py.

◆ shift_array_entries()

list shift_array_entries ( list origvals,
list recflds,
dict hashrecflds,
list pinflds )
Shifts array of parsed field values to correct having more values than defined fields. 
    Uses algorithm assessing best field and field + 1 (index positions in record) to join 
    based on datatypes, formats and known patterns of root causes of this error.

* origvals: array of parsed field values
    * recflds: list of Field object containing datatype and format specifications
    * hashrecflds: Dictionary of field title lowercase to its array index
    * pinflds: optional list of field titles that are pinned meaning their position cannot be changed
    
returns: new array of parsed field values. If error, 0th entry starts with notok:

Definition at line 24 of file remediateparsing.py.

Variable Documentation

◆ DQ

str DQ = "\""

Definition at line 22 of file remediateparsing.py.