VerityPy 1.1
Python library for Verity data profiling, quality control, remediation
Functions | Variables
VerityPy.processing.refinedata Namespace Reference

Functions

list do_refine (list outfields, list transforms, dict settings, list lookup_dicts, list srcfields, list srcrecs)
 

Variables

str DQ = "\""
 
str LF = "\n"
 

Detailed Description

Refinery to process analysis and transforms for full data set

Function Documentation

◆ do_refine()

list do_refine ( list outfields,
list transforms,
dict settings,
list lookup_dicts,
list srcfields,
list srcrecs )
Refines a set of data records to correct, normalize, and enrich

outfields-list of Field objects for output fields that will be assigned values in each output record. 
    As Field objects, each can specify rules for datatype and format that will be applied after transforms 
    (if defined for the field) or passing through original value (if no transform defined).
    - title: field name
    - datatype: int, real, bool, date, string. For date, there should be an entry specifying the format otherwise it is set to ISO yyyyMMdd
    - fmt_strlen: integer number of characters (\>0) if a fixed size is required. Ignored if \< 0
    - fmt_strcase: (upper, lower, empty)
    - fmt_strcut: (front, back, empty). Side to cut characters from if it is larger than specified fmt_strlen. Default is back.
    - fmt_strpad: (front, back, empty). Side to add characters to if it is smaller than specified fmt_strlen. Default is back.
    - fmt_strpadchar: single character or character alias (-space-, -fslash-, -bslash-, -tab-). Character to add if needed to make too small string meet specified fmt_strlen. Default is _
    - fmt_decimal: number of decimal digits (0-N). Ignored if \< 0
    - fmt_date:
        + without time part- yyyymmdd, yymmdd, yyyyddmm, yyddmm, mmddyyyy, mmddyy, ddmmyyyy, ddmmyy, 
        + (mmm= month abbreviation like Jan) yyyymmmdd, yyyyddmmm, ddmmmyyyy, ddmmmyy
        + (month= full month name like January) yyyymonthdd, yyyyddmonth, ddmonthyyyy, ddmonthyy
        + with time part: suffix to above date formats as- Thhmmss, Thhmm, Thh like mmddyyyyThhmm for 11282024T1407 or 11/28/2024T14:07
        + (S is space) Shhmmss, Shhmm, Shh like 11-28-2024 14:07
        + with time zone: if time zone is required at end of time part add suffix Z like mmddyyyyThhmmZ 11282024T1407

transforms-list of transform objects

settings-dictionary with various settings. required: 
    delim- (delimiter for parsing records) as comma, tab, pipe, caret, hyphen
    delim_out- optional specified delimiter for output records (default is to use delim) as comma, tab, pipe, caret, hyphen
    is_quoted - bool whether some field values are enclosed in double quotes to allow delimiter within the field value. Default is True.
    has_header - bool whether first used line is delimited field titles. Default is true
    use_comment - bool whether to use comment lines (start with # or //) or ignore them. Default is False so they are ignored.
    normalize- bool whether to apply datatype and format rules to output field value. Default is true. 
        If datatype is int or real and field value is not numeric then the value will be set to 0 for int and 0.00 for real. 
        If datatype is bool and field value is neither true nor false then value will be set to false. 
        If datatype is date and field value is not in date format then value will be set to empty string.
    embed_delim- new character(s) for replacing delim when a field contains delim. Default is a space.

lookup_dicts-list of LookUpDict objects. These should be made from files or arrays prior to invoking this method

srcfields-list of field objects in order correlated to input records when parsed using delimiter specified in settings

srcrecs-list of strings each of which is one input record. Default is to ignore empty lines and those 
    beginning with # or // as comments. This can be overidden with the setting useComments. 
    If the setting hasHeader is True (which is default) then the first used line must be a delimited line of field titles.

RETURNS outrecs as list of Refined data records including header of delimited fields names. If error, 0th entry will start with notok:

Definition at line 21 of file refinedata.py.

Variable Documentation

◆ DQ

str DQ = "\""

Definition at line 17 of file refinedata.py.

◆ LF

str LF = "\n"

Definition at line 18 of file refinedata.py.