VerityDotNet 1.0
C# library for Verity data profiling, quality control, remediation
Static Public Member Functions | List of all members
VerityDotNet.RefineData Class Reference

Refine data record into new output record doing datatyping, formatting, transforms, and error remediation. More...

Static Public Member Functions

static QualityAnalysis DoRefine (List< Field > outFields, List< Transform > transforms, Dictionary< string, string > settings, List< Field > srcFields, List< string > srcRecs, ref List< string > outRecs)
 Generates output data records from input data records using transforms to set or modify values, and specified datatypes and formats to normalize output data. The set of records submitted will be processed one per output record unless the special remediation algorithm AutoStitch is used to rebuild a good record from multiple partial source records broken due to improper parsing.
 
static QualityAnalysis DoRefine (List< Field > outFields, List< Transform > transforms, Dictionary< string, string > settings, List< Lookup.LookUpDict > lookupDicts, List< Field > srcFields, List< string > srcRecs, ref List< string > outRecs)
 Generates output data records from input data records using transforms to set or modify values, and specified datatypes and formats to normalize output data. The set of records submitted will be processed one per output record. This changes if the special remediation algorithm AutoStitch is used to rebuild a good record from multiple partial source records broken due to improper parsing. Lines are right trimmed so ending whitespace is removed.
 

Detailed Description

Refine data record into new output record doing datatyping, formatting, transforms, and error remediation.

Member Function Documentation

◆ DoRefine() [1/2]

static QualityAnalysis VerityDotNet.RefineData.DoRefine ( List< Field > outFields,
List< Transform > transforms,
Dictionary< string, string > settings,
List< Field > srcFields,
List< string > srcRecs,
ref List< string > outRecs )
static

Generates output data records from input data records using transforms to set or modify values, and specified datatypes and formats to normalize output data. The set of records submitted will be processed one per output record unless the special remediation algorithm AutoStitch is used to rebuild a good record from multiple partial source records broken due to improper parsing.

Parameters
outFieldslist of Field objects for output fields that will be assigned values in each output record. As Field objects, each can specify rules for datatype and format that will be applied after transforms (if defined for the field) or passing through original value (if no transform defined).
- title: field name
- datatype: int, real, bool, date, string. For date, there should be an entry specifying the format otherwise it is set to ISO yyyyMMdd
- fmt_strlen: integer number of characters (greater than 0) if a fixed size is required.Ignored if less than equal to 0
- fmt_strcase: (upper, lower, empty)
- fmt_strcut: (front, back, empty).Side to cut characters from if it is larger than specified fmt_strlen.Default is back.
- fmt_strpad: (front, back, empty).Side to add characters to if it is smaller than specified fmt_strlen.Default is back.
- fmt_strpadchar: single character or character alias(-space -, -fslash -, -bslash -, -tab -).Character to add if needed to make too small string meet specified fmt_strlen.Default is _
- fmt_decimal: number of decimal digits(0 - N).Ignored if less than 0
- fmt_date: without time part - yyyymmdd, yymmdd, yyyyddmm, yyddmm, mmddyyyy, mmddyy, ddmmyyyy, ddmmyy, (mmm = month abbreviation like Jan) yyyymmmdd, yyyyddmmm, ddmmmyyyy, ddmmmyy (month = full month name like January) yyyymonthdd, yyyyddmonth, ddmonthyyyy, ddmonthyy with time part: suffix to above date formats as- Thhmmss, Thhmm, Thh like mmddyyyyThhmm for 11282024T1407 or 11 / 28 / 2024T14:07 (S is space) Shhmmss, Shhmm, Shh like 11 - 28 - 2024 14:07 with time zone: if time zone is required at end of time part add suffix Z like mmddyyyyThhmmZ 11282024T1407
transformslist of transform objects
settingsdictionary with various settings.
-isCaseSens: bool whether is case sensitive. Default false
-isQuoted - bool whether some field values are enclosed in double quotes to allow delimiter within the field value. Default is True.
-hasHeader - bool whether first used line is delimited field titles. Default is true
-normalize- bool whether to apply datatype and format rules to output field value. Default is true. If
-delim: (delimiter for parsing records) as comma, tab, pipe, caret, hyphen
-delimOut- optional specified delimiter for output records (default is to use delim) as comma, tab, pipe, caret, hyphen datatype is int or real and field value is not numeric then the value will be set to 0 for int and 0.00 for real. If datatype is bool and field value is neither true nor false then value will be set to false. If datatype is date and field value is not in date format then value will be set to empty string.
-embedDelim- new character(s) for replacing delim when a field contains delim. Default is a space.
-maxThreads: optional. Default 40. string of integer value that is maximum number of threads to use when multi-threading is allowed.
-nRecsPerThreadMin: optional. Default 500 (min is 1). Minimum number of records to send to each thread if using multi-threading
-nRecsPerThreadMax: optional. Default 100000 (max is 1e6). Maximum number of records to send to each thread if using multi-threading
-useThreads: bool default false. Multi-threading will be used if the license active
-license: optional string of VerityDotNet license. Required to be active to use multi-Threading.
-licenseId: required when license is used. Id used to make license string. Is used to decrypt the license.
-debug: (info,trace,"") to collect log messages
srcFieldslist of field objects in order correlated to input records when parsed using delimiter specified in settings
srcRecslist of strings each of which is one input record
outRecslist of strings each of which is one output record. Passed by reference so changes made in this method are returned to calling method
Returns
QualityAnalysis report with its status starting with notok: if error

◆ DoRefine() [2/2]

static QualityAnalysis VerityDotNet.RefineData.DoRefine ( List< Field > outFields,
List< Transform > transforms,
Dictionary< string, string > settings,
List< Lookup::LookUpDict > lookupDicts,
List< Field > srcFields,
List< string > srcRecs,
ref List< string > outRecs )
static

Generates output data records from input data records using transforms to set or modify values, and specified datatypes and formats to normalize output data. The set of records submitted will be processed one per output record. This changes if the special remediation algorithm AutoStitch is used to rebuild a good record from multiple partial source records broken due to improper parsing. Lines are right trimmed so ending whitespace is removed.

Parameters
outFieldslist of Field objects for output fields that will be assigned values in each output record. As Field objects, each can specify rules for datatype and format that will be applied after transforms (if defined for the field) or passing through original value (if no transform defined).
- title: field name
- datatype: int, real, bool, date, string. For date, there should be an entry specifying the format otherwise it is set to ISO yyyyMMdd
- fmt_strlen: integer number of characters (greater than 0) if a fixed size is required.Ignored if less than equal to 0
- fmt_strcase: (upper, lower, empty)
- fmt_strcut: (front, back, empty).Side to cut characters from if it is larger than specified fmt_strlen.Default is back.
- fmt_strpad: (front, back, empty).Side to add characters to if it is smaller than specified fmt_strlen.Default is back.
- fmt_strpadchar: single character or character alias(-space -, -fslash -, -bslash -, -tab -).Character to add if needed to make too small string meet specified fmt_strlen.Default is _
- fmt_decimal: number of decimal digits(0 - N).Ignored if less than 0
- fmt_date: without time part - yyyymmdd, yymmdd, yyyyddmm, yyddmm, mmddyyyy, mmddyy, ddmmyyyy, ddmmyy, (mmm = month abbreviation like Jan) yyyymmmdd, yyyyddmmm, ddmmmyyyy, ddmmmyy (month = full month name like January) yyyymonthdd, yyyyddmonth, ddmonthyyyy, ddmonthyy with time part: suffix to above date formats as- Thhmmss, Thhmm, Thh like mmddyyyyThhmm for 11282024T1407 or 11 / 28 / 2024T14:07 (S is space) Shhmmss, Shhmm, Shh like 11 - 28 - 2024 14:07 with time zone: if time zone is required at end of time part add suffix Z like mmddyyyyThhmmZ 11282024T1407
transformslist of transform objects
settingsdictionary with various settings.

-isCaseSens: bool whether is case sensitive. Default false
-isQuoted - bool whether some field values are enclosed in double quotes to allow delimiter within the field value. Default is True.
-hasHeader - bool whether first used line is delimited field titles. Default is true
-normalize- bool whether to apply datatype and format rules to output field value. Default is true. If
-delim: (delimiter for parsing records) as comma, tab, pipe, caret, hyphen
-delimOut- optional specified delimiter for output records (default is to use delim) as comma, tab, pipe, caret, hyphen datatype is int or real and field value is not numeric then the value will be set to 0 for int and 0.00 for real. If datatype is bool and field value is neither true nor false then value will be set to false. If datatype is date and field value is not in date format then value will be set to empty string.
-embedDelim- new character(s) for replacing delim when a field contains delim. Default is a space.
-maxThreads: optional. Default 40. string of integer value that is maximum number of threads to use when multi-threading is allowed.
-nRecsPerThreadMin: optional. Default 500 (min is 1). Minimum number of records to send to each thread if using multi-threading
-nRecsPerThreadMax: optional. Default 100000 (max is 1e6). Maximum number of records to send to each thread if using multi-threading
-useThreads: bool default false. Multi-threading will be used if the license active
-license: optional string of VerityDotNet license. Required to be active to use multi-Threading.
-licenseId: required when license is used. Id used to make license string. Is used to decrypt the license.
-debug: (info,trace,"") to collect log messages
lookupDictslist of LookUpDict objects. These should be made from files or arrays prior to invoking this method
srcFieldslist of field objects in order correlated to input records when parsed using delimiter specified in settings
srcRecslist of strings each of which is one input record. Default is to ignore empty lines and those beginning with # or // as comments. This can be overidden with the setting useComments. If the setting hasHeader is True (which is default) then the first used line must be a delimited line of field titles.
outRecslist of strings each of which is one output record. Passed by reference so changes made in this method are returned to calling method
Returns
QualityAnalysis report with its status starting with notok: if error

The documentation for this class was generated from the following file: