VerityDotNet 1.0
C# library for Verity data profiling, quality control, remediation
|
Refine data record into new output record doing datatyping, formatting, transforms, and error remediation. More...
Static Public Member Functions | |
static QualityAnalysis | DoRefine (List< Field > outFields, List< Transform > transforms, Dictionary< string, string > settings, List< Field > srcFields, List< string > srcRecs, ref List< string > outRecs) |
Generates output data records from input data records using transforms to set or modify values, and specified datatypes and formats to normalize output data. The set of records submitted will be processed one per output record unless the special remediation algorithm AutoStitch is used to rebuild a good record from multiple partial source records broken due to improper parsing. | |
static QualityAnalysis | DoRefine (List< Field > outFields, List< Transform > transforms, Dictionary< string, string > settings, List< Lookup.LookUpDict > lookupDicts, List< Field > srcFields, List< string > srcRecs, ref List< string > outRecs) |
Generates output data records from input data records using transforms to set or modify values, and specified datatypes and formats to normalize output data. The set of records submitted will be processed one per output record. This changes if the special remediation algorithm AutoStitch is used to rebuild a good record from multiple partial source records broken due to improper parsing. Lines are right trimmed so ending whitespace is removed. | |
Refine data record into new output record doing datatyping, formatting, transforms, and error remediation.
|
static |
Generates output data records from input data records using transforms to set or modify values, and specified datatypes and formats to normalize output data. The set of records submitted will be processed one per output record unless the special remediation algorithm AutoStitch is used to rebuild a good record from multiple partial source records broken due to improper parsing.
outFields | list of Field objects for output fields that will be assigned values in each output record. As Field objects, each can specify rules for datatype and format that will be applied after transforms (if defined for the field) or passing through original value (if no transform defined). - title: field name - datatype: int, real, bool, date, string. For date, there should be an entry specifying the format otherwise it is set to ISO yyyyMMdd - fmt_strlen: integer number of characters (greater than 0) if a fixed size is required.Ignored if less than equal to 0 - fmt_strcase: (upper, lower, empty) - fmt_strcut: (front, back, empty).Side to cut characters from if it is larger than specified fmt_strlen.Default is back. - fmt_strpad: (front, back, empty).Side to add characters to if it is smaller than specified fmt_strlen.Default is back. - fmt_strpadchar: single character or character alias(-space -, -fslash -, -bslash -, -tab -).Character to add if needed to make too small string meet specified fmt_strlen.Default is _ - fmt_decimal: number of decimal digits(0 - N).Ignored if less than 0 - fmt_date: without time part - yyyymmdd, yymmdd, yyyyddmm, yyddmm, mmddyyyy, mmddyy, ddmmyyyy, ddmmyy, (mmm = month abbreviation like Jan) yyyymmmdd, yyyyddmmm, ddmmmyyyy, ddmmmyy (month = full month name like January) yyyymonthdd, yyyyddmonth, ddmonthyyyy, ddmonthyy with time part: suffix to above date formats as- Thhmmss, Thhmm, Thh like mmddyyyyThhmm for 11282024T1407 or 11 / 28 / 2024T14:07 (S is space) Shhmmss, Shhmm, Shh like 11 - 28 - 2024 14:07 with time zone: if time zone is required at end of time part add suffix Z like mmddyyyyThhmmZ 11282024T1407 |
transforms | list of transform objects |
settings | dictionary with various settings. -isCaseSens: bool whether is case sensitive. Default false -isQuoted - bool whether some field values are enclosed in double quotes to allow delimiter within the field value. Default is True. -hasHeader - bool whether first used line is delimited field titles. Default is true -normalize- bool whether to apply datatype and format rules to output field value. Default is true. If -delim: (delimiter for parsing records) as comma, tab, pipe, caret, hyphen -delimOut- optional specified delimiter for output records (default is to use delim) as comma, tab, pipe, caret, hyphen datatype is int or real and field value is not numeric then the value will be set to 0 for int and 0.00 for real. If datatype is bool and field value is neither true nor false then value will be set to false. If datatype is date and field value is not in date format then value will be set to empty string. -embedDelim- new character(s) for replacing delim when a field contains delim. Default is a space. -maxThreads: optional. Default 40. string of integer value that is maximum number of threads to use when multi-threading is allowed. -nRecsPerThreadMin: optional. Default 500 (min is 1). Minimum number of records to send to each thread if using multi-threading -nRecsPerThreadMax: optional. Default 100000 (max is 1e6). Maximum number of records to send to each thread if using multi-threading -useThreads: bool default false. Multi-threading will be used if the license active -license: optional string of VerityDotNet license. Required to be active to use multi-Threading. -licenseId: required when license is used. Id used to make license string. Is used to decrypt the license. -debug: (info,trace,"") to collect log messages |
srcFields | list of field objects in order correlated to input records when parsed using delimiter specified in settings |
srcRecs | list of strings each of which is one input record |
outRecs | list of strings each of which is one output record. Passed by reference so changes made in this method are returned to calling method |
|
static |
Generates output data records from input data records using transforms to set or modify values, and specified datatypes and formats to normalize output data. The set of records submitted will be processed one per output record. This changes if the special remediation algorithm AutoStitch is used to rebuild a good record from multiple partial source records broken due to improper parsing. Lines are right trimmed so ending whitespace is removed.
outFields | list of Field objects for output fields that will be assigned values in each output record. As Field objects, each can specify rules for datatype and format that will be applied after transforms (if defined for the field) or passing through original value (if no transform defined). - title: field name - datatype: int, real, bool, date, string. For date, there should be an entry specifying the format otherwise it is set to ISO yyyyMMdd - fmt_strlen: integer number of characters (greater than 0) if a fixed size is required.Ignored if less than equal to 0 - fmt_strcase: (upper, lower, empty) - fmt_strcut: (front, back, empty).Side to cut characters from if it is larger than specified fmt_strlen.Default is back. - fmt_strpad: (front, back, empty).Side to add characters to if it is smaller than specified fmt_strlen.Default is back. - fmt_strpadchar: single character or character alias(-space -, -fslash -, -bslash -, -tab -).Character to add if needed to make too small string meet specified fmt_strlen.Default is _ - fmt_decimal: number of decimal digits(0 - N).Ignored if less than 0 - fmt_date: without time part - yyyymmdd, yymmdd, yyyyddmm, yyddmm, mmddyyyy, mmddyy, ddmmyyyy, ddmmyy, (mmm = month abbreviation like Jan) yyyymmmdd, yyyyddmmm, ddmmmyyyy, ddmmmyy (month = full month name like January) yyyymonthdd, yyyyddmonth, ddmonthyyyy, ddmonthyy with time part: suffix to above date formats as- Thhmmss, Thhmm, Thh like mmddyyyyThhmm for 11282024T1407 or 11 / 28 / 2024T14:07 (S is space) Shhmmss, Shhmm, Shh like 11 - 28 - 2024 14:07 with time zone: if time zone is required at end of time part add suffix Z like mmddyyyyThhmmZ 11282024T1407 |
transforms | list of transform objects |
settings | dictionary with various settings. -isCaseSens: bool whether is case sensitive. Default false -isQuoted - bool whether some field values are enclosed in double quotes to allow delimiter within the field value. Default is True. -hasHeader - bool whether first used line is delimited field titles. Default is true -normalize- bool whether to apply datatype and format rules to output field value. Default is true. If -delim: (delimiter for parsing records) as comma, tab, pipe, caret, hyphen -delimOut- optional specified delimiter for output records (default is to use delim) as comma, tab, pipe, caret, hyphen datatype is int or real and field value is not numeric then the value will be set to 0 for int and 0.00 for real. If datatype is bool and field value is neither true nor false then value will be set to false. If datatype is date and field value is not in date format then value will be set to empty string. -embedDelim- new character(s) for replacing delim when a field contains delim. Default is a space. -maxThreads: optional. Default 40. string of integer value that is maximum number of threads to use when multi-threading is allowed. -nRecsPerThreadMin: optional. Default 500 (min is 1). Minimum number of records to send to each thread if using multi-threading -nRecsPerThreadMax: optional. Default 100000 (max is 1e6). Maximum number of records to send to each thread if using multi-threading -useThreads: bool default false. Multi-threading will be used if the license active -license: optional string of VerityDotNet license. Required to be active to use multi-Threading. -licenseId: required when license is used. Id used to make license string. Is used to decrypt the license. -debug: (info,trace,"") to collect log messages |
lookupDicts | list of LookUpDict objects. These should be made from files or arrays prior to invoking this method |
srcFields | list of field objects in order correlated to input records when parsed using delimiter specified in settings |
srcRecs | list of strings each of which is one input record. Default is to ignore empty lines and those beginning with # or // as comments. This can be overidden with the setting useComments. If the setting hasHeader is True (which is default) then the first used line must be a delimited line of field titles. |
outRecs | list of strings each of which is one output record. Passed by reference so changes made in this method are returned to calling method |