VerityPy 1.1
Python library for Verity data profiling, quality control, remediation
|
Functions | |
str | convert_char_aliases (str strin) |
str | extract_char_aliases (str strin, list ignore) |
bool | is_math_alias (str valnum) |
str | get_math_alias (str valnum) |
str | delim_get_char (str delimin) |
str | convert_special_notation (str strin) |
list | split_quoted_line (str line_in, str delim) |
str | detect_datatype (str strin) |
list | assign_datatype_to_fields (list datatype_dist_fields, dict settings) |
str | assign_datatype (dict datatype_dist, dict settings) |
bool | is_field_its_datatype (str dtype, str fieldval, str datefmt="") |
str | is_field_its_format (str fieldval, field.Field fld, bool allow_empty=False) |
Variables | |
str | DQ = "\"" |
str | LF = "\n" |
str | LCURLY = "{" |
str | RCURLY = "}" |
str | LSQUARE = "[" |
str | RSQUARE = "]" |
str | COMMA = "," |
dict | char_aliases |
dict | char_aliases_reverse |
Record Functions Various worker functions to process data records. char_aliases is a dictionary of alias to actual token and allow specifying disruptive characters in transforms. For example, -lsquare- is the left square [ character which has special meaning in Python and the coded data strings used within the classes and therefore cannot be included within a simple string value. In this case, use -lsquare- which will be interpreted and replaced in processing.
str assign_datatype | ( | dict | datatype_dist, |
dict | settings ) |
Uses distribution of detected datatypes for a fld to determine the most likely datatype appropriate to assign to it. This uses threshhold settings and knowledge from curated data sets across multiple domains and data systems. datatype_dist: dict object with keys [string, int, real, date, bool, empty] and for each values = number of instances. This should come from results of analyzequality.do_qualityinspect() settings: dict with keys for various settings including - include_empty: bool whether to include number of empty values in statistical calculation. Default is True - minfrac: real number minimum threshhold in either percentage (any value great than 1) or fraction (0-1). Default is 0.75 returns: string with datatype (string, int, real, date, bool) or empty if cannot be determined. will start with notok: if an error occurs
Definition at line 487 of file recfuncs.py.
list assign_datatype_to_fields | ( | list | datatype_dist_fields, |
dict | settings ) |
Uses list of distribution of detected datatypes for each fld to determine the most likely datatype appropriate to assign to it. This uses threshhold settings and knowledge from curated data sets across multiple domains and data systems. datatype_dist_fields: list of dict objects with keys [string, int, real, date, bool, empty] and for each values = number of instances for each fld. This should come from results of analyzequality.do_qualityinspect() settings: dict with keys for various settings including - include_empty: bool whether to include number of empty values in statistical calculation. Default is True - minfrac: real number minimum threshhold in either percentage (any value great than 1) or fraction (0-1). Default is 0.75 returns: string list with datatypes (string, int, real, date, bool) per fld (or empty if cannot be determined). 0th entry will start with notok: if an error occurs
Definition at line 445 of file recfuncs.py.
str convert_char_aliases | ( | str | strin | ) |
Finds and converts character aliases in a string such as -comma- to , Returns new string. Starts with notok: if error occurs
Definition at line 73 of file recfuncs.py.
str convert_special_notation | ( | str | strin | ) |
Convert VerityX special notation Converts the VerityX product special notations into their mapped strings. Returns decoded string or original value if not matched Notations: -comma- -> , -tab- -> \t -space- -> -pipe- -> | -bslash- -> \\ -fslash- -> / -lparen- -> ( -rparen- -> ) -lcurly- -> { -rcurly- -> } -lsquare- -> [ -rsquare- -> ] -mathpi- -> math.pi value -mathe- -> math.e value -crlf- -> \r\n -lf- -> \n
Definition at line 177 of file recfuncs.py.
str delim_get_char | ( | str | delimin | ) |
Converts name of delim into its character Delim can be words or char for: comma, pipe, tab, colon, caret, hyphen to become char (, | \t : ^ -) . If not one of these then return is 'false:xxx'
Definition at line 144 of file recfuncs.py.
str detect_datatype | ( | str | strin | ) |
Detect Value Datatype Detect a value's data type by looking for known patterns of characters and evaluating likely datatype. Returns datatype or starts with notok: if error
Definition at line 350 of file recfuncs.py.
str extract_char_aliases | ( | str | strin, |
list | ignore ) |
Finds and converts troublesome characters into aliases in a string ignore: list of characters to not extract such as ["|",","] Returns new string. Starts with notok: if error occurs
Definition at line 89 of file recfuncs.py.
str get_math_alias | ( | str | valnum | ) |
Checks if string is -mathpi- or -mathe- Returns string of Python value math.pi or math.e if string is -mathpi- or -mathe- which are Verity aliases. Otherwise, returns original string unless error in which case starts with notok:reason
Definition at line 122 of file recfuncs.py.
bool is_field_its_datatype | ( | str | dtype, |
str | fieldval, | ||
str | datefmt = "" ) |
IsFieldItsDatatype Determines if a field's value is in its specified datatype dtype: field's defined datatype (int, real, bool, date, string) fieldval: field value datefmt: date format if checking for a date returns: bool
Definition at line 598 of file recfuncs.py.
str is_field_its_format | ( | str | fieldval, |
field.Field | fld, | ||
bool | allow_empty = False ) |
IsFieldItsFormat Determines if field value conforms to its defined format (if set) fieldVal: field value to check fld: Field Object allow_empty: bool whether empty values (e.g null) are allowed returns: string as bool:message with bool =(true,false) and message= reason. If error, starts with notok:message
Definition at line 640 of file recfuncs.py.
bool is_math_alias | ( | str | valnum | ) |
Checks if string is -mathpi- or -mathe- Returns bool
Definition at line 108 of file recfuncs.py.
list split_quoted_line | ( | str | line_in, |
str | delim ) |
Decompose quoted record line. line_in: string data record delim: name of delimiter (comma, pipe, tab, colon) Returns list of parsed values. If error, 0th entry starts with notok:
Definition at line 247 of file recfuncs.py.
dict char_aliases |
Definition at line 42 of file recfuncs.py.
dict char_aliases_reverse |
Definition at line 59 of file recfuncs.py.
str COMMA = "," |
Definition at line 40 of file recfuncs.py.
str DQ = "\"" |
Definition at line 34 of file recfuncs.py.
str LCURLY = "{" |
Definition at line 36 of file recfuncs.py.
str LF = "\n" |
Definition at line 35 of file recfuncs.py.
str LSQUARE = "[" |
Definition at line 38 of file recfuncs.py.
str RCURLY = "}" |
Definition at line 37 of file recfuncs.py.
str RSQUARE = "]" |
Definition at line 39 of file recfuncs.py.