VerityPy 1.1
Python library for Verity data profiling, quality control, remediation
Functions | Variables
VerityPy.processing.recfuncs Namespace Reference

Functions

str convert_char_aliases (str strin)
 
str extract_char_aliases (str strin, list ignore)
 
bool is_math_alias (str valnum)
 
str get_math_alias (str valnum)
 
str delim_get_char (str delimin)
 
str convert_special_notation (str strin)
 
list split_quoted_line (str line_in, str delim)
 
str detect_datatype (str strin)
 
list assign_datatype_to_fields (list datatype_dist_fields, dict settings)
 
str assign_datatype (dict datatype_dist, dict settings)
 
bool is_field_its_datatype (str dtype, str fieldval, str datefmt="")
 
str is_field_its_format (str fieldval, field.Field fld, bool allow_empty=False)
 

Variables

str DQ = "\""
 
str LF = "\n"
 
str LCURLY = "{"
 
str RCURLY = "}"
 
str LSQUARE = "["
 
str RSQUARE = "]"
 
str COMMA = ","
 
dict char_aliases
 
dict char_aliases_reverse
 

Detailed Description

Record Functions

Various worker functions to process data records.

char_aliases is a dictionary of alias to actual token and allow specifying disruptive 
characters in transforms. For example, -lsquare- is the left square [ character which has special 
meaning in Python and the coded data strings used within the classes and therefore 
cannot be included within a simple string value. In this case, use -lsquare- which 
will be interpreted and replaced in processing. 

Function Documentation

◆ assign_datatype()

str assign_datatype ( dict datatype_dist,
dict settings )
Uses distribution of detected datatypes for a fld to determine the most likely 
datatype appropriate to assign to it. This uses threshhold settings and knowledge from 
curated data sets across multiple domains and data systems.

datatype_dist: dict object with keys [string, int, real, date, bool, empty]
    and for each values = number of instances. This should come from results of analyzequality.do_qualityinspect()
settings: dict with keys for various settings including
    - include_empty: bool whether to include number of empty values in statistical calculation. Default is True
    - minfrac: real number minimum threshhold in either percentage (any value great than 1) or fraction (0-1). Default is 0.75

returns: string with datatype (string, int, real, date, bool) or empty if cannot be determined. will start with notok: if an error occurs

Definition at line 487 of file recfuncs.py.

◆ assign_datatype_to_fields()

list assign_datatype_to_fields ( list datatype_dist_fields,
dict settings )
Uses list of distribution of detected datatypes for each fld to determine the most likely 
datatype appropriate to assign to it. This uses threshhold settings and knowledge from 
curated data sets across multiple domains and data systems.

datatype_dist_fields: list of dict objects with keys [string, int, real, date, bool, empty] 
    and for each values = number of instances for each fld. 
    This should come from results of analyzequality.do_qualityinspect()
settings: dict with keys for various settings including
    - include_empty: bool whether to include number of empty values in statistical calculation. Default is True
    - minfrac: real number minimum threshhold in either percentage (any value great than 1) or fraction (0-1). Default is 0.75

returns: string list with datatypes (string, int, real, date, bool) per fld (or empty if cannot be determined). 
    0th entry will start with notok: if an error occurs

Definition at line 445 of file recfuncs.py.

◆ convert_char_aliases()

str convert_char_aliases ( str strin)
Finds and converts character aliases in a string such as -comma- to ,
Returns new string. Starts with notok: if error occurs

Definition at line 73 of file recfuncs.py.

◆ convert_special_notation()

str convert_special_notation ( str strin)
Convert VerityX special notation

Converts the VerityX product special notations into their mapped 
strings. Returns decoded string or original value if not matched

Notations:
-comma-    ->  ,
-tab-      ->  \t
-space-    ->   
-pipe-     ->  |
-bslash-   ->  \\
-fslash-   ->  /
-lparen-   ->  (
-rparen-   ->  )
-lcurly-   ->  {
-rcurly-   ->  }
-lsquare-  ->  [
-rsquare-  ->  ]
-mathpi-   ->  math.pi value
-mathe-    ->  math.e value
-crlf-     ->  \r\n
-lf-       ->  \n

Definition at line 177 of file recfuncs.py.

◆ delim_get_char()

str delim_get_char ( str delimin)
Converts name of delim into its character

Delim can be words or char for: comma, pipe, tab, colon, caret, hyphen to become 
char (, | \t : ^ -) . If not one of these then return is 'false:xxx'

Definition at line 144 of file recfuncs.py.

◆ detect_datatype()

str detect_datatype ( str strin)
Detect Value Datatype

Detect a value's data type by looking for known patterns of 
characters and evaluating likely datatype.
Returns datatype or starts with notok: if error

Definition at line 350 of file recfuncs.py.

◆ extract_char_aliases()

str extract_char_aliases ( str strin,
list ignore )
Finds and converts troublesome characters into aliases in a string
ignore: list of characters to not extract such as ["|",","]
Returns new string. Starts with notok: if error occurs

Definition at line 89 of file recfuncs.py.

◆ get_math_alias()

str get_math_alias ( str valnum)
Checks if string is -mathpi- or -mathe-

Returns string of Python value math.pi or math.e if string is -mathpi- or -mathe- which are Verity aliases. 
Otherwise, returns original string unless error in which case starts with notok:reason

Definition at line 122 of file recfuncs.py.

◆ is_field_its_datatype()

bool is_field_its_datatype ( str dtype,
str fieldval,
str datefmt = "" )
IsFieldItsDatatype
Determines if a field's value is in its specified datatype

    dtype: field's defined datatype (int, real, bool, date, string)
    fieldval: field value
    datefmt: date format if checking for a date
    returns: bool

Definition at line 598 of file recfuncs.py.

◆ is_field_its_format()

str is_field_its_format ( str fieldval,
field.Field fld,
bool allow_empty = False )
IsFieldItsFormat
Determines if field value conforms to its defined format (if set)

fieldVal: field value to check
    fld: Field Object
    allow_empty: bool whether empty values (e.g null) are allowed
    returns: string as bool:message with bool =(true,false) and message= reason. If error, starts with notok:message

Definition at line 640 of file recfuncs.py.

◆ is_math_alias()

bool is_math_alias ( str valnum)
Checks if string is -mathpi- or -mathe-
Returns bool

Definition at line 108 of file recfuncs.py.

◆ split_quoted_line()

list split_quoted_line ( str line_in,
str delim )
Decompose quoted record line. 
line_in: string data record
delim: name of delimiter (comma, pipe, tab, colon)
Returns list of parsed values. If error, 0th entry starts with notok:

Definition at line 247 of file recfuncs.py.

Variable Documentation

◆ char_aliases

dict char_aliases
Initial value:
1= {"-comma-": ",",
2 "-space-":" ",
3 "-tab-": "\t",
4 "-pipe-": "|",
5 "-bslash-": "\\",
6 "-fslash-": "/",
7 "-lparen-": "(",
8 "-rparen-": ")",
9 "-lcurly-": "{",
10 "-rcurly-": "}",
11 "-lsquare-": "[",
12 "-rsquare-": "]",
13 "-mathpi-": str(math.pi),
14 "-mathe-": str(math.e),
15 "-dblquote-":"\""
16 }

Definition at line 42 of file recfuncs.py.

◆ char_aliases_reverse

dict char_aliases_reverse
Initial value:
1= {",":"-comma-",
2 "\t":"-tab-",
3 "|":"-pipe-",
4 "\\":"-bslash-",
5 "/":"-fslash-",
6 "(":"-lparen-",
7 ")":"-rparen-",
8 "{":"-lcurly-",
9 "}":"-rcurly-",
10 "[":"-lsquare-",
11 "]":"-rsquare-",
12 "\"":"-dblquote-"
13 }

Definition at line 59 of file recfuncs.py.

◆ COMMA

str COMMA = ","

Definition at line 40 of file recfuncs.py.

◆ DQ

str DQ = "\""

Definition at line 34 of file recfuncs.py.

◆ LCURLY

str LCURLY = "{"

Definition at line 36 of file recfuncs.py.

◆ LF

str LF = "\n"

Definition at line 35 of file recfuncs.py.

◆ LSQUARE

str LSQUARE = "["

Definition at line 38 of file recfuncs.py.

◆ RCURLY

str RCURLY = "}"

Definition at line 37 of file recfuncs.py.

◆ RSQUARE

str RSQUARE = "]"

Definition at line 39 of file recfuncs.py.