Objects#

Summary#

Python objects are used in several ways in this library. There are object classes to define the portfolio of transform functions organized by categories, to load and use lookup dictionaries to encode/decode values, to define transforms and their multiple operations per source and enrichment field, and to collate the settings and results from analyzing data records.

Optypes Module#

OpCat#

Description#

Category for transform operations grouped by either type of action performed or datatype it is applied to. Categories include: assignment, conditional, numeric, text, date.

Members#

  • category: string title of this category

  • funcs: list of OpFunc objects for this category

Constructor#

In module optypes.py: OpCat(category:str): category title is required

OpFunc#

Description#

Function definition for defining transform operations. This is intended to be used for the specification of available transform operations.

Members#

  • title: string name of function such as ifEq

  • desc: string description

  • param1req: bool true/false whether param1 is required to be set

  • param1typ: string datatype of param1 if it is restricted to one type

  • param2req: bool true/false whether param2 is required to be set

  • param2typ: string datatype of param2 if it is restricted to one type

  • param3req: bool true/false whether param3 is required to be set

  • param3typ: string datatype of param3 if it is restricted to one type

Constructor#

In module optypes.py: OpFunc(title:str): title is required

Transform Module#

Transform#

Description#

Transform object for modifying source or enrichment field value. A transform contains a sequence of operations that may include conditional testing of values and referenced fields. It operates on the field whose name is the title of the transform. This field may be a source data field or an enrichment field added to the output record.

Members#

  • title: string name of field transform operates on

  • ops: list of Op objects

Constructor#

In module transform.py: Transform(title:str): title is required

Methods#

get_json(add_lf:bool, add_quote:bool)

Get JSON string of transform properties.

  • add_lf: whether to add line feed at ends of each JSON property (default false)

  • add_quote: whether to enclose keys and values in double quotes (default false)

Op#

Description#

Operation (Op) for transforms. Each Op has a title which is the name of the function. It specifies the function and its category along with several possible parameters (param1, param2, param3) that depend on the specific function.

Members#

  • title: string name of function to be performed

  • category: string name of function category

  • param1: string parameter value which be unused, optional or required depending on specific function

  • param2: string parameter value which be unused, optional or required depending on specific function

  • param3: string parameter value which be unused, optional or required depending on specific function

  • order: integer value to order execution of operations for this transform. Default is -1 with system assigning values at runtime after sorting values which can be discontinuous

  • p1list: list for internal use only

  • p2list: list for internal use only

  • p3list: list for internal use only

Constructor#

In module transform.py: Op(title:str, param1:str=””, param2:str=””, param3:str=””, order:int=-1)

  • title is required

  • others optional

LookUp Module#

Lookup dictionary for transform processing. Transforms have an operation (Op) that allows assigning a value based on looking up the current value in a dictionary. 1, 2, 3 keys are allowed with the replacement value coming from the following column. This Op is in transform_types for category=assignment, function=lookup.

The description of this function and how it uses the lookup is:

Assigns a value from a lookup list based on matching values to keys where keys can use wildcards. The match can be made to one, two, or three source fields in the record with field 1 always the current value while fields 2 and 3 are optional if set in Param2. Leave Param2 empty to use only 1 field as the match value. All selected fields must match their respective conditions for a lookup result to be assigned. The conditions and the result to assign are in an object for each list entry with properties: key1, key2, key3, value defined as example {‘key1’:’top*’,’key2’:’*blue*’,’key3’:’*left’,’value’:’Orange’}. Conditions can use front and/or back wildcard (*) like top*, *night, *state* to allow token matching. To use multiple conditions for the same replacement value ( OR condition ), enter them as separate list entries. To use AND and NOT conditions, use special notations as delimiters: top*-and-*night-not-*blue* which means a match requires both top* and *night be true as well as no instances of the token blue.

  • Param1: title of list that has been pre-loaded into array of lists as part of initialization.

  • Param2: Fields 2 and 3 both of which are optional and if both supplied use pipe to delimit as with color|position. For this example, current value must start with top (key1 condition), the field color must contain blue, and the field position must end with left. All of these must be true for a match in which case the value of Orange is assigned.

LookUpDict#

Description#

Dictionary with keys (either 1,2,3 field values) mapped to replacement value. The keys can use wild cards and also special notations for AND and NOT conditions. In practice, this is made with a convenience method make_lookup_from_file such as:

lookup_dict= lookup.make_lookup_from_file(title, file_uri, "pipe", iscasesens, nkeys)

Members#

  • title: string used in function ‘lookup’ to select which LookUpDict to use

  • is_case_sens: if false then all text changed to lowercase

  • num_keys: integer number of keys used 1-3

  • delim: delimiter (comma, pipe, tab, colon)

  • fields: list of field titles which must correspond to columns in data set

  • recs:list of LookUpRec objects

Constructor#

In module lookup.py: LookUpDict()

LookUpRec#

Description#

Record within a lookup dictionary. Each lookup record has 1-3 keys with matching conditions. Each key value can use wildcards at front and/or back. In addition, each key can have a combined condition using more than one token joined using special notation for boolean AND and NOT conditions. For example, a key might be top*-and-*food*-not-*juice which means the field value being checked must statisfy starting with ‘top’, containing ‘food’, and not ending with ‘juice’. Therefore, each key is parsed into lists that are aligned: AND tokens and for each corresponding front_wild and back_wild lists. Similarly, NOT tokens and wild lists. After parsing, key1_and is a list of tokens minus front and back wildcards if they were supplied, and if so, they are in correlated lists key1_and_front_wild and key1_and_back_wild.

Members#

  • key1: single or combined value(s) to check

  • key2: optional. single or combined value(s) to check

  • key3: optional. single or combined value(s) to check

  • key1_and: list of AND conditions for key1 (at least 1). System made.

  • key1_not: list of NOT conditions for key1 (0 or more). System made.

  • key2_and: optional. list of AND conditions for key2 (at least 1 if used). System made.

  • key2_not: optional. list of NOT conditions for key2 (0 or more). System made.

  • key3_and: optional. list of AND conditions for key3 (at least 1 if used). System made.

  • key3_not: optional. list of NOT conditions for key3 (0 or more). System made.

  • key1_and_front_wild: bool for key1_and entry if it had front wildcard *. System made.

  • key1_and_back_wild: bool for key1_and entry if it had back wildcard *. System made.

  • key1_not_front_wild: bool for key1_not entry if it had front wildcard *. System made.

  • key1_not_back_wild: bool for key1_not entry if it had back wildcard *. System made.

  • key2_and_front_wild: bool for key2_and entry if it had front wildcard *. System made.

  • key2_and_back_wild: bool for key2_and entry if it had back wildcard *. System made.

  • key2_not_front_wild: bool for key2_not entry if it had front wildcard *. System made.

  • key2_not_back_wild: bool for key2_not entry if it had back wildcard *. System made.

  • key3_and_front_wild: bool for key3_and entry if it had front wildcard *. System made.

  • key3_and_back_wild: bool for key3_and entry if it had back wildcard *. System made.

  • key3_not_front_wild: bool for key3_not entry if it had front wildcard *. System made.

  • key3_not_back_wild: bool for key3_not entry if it had back wildcard *. System made.

  • result: string final value. System made.

Constructor#

In module lookup.py: LookUpRec()

QualityAnalysis Module#

QualityAnalysis#

Description#

Packaging object for both settings used and results from an analysis of data records.

Members#

  • title: name for this object usually the name of the job run to make it

  • status: system assigned value which will be notok:reason if there is an error

  • numrecs: integer count of records used

  • maxuv: maximum number of unique values to collect per field with remainder into category ‘-other-’. Default=50

  • is_case_sens: bool whether values are case sensitive. Default=False

  • is_quoted: bool whether source records can have quoted values. Default= False

  • has_header: bool whether source records have header line as first non-empty, non-comment line. Default= True

  • extract_fields: bool whether fields should be extracted from header line instead of supplied in ‘fields’ list. Default= False

  • delim: name of delimiter to parse source records (comma, tab, pipe, colon, caret)

  • delim_char: character for delim used in code

  • fields: list of field objects which have attributes for title, datatype, and formatting

  • hash_fields: dictionary key= field title lower case with value= list index

  • field_uniqvalues: list correlated to fields. Each entry is a descending sorted list of

    uniquevalue tuples with each tuple (uv,count) where uv= string of unique value and count= integer number of instances. A maximum number of values are kept (default=50) with additional grouped into -other-

  • field_quality: list correlated to fields. String of an integer 0-100 as a quality metric computed from discovered field characteristics

  • field_datatype_dist: list of field datatype distributions correlated to fields list. Each field has counts for detected datatypes (int, real, bool, date, string, empty).

  • rec_size_dist: dictionary of record sizes (byte lengths) to counts. Max 100 sizes.

  • rec_parse_errs: dictionary of parsing errors (number fields after parsing relative to defined fields) by type as small1 (1 too few fields), small2 (2 or more missing fields), big (1 or more too many fields). Also, has keys for lists of example records small1_recs, small2_recs, big_recs (each max 50).

  • rec_parse_dist: dictionary of number of parsed fields to count

  • spec_char_dist: dictionary of special characters and their counts.

    Special characters are (some use aliases as dictionary keys): tab, !, doublequote, #, <, >, [, ], backslash, ^, {, }, ~, ascii_[0-31, 127-255], unicode_[256-65535]

  • spec_char_dist_field: list correlated to fields[] with each being a dictionary

    of special characters to their counts for that specific field. Same organization as in spec_char_dist

  • spec_char_examples: list of examples of discovered special characters. Each entry is

    (nline)[sp char list]record with nline being the number line read from source data (excluding empty and comments lines) and is therefore 1 larger than the line’s index in the Python list (i.e. nline is 1 based while lists are 0-based). [sp char list] comma delimited string of each special character found in the record such as [spchar1,spchar2]lineIn. A single field can have more than 1 special character. For example, input line (pipe delimited) as record line #5 (although actual file line number could be larger due to comments and empty lines) and data = = !dog|{House}|123^456 will be stored as an example as (5)[!,{,},^]!dog|{House}|123^456

  • covalues: list of dictionaries of field combinations to collect unique value information. Each entry is for one covalue with keys- field1, field2, field3 (optional), number (number of fields either 2 or 3), title = field1,field2,field3

  • covalue_uniqvalues: correlated to covalues array. Similar to field unique values.

  • err_stats: dictionary of properties and counts.
    • numrecs_err: number records with any kind of error

    • numrecs_err_datatype: number records with datatype error

    • numrecs_err_fmt: number records with format error

    • fields_err_datatype: dictionary of fields with datatype errors and counts

    • fields_err_fmt: dictionary of fields with format errors and counts

  • err_datatype_examples: list of delimited fields within records with datatype errors.

    Syntax is: (nline)[fieldinfo]|[fieldinfo]….. where [fieldinfo] is fieldTitle:reason:fieldValue . fieldValue will be set to -empty- if the actual value is empty. nline is the number line read from source data (excluding empty and comments lines) and is therefore 1 larger than the line’s index in the Python list (i.e. nline is 1 based while lists are 0-based).

  • err_fmt_examples: list of delimited fields within records with format errors.

    Syntax is: (nline)[fieldinfo]|[fieldinfo]….. where [fieldinfo] is fieldTitle:reason:fieldValue . fieldValue will be set to -empty- if the actual value is empty. nline is the number line read from source data (excluding empty and comments lines) and is therefore 1 larger than the line’s index in the Python list (i.e. nline is 1 based while lists are 0-based).

Constructor#

In module qualityanalysis.py: QualityAnalysis()

Methods#

get_json(add_lf:bool=False)

Constructs array of JSON strings for components of this object.

  • add_lf: if True then line feed added at end of each entry. This is unnecessary if returned array is printed as line per entry.

  • Returns string list with first entry starting with ‘notok:’’ if error

Field Module#

Field#

Description#

Field Object containing attributes for title, datatype, formatting

Members#

  • title: field’s title

  • datatype: if used specifies datatype (int, real, date, bool, string)

  • fmt_strcase: for datatype=string, optionally specifies a value is upper or lower case. (upper,lower,””)

  • fmt_strlen: for datatype=string, optionally specifies a required integer length (1-n). If value < 1 then this is ignored.

  • fmt_strcut: for datatype= string when strlen>0 and value length larger then chars removed from either front or back. Default is back (front,back)

  • fmt_strpad: for datatype= string when strlen>0 and value length shorter then chars added to either front or back. Default is back (front,back)

  • fmt_strpadchar: for datatype= string when padding uses this character. Must be 1 character or use one of names (space, fslash, bslash, tab). Default is _

  • fmt_decimal: for datatype=real, optionally specifies a required integer number of decimal places (0-n)

  • fmt_date: for datatype=date, optionally specifies a required date format as one of
    • mmddyy, mmdyy, mdyy, mmddyyyy, mmdyyyy, mdyyyy,

    • ddmmyy, ddmyy, dmyy, ddmmyyyy, ddmyyyy, dmyyyy,

    • yymmdd, yymmd, yymd, yyyymmdd, yyyymmd, yyyymd,

    • yyyyddd (3 digit day number within year),

    • yyyyMMMdd, ddMMMyyyy (MMM = 3 letter month title like ‘JAN’),

    • ‘MONTHdd,yyyy’, ‘ddMONTH,yyyy’, yyyyMONTHdd, ddMONTHyyyy, yyMONTHdd, ddMONTHyy (MONTH full title),

    • *dmmyyyy, mm*dyyyy, *mddyyyy, dd*myyyy (*= can be 1 or 2 characters)

  • mapto: when using this Field as an output (i.e. target) field then this specifies if it is

    mapped to a source field which enables both renaming source fields and adding enrichment fields

  • parse_error_action: Handle empty field values due to parsing errors. Used in Refining Data as:
    • value to assign like ‘NA’ to denote parse error

    • set to ‘-ignore-’ which causes the field value to remain as empty since

      no transform nor normalize will be done

    • set to either ‘-use-’ or ‘’ which causes the empty field value to continue to

      transform and normalization routines. Note the transform function ifEmpty and ifNotEmpty can be used to set field specific values.

Constructor#

In module field.py: Field(title)

Methods#

get_json()

Constructs JSON string for this object.

CoValue#

Description#

CoValue object to define 2 or 3 fields for joint value analysis

Members#

  • title: title which is concantenation of field titles using _ to join them

  • field1: required first field title

  • field2: required second field title

  • field3: optional third field title

  • field1_index: first field’s array index assigned by function

  • field2_index: second field’s array index assigned by function

  • field3_index: third field’s array index assigned by function

  • numfields: number of fields to use either 2 or 3

Constructor#

In module field.py: CoValue(title)

Methods#

get_json()

Constructs JSON string for this object.