VerityPy 1.1
Python library for Verity data profiling, quality control, remediation
Functions | Variables
VerityPy.processing.analyzequality Namespace Reference

Functions

qualityanalysis.QualityAnalysis do_qualityinspect (list fields=None, list covalues=None, list recs=None, dict settings=None)
 
str detect_parsing_error (str linein, list field_values, qualityanalysis.QualityAnalysis report, int nrec)
 
str qc_fields (str linein, list field_values, qualityanalysis.QualityAnalysis report, int nrec)
 

Variables

str DQ = "\""
 
str LF = "\n"
 

Detailed Description

Analyze Quality

Performs deep inspection of data records supplied as List of strings 
which must be delimited. The delimiter should be specified in a dict supplied 
as settings in call to function. Various results are returned in a 
QualityAnalysis object.

Function Documentation

◆ detect_parsing_error()

str detect_parsing_error ( str linein,
list field_values,
qualityanalysis.QualityAnalysis report,
int nrec )
Assess parsed record values in 'field_values' list relative to number fields 
and collect distribution and note errors in report which is updated here

linein: original record line before parsing
field_values: parsed field values in list
report: QualityAnalysis object passed by reference so is changed in this function. Inbound 
    must have fields property as list of field titles. Results are added to this object's
    rec_parse_dist[], rec_parse_errs[x] x= ('small1', 'small2', 'big', 'small1_recs', 'small2_recs', 'big_recs') 
    with _recs being example lines stored as (nline)linein
nrec: integer current record number

Returns: string empty if no problems, notok:message if error, (rec_has_err=true) if parsing error

Definition at line 496 of file analyzequality.py.

◆ do_qualityinspect()

qualityanalysis.QualityAnalysis do_qualityinspect ( list fields = None,
list covalues = None,
list recs = None,
dict settings = None )
Do Quality Inspect. Performs deep inspection of data records to discover and assess 
a variety of structure, syntax, and semantic problems and inconsistencies. 
Field information can either be supplied with the 'fields' parameter or extracted from a header line in 
records. If using 'fields', there can also be specified datatypes and formats per field which will be 
used to detect errors when values do not meet these rules. 
fields: list of field objects with attributes-
    title: field name
    datatype: int, real, bool, date, string. For date, 
        there should be an entry in field.fmt_date specifying the date format otherwise it is set to ISO yyyyMMdd
    fmt_strlen: integer number of characters (>0) if a fixed size is required. Ignored if < 0
    fmt_strcase: (upper, lower, empty)
    fmt_strcut: (front, back, empty). Used in Refining records. Side to cut characters from if it is larger than specified fmt_strlen. Default is back.
    fmt_strpad: (front, back, empty). Used in Refining records. Side to add characters to if it is smaller than specified fmt_strlen. Default is back.
    fmt_strpadchar: single character or character alias (-space-, -fslash-, -bslash-, -tab-). Used in Refining records. Character to add if needed to make too small string meet specified fmt_strlen. Default is _
    fmt_decimal: number of decimal digits (0-N). Ignored if < 0
    fmt_date: without time part- yyyymmdd, yymmdd, yyyyddmm, yyddmm, mmddyyyy, mmddyy, ddmmyyyy, ddmmyy
        (mmm= month abbreviation like Jan) yyyymmmdd, yyyyddmmm, ddmmmyyyy, ddmmmyy
        (month= full month name like January) yyyymonthdd, yyyyddmonth, ddmonthyyyy, ddmonthyy
        with time part: suffix to above date formats as (T=letter T, S=space)- Thhmmss, Thhmm, Thh, 
            Shhmmss, Shhmm, Shh like mmddyyyyThhmm for 11282024T1407 or 11/28/2024T14:07 or 11-28-2024 14:07
        with time zone: if time zone is required at end of time part add suffix Z like mmddyyyyThhmmZ 11282024T1407
covalues: optional list of field titles (2-3) for joint value analysis with each list entry as field1,field2 and optionally with ,field3
recs: list of records. The delimiter should be specified in the settings object ('delim'= comma,pipe,tab,colon)
settings: dictionary object with entries for options to use in inspection. Includes:
    delim: record delimiter (comma,pipe,tab,colon). Default is comma.
    is_case_sens: is case sensitive (true,false). Default is false.
    is_quoted: field values may be enclosed (allows delimiter within) by double quotes (true, false). Default is false.
    maxuv: optional. string of integer value that is maximum number of unique values per field 
        to collect. Default is 50 and set to default if supplied value <1 or >1000
    extract_fields: bool whether to read in field titles from header line (first non-comment, non-empty line). 
        Default is False. If True then has_header must also be True, and submitted 'fields' list will only be 
        used to copy its datatype and formatting to the report field object. Thus, you can extract field 
        titles from data set and still define characteristics if desired. If not, ensure 'fields' is empty.
    has_header: bool whether has header line in file. Default is True. Must be True if extract_fields is True
Returns report as a QualityAnalysis class instance.

Definition at line 24 of file analyzequality.py.

◆ qc_fields()

str qc_fields ( str linein,
list field_values,
qualityanalysis.QualityAnalysis report,
int nrec )
Do Quality Control analysis of field values. Report is modified with assessments 
for datatypes, formats, unique values.

linein: original record line before parsing
field_values: parsed field values in list
report: QualityAnalysis object passed by reference so is changed in this function. Inbound 
    must have fields property as list of field titles. Results are added to this object's
    field_uniqvals[], fields[], err_stats{}, field_datatype_dist[], 
    spec_char_dist[], field_quality[], spec_char_examples[] with latter stored as 
    (nline)[comma delimited field:spchar pairs found in this record]linein with nline being the line number read 
    (excluding empty and comments lines) and is therefore 1 larger than the line's index 
    in the Python list (i.e. nline is 1 based while lists are 0-based).
nrec: integer current record number

Returns: string empty if no problems, notok:message if error, 
    possibly (rec_has_fmt_err=true) and/or (rec_has_dt_err=true)

Definition at line 545 of file analyzequality.py.

Variable Documentation

◆ DQ

str DQ = "\""

Definition at line 21 of file analyzequality.py.

◆ LF

str LF = "\n"

Definition at line 22 of file analyzequality.py.