report is VerityPy type=<class 'VerityPy.processing.qualityanalysis.QualityAnalysis'>
VerityPy version= 2024.1.19

Report Overview¶

The table below lists the detected errors in the As-is, remediated, and final (i.e. normalized and enriched) data sets. Remediation fixed parsing errors as intended and then the Normalizing and Enriching functions fixed all remaining errors and generated a high quality data set ready for further analytics and AI/ML modeling.

item as-is remediated final
number records 1003 1000 1000
parsing errors-small1 0 0 0
parsing errors-small2 5 0 0
parsing errors-big 0 0 0
number records with errors 20 16 0
number records with datatype errors 12 7 0
number records with format errors 12 10 0
number fields with datatype errors 3 2 0

Quality Factors are computed for each field using VerityPy's expert algorithms. These take into account errors and inconsistencies in structure and semantics for all records and fields. They should always be >= 99 for a high quality data set and =100 for a well managed Quality Controlled process feeding trusted Data Science and AI applications. The chart below shows that Verity processing successfully converted low quality source data into high quality final data.

No description has been provided for this image
Field Title Map: As-Is (source) --> Verity (target). Enrichment fields do not have source mapping. Unique Values (UV) excluding '-other-' listed as value(count)
As-Is Verity High UV As-Is Low UV As-Is High UV Verity Low UV Verity
y2_statefips DestStateCode
y1_statefips OrigStateCode
y1_state OrigStateAbbr
y1_state_name OrigStateName
n1 NumReturn
n2 NumExempt
AGI GrossIncome
useAGI
DestStateAbbr
DestStateName
isSubTotal

As-Is Errors¶

Parsing (Number of fields in record)¶

The most serious error is the incorrect parsing from many source file lines. They are all of type Small2 meaning the records lack 2 or more fields. The next cell shows examples of these records with the line number as a prefix like [line #4]. Sample records are listed after this for each of the three data sets. It is clear that the Remediation processing in VerityPy fixed the parsing problem in the As-is source data. Then, using the remediation data as the input data set, the Verity data set is much higher quality after processing with Verity's Normalize and Enrich functions.

[line# 4]=01,97,AL,AL Total Migration-Same 
[line# 5]=State,46630,95832,1.871804e+6
[line# 6]=01,01,AL,AL 
[line# 7]=Non-
[line# 8]=migrants,1598458,3579600,96406319
Examples header and records from As-is, Remediated, and Verity

As-is
y2_statefips,y1_statefips,y1_state,y1_state_name,n1,n2,AGI
01,96,AL,AL Total Migration-US and Foreign,33716,67747,1515297
01,97,AL,AL Total Migration-US,32868,x65647,1467218
01,98,AL,"AL Total Migration,Foreign",848,2100,48079
01,97,AL,AL Total Migration-Same 
State,46630,95832,1.871804e+6
01,01,AL,AL 
Non-
migrants,1598458,3579600,96406319
01,13,GA,Georgia,5972,12269,249108
01,12,FL,Florida,4489,8446,1.85502e+5

Remediated
y2_statefips,y1_statefips,y1_state,y1_state_name,n1,n2,AGI
01,96,AL,AL Total Migration-US and Foreign,33716,67747,1515297
01,97,AL,AL Total Migration-US,32868,x65647,1467218
01,98,AL,"AL Total Migration,Foreign",848,2100,48079
01,97,AL,AL Total Migration-Same State,46630,95832,1.871804e+6
01,01,AL,AL Non-migrants,1598458,3579600,96406319
01,13,GA,Georgia,5972,12269,249108
01,12,FL,Florida,4489,8446,1.85502e+5
01,48,TX,Texas,3435,7041,135165
01,47,TN,Tennessee,2608,5304,124203
01,28,MS,Mississippi,2212,4562,90723

Verity
DestStateCode|OrigStateCode|OrigStateAbbr|OrigStateName|NumReturn|NumExempt|GrossIncome|useAGI|DestStateAbbr|DestStateName|isSubTotal
01|96|AL|AL Total Migration-US and Foreign|33716|67747|1515297000.00|true|AL|Alabama|true
01|97|AL|AL Total Migration-US|32868|65647|1467218000.00|true|AL|Alabama|true
01|98|AL|AL Total Migration,Foreign|848|2100|48079000.00|true|AL|Alabama|true
01|97|AL|AL Total Migration-Same State|46630|95832|1871804000.00|true|AL|Alabama|true
01|01|AL|AL Non-migrants|1598458|3579600|96406319000.00|true|AL|Alabama|true
01|13|GA|Georgia|5972|12269|249108000.00|true|AL|Alabama|false
01|12|FL|Florida|4489|8446|185502000.00|true|AL|Alabama|false
01|48|TX|Texas|3435|7041|135165000.00|true|AL|Alabama|false
01|47|TN|Tennessee|2608|5304|124203000.00|true|AL|Alabama|false
01|28|MS|Mississippi|2212|4562|90723000.00|true|AL|Alabama|false

DataType and Format Errors¶

Inconsistent value datatypes and formats is a problem that should be removed using normalizing functions. Different applications read and process values in a manner often unknown to end users and can automatically modify values to fit their expected types. This is increasingly happening with addition of AI functions. One very common example is Microsoft Excel spredsheet import and export modifying strings of numeric digits into integers while removing leading 0's thereby corrupting the true value if it was a code using in accounting, ERPs, etc. Similarly, Excel modifies long numbers into exponential format again corrupting the number for many downstream applications. Other data applications also do so.

field #Errs DType #Errs Fmt DType Reasons Fmt Reasons
y2_statefips 0 4 string incorrect length(4)
y1_statefips 0 3 string incorrect length(3)
y1_state 0 8 string incorrect length(3) | string not uppercase(5)
y1_state_name 0 0
n1 5 0 empty(5)
n2 7 0 non-numeric(2) | empty(5)
AGI 10 3 empty(5) | non-numeric(5) real incorrect decimals(3)

Datatype Error Examples

source file line# 2
    field=n2, reason=non-numeric, value=x65647
source file line# 4
    field=n1, reason=empty, value=-empty-
    field=n2, reason=empty, value=-empty-
    field=agi, reason=empty, value=-empty-
source file line# 5
    field=n1, reason=empty, value=-empty-
    field=n2, reason=empty, value=-empty-
    field=agi, reason=empty, value=-empty-
source file line# 6
    field=n1, reason=empty, value=-empty-
    field=n2, reason=empty, value=-empty-
    field=agi, reason=empty, value=-empty-
source file line# 7
    field=n1, reason=empty, value=-empty-
    field=n2, reason=empty, value=-empty-
    field=agi, reason=empty, value=-empty-
source file line# 8
    field=n1, reason=empty, value=-empty-
    field=n2, reason=empty, value=-empty-
    field=agi, reason=empty, value=-empty-
source file line# 17
    field=n2, reason=non-numeric, value=$2100
source file line# 18
    field=agi, reason=non-numeric, value=$48796
source file line# 24
    field=agi, reason=non-numeric, value=27082ê
source file line# 25
    field=agi, reason=non-numeric, value=24958‼
source file line# 26
    field=agi, reason=non-numeric, value=20884§
source file line# 27
    field=agi, reason=non-numeric, value=21821â–º

Format Error Examples

source file line# 5
    field=y2_statefips, reason=string incorrect length, value=State
    field=y1_statefips, reason=string incorrect length, value=46630
    field=y1_state, reason=string incorrect length, value=95832
source file line# 7
    field=y2_statefips, reason=string incorrect length, value=Non-
    field=y1_statefips, reason=string incorrect length, value=-empty-
    field=y1_state, reason=string incorrect length, value=-empty-
source file line# 8
    field=y2_statefips, reason=string incorrect length, value=migrants
    field=y1_statefips, reason=string incorrect length, value=1598458
    field=y1_state, reason=string incorrect length, value=3579600
source file line# 10
    field=agi, reason=real incorrect decimals, value=1.85502e+5
source file line# 18
    field=y2_statefips, reason=string incorrect length, value=1
source file line# 19
    field=y1_state, reason=string not uppercase, value=ny
source file line# 20
    field=y1_state, reason=string not uppercase, value=il
source file line# 21
    field=y1_state, reason=string not uppercase, value=oh
source file line# 22
    field=y1_state, reason=string not uppercase, value=sc
source file line# 23
    field=y1_state, reason=string not uppercase, value=mi
source file line# 36
    field=agi, reason=real incorrect decimals, value=9052.00
source file line# 37
    field=agi, reason=real incorrect decimals, value=8578.00

Field Unique Values¶

The unique value distribution provides a QC examination of each field's conformance to expected semantics and code sets. In addition, checking the lowest occurring values provides a rapid way to detect serious problems such as parsing and special characters (e.g from multi-language encodings). The value '-other-' is an artificial value used to group actual values when there are too many distinct unique values to track. Listed below as unique value(count).

Field: y2_statefips
Field: y1_statefips
Field: y1_state
Field: y1_state_name
Field: n1
Field: n2
Field: AGI

CoValues¶

CoValues provide important visibility into semantically linked field values and is key to uncovering systemic errors that often arise from unknown changes in data rules, schema, and codes upstream. This example defined one CoValue.

CoValue Fields: y1_state,y1_state_name

Special Characters¶

Special Characters are ASCII characters intended for drawing symbols or legacy system codes and high Unicode sets commonly used in multi-language applications. They may not indicate errors if used for valid reasons but may cause problems when they visually look acceptable and like an ASCII character in an User Interface but fail to meet code and database rules that are specifically looking for only ASCII characters and words. VerityPy detects special characters with examples listed below.

Special Characters Detected

SpecChar Count
ascii_234 1
unicode_8252 1
ascii_167 1
unicode_9658 1

Special Character Examples

Line# Field:SpecChar Record
24 AGI:ascii_234 01,21,KY,Kentucky,574,1213,27082ê
25 AGI:unicode_8252 01,08,CO,Colorado,473,932,24958‼
26 AGI:ascii_167 01,29,MO,Missouri,459,901,20884§
27 AGI:unicode_9658 01,18,IN,Indiana,457,895,21821â–º