report is VerityPy type=<class 'VerityPy.processing.qualityanalysis.QualityAnalysis'> VerityPy version= 2024.1.19
Report Overview¶
The table below lists the detected errors in the As-is, remediated, and final (i.e. normalized and enriched) data sets. Remediation fixed parsing errors as intended and then the Normalizing and Enriching functions fixed all remaining errors and generated a high quality data set ready for further analytics and AI/ML modeling.
item | as-is | remediated | final |
---|---|---|---|
number records | 1003 | 1000 | 1000 |
parsing errors-small1 | 0 | 0 | 0 |
parsing errors-small2 | 5 | 0 | 0 |
parsing errors-big | 0 | 0 | 0 |
number records with errors | 20 | 16 | 0 |
number records with datatype errors | 12 | 7 | 0 |
number records with format errors | 12 | 10 | 0 |
number fields with datatype errors | 3 | 2 | 0 |
Quality Factors are computed for each field using VerityPy's expert algorithms. These take into account errors and inconsistencies in structure and semantics for all records and fields. They should always be >= 99 for a high quality data set and =100 for a well managed Quality Controlled process feeding trusted Data Science and AI applications. The chart below shows that Verity processing successfully converted low quality source data into high quality final data.
Field Title Map: As-Is (source) --> Verity (target). Enrichment fields do not have source mapping. Unique Values (UV) excluding '-other-' listed as value(count)
As-Is | Verity | High UV As-Is | Low UV As-Is | High UV Verity | Low UV Verity |
---|---|---|---|---|---|
y2_statefips | DestStateCode | ||||
y1_statefips | OrigStateCode | ||||
y1_state | OrigStateAbbr | ||||
y1_state_name | OrigStateName | ||||
n1 | NumReturn | ||||
n2 | NumExempt | ||||
AGI | GrossIncome | ||||
useAGI | |||||
DestStateAbbr | |||||
DestStateName | |||||
isSubTotal |
As-Is Errors¶
Parsing (Number of fields in record)¶
The most serious error is the incorrect parsing from many source file lines. They are all of type Small2 meaning the records lack 2 or more fields. The next cell shows examples of these records with the line number as a prefix like [line #4]. Sample records are listed after this for each of the three data sets. It is clear that the Remediation processing in VerityPy fixed the parsing problem in the As-is source data. Then, using the remediation data as the input data set, the Verity data set is much higher quality after processing with Verity's Normalize and Enrich functions.
[line# 4]=01,97,AL,AL Total Migration-Same [line# 5]=State,46630,95832,1.871804e+6 [line# 6]=01,01,AL,AL [line# 7]=Non- [line# 8]=migrants,1598458,3579600,96406319
Examples header and records from As-is, Remediated, and Verity As-is y2_statefips,y1_statefips,y1_state,y1_state_name,n1,n2,AGI 01,96,AL,AL Total Migration-US and Foreign,33716,67747,1515297 01,97,AL,AL Total Migration-US,32868,x65647,1467218 01,98,AL,"AL Total Migration,Foreign",848,2100,48079 01,97,AL,AL Total Migration-Same State,46630,95832,1.871804e+6 01,01,AL,AL Non- migrants,1598458,3579600,96406319 01,13,GA,Georgia,5972,12269,249108 01,12,FL,Florida,4489,8446,1.85502e+5 Remediated y2_statefips,y1_statefips,y1_state,y1_state_name,n1,n2,AGI 01,96,AL,AL Total Migration-US and Foreign,33716,67747,1515297 01,97,AL,AL Total Migration-US,32868,x65647,1467218 01,98,AL,"AL Total Migration,Foreign",848,2100,48079 01,97,AL,AL Total Migration-Same State,46630,95832,1.871804e+6 01,01,AL,AL Non-migrants,1598458,3579600,96406319 01,13,GA,Georgia,5972,12269,249108 01,12,FL,Florida,4489,8446,1.85502e+5 01,48,TX,Texas,3435,7041,135165 01,47,TN,Tennessee,2608,5304,124203 01,28,MS,Mississippi,2212,4562,90723 Verity DestStateCode|OrigStateCode|OrigStateAbbr|OrigStateName|NumReturn|NumExempt|GrossIncome|useAGI|DestStateAbbr|DestStateName|isSubTotal 01|96|AL|AL Total Migration-US and Foreign|33716|67747|1515297000.00|true|AL|Alabama|true 01|97|AL|AL Total Migration-US|32868|65647|1467218000.00|true|AL|Alabama|true 01|98|AL|AL Total Migration,Foreign|848|2100|48079000.00|true|AL|Alabama|true 01|97|AL|AL Total Migration-Same State|46630|95832|1871804000.00|true|AL|Alabama|true 01|01|AL|AL Non-migrants|1598458|3579600|96406319000.00|true|AL|Alabama|true 01|13|GA|Georgia|5972|12269|249108000.00|true|AL|Alabama|false 01|12|FL|Florida|4489|8446|185502000.00|true|AL|Alabama|false 01|48|TX|Texas|3435|7041|135165000.00|true|AL|Alabama|false 01|47|TN|Tennessee|2608|5304|124203000.00|true|AL|Alabama|false 01|28|MS|Mississippi|2212|4562|90723000.00|true|AL|Alabama|false
DataType and Format Errors¶
Inconsistent value datatypes and formats is a problem that should be removed using normalizing functions. Different applications read and process values in a manner often unknown to end users and can automatically modify values to fit their expected types. This is increasingly happening with addition of AI functions. One very common example is Microsoft Excel spredsheet import and export modifying strings of numeric digits into integers while removing leading 0's thereby corrupting the true value if it was a code using in accounting, ERPs, etc. Similarly, Excel modifies long numbers into exponential format again corrupting the number for many downstream applications. Other data applications also do so.
field | #Errs DType | #Errs Fmt | DType Reasons | Fmt Reasons |
---|---|---|---|---|
y2_statefips | 0 | 4 | string incorrect length(4) | |
y1_statefips | 0 | 3 | string incorrect length(3) | |
y1_state | 0 | 8 | string incorrect length(3) | string not uppercase(5) | |
y1_state_name | 0 | 0 | ||
n1 | 5 | 0 | empty(5) | |
n2 | 7 | 0 | non-numeric(2) | empty(5) | |
AGI | 10 | 3 | empty(5) | non-numeric(5) | real incorrect decimals(3) |
Datatype Error Examples
source file line# 2 field=n2, reason=non-numeric, value=x65647 source file line# 4 field=n1, reason=empty, value=-empty- field=n2, reason=empty, value=-empty- field=agi, reason=empty, value=-empty- source file line# 5 field=n1, reason=empty, value=-empty- field=n2, reason=empty, value=-empty- field=agi, reason=empty, value=-empty- source file line# 6 field=n1, reason=empty, value=-empty- field=n2, reason=empty, value=-empty- field=agi, reason=empty, value=-empty- source file line# 7 field=n1, reason=empty, value=-empty- field=n2, reason=empty, value=-empty- field=agi, reason=empty, value=-empty- source file line# 8 field=n1, reason=empty, value=-empty- field=n2, reason=empty, value=-empty- field=agi, reason=empty, value=-empty- source file line# 17 field=n2, reason=non-numeric, value=$2100 source file line# 18 field=agi, reason=non-numeric, value=$48796 source file line# 24 field=agi, reason=non-numeric, value=27082ê source file line# 25 field=agi, reason=non-numeric, value=24958‼ source file line# 26 field=agi, reason=non-numeric, value=20884§ source file line# 27 field=agi, reason=non-numeric, value=21821►
Format Error Examples
source file line# 5 field=y2_statefips, reason=string incorrect length, value=State field=y1_statefips, reason=string incorrect length, value=46630 field=y1_state, reason=string incorrect length, value=95832 source file line# 7 field=y2_statefips, reason=string incorrect length, value=Non- field=y1_statefips, reason=string incorrect length, value=-empty- field=y1_state, reason=string incorrect length, value=-empty- source file line# 8 field=y2_statefips, reason=string incorrect length, value=migrants field=y1_statefips, reason=string incorrect length, value=1598458 field=y1_state, reason=string incorrect length, value=3579600 source file line# 10 field=agi, reason=real incorrect decimals, value=1.85502e+5 source file line# 18 field=y2_statefips, reason=string incorrect length, value=1 source file line# 19 field=y1_state, reason=string not uppercase, value=ny source file line# 20 field=y1_state, reason=string not uppercase, value=il source file line# 21 field=y1_state, reason=string not uppercase, value=oh source file line# 22 field=y1_state, reason=string not uppercase, value=sc source file line# 23 field=y1_state, reason=string not uppercase, value=mi source file line# 36 field=agi, reason=real incorrect decimals, value=9052.00 source file line# 37 field=agi, reason=real incorrect decimals, value=8578.00
Field Unique Values¶
The unique value distribution provides a QC examination of each field's conformance to expected semantics and code sets. In addition, checking the lowest occurring values provides a rapid way to detect serious problems such as parsing and special characters (e.g from multi-language encodings). The value '-other-' is an artificial value used to group actual values when there are too many distinct unique values to track. Listed below as unique value(count).
Field: y2_statefips
Field: y1_statefips
Field: y1_state
Field: y1_state_name
Field: n1
Field: n2
Field: AGI
CoValues¶
CoValues provide important visibility into semantically linked field values and is key to uncovering systemic errors that often arise from unknown changes in data rules, schema, and codes upstream. This example defined one CoValue.
CoValue Fields: y1_state,y1_state_name
Special Characters¶
Special Characters are ASCII characters intended for drawing symbols or legacy system codes and high Unicode sets commonly used in multi-language applications. They may not indicate errors if used for valid reasons but may cause problems when they visually look acceptable and like an ASCII character in an User Interface but fail to meet code and database rules that are specifically looking for only ASCII characters and words. VerityPy detects special characters with examples listed below.
Special Characters Detected
SpecChar | Count |
---|---|
ascii_234 | 1 |
unicode_8252 | 1 |
ascii_167 | 1 |
unicode_9658 | 1 |
Special Character Examples
Line# | Field:SpecChar | Record |
---|---|---|
24 | AGI:ascii_234 | 01,21,KY,Kentucky,574,1213,27082ê |
25 | AGI:unicode_8252 | 01,08,CO,Colorado,473,932,24958‼ |
26 | AGI:ascii_167 | 01,29,MO,Missouri,459,901,20884§ |
27 | AGI:unicode_9658 | 01,18,IN,Indiana,457,895,21821â–º |