Technik Interlytics: Tools for Better Data

Python & .Net Libraries to Inspect, Remediate, Enrich
Woman Analytics

Overview

Better Data = Better Outcomes. VerityPy & VerityDotNet combine curated human expertise with additional analysis using Machine Learning (ML) to extract error patterns from very large data sets. Our libraries significantly lower the level of effort needed to make data ready for high quality Data Science, Machine Learning, and predictive models.

The libraries contain expert algorithms developed from in-depth investigations, forensic tracing, and specialized remediation on large data systems across many fields, always with successful outcomes. Our human experts discovered where to look, what to look for, and what to do about problems especially deeply buried ones missed by traditional tools. These were then tuned and tested on a variety of data sets to refine their performance and determine the characteristics and statistics most useful to enable faster and more accurate data processing.

The functions enable coding labor-intensive, complicated tasks to improve data quality, remediate errors, normalize data across sources, and enrich data for AI/ML, Data Science, scenario modeling, and database modernization.

A Growing Challenge

The goal of data analysis and quality control is to characterize and process data to make it fit-for-purpose, and systemically managed. This goes by many names but all have the same goal although with differing levels of required accuracy, uniformity, and tracking (e.g. data quality, integration, warehousing, wrangling, governance, and more) . We want to know its structure, range of values, and presence of anomalies in relation to what it should be as defined by its architecture. We want to correct any errors and possibly augment with additional information to feed analytics and modeling. Ideally, the documentation would describe the details of how it was collected, stored, and the meaning of the data in the context of its intended use. In this ideal case, unit tests could be made to automatically measure quality metrics both as the data is received and processed, and as it is distributed and used. Unfortunately, this ideal situation rarely exists and we are forced to manage data with uncertain quality, pedigree, and trustworthiness. When the use can tolerate imperfect data then this is not much of a problem. However, we now have increasingly stringent needs for better data to feed Artificial Intelligence (AI), Data Science (DataSci), and more sophisticated forecasting models in financial markets, global supply chain, consumer activity, and many others. We need a powerful, transparent iterative technology to improve data set quality and enrich it. It is counter-productive to attempt to define every detail of a data set’s structure, meaning, and use case rules in one lengthy requirements gathering process. This is too complicated, laborious and prone to error.

Features Overview

Inspect

Deep analysis and characterization of data for structural and value consistency, anomalies, and errors especially those deeply buried problems that are not detected by common DataOps tools.

Remediate

Make data accurate for structure (field positions, data types, formats), and values (numeric ranges, codes, lists of allowed values). Detect and fix parsing errors including challenging multi-line records.

Enrich

Add metadata using realistic logic. Enable downstream analytics to easily filter on multiple criteria, generate Pivot tables, and feed drill-down style dashboards. Facets fuel Machine Learning producing more accurate models.

Annual License Plans

Upgrade or Buy When You Need
1 needed per library ( Get 75% Off 2 Developer Licenses. See FAQ for coupon. )
Contact us for OEM and Enterprise uses; source code access

Community

Do It Yourself
$0
  • Personal & Non-commercial
  • Self-support

Support Only

Not for distributing applications
$125
  • Personal & Non-commercial
  • Forums
  • Multi-user application prototyping

Developer

Standard / Pro
$400 / 935
  • Web-based multi-user applications: Standard = 1, Pro = unlimited.
  • Support
  • Redistribute with OEM license

Frequently Asked Questions

Q. Where do I get the libraries?

Start by making an account at VerityUserMgr. Then, login to your account and select a plan for VerityPy and/or VerityDotNet if getting a paid plan for Support or Developer. Once this is done, you can get VerityPy at https://pypi.org/project/VerityPy/ or installing via: pip install VerityPy. VerityDotNet is at https://www.nuget.org/packages/VerityDotNet/ or use NuGet within your Visual Studio project.

Q. What does Inspect do?

Analysis and characterization of data for structural and value consistency, anomalies, and errors especially those not detected by common DataOps tools. Some examples are:

  • data variations (types, formats, encoding) from import/export in different systems especially spreadsheets and legacy mainframes
  • special characters not visible to users causing downstream problems
  • small number of anomalies buried within very large data sets overwhelming tools
  • mismatched joint field values such as geographic location codes and names
  • long strings of digits used as codes (e.g. accounting, ERP) cast into number format stripping digits thereby corrupting codes
  • records broken into multiple lines causing fields to be misaligned, partial records, and incorrect values
  • open source data with embedded information-only records (e.g. IRS USA Migration demographics, Covid disease census) unknown to users

Q. What does Remediate do?

Make data accurate for structure (field positions, data types, formats), and values (numeric ranges, codes, lists of allowed values). Detect and fix parsing errors including challenging multi-line records. Some examples are:

  • rebuilding records broken into multiple lines
  • adding enrichment fields to annotate pedigree, trust, privacy, increased granularity with conditional logic and controlled vocabulary
  • several levels of conditional testing of multiple fields within single records to correctly encode/decode, transform field values
  • allowing multiple versions of lookup decoding per field based on other field indicators (e.g. time varying encoding schemes)
  • identifying when long values of numeric digits are strings or numbers and handling accordingly
  • lookup dictionary replacements using 1 or multiple fields as well as wildcard tokens and both boolean AND and NOT conditions

Q. What does Enrich do?

Add metadata into records as analytic factors and facets for Machine Learning. Enable downstream analytics to easily filter on multiple criteria, generate Pivot tables, and feed drill-down style dashboards. Higher granularity facets can be inserted into data records to power Machine Learning thereby producing more accurate models. Some examples from the IRS Migration data used in code samples are:

  • add boolean field useAGI to annotate records that have correct AGI (adjusted gross income) values from those intentionally coded with false values by data owners (actual in source)
  • add boolean field isSubTotal to annotate records that have partial sums of transaction records (actual in source)
  • add string field DestStateAbbr since source records have abbreviation for origin state but not destination
  • add string field DestStateName since source records have name for origin state but not destination

Q. Why were VerityDotNet and VerityPy made when there are other data Quality, Governance, Integration tools?

Verity tools focus on the need to provide high visibility into how and why data is processed such that it can be managed like other key business assets with oversight and collaborative review. All too often, the actual manipulation of data is handled by proficient engineers but the people who most need to review, understand, and adjust what is done are unable to decipher the complicated code, scripts, and technical documentation. Our human experts witnessed this situation in many clients and had to solve this challenge before the technical results would be accepted. Doing so led us to develop new data processing and reporting approaches that jointly handled complicated data engineering requirements along with visible and easy to understand business reporting. They were created to provide this capability to a wide community with the following key concepts:

  • easily reuse and adjust processing parameters for multiple iterations of transforms, codings, rules with reviews of results in end-use applications
  • review data processing steps and intermediate results throughout the entire process (i.e. no black boxes)
  • use processing commands that can be reviewed by business and technical people at both staff and manager levels
  • enable drop-in reporting with summary and detailed charts and tables of data actions, discoveries, and results
  • provide multiple views of data before and after to maximize understanding and discovery among all user types

Q. What are some of the Analysis functions?

Verity tools analyze structured source data and generate a thorough assessment of each field's actual value space for data types, formats, ranges, special characters, unique values and even coValues which are joint field (2 or 3) value distributions. This is a quick way to both profile source data, extract its schema, and discover anomalies that can be overlooked by other tools or missed by manual Quality Control reviews. A comprehensive report is returned in an object that can be used in further processing to make tables and charts such as in Jupyter Notebook.

Results are coordinated in a class 'QualityAnalysis' allowing concise handling of the setup parameters and the breadth and depth of discovered characteristics and known/suspected errors. These results include:

  • field unique values: per field unique values with count of instances.
  • field datatype distributions: each field has counts for detected datatypes (int, real, bool, date, string, empty).
  • field quality: each field is assigned a quality factor 0-100 based on discovered characteristics and knowledge-based algorithms.
  • record size distribution: record sizes (byte lengths) to count of instances.
  • record parsing errors: parsing errors (number fields after parsing relative to defined fields) by small1 (1 too few fields), small2 (2 or more missing fields), big (1 or more too many fields). Also, has example records.
  • record parsing distribution: number of parsed fields to count of instances.
  • special character distribution: special characters and their count of instances, as well as example records.
  • coValues: field combinations (2 or 3) unique value information.
  • error statistics: values such as number records with any kind of error, number records with datatype error, number records with format error and more

Q. What are some of the Normalization and Enrichment functions?

Verity tool transforms allow Normalizing and Enriching source data with a high level of quality, accuracy, and meaning to support demanding use cases. There are five kinds of transforms (see transforms page in User Guide for details):

  1. Assignment: assigns values to field as a fixed value, reference to another field in record, random number, list of categories via frequencies, lookup dictionaries
  2. Conditional: conditional tests of equality and inequality for numeric, string, and date values
  3. Numeric: numeric calculation functions including using other fields in record by reference
  4. Text: manipulate with slicing, adding, padding, replacing
  5. Date: Change date format to ISO 8601 including from special Excel format

This is an example of a transform to populate an enrichment field 'useAGI' that denotes whether the record should be used in analytics based on the value of a numeric source field 'AGI'.

  1. setToRef("AGI")
  2. ifEq("-1")
  3. setToValue("true")
  4. setToValue("false")

In order to allow chaining of conditional functions, the flow is condition -> [false action] else [true action]. Thus, if step 2 above is False then step 3 is done and the chain stops, whereas if step 2 is True then step 3 is skipped and step 4 is done (and any steps after it if they existed). The net result is this simple transform fills an enrichment field with boolean value enabling easy filtering downstream in a spreadsheet, database, or analytics dashboard.

A slightly more complicated logic flow is the following transform. It uses the source field 'y2_statefips' which is supposed to contain a 2 character code. The external lookup dictionary requires 2 characters even for values like '01' which makes it vulnerable to unintentional number format changes such as when import/export with spreadsheets and databases (commonly remove front 0). This transform proactively fixes that error to ensure a successful lookup occurs to assign new value to enrichment field 'DestStateName'.

  1. setToRef("y2_statefips")
  2. setLength("2","left","0")
  3. lookup("StateName")
Step 1 gets the value of the field 'y2_statefips' from the current record. Step 2 fixes the string length to 2 characters with changes made to the left side of the string if it is too long (characters cut from left) or too short (characters added to left) with the character to add set to be '0' (zero). This is critical for code lookups since a very common problem when data is moved among systems is for leading zeros to be removed thereby changing a code like '01' into '1' which would not be found in the lookup. This ensures that such an error is fixed prior to doing the lookup which occurs in step 3 to a dictionary name 'StateName' (loaded during the setup phase of the job).

Q. Are VerityX libraries open source?

They are not open source software and cannot be included in an open source project as its license will break the open source license. However, there is a license allowing free use for non-commercial, personal applications. Read the license file for full details about allowed scope of free use. Paid licenses are required for commercial products either distributed or web hosted (e.g. SaaS), as well as enterprise applications with multiple users. There are licenses for developers, royalty inclusion in other products, and support.

Q. Is there a discount to get both libraries VerityPy and VerityDotNet?

Yes. Purchase a developer license for one and get 75% (subject to change) discount on the second developer license. Use coupon Verity-Dev-2x-2025 in Plan Payment form.

Q. Who owns the data and application code when using the libraries?

You. Technik Interlytics maintains the libraries and owns the algorithms and methods within in them. We do not view nor have claim to any client data or code.

Q. Does the library need a license number to operate?

Only the VerityDotNet product needs a license number passed to functions to enable higher capacity multi-threading and block processing. Purchase a developer license for one and get 75% discount on the second (see above).

Get in Touch with us

Still have Questions? Send us an email info@technikinterlytics.com

Get Example Data Files and Results

Go to github.com/TechnikInterlytics/VerityExamples