Data Quality And Integrity In The Age Of AI

1 year ago 53

Kathleen Hurley is the founder of Sage Inc., a tech company that offers SMB businesses infrastructure solutions and next-gen technology.

getty

As data becomes everything to everybody, it’s crucial to oversee the quality of information that runs through an organization. When information is ingested by systems analytically and output in summary form for decision-making by humans, data integrity and quality are paramount. Investing in and assuring data quality from the first stage of its ingestion in an organization must be the new "infrastructure investment"—without it, the organization is going to have difficulty operating well.

How bad is bad data?

Financial regulators have been talking about the requirement for high-quality data for years, worldwide. As Kaizen Reporting, a company of regulatory and data specialists, noted the AMF’s 2022 supervisory priorities, "Accurate reporting data is of paramount importance to the oversight of trading activity and the detection of market abuse and manipulation."

However, this focus on high-quality data that is reliable has not taken hold. Banks and other organizations are sitting on data that is inaccurate and taking actions upon it that impact ordinary individuals and organizations. Imagine being charged for something because of a bank error. It happens every day, you might say. But imagine being charged thousands of dollars for something that you aren’t responsible for and having no way to counter-claim because the bank’s data says you owe the bill.

What can be done to clean the data?

Once bad data is integrated into a system, it can be very difficult to disintegrate it. This is in part because of the volume of data making its way through organizations today, but in large part because of historical data that is retained by companies.

Think about the volume of data that is stored in companies from, for instance, the year 2012. If a company was operating then, the entire year’s data is on file. To cleanse its information, it may be necessary to look at both structured (database) and unstructured (file systems like Word and Excel) data. Until fairly recently, this required a people-first data strategy. New technologies allow the cleansing of structured and unstructured data without as much investment, but some organizations can still struggle with the volume of information that needs to be sifted to make a meaningful dent in the process.

One way to approach a data cleansing operation is to segregate the incoming data from the historical data. By implementing 'clean intake' processes for new information and funneling it into a verified, clean database, the company can ensure that all new data is processed accurately and remains clean. The historical information can be cleaned in segments and funneled into the new system, gradually replacing the old system.

This requires firms to operate in two places at one time or put a bridge system in place to allow both systems to function at once, but it is significantly less painful than attempting to clean all data at the same time. In fact, doing that kind of data cleansing lift is impossible for some companies. The volume of information is too great.

What does cleaning data mean?

Clean data must be validated for accuracy, completeness, consistency, timeliness and relevance. While it may seem like all data inherently meets these criteria, that’s often not the case. For example, a database entry screen might allow users to skip certain fields, leading to incomplete records. Typing a state name manually instead of selecting from a drop-down menu can introduce errors through typos. Similarly, allowing records to be saved without submission may leave them incomplete and untimely.

These issues were common in older, more permissive data systems, some of which are still in use today. If your system lacks proper data validation rules, now is the time to implement them and align with modern data management standards.

If changing database programming or implementing structured data validation isn’t an option, it’s still crucial to understand and document your organization’s ongoing data governance processes. For instance, you might rely on manual reviews (eyeball audits) or use automation to identify outliers in your data streams. Some software tools can monitor and flag information that falls outside system standards.

If you’re using these validation methods, ensure your processes are well-documented with clear and consistent rules defining what constitutes 'good' and 'clean' data. This not only simplifies internal audits but also helps meet regulatory requirements. Data privacy regulations, such as GDPR, emphasize regular audits and assessments of data integrity, aligning with principles like "privacy by design."

In addition, don’t forget to include user awareness training in your strategic planning and documentation. Users must understand the importance of data cleansing and management and be a part of the solution for the ongoing new data stream. They may not be part of cleaning the old data, but they will need to be part of getting clean new data into the system and should understand what is happening, why it is happening and how it is happening. They should also be aware of any ongoing monitoring to ensure that data integrity is managed on an ongoing basis.

Conclusion

For most companies, there is a sense of overwhelming odds when it comes to straightening out the data integrity situation they find themselves facing. It’s not through anything other than historical data collection that we find ourselves here, but most companies have a lot of cleanup to do before reaching true data integrity. Taking it in a phased approach and making sure to have realistic goals that the whole team is aware of can make a huge difference in the success of a data integrity project.


Forbes Technology Council is an invitation-only community for world-class CIOs, CTOs and technology executives. Do I qualify?


Read Entire Article