
Data Profiler is an open source solution from Capital One that uses machine learning to help companies monitor big data and detect private customer information so that it can be protected. Data Profiler provides a pre-trained deep learning model to efficiently identify sensitive information, components to conduct statistical analysis of the dataset, and an API to build data labelers. Data Profiler can accept a wide range of data formats including csv, avro, parquet, json, text, and pandas DataFrames. Whether the data is structured, semi-structured or unstructured, the library is able to identify the schema, statistics, entities from the data. Versatility of the data labeler allows models to be modified as needed and it’s possible to run multiple models on the same dataset with just a few lines of code.

