Validate, Correct and Quarantine Data Using Pydantic
Here’s a demo on how pydantic
can be used to validate, correct, and quarantine data.
Access the code and demo file here.
Given a csv file users.csv
containing the following records
1
2
3
4
5
6
id,name,email
1001,ana,ana@example.com
1002,isabel,isabel@example.com
1003,kris,kris@example.com
1004,bee,bee@ example.com
100s,kim,jan@example.com
Column id
is expected to be an integer, name
is expected to be a string, and email
is expected to be a valid email address.
The data can be modeled using pydantic
.
1
2
3
4
5
6
from pydantic import BaseModel, EmailStr
class User(BaseModel):
id: int
name: str
email: EmailStr
EmailStr
is a built-in type in pydantic
that can be used to validate email addresses. It depends on the library email-validator
and so this needs to also be installed along with pydantic
.
1
2
pip install pydantic
pip install email-validator
Running the data through this BaseModel
will result in two (2) errors: there is an invalid email
for user bee
, and an invalid id
for user kim
.
The email can be easily corrected by removing the space. To do this, model_validator
with mode=before
can be used. This means, the function to correct the email address will run first before the the inner validators int
, str,
and EmailStr
.
1
2
3
4
5
6
7
8
9
10
11
12
13
from pydantic import BaseModel, EmailStr, field_validator, model_validator
class User(BaseModel):
id: int
name: str
email: EmailStr
@model_validator(mode="before")
def correct_email(cls, data: dict):
# remove space in email address
data["email"] = data["email"].replace(" ", "")
return data
Unlike the incorrect email
, the erroneous id
looks less straight-forward to correct, and so it’s best to quarantine this row to be handled at a later time. The rest of the passing records can be processed first.
To do this, the ValidationErrors
Exception in pydantic
can be set-up to capture and set-aside the records, along with the error details.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# import ValidationError from `pydantic_core`
from pydantic_core import ValidationError
# ...
# define passed and quarantine lists
users_passed = list()
users_quarantined = list()
clines = 0
with open("users.csv") as inf:
for i, user in enumerate(inf):
if i != 0:
clines += 1
data = user.strip().split(",")
try:
user = User(id=data[0], name=data[1], email=data[2])
# add passing records to this `passed` list
users_passed.append(user)
except ValidationError as e:
q_data = ",".join(data)
errors = e.errors()
# add erroneous records to the `quarantine` list
# along with the records of the error
# `errors` would contain the error details
# and can be further processed to be more concise
users_quarantined.append((q_data, errors))
Printing some information about this flow
1
2
3
4
5
6
7
8
9
print(f"Total rows processed: {clines}")
print(f"Total rows passed: {len(users_passed)}")
print(f"Total rows quarantined: {len(users_quarantined)}")
print()
print("----passed------\n", users_passed)
print()
print("----quarantined------\n", users_quarantined)
print()
Will result in
1
2
3
4
5
6
7
8
9
Total rows processed: 5
Total rows passed: 4
Total rows quarantined: 1
----passed------
[User(id=1001, name='ana', email='ana@example.com'), User(id=1002, name='isabel', email='isabel@example.com'), User(id=1003, name='kris', email='kris@example.com'), User(id=1004, name='bee', email='bee@example.com')]
----quarantined------
[('100s,kim,jan@example.com', [{'type': 'int_parsing', 'loc': ('id',), 'msg': 'Input should be a valid integer, unable to parse string as an integer', 'input': '100s', 'url': 'https://errors.pydantic.dev/2.7/v/int_parsing'}])]
This example can be extended to process more complex data.