Post

Validate, Correct and Quarantine Data Using Pydantic

Here’s a demo on how pydantic can be used to validate, correct, and quarantine data.

Access the code and demo file here.

Given a csv file users.csv containing the following records

1
2
3
4
5
6
id,name,email
1001,ana,ana@example.com
1002,isabel,isabel@example.com
1003,kris,kris@example.com
1004,bee,bee@ example.com
100s,kim,jan@example.com

Column id is expected to be an integer, name is expected to be a string, and email is expected to be a valid email address.

The data can be modeled using pydantic.

1
2
3
4
5
6
from pydantic import BaseModel, EmailStr

class User(BaseModel):
    id: int
    name: str
    email: EmailStr

EmailStr is a built-in type in pydantic that can be used to validate email addresses. It depends on the library email-validator and so this needs to also be installed along with pydantic.

1
2
pip install pydantic
pip install email-validator

Running the data through this BaseModel will result in two (2) errors: there is an invalid email for user bee, and an invalid id for user kim.

The email can be easily corrected by removing the space. To do this, model_validator with mode=before can be used. This means, the function to correct the email address will run first before the the inner validators int, str, and EmailStr.

1
2
3
4
5
6
7
8
9
10
11
12
13
from pydantic import BaseModel, EmailStr, field_validator, model_validator


class User(BaseModel):
    id: int
    name: str
    email: EmailStr

    @model_validator(mode="before")
    def correct_email(cls, data: dict):
        # remove space in email address
        data["email"] = data["email"].replace(" ", "")
        return data

Unlike the incorrect email, the erroneous id looks less straight-forward to correct, and so it’s best to quarantine this row to be handled at a later time. The rest of the passing records can be processed first.

To do this, the ValidationErrors Exception in pydantic can be set-up to capture and set-aside the records, along with the error details.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# import ValidationError from `pydantic_core`
from pydantic_core import ValidationError


# ...

# define passed and quarantine lists
users_passed = list()
users_quarantined = list()

clines = 0
with open("users.csv") as inf:
    for i, user in enumerate(inf):
        if i != 0:
            clines += 1
            data = user.strip().split(",")
            try:
                user = User(id=data[0], name=data[1], email=data[2])
                # add passing records to this `passed` list
                users_passed.append(user)
            except ValidationError as e:
                q_data = ",".join(data)
                errors = e.errors()
                # add erroneous records to the `quarantine` list
                # along with the records of the error
                # `errors` would contain the error details
                # and can be further processed to be more concise
                users_quarantined.append((q_data, errors))

Printing some information about this flow

1
2
3
4
5
6
7
8
9
print(f"Total rows processed: {clines}")
print(f"Total rows passed: {len(users_passed)}")
print(f"Total rows quarantined: {len(users_quarantined)}")

print()
print("----passed------\n", users_passed)
print()
print("----quarantined------\n", users_quarantined)
print()

Will result in

1
2
3
4
5
6
7
8
9
Total rows processed: 5
Total rows passed: 4     
Total rows quarantined: 1

----passed------
 [User(id=1001, name='ana', email='ana@example.com'), User(id=1002, name='isabel', email='isabel@example.com'), User(id=1003, name='kris', email='kris@example.com'), User(id=1004, name='bee', email='bee@example.com')]

----quarantined------
 [('100s,kim,jan@example.com', [{'type': 'int_parsing', 'loc': ('id',), 'msg': 'Input should be a valid integer, unable to parse string as an integer', 'input': '100s', 'url': 'https://errors.pydantic.dev/2.7/v/int_parsing'}])]

This example can be extended to process more complex data.

This post is licensed under CC BY 4.0 by the author.