SQLAlchemy

05. Nov. 2024

SQLAlchemy

For an upcoming project, I want to store a large amount of data that is related to each other or dependent on each other. Data itself can be stored in a text or binary file without any problems (e.g. using pickle). However, if certain data records are to be searched for, it is advisable to use a database.

There are already modules integrated into Python for this (e.g. sqlite3). However, these modules lack user-friendliness, since you have to formulate and evaluate each data query yourself as an SQL query.

The module SQLAlchemy, on the other hand, provides a user-friendly interface between data structures and a database. The aim is to manipulate an object or class instance instead of SQL queries and then pass it to the database.

In the following, I summarize my first contact and what I have learned about SQLAlchemy.

Basic principles

SQLAlchemy can be used in two ways. Either you use SQLAlchemy ORM (Object-Relational Mapper), where the ORM provides patterns for data mapping between the database and classes, or you use SQLAlchemy Core, which is closer to the classic way of handling databases. While the ORM represents entries in a table as instances of a class, with Core you have to translate the database structure to objects in Python yourself.

In both cases, relationships between tables are identified and stored using primary keys and foreign keys. But if you want to access the reference object, you don't have to read it out of the database with ORM, but can access it directly via the object returned by SQLAlchemy. This will become clearer in the following example, where a small database for managing entries in a to-do list is created and filled with sample data. At the end of this article, I will also briefly explain how to create a comparable database structure in Core.

Setup

First, SQLAlchemy needs to be installed. This can be done using PIP as follows:

pip install SQLAlchemy

In the code, a connection to a database must then be created first. The database type is arbitrary, but a local database is suitable for example purposes:

from sqlalchemy import create_engine

engine = create_engine("sqlite:///database.db")

Or you can create a database in the cache, which does not create a file, but the data is not stored permanently:

from sqlalchemy import create_engine

engine = create_engine("sqlite:///:memory:")

These steps are the same for both ORM and Core.

ORM

It is worth noting that all classes defined in Python are declarative and thus automatically define the schema of a table. The class parameters and their data types specify the column name and data type (in SQLAlchemy). To define the classes as declarative, they are based on the class DeclarativeBase, and to specify the data types, type hints are used in accordance with PEP 484:

from sqlalchemy.orm import (
    DeclarativeBase,
    Mapped,
    mapped_column,
)
from typing import Optional
from datetime import datetime

class Base(DeclarativeBase):
    pass

class User(Base):
    __tablename__ = "users_table"

    id: Mapped[int] = mapped_column(primary_key=True)
    name: Mapped[str] = mapped_column(nullable=False, unique=True)

    def __repr__(self) -> str:
        return f"<User(id={self.id}, name={self.name})>"

class Todo(Base):
    __tablename__ = "todos_table"

    id: Mapped[int] = mapped_column(primary_key=True)
    description: Mapped[str] = mapped_column(nullable=False)
    done_date: Mapped[Optional[datetime]] = mapped_column(nullable=True)

    def __repr__(self) -> str:
        return f"<Todo(id={self.id}, desc={self.description}, done_date={self.done_date})>"

In the above code, two classes are defined that are translated into tables. The table name is set with __tablename__ and the mapped type hint (with Mapped[_T]) or the function mapped_column() is used to describe the properties of the table columns. The fact that a column should be a primary key (with primary_key=True), contain non-repeating values (with unique=True), or contain non-empty entries (with nullable=False) is shown here quite clearly. You can also specify in the type hints that entries are not mandatory (with Optional[_T]); for example, if the todo item is not yet completed, it does not yet have a completion date, which is indicated by the value null.

If you do not want to use type hints, you can also define the data types in the function mapped_column() as the first argument.

The __repr__(self) method makes it easier to read an instance of the class when it is output with print() at the command line, for example.

To create a database with its tables, the create_all() method of the Base class must be called, and you can insert the first data:

from sqlalchemy.orm import sessionmaker
from datetime import datetime

def main() -> None:
    Base.metadata.create_all(engine)

    Session = sessionmaker(bind=engine)

    with Session() as session:
        user_max = User(name="Max")
        user_kathi = User(name="Kathi")
        session.add(user_max)
        session.add(user_kathi)

        todo_1 = Todo(description="Write code", done_date=datetime.now())
        todo_2 = Todo(description="Tidy your room")
        session.add(todo_1)
        session.add(todo_2)

        session.commit()

        print(session.query(User).all())
        print(session.query(Todo).all())

if __name__ == "__main__":
    main()

If you now execute the above code, you will get the following result:

[<User(id=1, name=Max)>, <User(id=2, name=Kathi)>]
[<Todo(id=1, desc=Write code, done_date=2024-10-12 17:55:10.716922)>, <Todo(id=2, desc=Tidy your room, done_date=None)>]

As you can see, data can be created and stored quite easily. A session must always be created for this, which is also closed by the context manager with (otherwise you would have to do this yourself with session.close(), which is often forgotten). With session.add(), the new records are automatically inserted into the correct table, and with session.commit(), the changes are written to the database. The data is then queried using session.query(), where in this example all data should simply be displayed (with .all()). If you want to change an entry, you just have to change the parameter of the corresponding object, apply the change with session.commit() and update the object with session.refresh(). For example, if you want to change a user's name, you do this as follows:

...
    user_max.name = "Maximilian"
    session.commit()
    session.refresh(user_max)
...

And if you first have to read the user from the database, you can do this with a query that uses the .where() or .filter() addition to restrict the values to be returned:

...
    user_max = session.query(User).where(User.id == 1).first()
    # or
    user_max = session.query(User).where(User.name == "Max").first()
...

The user's entry would now be changed from <User(id=1, name=Max)> to <User(id=1, name=Maximilian)>.

Deleting an entry is done with session.delete(), where the entry to be deleted is passed as an object, and the deletion is applied with session.commit().

However, this data is not yet related to each other. This is changed by adding the foreign keys (with the argument ForeignKey()) and relationships (with the function relationship()) to the class parameters. The final code then looks like this:

from sqlalchemy import create_engine, ForeignKey
from sqlalchemy.orm import (
    sessionmaker,
    DeclarativeBase,
    Mapped,
    mapped_column,
    relationship,
)
from typing import List, Optional
from datetime import datetime

engine = create_engine("sqlite:///:memory:")

class Base(DeclarativeBase):
    pass

class User(Base):
    __tablename__ = "users_table"

    id: Mapped[int] = mapped_column(primary_key=True)
    name: Mapped[str] = mapped_column(nullable=False, unique=True)
    todos: Mapped[List["Todo"]] = relationship(back_populates="user")

    def __repr__(self) -> str:
        return f"<User(id={self.id}, name={self.name}, todos={len(self.open_todos)}/{len(self.todos)})>"

    @property
    def open_todos(self) -> List["Todo"]:
        return [t for t in self.todos if t.done_date is None]

class Todo(Base):
    __tablename__ = "todos_table"

    id: Mapped[int] = mapped_column(primary_key=True)
    description: Mapped[str] = mapped_column(nullable=False)
    done_date: Mapped[Optional[datetime]] = mapped_column(nullable=True)
    user_id: Mapped[int] = mapped_column(ForeignKey("users_table.id"), nullable=False)
    user: Mapped["User"] = relationship(back_populates="todos")

    def __repr__(self) -> str:
        return f"<Todo(id={self.id}, desc={self.description}, user={self.user.name}, done_date={self.done_date})>"

def main() -> None:
    Base.metadata.create_all(engine)

    Session = sessionmaker(bind=engine)

    with Session() as session:
        user_max = User(name="Max")
        user_kathi = User(name="Kathi")
        session.add(user_max)
        session.add(user_kathi)

        todo_1 = Todo(description="Write code", user=user_max, done_date=datetime.now())
        todo_2 = Todo(description="Tidy your room", user=user_max)
        session.add(todo_1)
        session.add(todo_2)

        session.commit()

        print(session.query(User).all())
        print(session.query(Todo).all())

if __name__ == "__main__":
    main()

In this case, the class User has been extended by the parameter todos, which should contain a list of instances of the class Todo (with Mapped[List[“Todo”]]). Since this list does not contain any values itself, but is only intended to provide a simplified access option for todo items, no column is defined here either, but a relationship (with relationship(back_populates=“user”)). This may seem a bit cryptic at first glance. However, the function's argument only means that the parameter (or column) 'user' of the 'Todo' class should refer back to the User class.

Accordingly, the parameter user has also been added to the class Todo, which is of the type User (with Mapped[“User”]) and specifies the relationship corresponding to the class User (namely relationship(back_populates=“todos”)).

To save the reference between the classes User and Todo, the foreign key user_id is included as an additional parameter in the class Todo. Since data is actually stored here, the function mapped_column() is used again, and the fact that this is a foreign key is specified with ForeignKey(“users_table.id”). The foreign key here is the column id (or the parameter id of the class User) in the table users_table. Since each todo item must also be assigned to a user, nullable=False is set.

Now you can access the parameters of the other class in both the 'User' and 'Todo' classes without having to worry about how the corresponding data can be read from the database. This is clearly visible in the above example: e.g. in the property open_todos of the class User, where the parameter done_date of Todo is accessed; or in the method __repr__(self) of the class Todo, where the parameter name of User is accessed.

The output on the command line is therefore much more informative, but the code required to achieve this remains easy to understand:

[<User(id=1, name=Max, todos=1/2)>, <User(id=2, name=Kathi, todos=0/0)>]
[<Todo(id=1, desc=Write code, user=Max, done_date=2024-10-12 18:03:16.312209)>, <Todo(id=2, desc=Tidy your room, user=Max, done_date=None)>]

Core

Instead of defining classes that serve as models for tables in a database, you can also create the tables in SQLAlchemy in the traditional way, or directly.

Instead of the class Base, you need an object that manages the metadata and the schema:

from sqlalchemy import MetaData()

metadata = MetaData()

Now you can define the tables as variables and pass the metadata object when you create it, so that the metadata object can create the tables using its method create_all():

from sqlalchemy import (
    create_engine,
    ForeignKey,
    MetaData,
    Table,
    Column,
    String,
    Integer,
    DateTime,
)

users_table = Table(
    "users_table",
    metadata,
    Column("id", Integer, primary_key=True),
    Column("name", String, nullable=False, unique=True),
)

todos_table = Table(
    "todos_table",
    metadata,
    Column("id", Integer, primary_key=True),
    Column("description", String, nullable=False),
    Column("done_date", DateTime, nullable=True),
    Column("user_id", Integer, ForeignKey("users_table.id"), nullable=False),
)

def main() -> None:
    connection = engine.connect()
    metadata.create_all(engine)

if __name__ == "__main__":
    main()

As you can see, a connection to the database engine must be created here so that the metadata object can pass the instructions for creating the tables to the database. Now that all the tables have been created, there are no functions with which data can be generated, stored and read. You now have to define these yourself:

from datetime import datetime

def insert_user(name: str) -> None:
    query = users_table.insert().values(name=name)
    connection.execute(query)

def select_user(name: str) -> dict | None:
    query = users_table.select().where(users_table.c.name == name)
    result = connection.execute(query)
    d = result.fetchone()
    if d is None:
        return None
    return dict(zip((c.name for c in users_table.c), d))

def list_users() -> list:
    query = users_table.select()
    result = connection.execute(query)
    return [dict(zip((c.name for c in users_table.c), d)) for d in result.fetchall()]

def insert_todo(description: str, user_id: int, done_date: datetime = None) -> None:
    query = todos_table.insert().values(
        description=description, user_id=user_id, done_date=done_date
    )
    connection.execute(query)

def select_todo(id: int) -> dict | None:
    query = todos_table.select().where(todos_table.c.id == id)
    result = connection.execute(query)
    d = result.fetchone()
    if d is None:
        return None
    return dict(zip((c.name for c in todos_table.c), d))

def list_todos() -> list:
    query = todos_table.select()
    result = connection.execute(query)
    return [dict(zip((c.name for c in todos_table.c), d)) for d in result.fetchall()]

This is significantly more complex than the ORM approach. However, it does give you more control over the process.

Note that the code above uses (c.name for c in _TABLE_NAME_.c) to create a list of the column names, which is then passed to a dictionary as keywords using dict(zip(a, b)). This is a small syntactic hack that can be used to return the results of a query in a manageable way.

With the code above, the foreign key users_table.id is also stored as the table column user_id in todos_table. However, to access the corresponding user, you have to make a new query based on the user_id stored in todos_table to read the user's entry from users_table.

In addition, each query always accesses the connection, which means that it must be available to each query function. For the sake of simplicity, the connection is defined globally in the following code, but this approach is not recommended for security reasons:

from sqlalchemy import (
    create_engine,
    ForeignKey,
    MetaData,
    Table,
    Column,
    String,
    Integer,
    DateTime,
)

from datetime import datetime

engine = create_engine("sqlite:///:memory:")
connection = engine.connect()

metadata = MetaData()

users_table = Table(
    "users_table",
    metadata,
    Column("id", Integer, primary_key=True),
    Column("name", String, nullable=False, unique=True),
)

def insert_user(name: str) -> None:
    query = users_table.insert().values(name=name)
    connection.execute(query)

def select_user(name: str) -> dict | None:
    query = users_table.select().where(users_table.c.name == name)
    result = connection.execute(query)
    d = result.fetchone()
    if d is None:
        return None
    return dict(zip((c.name for c in users_table.c), d))

def list_users() -> list:
    query = users_table.select()
    result = connection.execute(query)
    return [dict(zip((c.name for c in users_table.c), d)) for d in result.fetchall()]

todos_table = Table(
    "todos_table",
    metadata,
    Column("id", Integer, primary_key=True),
    Column("description", String, nullable=False),
    Column("done_date", DateTime, nullable=True),
    Column("user_id", Integer, ForeignKey("users_table.id"), nullable=False),
)

def insert_todo(description: str, user_id: int, done_date: datetime = None) -> None:
    query = todos_table.insert().values(
        description=description, user_id=user_id, done_date=done_date
    )
    connection.execute(query)

def select_todo(id: int) -> dict | None:
    query = todos_table.select().where(todos_table.c.id == id)
    result = connection.execute(query)
    d = result.fetchone()
    if d is None:
        return None
    return dict(zip((c.name for c in todos_table.c), d))

def list_todos() -> list:
    query = todos_table.select()
    result = connection.execute(query)
    return [dict(zip((c.name for c in todos_table.c), d)) for d in result.fetchall()]

def main() -> None:
    metadata.create_all(engine)

    insert_user("Max")
    insert_user("Kathi")

    max = select_user("Max")

    insert_todo("Write code", max["id"], datetime.now())
    insert_todo("Tidy your room", max["id"])

    print(list_users())
    print(list_todos())

    connection.close()

if __name__ == "__main__":
    main()

The output on the command line shows the desired information, but compared to ORM it is incomplete and (in my opinion) generated with more complex code:

[{'id': 1, 'name': 'Max'}, {'id': 2, 'name': 'Kathi'}]
[{'id': 1, 'description': 'Write code', 'done_date': datetime.datetime(2024, 10, 12, 10, 4, 23, 249914), 'user_id': 1}, {'id': 2, 'description': 'Tidy your room', 'done_date': None, 'user_id': 1}]