LLM Model Formats, Conversion and Quantization

15. Dec. 2025

I have been running and experimenting with different types of AI models including stable diffusion (SD) and large language models (LLMs). Due to my limited compute capabilities (with a total of 96 GB of GPU memory), I cannot run all available models natively. Especially the latest LLMs (as of December 2025) are too large for my hardware. As a result, different quantized versions had to be run. Because the landscape of available model formats and quantization options has grown significantly, in this post I summarize what I have learned (maybe it is of use to someone).

Model File Formats

Five model-storage formats I most have commonly encountered are:

bin: Old-School Catch-All
pt: PyTorch’s Native Format
safetensors: Secure, Fast, and Framework-Agnostic
ggml: Lightweight CPU-Friendly Format (legacy lama.cpp format)
gguf: The Modern llama.cpp Format (successor to ggml)

In this section, I summarize the background of these formats.

`bin`: Old-School Catch-All

Although bin is not technically a format, it still is a type of representing a model and is typical in early releases of older transformer models. Due to its generic format, framework such as PyTorch or TensorFlow (depending on the origin) could store their model weights as a bin.

After all, the bin extension simply means "binary file", and historically it was used as a generic container for model weights.

Although this format is still common and widely supported since it is easy to save and load model weights, it requires a significant amount of memory to store. Also, it has no built-in safety guarantees, no standard structure (meaning two bin files might be totally different inside)

Therefore it is mostly used for legacy models or quick manual checkpoints. Modern workflows prefer more structured formats.

`pt`: PyTorch’s Native Format

pt is typical in the PyTorch training pipelines, and is the official PyTorch model save format, used for storing:

model weights,
optimizer states, and
full training checkpoints.

It uses Python's pickle under the hood, and sometimes files are also stored as pth. Therefore, this format is an official format, fully compatible with PyTorch, and represents the full training state (if saved that way). However, because it uses pickle, it also allows arbitrary code execution, which means it is not safe to load from untrusted sources. Similar to bin, pt can be large and slow to load.

Therefore, pt is mostly used for training models, building research prototypes, or sharing checkpoints across PyTorch teams.

`safetensors`: Secure, Fast, and Framework-Agnostic

The safetensors format was generated by the team of Hugging Face and became a format typically used by diffusion models, LLMs, any model shared publicly. Unlike pt the safetensors format is supported by a wider variety of frameworks such as PyTorch, TensorFlow, JAX, etc.

Therefore, one can say that safetensors is a modern alternative to pt, and is designed around two goals:

Safety (no pickle, no executable content, 100% safe to load from untrusted sources)
Speed (zero-copy memory mapping, faster loading than pt)

However, the safetensors format is a slightly more rigid format compared, meaning it is not as effortless to be used for exotic checkpoint structures.

Therefore, it is mostly used when downloading models from the internet, distributing models safely, or running large models with high performance.

`ggml`: Lightweight CPU-Friendly Format

The ggml format was generated by Georgi Gerganov (for llama.cpp) and is optimized for CPU and small-device inference. It supports quantized models (4-bit, 5-bit, 8-bit), meaning the file sizes can be reduced. Its abbreviation hence stands for Georgi Gerganov Machine Learning.

I would say that ggml was the breakthrough format since it allowed LLaMA-style models to run locally on laptops and even phones. Nowadays, LLMs are often quantized to dramatically reduce size and memory usage.

However, the ggml format is becoming outdated and is not standardized. Although it can be loaded faster than bin, pt or safetensors, it still loads slower than successor designs (such as gguf).

Therefore, ggml is mostly used for older llama.cpp versions or legacy models.

`gguf`: The Modern llama.cpp Format

By the maintainers of llama.cpp, gguf stands for GPT-Generated Unified Format (sometimes explained as GGML Unified Format). It is optimized for speed, metadata, quantization, and portability.

As a result gguf is the new recommended format for the llama.cpp ecosystem. It fixes many limitations of ggml and adds rich metadata, improved loading speed, and broad tooling support.

However, even gguf has its drawbacks. For example, it is not readable by older frameworks and requires model conversion from original weights.

Nonetheless, gguf has become widely accepted as the new standard and allows running of quantized LLMs on home PCs (including a Raspberry Pi) and is compatible with local inference engines (e.g., Ollama, llama.cpp, LM Studio, KoboldCpp, etc.).

Honorable Mentions

Although I haven't used them, there are also other formats that may be worth mentioning.

ONNX (.onnx)

This is a cross-framework portable format supported by PyTorch, TensorFlow, C++, JavaScript, etc. It is popular for production deployments and optimizing models for inference (e.g., ONNX Runtime).

H5 (.h5 / .hdf5)

This is the legacy Keras/TensorFlow format. It is declining in popularity but still found in older projects.

.ckpt

This is a generic checkpoint file used across frameworks, but is not standardized. It is common in older Stable Diffusion models, but they also adapted safetensors for its benefits.

Summary of Model Formats

Each of these file formats reflects a moment in the evolution of machine-learning practice:

When training-heavy research use pt
When sharing model publicly use safetensors
When wanting an efficient local inference use ggml or (preferably) gguf

Quantization

Both ggml and gguf support quantization of models. Quantization converts the data types of how the model weights are stored into less precise data types to reduce the memory requirements when storing the model. It is therefore a tradeoff between model speed, size, and accuracy.

Whilst full-sized models typically use floating point representations (i.e. F16 for 16-bit floating point values or F32 for 32-bit floating point values), quantization reduces the precision by using smaller sized representations. The suffixes on model files (Q4_K_M, Q6_K, or Q8_0) are an indication of how demanding or performant the model is.

The suffix can be split into four parts:

Improved: I is prefixed to indicate an integer or improved quantizer.
The quantization level: Q followed by a digit (e.g. Q4, Q6, Q8 etc.) that indicates the number of bits in each quantized weight memory storage.
The quantization format: K indicates grouped quantization with (uses per-group scale and zero point), 0 indicates ungrouped (older style quantization), or 1 indicates another legacy quantization format.
The quantization precision: S for small precision (fastest, lower accuracy), M for medium precision, or L for large precision (slowest, higher accuracy).

This means, the example of Q4_K_M means it is a Quantized model with 4 bis per weight of a Medium-precision/quality variant of the K scheme.

The quantization process does not simply convert floating point values into integer representations (here a value of 0.327 would be rounded to 0) since model weights are typically close to zero and do not make use of the full range of integers (e.g., 0 to 255 of an unsigned 8 bit integer). Therefore, a scale \(d\) and zero point offset \(m\) are used to best use the range of the data type used by the quantization:

\[ w = d \cdot q + m\]

When restoring the weight \(w\) from the quantized value \(q\), the result is close to the original value, but not exactly the original value. For most applications this quantization loss is however acceptable.

Model Perplexity

It is possible to measure how much quantization degrades model accuracy, and the metric doing this is called perplexity. Perplexity measures how well a language model predicts text. A lower perplexity indicates less "confusion" of the model, and the closer it behaves to the original version without quantization. When comparing quantization schemes, what matters most is the relative perplexity i.e., the perplexity of the quantized model compared to its full-precision FP16 or FP32 counterpart. When running ./llama-quantize --help in the llama.cpp project, a table is provided listing all supported quantization variants along with their perplexity scores (the numbers are based on the Vicuna-13B model):

Allowed quantization types:
   2  or  Q4_0    :  4.34G, +0.4685 ppl @ Llama-3-8B
   3  or  Q4_1    :  4.78G, +0.4511 ppl @ Llama-3-8B
  38  or  MXFP4_MOE :  MXFP4 MoE
   8  or  Q5_0    :  5.21G, +0.1316 ppl @ Llama-3-8B
   9  or  Q5_1    :  5.65G, +0.1062 ppl @ Llama-3-8B
  19  or  IQ2_XXS :  2.06 bpw quantization
  20  or  IQ2_XS  :  2.31 bpw quantization
  28  or  IQ2_S   :  2.5  bpw quantization
  29  or  IQ2_M   :  2.7  bpw quantization
  24  or  IQ1_S   :  1.56 bpw quantization
  31  or  IQ1_M   :  1.75 bpw quantization
  36  or  TQ1_0   :  1.69 bpw ternarization
  37  or  TQ2_0   :  2.06 bpw ternarization
  10  or  Q2_K    :  2.96G, +3.5199 ppl @ Llama-3-8B
  21  or  Q2_K_S  :  2.96G, +3.1836 ppl @ Llama-3-8B
  23  or  IQ3_XXS :  3.06 bpw quantization
  26  or  IQ3_S   :  3.44 bpw quantization
  27  or  IQ3_M   :  3.66 bpw quantization mix
  12  or  Q3_K    : alias for Q3_K_M
  22  or  IQ3_XS  :  3.3 bpw quantization
  11  or  Q3_K_S  :  3.41G, +1.6321 ppl @ Llama-3-8B
  12  or  Q3_K_M  :  3.74G, +0.6569 ppl @ Llama-3-8B
  13  or  Q3_K_L  :  4.03G, +0.5562 ppl @ Llama-3-8B
  25  or  IQ4_NL  :  4.50 bpw non-linear quantization
  30  or  IQ4_XS  :  4.25 bpw non-linear quantization
  15  or  Q4_K    : alias for Q4_K_M
  14  or  Q4_K_S  :  4.37G, +0.2689 ppl @ Llama-3-8B
  15  or  Q4_K_M  :  4.58G, +0.1754 ppl @ Llama-3-8B
  17  or  Q5_K    : alias for Q5_K_M
  16  or  Q5_K_S  :  5.21G, +0.1049 ppl @ Llama-3-8B
  17  or  Q5_K_M  :  5.33G, +0.0569 ppl @ Llama-3-8B
  18  or  Q6_K    :  6.14G, +0.0217 ppl @ Llama-3-8B
   7  or  Q8_0    :  7.96G, +0.0026 ppl @ Llama-3-8B
   1  or  F16     : 14.00G, +0.0020 ppl @ Mistral-7B
  32  or  BF16    : 14.00G, -0.0050 ppl @ Mistral-7B
   0  or  F32     : 26.00G              @ 7B

As discussed here on Github, perplexity changes based on the size of the model or choice of quantization:

perplexity vs size

Note that the x-axis (model size in GiB) is logarithmic. The various circles on the graph show the perplexity of different quantization mixes and the solid squares represent the original FP16 model (i.e., for each line Q2_K → Q3_K → Q4_K → Q5_K → Q6_K → Q7_K → FP16 was used). The different colors indicate the LLaMA variant used (7B in black, 13B in red, 30B in blue, 65B in magenta). The dashed lines are added for convenience to allow for a better judgement of how closely the quantized models approach the FP16 perplexity. As can be seen from this graph, generation performance as measured by perplexity is basically a fairly smooth function of quantized model size, and the quantization types allow the user to pick the best performing quantized model, given the limits of their compute resources (in terms of being able to fully load the model into memory, but also in terms of inference speed, which tends to depend on the model size).

Perhaps worth noting is that the 6-bit quantized (Q6_K) perplexity is within 0.1% or better from the original FP16 model.

Converting Models

Sometimes, models are made available on Hugging Face in a raw format that is too large for local execution. Therefore, it becomes necessary to convert the model from one format into another format. The conversion process involves:

obtaining the original weights,
load them into Python, and
export them using a converter script.

Here, llama.cpp has become my tool of choice, and in the following section, I explain how to convert models using llama.cpp.

Obtaining a Model

Typically, a Hugging Face model folder looks something like this:

model/
  config.json
  tokenizer.json / tokenizer.model / vocab files
  model-00001-of-00004.safetensors
  model-00002-of-00004.safetensors
  model-00003-of-00004.safetensors
  model-00004-of-00004.safetensors

Manually

The easiest way is to download the model via a browser by going to https://huggingface.co/models, selecting the wanted model, clicking on files and versions, and downloading each file of the model manually. For larger models, this can be quite painful.

Git LFS

Alternatively, Git LFS (Large File System) can be used. Therefore, LFS needs to be installed:

git lfs install

And then, the model repository can be cloned:

git clone https://huggingface.co/<user>/<model-name>

This downloads all files of the model automatically.

Hugging Face CLI

When needing to log in in order to download a model, the Hugging Face CLI may be used to download the model repository automatically. It can be installed like so:

pip install huggingface_hub

Then, you need to log in using:

huggingface-cli login

And finally the repository can be downloaded like so:

huggingface-cli download <user>/<model-name> --local-dir ./model

Conversion

Using the downloaded repository

First, the converter (e.g., llama.cpp) needs to be obtained. For gguf, the most used toolchain is llama.cpp, which can be downloaded and set up like so:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

# if you want to use the Python converter
pip install -r requirements.txt

# if you want to quantize and/or run models
cmake -B build
cmake --build build --config Release

The conversion script can be run like so:

python3 convert_hf_to_gguf.py \
  /path/to/model \
  --outfile /path/to/output/model.gguf \
  --vocab-type auto

Herein, the /path/to/model is the folder that contains:

config.json
tokenizer files
multiple safetensors files

The script then automatically loads the config, reads all safetensor shards and merges them, and writes a floating-point gguf file.

At this point a "base" gguf file has been created (often f16), which may (depending on the model) be of huge size.

To get smaller files (e.g. Q4_K_M, Q5_0 etc.), one needs to quantize:

./llama-quantize \
  /path/to/output/model.gguf \
  /path/to/output/model-Q4_K_M.gguf \
  Q4_K_M

Docker

Alternatively, you can pull the official Docker image, which removes the need to install dependencies manually:

docker pull ghcr.io/ggml-org/llama.cpp:full

You can start a docker container and then use the same commands as above.

docker run --rm -it \
  -v $HOME/llama-models:/models \
  --entrypoint /bin/bash \
  ghcr.io/ggml-org/llama.cpp:full

Alternatively, you can run just the conversion / quantization commands using docker:

docker run --rm -it \
  -v $HOME/llama-models:/models \
  ghcr.io/ggml-org/llama.cpp:full \
  --convert /models \
  --outfile OUTFILE \
  --outtype OUTTYPE

Here the parameter /models is the directory containing the model file (it may be replaced by a huggingface repository ID if the model is not yet downloaded). The OUTFILE parameter (optional) is the path the converted / quantized mode will be written to (default is based on the input directory). The OUTTYPE parameter (optional) indicates the output format of the model and may be:

f32 for float32,
f16 for float16,
bf16 for bfloat16,
q8_0 for Q8_0,
tq1_0 or tq2_0 for ternary, and
auto for the highest-fidelity 16-bit float type depending on the first loaded tensor type.

More information regarding the llama.cpp and its docker image may be found here.

Gerfficient