.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "examples/programmatic/data_preparation/data_preparation.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_examples_programmatic_data_preparation_data_preparation.py>`
        to download the full example code.

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_examples_programmatic_data_preparation_data_preparation.py:


How to prepare data for training
================================

.. attention::

    This tutorial is only relevant for users who need to prepare their data from scratch
    from several files or for big datasets. If you already have your data in a common
    file format (like XYZ or `ASE database`_), you can skip this tutorial and directly
    start training.

.. _ASE database: https://ase-lib.org/ase/db/db.html

XYZ, ASE databases, and also from metrain's
:class:`metatrain.utils.data.dataset.DiskDataset <DiskDataset>` file.

For the small datasets (<10k structures), you can simply provide an XYZ file or an ASE
database to ``metatrain``, and it will handle the data loading for you. Large datasets
(>10k structures) may not fit into the GPU memory. In such cases, it is useful to
pre-process the dataset, save it to disk and load it on the fly during training.

In this tutorial, we will show how to prepare data for training using three different
formats. You can choose the one that best fits your needs.

We start by importing the necessary packages.

.. GENERATED FROM PYTHON SOURCE LINES 30-43

.. code-block:: Python

    import subprocess
    from pathlib import Path

    import ase.io
    import numpy as np
    import torch
    from metatensor.torch import Labels, TensorBlock, TensorMap
    from metatomic.torch import NeighborListOptions, systems_to_torch

    from metatrain.utils.data.writers import DiskDatasetWriter
    from metatrain.utils.neighbor_lists import get_system_with_neighbor_lists


.. GENERATED FROM PYTHON SOURCE LINES 44-55

Create a XYZ training file (small datasets)
-------------------------------------------

First, we will show how to create a XYZ file with fields corresponding to the target
properties. On modern HPC systems, this format is suitable for datasets up to around
1M structures. As an example, we will use 100 structures from a file read by ASE_.
Since files from reference calculations may be located in different directories, we
first create a list of all path that we want to read from. Here, for simplicity, we
assume that all files are located in the same directory.

.. _ASE: https://ase-lib.org/

.. GENERATED FROM PYTHON SOURCE LINES 55-58

.. code-block:: Python


    filelist = 100 * ["qm9_reduced_100.xyz"]


.. GENERATED FROM PYTHON SOURCE LINES 59-71

We will now read the structures using the ASE package. Check the ase documentation for
more details on how to read different file formats. Instead of creating the ``atoms``
object by reading from disk, you can also create an
:class:`ase.Atoms` object containing the chemical ``symbols``, ``positions``, the
``cell`` and the periodic boundary conditions (``pbc``) by hand using its constructor.

.. hint::

  If a property is not read by the :func:`ase.io.read` function, you can add custom
  scalar properties to the ``info`` dictionary. Vector properties (e.g. forces) can be
  added to the ``arrays`` dictionary. Tensor properties (e.g. stress) must
  be flattened before adding them to the ``arrays`` dictionary.

.. GENERATED FROM PYTHON SOURCE LINES 72-89

.. code-block:: Python


    frames = []
    for i, fname in enumerate(filelist):
        atoms = ase.io.read(fname, index=i)

        n_atoms = len(atoms)
        # scalar
        atoms.info["U0"] = -100.0
        # vector
        atoms.arrays["forces"] = np.zeros((n_atoms, 3))
        # tensor
        atoms.arrays["my_tensor"] = np.zeros((n_atoms, 3, 3)).reshape(n_atoms, 9)

        frames.append(atoms)

    ase.io.write("data.xyz", frames)


.. GENERATED FROM PYTHON SOURCE LINES 90-101

.. note::

  The names of the added properties (like, ``U0``, etc.) must be referenced correctly
  in the ``options.yaml`` file.

Create a ``DiskDataset`` (large datasets)
-----------------------------------------

In addition to the systems and targets (as above), we also save the neighbor
lists that the model will use during training. We first create the writer object that
will write the data to a zip file.

.. GENERATED FROM PYTHON SOURCE LINES 102-105

.. code-block:: Python


    disk_dataset_writer = DiskDatasetWriter("qm9_reduced_100.zip")


.. GENERATED FROM PYTHON SOURCE LINES 106-110

Then we loop over all structures, convert them to the internal torch format using
:func:`metatomic.torch.systems_to_torch`, compute the neighbor lists using
:func:`metatrain.utils.neighbor_lists.get_system_with_neighbor_lists` and write
everything to disk using the writer's ``write()`` method.

.. GENERATED FROM PYTHON SOURCE LINES 111-138

.. code-block:: Python


    for i, fname in enumerate(filelist):
        atoms = ase.io.read(fname, index=i)

        system = systems_to_torch(atoms, dtype=torch.float64)
        system = get_system_with_neighbor_lists(
            system,
            [NeighborListOptions(cutoff=5.0, full_list=True, strict=True)],
        )
        energy = TensorMap(
            keys=Labels.single(),
            blocks=[
                TensorBlock(
                    values=torch.tensor([[atoms.info["U0"]]], dtype=torch.float64),
                    samples=Labels(
                        names=["system"],
                        values=torch.tensor([[i]]),
                    ),
                    components=[],
                    properties=Labels("energy", torch.tensor([[0]])),
                )
            ],
        )
        disk_dataset_writer.write([system], {"energy": energy})

    disk_dataset_writer.finish()


.. GENERATED FROM PYTHON SOURCE LINES 139-142

Alternatively, you can also write the whole dataset at once, which might be more
efficient (but also potentially run into memory issues). We use the same ``frames``
that we created above.

.. GENERATED FROM PYTHON SOURCE LINES 143-171

.. code-block:: Python


    disk_dataset_writer = DiskDatasetWriter("qm9_reduced_100_all_at_once.zip")

    systems = systems_to_torch(frames, dtype=torch.float64)
    systems = [
        get_system_with_neighbor_lists(
            system,
            [NeighborListOptions(cutoff=5.0, full_list=True, strict=True)],
        )
        for system in systems
    ]
    energy = TensorMap(
        keys=Labels.single(),
        blocks=[
            TensorBlock(
                values=torch.tensor(
                    [frame.info["U0"] for frame in frames], dtype=torch.float64
                ).reshape(-1, 1),
                samples=Labels.range("system", len(frames)),
                components=[],
                properties=Labels("energy", torch.tensor([[0]])),
            )
        ],
    )

    disk_dataset_writer.write(systems, {"energy": energy})
    disk_dataset_writer.finish()


.. GENERATED FROM PYTHON SOURCE LINES 172-187

The dataset is saved to disk. You can now provide it to ``metatrain`` as a
dataset to train from, simply by replacing your ``.xyz`` file with the newly created
zip file (e.g. ``read_from: qm9_reduced_100.zip``).

Create a ``MemmapDataset`` (large datasets, parallel filesystems)
-----------------------------------------------------------------

If your dataset is large and you are using a parallel filesystem (e.g. on an HPC
cluster), it is recommended to use a ``MemmapDataset`` instead of a ``DiskDataset``.
The ``MemmapDataset`` stores the data inside memory-mapped numpy arrays instead of a
zip file. Reading from this format avoids I/O bottlenecks, but it does not support
spherical targets or storing neighbor lists.

As an example, we will use 100 structures from a dataset of carbon structures. The
numpy arrays must be saved inside a directory, using the following format.

.. GENERATED FROM PYTHON SOURCE LINES 188-230

.. code-block:: Python


    structures = ase.io.read("carbon_reduced_100.xyz", index=":")

    root = Path("carbon_reduced_100_memmap/")
    root.mkdir()

    ns_path = root / "ns.npy"
    na_path = root / "na.npy"
    a_path = root / "a.bin"
    x_path = root / "x.bin"
    c_path = root / "c.bin"
    e_path = root / "e.bin"
    f_path = root / "f.bin"
    s_path = root / "s.bin"

    ns = len(structures)
    na = np.cumsum(np.array([0] + [len(s) for s in structures], dtype=np.int64))
    np.save(ns_path, ns)
    np.save(na_path, na)

    a_mm = np.memmap(a_path, dtype="int32", mode="w+", shape=(na[-1],))
    x_mm = np.memmap(x_path, dtype="float32", mode="w+", shape=(na[-1], 3))
    c_mm = np.memmap(c_path, dtype="float32", mode="w+", shape=(ns, 3, 3))
    e_mm = np.memmap(e_path, dtype="float32", mode="w+", shape=(ns, 1))
    f_mm = np.memmap(f_path, dtype="float32", mode="w+", shape=(na[-1], 3))
    s_mm = np.memmap(s_path, dtype="float32", mode="w+", shape=(ns, 3, 3))

    for i, s in enumerate(structures):
        a_mm[na[i] : na[i + 1]] = s.numbers
        x_mm[na[i] : na[i + 1]] = s.get_positions()
        c_mm[i] = s.get_cell()[:]
        e_mm[i] = s.get_potential_energy()
        f_mm[na[i] : na[i + 1]] = s.arrays["force"]
        s_mm[i] = -s.info["virial"] / s.get_volume()

    a_mm.flush()
    x_mm.flush()
    c_mm.flush()
    e_mm.flush()
    f_mm.flush()
    s_mm.flush()


.. GENERATED FROM PYTHON SOURCE LINES 231-240

The dataset is saved to disk. You can now provide it to ``metatrain`` as a
dataset to train from, simply by specifying the newly created
directory as the path from which to read the systems
(e.g. ``read_from: carbon_reduced_100_memmap/``).

For example, you can use the following options file:

.. literalinclude:: options.yaml
   :language: yaml

.. GENERATED FROM PYTHON SOURCE LINES 241-243

.. code-block:: Python


    subprocess.run(["mtt", "train", "options.yaml"])


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    CompletedProcess(args=['mtt', 'train', 'options.yaml'], returncode=0)


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** (0 minutes 15.096 seconds)


.. _sphx_glr_download_examples_programmatic_data_preparation_data_preparation.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: data_preparation.ipynb <data_preparation.ipynb>`

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: data_preparation.py <data_preparation.py>`

    .. container:: sphx-glr-download sphx-glr-download-zip

      :download:`Download zipped: data_preparation.zip <data_preparation.zip>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_