Training a QM/MM+ΔMLP model with DeePMD-kit from AMBER simulation data
======================================================================

| Erika McCarthy\ :sup:`1`, Şölen Ekesan\ :sup:`1`, and Darrin M.
  York\ :sup:`1`
| :sup:`1`\ Laboratory for Biomolecular Simulation Research, Institute
  for Quantitative Biomedicine and Department of Chemistry and Chemical
  Biology, Rutgers University, Piscataway, NJ 08854, USA

Learning objectives
-------------------

-  Perform QM/MM umbrella sampling simulations to obtain a reaction path
-  Perform single point calculations to evaluate reference and target
   level energies and forces
-  Package the data for training using dpamber
-  Prepare input files for DeePMD-kit
-  Train an initial model using DP-GEN simplify
-  Refine a pretained model
-  Run QM/MM+ΔMLP simulations

Relevant literature
-------------------

-  `Development of Range-Corrected Deep Learning Potentials for Fast, Accurate Quantum Mechanical/Molecular Mechanical Simulations for Chemcial Reactions in Solution <https://pubs.acs.org/doi/10.1021/acs.jctc.1c00201>`__
-  `DeePMD-kit v2: A software package for deep potential models <https://pubs.aip.org/aip/jcp/article/159/5/054801/2904916/DeePMD-kit-v2-A-software-package-for-deep>`__
-  `DeePMD-kit v3: A Multiple-Backend Framework for Machine Learning Potentials <https://pubs.acs.org/doi/full/10.1021/acs.jctc.5c00340>`__
-  `DeePMD-GNN: A DeePMD-kit Plugin for External Graph Neural Network Potentials <https://pubs.acs.org/doi/full/10.1021/acs.jcim.4c02441>`__
-  `DP-GEN: A concurrent learning platform for the generation of reliable deep learning based potential energy models <https://www.sciencedirect.com/science/article/pii/S001046552030045X>`__
-  `MACE: Higher Order Equivariant Message Passing Neural Networks for Fast and Accurate Force Fields <https://proceedings.neurips.cc/paper_files/paper/2022/file/4a36c3c51af11ed9f34615b81edb5bbc-Paper-Conference.pdf>`__
-  `Transferability of MACE Graph Neural Network for Range Corrected Δ-Machine Learning Potential QM/MM Applications <https://pubs.acs.org/doi/full/10.1021/acs.jpcb.5c02006>`__
  

Introduction
------------

DeePMD-Kit supports training for various model architectures including
Deep Potentials (DP) and Graph Neural Networks (GNN). There are two
general approaches for training a QM/MM+ΔMLP model, depending on your
end goal. The first being a specific reaction potential (SRP) approach
in which you have a system of interest, and you just want a model that
allows you perform long production sampling for that system, and maybe
some minor variations. The second approach could be to train a more
foundational model, in which case you may have many systems of interest
you want to experiment with. In that case, it may be worth more time
upfront when training for enhanced transferability. A DP model may be
desirable for the first approach, but a GNN could be better for the
latter. The training data preparation procedure is agnostic to the type
of model you wish to train, but the contents may vary. This tutorial
will use a single system as an example, but one could apply the
procedure to multiple systems and combine the training data for a
foundational model approach. Differences in procedure for a DP and MACE
model will be highlighted.

Outline
-------

In this tutorial you will learn how to prepare QM/MM training data,
train a QM/MM+ΔMLP model with DeePMD-Kit, then apply it to simulations. The preparation is agnostic to the
type of model you wish to train, so the data could be used to train a
Deep Potential (DP) or Graph Neural Network (GNN) model. We will assume
the low level, semi-empirical model is DFTB3, and the high level, ab
initio target model is PBE0/6-31G*. Example input files have been taken
from simulations of the HDV ribozyme phosphoryl transfer reaction. The associated files can be downloaded from `Files.tar <Files.tar>`__. The
tutorial is structured as follows:

1. Preparing training data from AMBER QM/MM simulations

-  `1.1 String simulations <#string>`__
-  `1.2 Production simulations <#prod>`__
-  `1.3 Single point calculations <#snglpts>`__
-  `1.4 Packaging the data <#dpamber>`__

2. Training QM/MM+ΔMLP models

-  `2.1 Generating input files for a Deep Potential <#input1>`__
-  `2.2 Generating input files for MACE <#input2>`__
-  `2.3 Machine files <#machine>`__
-  `2.4 Running DP-GEN simplify <#simplify>`__
-  `2.5 Starting from a pretrained model <#pretain>`__
-  `2.6 Evaluating a model <#eval>`__

3. Running QM/MM+ΔMLP simulations

-   `3.1 QM/MM+ΔMLP simulations <#runsim>`__


1. Preparing training data from AMBER QM/MM simulations
-------------------------------------------------------

.. _string:

1.1 String simulations
~~~~~~~~~~~~~~~~~~~~~~

You will first need to peform umbrella sampling simulations to obtain the minimum
free energy path using the low level potential. If your reaction is
1-dimensional, this is simply a linear interpolation from reactants to
products; however, if your reaction contains multiple reaction
coordinates, you must optimize the path using a string method. Here we
will use the rapidily converging Surface Accelerated String Method
(SASM) approach and assume the directory structure created by the
N-dimension free energy surface (NDFES) program available in AMBERTOOLS.
This could techniquically be done with the high level potential (or any potential for that matter), but
most likely that sampling is prohibitively expensive. Below is an
example slurm script to run surface accelerated string method (SASM)
simulations on the Expanse supercomputer. This is run from the base
directory. Your directory structure should start as follows:

.. code-block:: bash

      user@cluster:~/YOUR/WORKING/DIR $ ls
      TEMPLATE_DFTB3.mdin  equil       nextSASM.sh  runString.slurm
      TEMPLATE_PBE0.mdin   hdf5.slurm  reanalysis.slurm           template  

      user@cluster:~/YOUR/WORKING/DIR $ ls template
      1ps.mdin  TEMPLATE.disang  qmmm.parm7  template.groupfile
          

You should already have equilibrated windows in the equil directory.
Here are the contents of runString.slurm:

.. code-block:: bash

      #!/bin/bash

      #SBATCH --job-name="runString_HDV_DFTB3"
      #SBATCH -o slurm/slurm-%j.%a.out
      #SBATCH -e slurm/slurm-%j.%a.err
      #SBATCH --export=ALL
      #SBATCH --partition=compute
      #SBATCH --nodes=1
      #SBATCH --ntasks-per-node=128
      #SBATCH --mem=200G
      #SBATCH --cpus-per-task=1
      #SBATCH --account=rut149
      #SBATCH --time=2-00:00:00
      #SBATCH --constraint=lustre

      module load amber/xtb/13may24

      export LAUNCH="srun --mpi=pmi2 -K1 -N1 -n128 -c1 --exclusive"
      export EXE="sander.MPI"


      if [ ! -d slurm ]; then
        mkdir slurm
      fi

      for iter in init {01..50}; do
        if [[ "${iter}" == "init" ]]; then
          itdir=init
          echo "Running init directory"
          for win in {01..32}; do
            cp equil/img${win}.rst7 init/init${win}.rst7
          done
        else
          itdir=it${iter}
        fi
        cd $itdir
        $LAUNCH $EXE -ng 32 -groupfile ../template/template.groupfile
        wait
        sleep 1

        ###################################
        #### Check for convergence failure
        ###################################

        FAILED=0
        for s in $(seq 32); do
          base=$(printf "img%02i" ${s})
          if grep -q "Convergence could not be achieved" ${base}.mdout; then
            FAILED=1
            echo "${base} failed SCF convergence!!!"
          else
            echo "SCF converged"
            continue
          fi

        done


        if [ "${FAILED}" == 1 ]; then
          echo "Detected a failed simulation"
          exit 1
        fi

        ###################################
        #### Check for any other failures
        ###################################
        echo ${pwd}

        if grep -q "   5.  TIMINGS" *.mdout; then
          sleep 1
          cd ../.
          ./nextSASM.sh $iter 128
        else
          echo "run incomplete"
          exit 1
        fi
      done
        

The above slurm script calls a bash script that uses NDFES to analyze
the simulations and generate the next directory of simulations. Here is
the contents of nextSASM.sh:

.. code-block:: bash

      #!/bin/bash

      cit=$(printf %02d ${1#0})
      NCPU=$2


      export OMP_NUM_THREADS=${NCPU}
      export OPENBLAS_NUM_THREADS=${NCPU}

      ###################
      #General controls
      ###################

      template="./template"
      dirstr="it"
      disang=TEMPLATE.disang
      mdin="1ps.mdin"
      TEMP="--temp=298"
      pad="--pad=2"
      NSIM=32

      fil_analysis="analysis.${cit}.log"
      truncate -s0 $fil_analysis

      ####################################
      # MakeAna -> ndfes-path-analyzesims
      ####################################

      MAXEQ="--maxeq=0.25"
      SKIPG="--skipg"
      NPREV="--nprev=1000 --neqit=0"

      MAKEANA="ndfes-path-analyzesims.py --curit=${cit} -d ${template}/${disang} ${TEMP} ${MAXEQ} ${SKIPG} ${NPREV} ${pad}"

      ####################################
      # EXE -> ndfes_omp
      ####################################

      binwid="-w 0.15"
      bootstrap="--nboot 50"


      # Check to see if we have the OpenMP version of ndfes

      if test -x "$(command -v ndfes_omp)"; then
          export NDFESEXE="ndfes_omp"
      elif test -x "$(command -v ndfes.OMP)"; then
          export NDFESEXE="ndfes.OMP"
      elif test -x "$(command -v ndfes)"; then
          export NDFESEXE="ndfes"
      else
          echo "Could not find ndfes"
          exit 1
      fi

      MAKECHK="${NDFESEXE} --mbar ${binwid} ${bootstrap} -c metafile.all.chk metafile.all"

      #################################
      # MakeSims -> ndfes_path options
      #################################

      MAXIT="--maxit=100 --buffer"
      TOL="--distol=0.05 --angtol=0.5 --mlen=5"
      FIX0=""
      FIX1=""
      INTERP="--wavg=4 --wavg-niter=1 --minsize=10"
      SMOOTH=""
      PREDFC=""
      CPNEAR="--cp-nearest"
      FESCONV="--sasm"
      PATHPTS="--npathpts=50 --nsplpts=400"


      MAKESIMS="ndfes-path.OMP --curit=${cit} -d ${template}/${disang} -i ${template}/${mdin} ${MAXIT} ${TOL} ${FIX0} ${FIX1} ${INTERP} ${SMOOTH} ${PREDFC} ${CPNEAR} ${FESCONV} ${pad} ${PATHPTS}" 


      ###################
      # Implementation
      ##################


      # Calculate previous and next iteration numbers
      next=$(printf %02d $((${cit#0} + 1)))
      old=$(printf %02d $((${cit#0} - 1)))

      # Establish directory names
      if [ ${cit} -lt 1 ]; then
        curdir=init
      else
        curdir=${dirstr}$cit
      fi
      newdir=${dirstr}$next
      if [ ${old} -lt 1 ]; then
        olddir=init
      else
      olddir=${dirstr}${old}
      fi

      #
      # Do not do analysis if the simulations did not run or did not complete
      #

      statfiles=($(for f in ${curdir}/*.stat; do if [ -e "${f}" ]; then echo "${f}"; fi; done))
      nstat=${#statfiles[@]}

      if [ "${nstat}" -gt 0 ]; then
          echo "Analysis skipped because the following files exist: ${statfiles[@]}"
          exit 1
      fi

      #
      # Prepare the dumpaves and metafile by running MAKEANA
      #
      echo "ANALYSIS START AT $(date)" > ${fil_analysis} 2>&1

      echo "RUNNING: ${MAKEANA}"
      time ${MAKEANA} >  $fil_analysis 2>&1

      echo "FINISHED MAKEANA AT $(date)" >> $fil_analysis 2>&1

      #
      # Run ndfes to get the FES
      #

      topdir=${PWD}
      cd ${curdir}/analysis

      echo "Running ndfes"
      ${MAKECHK} >> ${topdir}/$fil_analysis 2>&1
      echo "FINISHED NDFES AT $(date)" >> ${topdir}/$fil_analysis 2>&1

      cd ${topdir}


      #
      # Prepare a new directory for the next iteration by running MAKESIMS
      #

      if [ ! -d ${newdir} ]; then
          mkdir ${newdir}
      fi

      echo "RUNNING: ${MAKESIMS}"
      ${MAKESIMS} >> ${fil_analysis} 2>&1
      echo "FINISHED MAKESIMS AT $(date)" >> $fil_analysis 2>&1

      export OMP_NUM_THREADS=1  

The above strings are set to run for 50 iterations, but one may wish to
run more if the path is not converged. Submit the job.

.. code-block:: bash

    sbatch runString.slurm


.. _prod:

1.2 Production simulations
~~~~~~~~~~~~~~~~~~~~~~~~~~

Once you are satisfied with the convergence of the string, you will need
to run production simulations on the final path to generate structures
that will be used for traing. We have observed that 500 samples per
window is ideal. To generate 500 samples, run 25 ps simulations with
ntwx=50 for a 1 fs time step and ntwx=25 for a 2 fs timestep. Modify the
runString.slurm script from "for iter in init {01..50}; do" to "for iter
in 51; do" if iteration 51 is in fact your most recently generated
string. Also, modify the mdin files from 1 ps to 25 ps. Here is an
example:

.. container:: box

   .. code:: bash

      for i in {01..32}; do sed -i "s/nstlim = 1000/nstlim = 25000/g" it51/img${i}.mdin; done
      sed -i "s/for iter in init {01..50}; do/for iter in 51; do" runString.slurm
      sbatch runString.slurm


.. _snglpt:

1.3 Single point calculations
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Once the production simulations are done, you will need to reanalyze
each frame of the trajectories with the high and low level using single
point calculations. These calculations will ouput the energies and
forces as mden and mdfrc files. The energies, forces, and coordinates
constitute the training data. Here is an example reanalysis.slurm
script. The script should be in the base directory.

.. code-block:: bash

      #!/bin/bash
      #SBATCH --job-name="reanalysis_4Training_HDV"
      #SBATCH --output="slurm/reanalysis%a.slurmout"
      #SBATCH --error="slurm/reanalysis%a.slurmerr"
      #SBATCH --partition=shared
      #SBATCH --nodes=1
      #SBATCH --ntasks=16
      #SBATCH --cpus-per-task=1
      #SBATCH --mem=48G
      #SBATCH --export=ALL
      #SBATCH -t 2-00:00:00
      #SBATCH --account=rut149
      #SBATCH --constraint=lustre
      #SBATCH --requeue
      #SBATCH --array=0-31

      set -u

      qmreg=':81,220-224|@1-10,22-39,55-60,706-710,723-736,1969-1990,1992-1993'

      top=`pwd`

      for i in 51; do
        RC=($(seq -w 1 1 32))
        R=${RC[${SLURM_ARRAY_TASK_ID}]}
        echo R is ${R}

        cd it${i}
        if [ ! -d prune_traj ]; then mkdir prune_traj; fi
        cd prune_traj
        
        BASE=img${R}
        PBE0=img${R}_PBE0
        DFTB=img${R}_DFTB3

        parm=qmmm.parm7

        sed -e "s/XXXX/${R}/g" ${top}/TEMPLATE_DFTB3.mdin > img${R}_DFTB3.mdin
        sed -e "s/XXXX/${R}/g" ${top}/TEMPLATE_PBE0.mdin > img${R}_PBE0.mdin

        module load amber/20mlipi/05jun23
        mpirun -n 16 sander.MPI -O -p ../../template/${parm} -i ${DFTB}.mdin -c ../${BASE}.rst7 -o ene_${DFTB}.mdout -y ../${BASE}.nc -frc ene_${DFTB}.mdfrc -e ene_${DFTB}.mden -inf ${DFTB}.mdinfo
       
        module unload amber/20mlipi/05jun23
        module load amber/hfdf/05jun23
        mpirun -n 16 sander.MPI -O -p ../../template/${parm} -i ${PBE0}.mdin -c ../${BASE}.rst7 -o ene_${PBE0}.mdout -y ../${BASE}.nc -frc ene_${PBE0}.mdfrc -e ene_${PBE0}.mden -inf ${PBE0}.mdinfo


        if [[ $(grep -r "TIMINGS" ene_${DFTB}.mdout) ]] && [[ $(grep -r "TIMINGS" ene_${PBE0}.mdout) ]]; then
          echo "${BASE} finished correctly"
          
          module purge
          module load dpgen_Aug092023
          export OMP_NUM_THREADS=1
          export TF_INTER_OP_PARALLELISM_THREADS=1
          export TF_INTRA_OP_PARALLELISM_THREADS=1

          hl=ene_img${R}_PBE0
          ll=ene_img${R}_DFTB3
          out=img${R}_DFTB3.hdf5
          nc=../img${R}.nc
          dpamber corr --cutoff 6. --qm_region ${qmreg} --parm7_file ${top}/template/qmmm.parm7 --nc ${nc} --hl ${hl} --ll ${ll} --out ${out}

        else
          exit 1
        fi

        cd ${top}
      done


This will create a directory inside your production iteration dirctory
called prune_traj. In this directory it will output forces and energies
at the high and low level. The script is run as an array, and can be set
up with a dependency on the production simulations (sbatch
--dependency=afterok:JOBID reanalysis.slurm). This script switches
between the two versions of amber because DFTB3 and PBE0 do not exist in
the same version. Once the single point calculations are complete, it
will use the dpamber program to read the low and high level energies and
forces along with the coordinates to package them in hdf5 file format.
It with carve out the coordinates of atoms within a 6 cuttoff (--cutoff)
of the QM region to save to the hdf5 file. The script uses template mdin
files called TEMPLATE_DFTB3.mdin and TEMPLATE_PBE0.mdin, examples of
which are given below. These files should be at the same level as the
reanalysis script. Be sure to modify these and the QM region in the
reanalysis script to reflect your system.

.. code-block:: bash

      DFTB3snglpt
      &cntrl
      ! IO =======================================
            irest = 0       ! 0 = start, 1 = restart
              ntx = 1       ! 1 = start, 5 = restart
             ntxo = 1       ! read/write rst as formatted file
            iwrap = 1       ! wrap crds to unit cell
           ioutfm = 1       ! write mdcrd as netcdf
             imin = 6
            ntmin = 1
             ntpr = 1
             ntwr = 0
             ntwx = 0
             ntwf = 1       ! print mdfrc file
             ntwe = 1       ! print mdene file
      ! DYNAMICS =================================
           nstlim = 0    ! number of time steps
               dt = 0.001   ! ps/step
              ntb = 1       ! 1=NVT periodic, 2=NPT periodic, 0=no box
      ! TEMPERATURE ==============================
            temp0 = 298     ! target temp
         gamma_ln = 5.0     ! Langevin collision freq
              ntt = 3       ! thermostat (3=Langevin)
      ! PRESSURE  ================================
              ntp = 0       ! 0=no scaling, 1=isotropic, 2=anisotropic
      ! SHAKE ====================================
              ntc = 2       ! 1=no shake, 2=HX constrained, 3=all constrained
      noshakemask = ":81,220-224|@1-10,22-39,55-60,706-710,723-736,1969-1990,1992-1993"    ! do not shake these
              ntf = 1       ! 1=cpt all bond E, 2=ignore HX bond E, 3=ignore all bond E
      ! MISC =====================================
              cut = 10.0
            ifqnt = 1
               ig = -1
           nmropt = 0
           lj1264 = 0
      /


      &ewald
        dsum_tol = 1.e-6
      /

      &qmmm
          qm_theory   = 'DFTB3'
              qmmask  = ':81,220-224|@1-10,22-39,55-60,706-710,723-736,1969-1990,1992-1993'
            qmcharge  = 0
                spin  = 1
             qmshake  = 0
            qm_ewald  = 1
         qmmm_switch  = 1
             scfconv  = 1.e-10
           verbosity  = 0
        tight_p_conv  = 1
        diag_routine  = 0
         pseudo_diag  = 1
        dftb_maxiter  = 100
      /

      &wt type = 'DUMPFREQ', istep1 = 8, /
      &wt type='END' /
      DISANG=imgXXXX.disang
      DUMPAVE=imgXXXX.dumpave
      LISTOUT=POUT
      LISTIN=POUT

       
.. code-block:: bash

      PBE0snglpt
      &cntrl
      ! IO =======================================
            irest = 0       ! 0 = start, 1 = restart
              ntx = 1       ! 1 = start, 5 = restart
             ntxo = 1       ! read/write rst as formatted file
            iwrap = 1       ! wrap crds to unit cell
           ioutfm = 1       ! write mdcrd as netcdf
             imin = 6
            ntmin = 1
             ntpr = 1
             ntwr = 0
             ntwx = 0
             ntwf = 1
             ntwe = 1
      ! DYNAMICS =================================
           nstlim = 0    ! number of time steps
               dt = 0.001   ! ps/step
              ntb = 1       ! 1=NVT periodic, 2=NPT periodic, 0=no box
      ! TEMPERATURE ==============================
            temp0 = 298     ! target temp
         gamma_ln = 5.0     ! Langevin collision freq
              ntt = 3       ! thermostat (3=Langevin)
      ! PRESSURE  ================================
              ntp = 0       ! 0=no scaling, 1=isotropic, 2=anisotropic
      ! SHAKE ====================================
              ntc = 2       ! 1=no shake, 2=HX constrained, 3=all constrained
      noshakemask = ":81,220-224|@1-10,22-39,55-60,706-710,723-736,1969-1990,1992-1993"    ! do not shake these
              ntf = 1       ! 1=cpt all bond E, 2=ignore HX bond E, 3=ignore all bond E
      ! MISC =====================================
              cut = 10.0
            ifqnt = 1
               ig = -1
           nmropt = 0
      /


      &ewald
        dsum_tol = 1.e-6
      /

      &qmmm
           qm_theory  = 'HFDF'
         hfdf_theory  = 'PBE0'
          hfdf_basis  = '6-31G*'
              qmmask  = ':81,220-224|@1-10,22-39,55-60,706-710,723-736,1969-1990,1992-1993'
            qmcharge  = 0
                spin  = 1
              itrmax  = 50
             qmshake  = 0
             scfconv  = 1e-07
           verbosity  = 0
      hfdf_mempercore = 2000
      hfdf_restricted = T
      !hfdf_guess_mix = 0.5
      !-------------------------------------------------------
      ! Ambient-Potential Composite Ewald
      hfdf_ewald=T,qm_ewald=1
      hfdf_mm_percent=0.0, hfdf_qmmm_wswitch=0.0, qmmm_switch=0
      !-------------------------------------------------------
      !
      ! Cutoff QM/MM electrostatics; set use_pme=0 in &ewald
      !hfdf_ewald=F,qm_ewald=0
      !hfdf_mm_percent=1.0,hfdf_qmmm_wswitch=0.0,qmmm_switch=0
      !
      ! MM-charge Ewald w/ short-range cutoff QM/MM correction
      !hfdf_ewald=F,qm_ewald=1
      !hfdf_mm_percent=1.0,hfdf_qmmm_wswitch=0.0,qmmm_switch=1
      !
      ! MM-charge Ewald w/ short-range switched-cutoff QM/MM correction
      !hfdf_ewald=F,qm_ewald=1
      !hfdf_mm_percent=1.0,hfdf_qmmm_wswitch=2.0,qmmm_switch=1
      !
      ! Nam-style Mulliken-charge PME w/ short-range cutoff QM/MM correction
      !hfdf_ewald=F,qm_ewald=1
      !hfdf_mm_percent=0.0,hfdf_qmmm_wswitch=0.0,qmmm_switch=1
      !
      ! Nam-style Mulliken-charge PME w/ switched QM/MM correction
      !hfdf_ewald=F,qm_ewald=1
      !hfdf_mm_percent=0.0,hfdf_qmmm_wswitch=2.0,qmmm_switch=1
      !
      /

      &wt type = 'DUMPFREQ', istep1 = 25, /
      &wt type='END' /
      DISANG=imgXXXX.disang
      DUMPAVE=imgXXXX.dumpave
      LISTOUT=POUT
      LISTIN=POUT

       
.. _dpamber:

1.4 Packaging the data
~~~~~~~~~~~~~~~~~~~~~~

Now you should have a seperate hdf5 file for each window in the
prune_traj directory. You can get a sense of what's inside using python:

.. code-block:: python

      ipython

      In [1]: import h5py                                                                                                    

      In [2]: file = h5py.File('img01_DFTB3.hdf5')

      In [3]: keynames = list(file.keys())

      In [4]: print(keynames[0])                                                                                                
      C23H41HW179Mg1N3O23OW89P2mC106mCl0mH110mMg3mN51mNa3mO65mP8

      In [5]: print(file[keynames[0]].keys())                                                                                   
      KeysViewHDF5 ['nopbc', 'set.000', 'type.raw', 'type_map.raw']

      In [6]: print(file[keynames[0]]['set.000'].keys())                                                                       
      KeysViewHDF5 ['aparam.npy', 'coord.npy', 'energy.npy', 'force.npy']
        

Each structure is represented by a string of elements and the total
number of that element in the cut out region. For each system string,
there is an indicator of whether or not periodic boundary conditions are
used (nopbc), the atom types present (type/type_map), and most
importantly, the set of energies, forces and coordinates contained in
set.000. Additionally, set.000 contains aparam, which stands for atomic
parameters, which is only important for pairwise dprc models to assign
atoms to residues. The
following hdf5.slurm script can be run from the base directory:

.. code-block:: bash

      #!/bin/bash
      #SBATCH --output="slurm/dpdata-%a.slurmout"
      #SBATCH --error="slurm/dpdata-%a.slurmerr"
      #SBATCH --nodes 1
      #SBATCH --ntasks-per-node 1
      #SBATCH --partition=debug
      #SBATCH -c 1
      #SBATCH --mem=8G
      #SBATCH --time=00:30:00
      #SBATCH --requeue
      #SBATCH --account=rut149
      #SBATCH --constraint=lustre

      module load dpgen_Aug092023

      export PYTHONPATH=/expanse/lustre/projects/rut149/emccart3/DPGEN_installs/Aug092023/local/lib/python3.9/site-packages
      echo "${PYTHONPATH}"

      it='it51'
      sysname='HDV'

      cat << EOF > hdf5.py
      from glob import glob
      from pathlib import Path
      import os
      import dpdata
      import numpy as np


      m = dpdata.MultiSystems()

      for dd in glob.glob('{}/prune_traj/*hdf5'.format(it)):
        if os.path.isfile(dd):
            m.from_deepmd_hdf5(dd)
      m.to_deepmd_hdf5("initial_{}.hdf5".format(sysname))

      EOF

      python3.9 hdf5.py

       
Modify the directory and systems names in this file for your system.
This will produce two hdf5 files that can later be specified as the
training data for DP-GEN.

2. Training a QM/MM+ΔMLP model with DeePMD-kit
----------------------------------------------

The following sections will detail the training procedure for a DP and
MACE model. The general procedure is the same, but the input files will
differ. Choose the input file instructions for the model type you wish
to train.

Please note that when performing training with DeePMD-kit, it will
create many symbolic links, so you should ensure your file system
supports this. You will also need to have ssh keys set up on any remote
cluster you wish to run training on so that passwordless login is
enabled.

.. _input1:

2.1 Generating input files for a Deep Potential
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

For the sake of demonstration, let's imagine you have generated data for
two seperate systems and want to combine them for training together,
otherwise, just proceed with your single hdf5 file. You should have a
directory containing the hdf5 files you want to combine:

.. code-block:: bash

      user@local:~/YOUR/WORKING/DIR $ ls
      combine_hdf5s.py  writejson_simplify.py  initial_HDV_assoc.hdf5  initial_HDV_dissoc.hdf5

Here are the contents of combine_hdf5s.py

.. code-block:: python

      from glob import glob
      import dpdata

      m = dpdata.MultiSystems()

      for dd in glob("*.hdf5"):

          m.from_deepmd_hdf5(dd)

      m.to_deepmd_hdf5("trainingset.hdf5")  

After running combine_hdf5s.py, you will have a single hdf5 file called
trainingset.hdf5. We will need information from this file to write the
input file for DeePMD-kit, which requires json or yaml format. You can
think of these file formats like a python dictionary with keys and
corresponding values. You have been provided a script called
writejson_simplify.py to assist in generating the json file. This script
requires functions from writejson.py and sysdef.py, also provided. Be
sure to store those two scripts in a directory that is in your
$PYTHONPATH, or in the working directory. Here are the arguments for
writejson_simplify.py:

.. code-block:: bash

      user@local:~/YOUR/WORKING/DIR $ python writejson_simplify.py -h
      usage: writejson_simplify.py [-h] -h5 HDF5 [-m MODEL] [-r RADIUS] [-os OUTSIMP] [-x EXCLUDE]

          Write training input json file for dpgen simplify.

          
      optional arguments:
        -h, --help            show this help message and exit
        -h5 HDF5, --hdf5 HDF5
                              hdf5 file containing the training data. All data should be combined already
        -m MODEL, --model MODEL
                              Type of model to train. Options are DP or MACE.
        -r RADIUS, --radius RADIUS
                              Radius around QM region for training. Default: 6 angstrom
        -os OUTSIMP, --outsimp OUTSIMP
                              Output file name for simplify (passive learning) json file. Default: simplify.json
        -x EXCLUDE, --exclude EXCLUDE
                              json file with sel and exclusion lists from previous run of this script if you have it.
                              This will avoid calculating exclude types and sel again to save time.

The radius should be the same as what you used in dpamber to carve out
the training region around the QM region. Let's generate an input file
for a DP model. Run the script with the trainingset.hdf5 file as input:

.. code-block:: bash

      user@local:~/YOUR/WORKING/DIR $ python writejson_simplify.py -h5 trainingset.hdf5 -m DP -r 6 -os simplify_DP.json

Let's get a sense of what parameters the script set based on your
training data, and what parameters are defaults. Earlier, we saw that
the hdf5 file contains many system keys consisting of a string of atom
names and their corresponding count. An "m" in front of an atom
indicates that it is an MM atom. In simplify_DP.json, you will see a key
called "type_map", which is simply the QM and MM atom types that occur
in the training data. You will see sections for QM/QM interactions and
for QM/MM interactions, each containing a "sel" and "exclude_types" key.
The sel (short for selection) is the number of atoms of each type a
particular atom type should expect to interact with during a simulation.
This initializes the dimensions of tensors involved in training, so
setting it too high could impede performance, but setting it too low
could mean that if a higher number of atoms is encountered in a future
application, the model might not conserve energy properly. This is more
straightword for the QM/QM interactions because the QM region is
generally fixed, so the sel is a list of the total number of each QM
atom, and 0 for MM atoms. You will notice that we are using a hybrid
descriptor, se_a_ebd_v2 for QM/QM and se_atten_v2 for QM/MM. This allows
us to set the QM/MM sel to an integer rather than a list, which is
better for performance. The script will compute the maximum number of
each MM atom type that can be seen by a QM atom and take the total. This
number is padded such that is it divisible by 16 for performance. The
exclude_types lists indicate which interactions should be excluded from
the calculation, so in the QM/QM section those indices will correspond
to MM-MM and QM-MM pairs, and in the QM/MM section they are QM-QM and
MM-MM pairs. These system specific keys are written to a file called
exclude.json that the script can read if you don't want to recompute
them.

One may wish to tune numb_steps (the number of training steps),
init_pick_number, and iter_pick_number based on the amount of training
data you have. Typically, init_pick_number will be larger than
iter_pick_number because you want the first iteration of training to see
a representative sample of the data. The iter_pick_number controls how
many new samples are added to training to replace the initial samples,
so by setting this lower than init_pick_number you avoid completely
replacing the dataset and getting sharp fluctuations in the model. You
may wish to make these larger for larger training sets to reduce the
number of training iterations necessary to get through all of your data.

The rest of the input corresponds to the training procedure and
hyperparameters that are defaults. Please refer to the DeePMD-kit manual
and literature for detailed descriptions of these parameters. Here is a
summary of how key parameters appear in the input file (shown in bold).


.. math::

   E_{i} =
   \begin{cases}
   N_{2}(D_{i}) & i \in \text{QM} \\
   N_{2}(D_{i}) - N_{2}(0) & i \in \text{MM}
   \end{cases}

**Fitting network** \(N_{2}\): 3 hidden layers with **240** neurons per layer.  
See **descriptor** → **fitting net** → **neuron** = **[240, 240, 240]** in `simplify_DP.json`.

.. math::

   D_{i} = (G_{i1})^{T} R_{i} (R_{i}^{T}) G_{i2}

Atomic descriptor (\(D_{i}\)): input layer of fitting network (\(N_{2}\)).

.. math::

   (G_{i1})_{j} = N_{e,2}(\{s(r_{ij}), A^{j}\})

**Embedding matrix** (\(G_{i1}\)): output of the filtering network.  
**Filtering/embedding network** (\(N_{e,2}\)): 3 hidden layers with **25**, **50**, and **100** neurons.  
See **descriptor** → **neuron** = **[25, 50, 100]**.

.. math::

   A^{j} = N_{t}(\text{onehot}(\alpha_{i}))

**Type embedding matrix** (\(A^{j}\)): output of the type embedding network.  
**Type embedding network** (\(N_{t}\)): 1 layer with **12** neurons.  
See **type embedding** → **neuron** = **[12]**.

.. math::

   \begin{align}
   G_{i1} &\in \mathbb{R}^{N \times M_1},\quad M_1 = 100 \\
   G_{i2} &\in \mathbb{R}^{N \times M_2},\quad M_2 = 12
   \end{align}

**Truncated embedding matrix** (\(G_{i2}\)): the first **12** columns of the embedding matrix (\(G_{i1}\)).  
See **descriptor** → **axis neuron** = **12**


Finally, the last input will be train_backend, which is set to
tensorflow for a DP model.

.. _input2:

2.2 Generating input files for MACE
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Now we will generate an input file for a MACE model. If you only wish to
train a DP model, you may skip this section. Using trainingset.hdf5
again as input, rerun the writejson_simplify.py with -m set to MACE:

.. code-block:: bash

      user@local:~/YOUR/WORKING/DIR $ python writejson_simplify.py -h5 trainingset.hdf5 -m MACE -r 6 -os simplify_MACE.json

You will notice that simplify_MACE.json contains some new parameters in
the "model" section:

.. code-block:: python

            "r_max": 6.0,
            "sel": 128,
            "num_radial_basis": 8,
            "num_cutoff_basis": 5,
            "max_ell": 3,
            "interaction": "RealAgnosticResidualInteractionBlock",
            "num_interactions": 2,
            "hidden_irreps": "128x0e + 128x1o",
            "pair_repulsion": false,
            "distance_transform": "None",
            "correlation": 3,
            "gate": "silu",
            "MLP_irreps": "16x0e",
            "radial_type": "bessel",
            "radial_MLP": [
              64,
              64,
              64
            ]  

These are the default model parameters suggested by the MACE developers
that have been shown to perform well. Here is a summary of how key
parameters relate to the input file.

An atomic center is represented as follows:

.. math::

   \begin{align}
   \sigma_{i}^{(t+1)} &= (r_{i}, \theta_{i}, h_{i}^{(t+1)}) \\
   h_{i}^{(t+1)} &= U_{t}(\sigma_{i}^{(t)}, m_{i}^{(t)})
   \end{align}

\(\sigma\) is a node representing the atomic center \(i\) with coordinates \(r\), attributes \(\theta\) (such as atomic number), and features \(h\).  
The features are derived from the previous message \(m\) passing iteration \(t\).  
The update function \(U\) is a linear function of the message.

The message is constructed by pooling over single particle basis functions \(\phi\) constructed between neighbors \(i\) and \(j\):

.. math::

   \begin{align}
   \phi_{i,kl_{3}m_{3}}^{(t)} &= R_{k,l_{1}l_{2}l_{3}}^{(t)}(r_{ji}) Y_{l_1}^{m_1}(\hat{r_{ji}})
   T_{kcL}^{(t)}(h_{j}^{(t)}, \theta_{i}^{(t)}, \theta_{j}^{(t)}) \\
   R_{k,l_{1}l_{2}l_{3}}^{(t)}(r_{ji}) &= \text{MLP}(R_{n}(r_{ji}) f_{\text{cut}}(r_{ji}))
   \end{align}

The single particle basis functions contain **radial basis functions**, \(R_n\), which are set to be **8 Bessel functions** multiplied by a smooth polynomial cutoff, \(f_{\text{cut}}\), of order **5**.  
The **MLP** contains 3 layers each with **64** neurons. \(Y\) are spherical harmonics, and \(T\) is a node embedding function.

- See **model** → **radial_type** = **bessel**  
- **model** → **num_radial_basis** = **8**  
- **model** → **num_cutoff_basis** = **5**  
- **model** → **radial_MLP** = **[64, 64, 64]**

The index \(t\) is the message passing iteration, which in our input has a maximum of **2**.  
- See **model** → **num_interactions** = **2**

The message is constructed as:

.. math::

   \begin{align}
   m_{i,kLM}^{(t)} &= \bigoplus_{j \in N_{i}} \phi_{k\nu L}^{(t)}(\sigma_{j}^{(t)}, \sigma_{i}^{(t)}) \\
   &= \sum_{\eta} w_{i,k\eta L}^{(t)} \sum_{\upsilon} C^{LM}_{\eta_{\upsilon}} \prod_{\xi=1}^{\nu} \sum_{j \in N(i)} \phi_{k\nu_{\xi}L}^{(t)}(\sigma_{i}^{(t)}, \sigma_{j}^{(t)})
   \end{align}

- The **correlation order**, \(\nu\), is set to **3**  
- The number of uncoupled **channels**, \(k\), is set to **128**  
- The index \(\eta\) is for a given symmetry  
- The maximum order of symmetry in the message, \(L_{\text{max}}\), is set to **3**  
- See **model** → **correlation** = **3**

Using these defaults, we will consider up to 3 body interactions. The 2
body interactions are computed explicitly, and the 3 body interactions
are approximated using tensor products of lower body interactions from
previous messages. This is a hallmark of the MACE model which allows it
to consider higher body order interactions efficiently, unlink other
more expensive GNN's which may compute them explicitly. This also
distinguishes it from the Deep Potential, which only considers 2 body
interactions.

Finally, the site energy is

.. container:: math-block

   \\[ E_{i}^{(t)} = \\begin{cases} \\sum W_{\text{readout}} h_{i}^{(t)}
   & \\text{if } t < T \\\\
   \\text{MLP}_{\text{readout}}^{(t)}(\{h_{i}^{(t)}\}_{k}) & \\text{if }
   t = T \\end{cases} \\]

| The readout function \\(W\) is a **sigmoid linear unit function
  (silu)** except for the last message construction, in which case the
  readout function is a **MLP** with a single layer with **16 neurons**.
| See **model** → **gate** = **silu** and
| **model** → **MLP_irreps** = **16x0e**.

Finally, the training backend will now be set to pytorch instead of
tensorflow. Other parameters such as the form of the loss function,
learning rate, training steps, output frequencies, and model deviation
bounds are common between the DP and MACE inputs.

.. _machine:

2.3 Machine files
~~~~~~~~~~~~~~~~~

In addition to the training parameter file, you will need a "machine
file." This is a yaml or json file that describes how the job will be
run. DeePMD-kit contains a program called dpdisbatcher, which can
automatically transfer your files to a remote cluster, launch the jobs,
and transfer the data back to your local working directory. You can also
initiate the training directly from the cluster if you prefer to
transfer the data after the training is complete. The job submission
information such as slurm headers, environment modules, system
architecture, and paralellization strategies are contained in the
machine file to be used by dpdisbatcher. If you want to automatically
sync data from your local machine to a cluster, ensure that you have ssh
keys set up, as dpdisbatcher will require passwordless login for
automated file transfers.

The user must customize the machine file for the desired cluster. Below
are two examples for running DP-GEN on the AMAREL cluster at Rutgers. The first uses a local context where all training data is currently on the cluster and will remain there, and the second uses an SSH context to automatically run data from your local machine to a remote cluster.

.. code-block:: python

      {
      
          "train": [
        {
            "command": "dp",
                      "machine": {
                      "context_type": "LocalContext",
                      "batch_type": "Slurm",
                      "local_root": "./",
                      "remote_root": "./"
                  },
                  "resources": {
                      "number_node": 1,
                      "cpu_per_node": 1,
                      "gpu_per_node": 1,
                      "queue_name": "gpu",
                      "custom_flags" : ["#SBATCH --mem=64G",
                "#SBATCH --time=0-020:00:00",
                            "#SBATCH --constraint=\"ampere|titan\"",
                "#SBATCH --exclude=gpu[005-006,012],cuda[001-008]"],
                      "group_size": 1,
                      "module_list": ["pydeepmdkit/deepmdkit/default"],
                      "envs": {
                          "OMP_NUM_THREADS": 1,
                          "TF_INTRA_OP_PARALLELISM_THREADS": 1,
                          "TF_INTER_OP_PARALLELISM_THREADS": 1
                      }
      
                  }
      
        }
          ],

      
          "model_devi": [
        {
            "command": "dp",
            "nchunks": 4,
            "true_error": "false",
            "machine": {
          "context_type": "LocalContext",
          "batch_type": "Slurm",
          "local_root": "./",
          "remote_root": "./"
            },
                  "resources": {
                      "number_node": 1,
                      "cpu_per_node": 1,
                      "gpu_per_node": 1,
                      "queue_name": "p_lbsr,gpu",
                      "custom_flags" : ["#SBATCH --mem=32G",
                                        "#SBATCH --time=0-10:00:00",
                                        "#SBATCH --constraint=\"ampere|titan\"",
                                        "#SBATCH --exclude=gpu[005-006,012],cuda[001-008]"],
                      "group_size": 1,
                      "module_list": ["pydeepmdkit/deepmdkit/default"],
                      "envs": {
                          "OMP_NUM_THREADS": 1,
                          "TF_INTRA_OP_PARALLELISM_THREADS": 1,
                          "TF_INTER_OP_PARALLELISM_THREADS": 1
                      }
                  }
        }
          ],

      
          "fp": [
        {
            "command": "dp",
            "machine": {
          "context_type": "LocalContext",
          "batch_type": "Slurm",
          "local_root": "./",
          "remote_root": "./"
            },
            "resources": {
                      "number_node": 1,
                      "cpu_per_node": 1,
                      "gpu_per_node": 1,
                      "queue_name": "p_lbsr,gpu",
                      "custom_flags" : ["#SBATCH --mem=10G",
                                        "#SBATCH --time=0-10:00:00",
                                        "#SBATCH --constraint=\"ampere|titan\"",
                                        "#SBATCH --exclude=gpu[005-006,012],cuda[001-008]"],
                      "group_size": 1,
                      "module_list": ["pydeepmdkit/deepmdkit/default"],
                      "envs": {
                          "OMP_NUM_THREADS": 1,
                          "TF_INTRA_OP_PARALLELISM_THREADS": 1,
                          "TF_INTER_OP_PARALLELISM_THREADS": 1
                      }
                  }
        }
          ]
      }


.. code-block:: python

      {

          "train": [
          {
              "command": "dp",
                      "machine": {
                      "context_type": "SSHContext",
                      "batch_type": "Slurm",
                      "local_root": "./",  
                      "remote_root": "/scratch/em989/mltrain/dpgen_workdir",
                      "remote_profile": {
                          "hostname": "amarel.rutgers.edu",
                          "username": "em989"
                      }
                  },
                  "resources": {
                      "number_node": 1,
                      "cpu_per_node": 1,
                      "gpu_per_node": 1,
                      "queue_name": "gpu",
                      "custom_flags" : ["#SBATCH --mem=16G",
                        "#SBATCH --time=1-23:00:00",
                                "#SBATCH --constraint=\"ampere|titan\"",
                        "#SBATCH --exclude=gpu[005-006,012],cuda[001-008]"],
                      "group_size": 1,
                      "module_list": ["pydeepmdkit/deepmdkit/default"],
                      "envs": {
                          "OMP_NUM_THREADS": 1,
                          "TF_INTRA_OP_PARALLELISM_THREADS": 1,
                          "TF_INTER_OP_PARALLELISM_THREADS": 1
                      }

                  }

          }
          ],
          

          "model_devi": [
          {
              "command": "dp",
              "machine": {
              "context_type": "SSHContext",
              "batch_type": "Slurm",
              "local_root": "./",
              "remote_root": "/scratch/em989/mltrain/dpgen_workdir",
              "remote_profile": {
                  "hostname": "amarel.rutgers.edu",
                  "username": "em989"
              }
              },
                  "resources": {
                      "number_node": 1,
                      "cpu_per_node": 4,
                      "gpu_per_node": 1,
                      "queue_name": "p_lbsr,gpu",
                      "custom_flags" : ["#SBATCH --mem=16G",
                                        "#SBATCH --time=1-23:00:00",
                                        "#SBATCH --constraint=\"ampere|titan\"",
                                        "#SBATCH --exclude=gpu[005-006,012],cuda[001-008]"],
                      "group_size": 1,
                      "module_list": ["pydeepmdkit/deepmdkit/default"],
                      "envs": {
                          "OMP_NUM_THREADS": 4,
                          "TF_INTRA_OP_PARALLELISM_THREADS": 1,
                          "TF_INTER_OP_PARALLELISM_THREADS": 1
                      }
                  }
          }
          ],
          

          "fp": [
          {
              "command": "dp",
              "machine": {
              "context_type": "SSHContext",
              "batch_type": "Slurm",
              "local_root": "./",
              "remote_root": "/scratch/em989/mltrain/dpgen_workdir",
              "remote_profile": {
                  "hostname": "amarel.rutgers.edu",
                  "username": "em989"
              }
              },
              "resources": {
                      "number_node": 1,
                      "cpu_per_node": 1,
                      "gpu_per_node": 1,
                      "queue_name": "p_lbsr,gpu",
                      "custom_flags" : ["#SBATCH --mem=16G",
                                        "#SBATCH --time=1-23:00:00",
                                        "#SBATCH --constraint=\"ampere|titan\"",
                                        "#SBATCH --exclude=gpu[005-006,012],cuda[001-008]"],
                      "group_size": 1,
                      "module_list": ["pydeepmdkit/deepmdkit/default"],
                      "envs": {
                          "OMP_NUM_THREADS": 1,
                          "TF_INTRA_OP_PARALLELISM_THREADS": 1,
                          "TF_INTER_OP_PARALLELISM_THREADS": 1
                      }
                  }
          }
          ]
          
      }  

One should modify the module highlighed in green. The block highlighted
in cyan means the data will be transfered to the cluster for training,
and results tranferred back at every training iteration. If instead you
are starting with your data already on the cluster, change "SSHContext"
to "LocalContext", "remote_root" to "./", and remove the remote profile
information. This should be done for the train, model_devi, and fp
sections.

The first section, train, controls the training step, which should be
run on a GPU, expecially if training a MACE potential. We will train 4
models in parallel, so each will be alotted 1 GPU. The initial weights
and biases are assigned stochastically, so the 4 models will deviate.
The model deviation (model_devi) is computed to estimate the uncertainty
in the forces for a given sample, and if it falls between our preset
values of 0.08 and 0.25 eV/A, the sample will still be availble to be
selected for future training. If it falls below 0.08 eV/A, it is deemed
accurate and the sample is removed from the training set. Similarly, if
it falls above 0.25 eV/A, it is deemed an outlier and removed. The first
principles (fp) step labels structures for which the model is uncertain
with their high level energies and forces for the next round of
training. In simplify, that data is already available to us, so it is
only a matter of grabbing it from our training set database rather than
computing it.

If you have a large amount of data, especially for training a MACE
potential, it could be advisable to use distributed data parellelism.
This means the initial model parameters will be broadcast across
multiple GPUs, but each GPU trains based on a mini-batch of the data.
Then when all instances have finished, the resulting models are averaged
such that the final model is the same across the mini-batches. Here is
an example splitting the data for a given model between 2 GPUs (8 total
for 4 stochastic models):

.. code-block:: python

      {

          "train": [
          {
              "command": "torchrun --nproc_per_node=2 --no-python --rdzv-backend=c10d --rdzv_endpoint=localhost:0 dp",
                      "machine": {
                      "context_type": "SSHContext",
                      "batch_type": "Slurm",
                      "local_root": "./",
                      "remote_root": "/scratch/em989/mltrain/dpgen_workdir",
                      "remote_profile": {
                          "hostname": "amarel.rutgers.edu",
                          "username": "em989"
                      }
                  },
                  "resources": {
                       "number_node": 2,
                       "cpu_per_node": 2,
                       "gpu_per_node": 2,
                      "queue_name": "gpu",
                      "custom_flags" : ["#SBATCH --mem=16G",
                        "#SBATCH --time=1-23:00:00",
                                "#SBATCH --constraint=\"ampere|titan\"",
                        "#SBATCH --exclude=gpu[005-006,012],cuda[001-008]"],
                      "group_size": 1,
                      "module_list": ["pydeepmdkit/deepmdkit/default"],
                      "envs": {
                          "OMP_NUM_THREADS": 2,
                          "TF_INTRA_OP_PARALLELISM_THREADS": 1,
                          "TF_INTER_OP_PARALLELISM_THREADS": 1
                      }

                  }

          }
          ], ... 

The model_devi and fp sections are not effected. For MACE models, which
are based on the pytorch backend, we must use the command torchrun. The
--no-python command means that we are running the dp command as opposed
to directly running a python program. The flags --rdzv-backend=c10d and
--rdzv_endpoint=localhost:0 where required in this case because the
defailt port (29500) was already in use.

In the training base directory, create a directory called
"computerprofiles" and put machine.json in it.

.. code-block:: bash

      user@local:~/YOUR/WORKING/DIR $ mkdir computerprofiles
      user@local:~/YOUR/WORKING/DIR $ mv PATH/2/machine.json computerprofiles

.. _simplify:

2.4 Running DP-GEN simplify
~~~~~~~~~~~~~~~~~~~~~~~~~~~

Now you can run DP-GEN simplify. Job progress will be logged to the
console as well as to files called dpgen.log and dpdisbatcher.log. It
may be convenient to run the job from a tmux session so that you can
detach and attach for easier progress monitoring. One could also
background the process and periodically check the log file. Run the
following command to initiate simplify:

.. code-block:: bash

      user@local:~/YOUR/WORKING/DIR $ dpgen simplify simplify_MACE.json computerprofiles/machine.json

      Reference
      ------------
      Please cite:
      Yuzhi Zhang, Haidi Wang, Weijie Chen, Jinzhe Zeng, Linfeng Zhang, Han Wang, and Weinan E,
      DP-GEN: A concurrent learning platform for the generation of reliable deep learning
      based potential energy models, Computer Physics Communications, 2020, 107206.
      ------------

      Description
      ------------
      INFO:dpgen:start simplifying
      INFO:dpgen:=============================iter.000000==============================
      INFO:dpgen:-------------------------iter.000000 task 00--------------------------
      INFO:dpgen:-------------------------iter.000000 task 01--------------------------
      INFO:dpgen:first iter, skip step 1-5
      INFO:dpgen:-------------------------iter.000000 task 02--------------------------
      INFO:dpgen:first iter, skip step 1-5
      INFO:dpgen:-------------------------iter.000000 task 03-------------------------- 

Training iterations will run until either all of the samples are deemed
accurate, or the data is exhausted. When the job is complete, you will
see the output "INFO : finished".

.. _pretrain:

2.5 Starting from a pretrained model
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Let's consider the case where you have a pretrained model that you want
to refine. In the case of simplify, this means you would have 4 existing
model files from some previous training. They should be stored in the
directory structure produced by dpgen. We will assume the following
directory structure.

.. code-block:: bash

      user@local:~/YOUR/WORKING/DIR $ ls
      Foundation_model  Refined_model

      user@local:~/YOUR/WORKING/DIR $ ls Foundation_model
      iter.000000  iter.000001  iter.000002  iter.000003  iter.000004  iter.000005  iter.000006  iter.000007  iter.000008

      user@local:~/YOUR/WORKING/DIR $ ls Foundation_model/iter.000008/00.train
      000  001  002  003  data.init  data.iters  graph.000.pth  graph.001.pth  graph.002.pth  graph.003.pth

Let's say your final model files are in
Foundation_model/iter.000008/00.train, and you want to train the refined
model in the directory called Refined_model. Within the Refined_model,
you will again have a common directory, but the training dataset should
correspond to the specific system with which you want to refine the
training. Copy the input json file from the Foundation_model directory,
modify the pick_data key to match your new hdf5 file name, and add the
"training_iter0_model_path" key:

.. code-block:: python

      "pick_data": "common/MySystem.hdf5",
      "training_iter0_model_path": ["../Foundation_model/iter.000008/00.train/000", "../Foundation_model/iter.000008/00.train/001", "../Foundation_model/iter.000008/00.train/002", "../Foundation_model/iter.000008/00.train/003"],

You can now initiate simplify as you did in the previous section:

.. code-block:: bash

      user@local:~/YOUR/WORKING/DIR $ dpgen simplify simplify_MACE.json computerprofiles/machine.json 

.. _eval:

2.6 Evaluating a model
~~~~~~~~~~~~~~~~~~~~~~

Once you have completed simplify, let's check on the progress of our
model. You have been provided multiple python scripts to help you plot
output files from DeePMD-kit. Let's say your training concluded after
iteration 4. In this example we will compare the model deviation and
parity plots for the training sets from the first and last iteration. Go
to the training directory for iteration 1 and run dp model-devi and dp
test, then do the same for iteration 4:

.. code-block:: bash

      user@local:~/YOUR/WORKING/DIR $ cd iter.000001/00.train
      user@local:~/YOUR/WORKING/DIR/iter.000001/00.train $ dp model-devi -m graph.*.pth -s data.iters/iter.000000/02.fp/data.000
      user@local:~/YOUR/WORKING/DIR/iter.000001/00.train $ dp test -m graph.000.pth -s data.iters/iter.000000/02.fp/data.000 -n 100 -d results

      user@local:~/YOUR/WORKING/DIR $ cd ../../iter.000004/00.train
      user@local:~/YOUR/WORKING/DIR/iter.000001/00.train $ dp model-devi -m graph.*.pth -s data.iters/iter.000003/02.fp/data.000
      user@local:~/YOUR/WORKING/DIR/iter.000001/00.train $ dp test -m graph.000.pth -s data.iters/iter.000003/02.fp/data.000 -n 100 -d results
      user@local:~/YOUR/WORKING/DIR/iter.000001/00.train $ cd ../../

Move the plot_modeldevi.py and plot_parity.py scripts to your working
directory. First let's plot the model deviations:

.. code-block:: bash

      plot_modeldevi.py -f iter.000001/00.train/model_devi.out -f iter.000004/00.train/model_devi.out -l iter1 -l iter4
      1520
      Percent accurate for ftrust=0.1 is 63.6%
      Percent accurate for ftrust=0.08 is 16.6%
      Percent accurate for ftrust=0.05 is 0.0%
      iter.000004/00.train/model_devi.out
      500
      Percent accurate for ftrust=0.1 is 90.8%
      Percent accurate for ftrust=0.08 is 53.2%
      Percent accurate for ftrust=0.05 is 0.0%

.. container::

   .. figure:: imgs/model_deviations.png
      :alt: Figure 1
      
      **Figure 1.** Deviation between 4 model after the first simplify training iteration (blue) and last (orange).


We have gone from the 4 models agreeing for ~17% of the training data to
~53%. The size of the training set has also decreased because samples
that were deviously deemed accurate have been discarded. We may wish to add more data to this training to achieve higher accuracy, but for the sake of this tutorial we will continue with the current model.


3. Training a QM/MM+ΔMLP model with DeePMD-kit
----------------------------------------------

.. _runsim:

3.1 QM/MM+ΔMLP simulations
~~~~~~~~~~~~~~~~~~~~~~~~~~

Now we will apply the model to run simulations applying ΔMLP correction. Both DP and GNN models perform better on a GPU compared to a CPU, but Pytorch based GNNs have a larger performance drop-off on CPUs than the tensorflow based DP. Assuming the potential will be used to run new string simulations, we will modify the runString.slurm script run on a GPU node. Here are the contents of runString_ML.slurm written for the expanse cluster:

.. code-block:: bash

      #!/bin/bash
      #SBATCH --job-name="runString_ML"
      #SBATCH -o slurm/slurm-%j.%a.out
      #SBATCH -e slurm/slurm-%j.%a.err
      #SBATCH --export=ALL
      #SBATCH --partition=gpu
      #SBATCH --gpus=8
      #SBATCH --nodes=2
      #SBATCH --cpus-per-task=1
      #SBATCH --ntasks-per-gpu=4
      #SBATCH --mem=224G
      #SBATCH --account=rut149
      #SBATCH --time=2-00:00:00
      #SBATCH --constraint=lustre

      module purge
      module load dpmace/amber24/default

      export OMP_NUM_THREADS=1
      export DP_INTRA_OP_PARALLELISM_THREADS=1
      export DP_INTER_OP_PARALLELISM_THREADS=1
      export GFORTRAN_UNBUFERED_ALL=Y
      #PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

      export LAUNCH="srun --mpi=pmi2 -K1 -N2 -n32 -c1"
      export EXE="sander.MPI"


      if [ ! -d "slurm" ]; then
        mkdir slurm
      fi

      for iter in init {01..50}; do
        if [[ "${iter}" == "init" ]]; then
          itdir=init
          echo "Running init directory"
        else
          itdir=it${iter}
        fi

        cd $itdir

        if [[ "${itdir}" == "it10" ]]; then
          echo ${itdir}
        else

          $LAUNCH $EXE -ng 32 -groupfile ../template/template.groupfile
          wait
          sleep 1
        fi

        ###################################
        #### Check for convergence failure
        ###################################

        FAILED=0
        for s in $(seq 32); do
          base=$(printf "img%02i" ${s})
          if grep -q "Convergence could not be achieved" ${base}.mdout; then
            FAILED=1
            echo "${base} failed SCF convergence!!!"
          else
            echo "SCF converged"
            continue
          fi

        done
        
        if [ "${FAILED}" == 1 ]; then
          echo "Detected a failed simulation"
          exit 1
        fi

        ###################################
        #### Check for any other failures
        ###################################
        echo ${pwd}

        if grep -q "   5.  TIMINGS" *.mdout; then
          sleep 1
          cd ../.
          ./nextSASM.sh $iter 128
        else
          echo "run incomplete"
          exit 1
        fi
      done


The nextSASM.sh script is unchanged. The template directory should now contain graph.000.pth if you trained a GNN, or graph.000.pb if you trained a DP. The 1ps.mdin template mdin file should now contain a dprc namelist pointing to the model file:


.. code-block:: bash

      1ps_ml.mdin
      &cntrl
      ! IO =======================================
            irest = 1       ! 0 = start, 1 = restart
              ntx = 5       ! 1 = start, 5 = restart
            ntxo = 1       ! read/write rst as formatted file
            iwrap = 1       ! wrap crds to unit cell
          ioutfm = 1       ! write mdcrd as netcdf
            imin = 0
            ntmin = 1
            ntpr = 50
            ntwr = 50
            ntwx = 50
            ntwf = 0
      ! DYNAMICS =================================
          nstlim = 1000    ! number of time steps
              dt = 0.001   ! ps/step
              ntb = 1       ! 1=NVT periodic, 2=NPT periodic, 0=no box
      ! TEMPERATURE ==============================
            temp0 = 298     ! target temp
        gamma_ln = 5.0     ! Langevin collision freq
              ntt = 3       ! thermostat (3=Langevin)
      ! PRESSURE  ================================
              ntp = 0       ! 0=no scaling, 1=isotropic, 2=anisotropic
      ! SHAKE ====================================
              ntc = 2       ! 1=no shake, 2=HX constrained, 3=all constrained
      noshakemask = ":81,220-224|@1-10,22-39,55-60,706-710,723-736,1969-1990,1992-1993"    ! do not shake these
              ntf = 1       ! 1=cpt all bond E, 2=ignore HX bond E, 3=ignore all bond E
      ! MISC =====================================
              cut = 10.0
            ifqnt = 1
              ig = -1
          nmropt = 1
      /


      &ewald
        dsum_tol = 1.e-6
      /

      &qmmm
          qm_theory   = 'DFTB3'
              qmmask  = ':81,220-224|@1-10,22-39,55-60,706-710,723-736,1969-1990,1992-1993'
            qmcharge  = 0
                spin  = 1
            qmshake  = 0
            qm_ewald  = 1
        qmmm_switch  = 1
            scfconv  = 1.e-10
          verbosity  = 0
        tight_p_conv  = 1
        diag_routine  = 0
        pseudo_diag  = 1
        dftb_maxiter  = 100
      /

      &wt type = 'DUMPFREQ', istep1 = 20, /
      &wt type='END' /
      DISANG=img.disang
      DUMPAVE=img.dumpave
      LISTOUT=POUT
      LISTIN=POUT

      &dprc
        idprc=1
        mask=":81,220-224|@1-10,22-39,55-60,706-710,723-736,1969-1990,1992-1993"
        rcut = 6.0
        intrafile=""
        interfile(1)="../template/graph.000.pth"
      /


Modify the QM region, noshake mask, and dprc mask for your system. The mask in the dprc section should be your QM region. Change the file extension of the graph model file based on the type of model you trained.