Hands-On Session 11: Transfer learning: active learning to fine-tune a foundational machine learning potential to a specific target system for improved accuracy
================================================================================================================================================================

| Erika McCarthy\ :sup:`1`, Timothy Giese\ :sup:`1`, and Darrin M. York\ :sup:`1`
| :sup:`1`\ Laboratory for Biomolecular Simulation Research, Institute  
  for Quantitative Biomedicine and Department of Chemistry and Chemical
  Biology, Rutgers University, Piscataway, NJ 08854, USA

Learning Objectives 
------------------- 

- Generate training data for a ΔMLP from AMBER simulation data
- Write json style input files for training models with DeePMD-kit
- Verify the setup by briefly training a model a producing the loss output
- Use dptest to assess the accuracy of your model compared to a model after longer training
- Demonstrate that the free energy profile obtained from ΔMLP simulatins of MTR1 is more accurate than DFTB3  

Relevant literature
-------------------

  
Tutorial
--------

In this tutorial you will learn how to prepare QM/MM training data,
train a QM/MM+ΔMLP model with DeePMD-Kit, then validate the accuracy of the model. The preparation is agnostic to the
type of model you wish to train, so the data could be used to train a
Deep Potential (DP) or Graph Neural Network (GNN) model. We will assume
the low level, semi-empirical model is DFTB3, and the high level, ab
initio target model is PBE0/6-31G*. In this tutorial we will train a MACE graph neural network potential. Specificaly, we will refine a foundational ΔMACE model pre-trained for nucleic acid enzyme reactions using transfer learning. The first step will be to prepare the training data which includes the positions of atoms within 6 angstroms of the QM region, otherwise known as the environment, and their associated forces and energies at the DFTB3 and PBE0 level. The model will be trained to correct the difference between the models.

Generate training data from AMBER QM/MM simulaitons
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
You will first need to peform umbrella sampling simulations to obtain the minimum
free energy path using the low level potential. If your reaction is
1-dimensional, this is simply a linear interpolation from reactants to
products; however, if your reaction contains multiple reaction
coordinates, you must optimize the path using a string method. For this tutorial, we will use the 2-dimensional MTR1 MFEP obtained in Hands on session 5. For the sake of this tutorial, we will only perform training for a single umbrella window with a minimal number of training steps in order to obtain a model within the workshop time constraints. However, one could follow this procedure for all windows and increase the number of training steps for a real-world application.

25 ps of production sampling on the minimum free energy path have been performed for you. We will use the first umbrella window as an example for this exercise. In order to generate training data, we will perform single point calculations using the DFTB3 and PBE0 methods to output the forces and energies to .mdfrc and .mden file, respectively. The trajectory is recommended to contain 100 frames, but you will perform single point calculations for only the first two frames for time purposes. Download the inputs and navigate to the HandsOn11 directory:

.. code-block:: bash

  cd HandsOn11/GenData
  ls
  TEMPLATE_DFTB3.mdin  TEMPLATE_PBE0.mdin  it021  reanalyze_array.slurm  slurm  template

  ls it021
  img001.disang  img001.nc  img001.rst7

The necessary outputs from production are contained in it021. You have been provided two input files for single point calculations, one with DFTB3 and one with PBE0. 

.. tab-set::

    .. tab-item:: TEMPLATE_DFTB3.mdin

      .. code-block:: bash

        DFTB3
        &cntrl
        ! IO =======================================
              irest = 0       ! 0 = start, 1 = restart
              ntx = 1       ! 1 = start, 5 = restart
              ntxo = 1       ! read/write rst as formatted file
              iwrap = 1       ! wrap crds to unit cell
              ioutfm = 1       ! write mdcrd as netcdf
              imin = 6
              ntmin = 1
              ntpr = 1
              ntwr = 0
              ntwx = 0
              ntwf = 1       ! print mdfrc file
              ntwe = 1       ! print mdene file
        ! DYNAMICS =================================
            nstlim = 0    ! number of time steps
                dt = 0.001   ! ps/step
                ntb = 1       ! 1=NVT periodic, 2=NPT periodic, 0=no box
        ! TEMPERATURE ==============================
              temp0 = 298     ! target temp
          gamma_ln = 5.0     ! Langevin collision freq
                ntt = 3       ! thermostat (3=Langevin)
        ! PRESSURE  ================================
                ntp = 0       ! 0=no scaling, 1=isotropic, 2=anisotropic
        ! SHAKE ====================================
                ntc = 2       ! 1=no shake, 2=HX constrained, 3=all constrained
        noshakemask = ":69|@272-282,290-291,1975-1988"    ! do not shake these
                ntf = 1       ! 1=cpt all bond E, 2=ignore HX bond E, 3=ignore all bond E
        ! MISC =====================================
                cut = 10.0
              ifqnt = 1
                ig = -1
            nmropt = 0
            lj1264 = 0
        /


        &ewald
          dsum_tol = 1.e-6
        /

        &qmmm
            qm_theory   = 'DFTB3'
                qmmask  = ':69|@272-282,290-291,1975-1988'
              qmcharge  = 1
                  spin  = 1
              qmshake  = 0
              qm_ewald  = 1
          qmmm_switch  = 1
              scfconv  = 1.e-10
            verbosity  = 0
          tight_p_conv  = 1
          diag_routine  = 0
          pseudo_diag  = 1
          dftb_maxiter  = 100
        /

        &wt type = 'DUMPFREQ', istep1 = 8, /
        &wt type='END' /
        DISANG=imgXXXX.disang
        DUMPAVE=imgXXXX.dumpave
        LISTOUT=POUT
        LISTIN=POUT

    .. tab-item:: TEMPLATE_PBE0.mdin

          .. code-block:: bash

            PBE0
            &cntrl
            ! IO =======================================
                  irest = 0       ! 0 = start, 1 = restart
                  ntx = 1       ! 1 = start, 5 = restart
                  ntxo = 1       ! read/write rst as formatted file
                  iwrap = 1       ! wrap crds to unit cell
                  ioutfm = 1       ! write mdcrd as netcdf
                  imin = 6
                  ntmin = 1
                  ntpr = 1
                  ntwr = 0
                  ntwx = 0
                  ntwf = 1
                  ntwe = 1
            ! DYNAMICS =================================
                nstlim = 0    ! number of time steps
                    dt = 0.001   ! ps/step
                    ntb = 1       ! 1=NVT periodic, 2=NPT periodic, 0=no box
            ! TEMPERATURE ==============================
                  temp0 = 298     ! target temp
              gamma_ln = 5.0     ! Langevin collision freq
                    ntt = 3       ! thermostat (3=Langevin)
            ! PRESSURE  ================================
                    ntp = 0       ! 0=no scaling, 1=isotropic, 2=anisotropic
            ! SHAKE ====================================
                    ntc = 2       ! 1=no shake, 2=HX constrained, 3=all constrained
            noshakemask = ":69|@272-282,290-291,1975-1988"    ! do not shake these
                    ntf = 1       ! 1=cpt all bond E, 2=ignore HX bond E, 3=ignore all bond E
            ! MISC =====================================
                    cut = 10.0
                  ifqnt = 1
                    ig = -1
                nmropt = 0
            /


            &ewald
              dsum_tol = 1.e-6
            /

            &qmmm
                qm_theory   = 'quick'
                    qmmask  = ':69|@272-282,290-291,1975-1988'
                  qmcharge  = 1
                      spin  = 1
                  qmmm_int  = 1
                  qm_ewald  = 1
                  qmshake  = 0
                    itrmax  = 50
                  scfconv  = 1e-07
                verbosity  = 0
            /

            &quick
              method   = 'PBE0'
              basis    = '6-31G*'
            /

            &wt type = 'DUMPFREQ', istep1 = 25, /
            &wt type='END' /
            DISANG=imgXXXX.disang
            DUMPAVE=imgXXXX.dumpave
            LISTOUT=POUT
            LISTIN=POUT


Note that imin is set to 6 and nstlim is set to 0 for reanalysis. We will use QUICK to perform the high level calculations. Now take a look at reanalyze_array.slurm:

.. code-block:: bash

  #!/bin/bash
  #SBATCH --job-name="reanalysis_4Training"
  #SBATCH --output="slurm/reanalysis_%a.slurmout"
  #SBATCH --error="slurm/reanalysis_%a.slurmerr"
  #SBATCH --partition={{CPUpartition}}
  #SBATCH --nodes=1
  #SBATCH --ntasks=16
  #SBATCH --cpus-per-task=1
  #SBATCH --mem=60G
  #SBATCH --export=ALL
  #SBATCH -t 0-00:30:00
  #SBATCH --array=0       ##### Use for running 1st window
  ##SBATCH --array=0-31   ##### Use for running all windows


  {{amberload}}

  export QUICK_BASIS=${AMBERHOME}/AmberTools/src/quick/basis

  qmreg=':69|@272-282,290-291,1975-1988'

  top=`pwd`

  i=021

  RC=($(seq -w 1 1 32))            ##### If running all windows, img number is taken from the SLURM_ARRAY_TASK_ID
  R=0${RC[${SLURM_ARRAY_TASK_ID}]}
  echo R is ${R}

  cd it${i}
  if [ ! -d reanalysis ]; then mkdir reanalysis; fi
  cd reanalysis

  BASE=img${R}
  PBE0=img${R}_PBE0
  DFTB=img${R}_DFTB3
  parm=qmmm.parm7

  sed -e "s/XXXX/${R}/g" ${top}/TEMPLATE_DFTB3.mdin > img${R}_DFTB3.mdin
  sed -e "s/XXXX/${R}/g" ${top}/TEMPLATE_PBE0.mdin > img${R}_PBE0.mdin

  time mpirun -n 16 sander.MPI -O -p ../../template/${parm} -i ${DFTB}.mdin -c ../${BASE}.rst7 -o ene_${DFTB}.mdout -y ../${BASE}.nc -frc ene_${DFTB}.mdfrc -e ene_${DFTB}.mden -inf ${DFTB}.mdinfo

  time mpirun -n 16 sander.MPI -O -p ../../template/${parm} -i ${PBE0}.mdin -c ../${BASE}.rst7 -o ene_${PBE0}.mdout -y ../${BASE}.nc -frc ene_${PBE0}.mdfrc -e ene_${PBE0}.mden -inf ${PBE0}.mdinfo


  if [[ $(grep -r "TIMINGS" ene_${DFTB}.mdout) ]] && [[ $(grep -r "TIMINGS" ene_${PBE0}.mdout) ]]; then
    echo "${BASE} finished correctly"

    module purge
    {{dpgenload}}
    export OMP_NUM_THREADS=1
    export TF_INTER_OP_PARALLELISM_THREADS=1
    export TF_INTRA_OP_PARALLELISM_THREADS=1

    hl=ene_img${R}_PBE0
    ll=ene_img${R}_DFTB3
    out=img${R}_DFTB3.hdf5
    nc=../img${R}.nc
    dpamber corr --cutoff 6. --qm_region ${qmreg} --parm7_file ${top}/template/qmmm.parm7 --nc ${nc} --hl ${hl} --ll ${ll} --out ${out}

  else
    exit 1
  fi


Note that this slurm script is structured as a slurm array, so it can be easily be scaled for all umbrella windows by setting #SBATCH --array=0-31. For a sinlge window we just set it to 0. First, the script will write mdin files from the TEMPLATE files above. Then it will read in it021/img001.nc and perform single points at the DFTB3 and PBE0 levels. You must provide -frc and -mden flags here, along with setting  ntwf and ntwe equal to 1 in the mdin file to output forces and energies. Finally, we will use the dpamber corr functionality from DeePMD-kit to extract necessary data from our outputs and package it in hdf5 file format. This is a pickle file that can be easily read by python. Specifically, the environment, ie coordinates of atoms within rcut and their associated elements, will be extracted from the trajectory. The associated differences in forces and energies will also be extracted.

The job should take approximately 10 minutes. Submit the job:

.. code-block:: bash

  sbatch reanalyze_array.slurm

There should be a reanalysis directory created within it021. Navigate to the directory:

.. code-block:: bash

  cd it021/reanalysis

The DFTB3 outputs should appear quickly once the jobs begins. When the job is complete, the img001_DFTB3.hdf5 should be produced:

.. code-block:: bash

  ls
  ene_img001_DFTB3.mden   ene_img001_DFTB3.mdout  ene_img001_PBE0.mdfrc  img001_DFTB3.hdf5  img001_DFTB3.mdinfo  img001_PBE0.mdinfo
  ene_img001_DFTB3.mdfrc  ene_img001_PBE0.mden    ene_img001_PBE0.mdout  img001_DFTB3.mdin  img001_PBE0.mdin     quick.out

Let's inspect the contents of img001_DFTB3.hdf5:

.. code-block:: bash

  h5ls img001_DFTB3.hdf5
  C15H16HW94N13O2OW49mC83mCl0mH94mN36mNa6mO48mP7 Group
  C15H16HW95N13O2OW50mC83mCl0mH92mN37mNa6mO48mP6 Group

  h5ls img001_DFTB3.hdf5/C15H16HW94N13O2OW49mC83mCl0mH94mN36mNa6mO48mP7
  nopbc                    Dataset {SCALAR}
  set.000                  Group
  type.raw                 Dataset {463}
  type_map.raw             Dataset {13}

  h5ls img001_DFTB3.hdf5/C15H16HW94N13O2OW49mC83mCl0mH94mN36mNa6mO48mP7/set.000
  aparam.npy               Dataset {1, 463}
  coord.npy                Dataset {1, 1389}
  energy.npy               Dataset {1}
  force.npy                Dataset {1, 1389}

Each unique environment, or combination of atoms within rcut, is assigned to a group labeled by the element and number of occurances of the element in that structure. MM atoms have an "m" preceeding the element. Here we only have two structures. Looking further into the first entry, it contains information related to whether the system has periodic boundary conditions (nopbc), the data set (set.000), the elements types for every atom (type.raw), and the unique element types (type_map.raw). The number in {} indicates the length of the array. Looking further into set.000, we see that this is where the forces, energies, and coordinates are stored. There is also an entry called aparam indicating which residue the atoms belong to.

You have been provided the full data file for all of the frames. You can verify that the data you obtained matches is contained within that full data set:

.. attention::

  Insert download link


Generating input files for a DeePMD-Kit
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


The input files for DeePMD-kit training are in .json or .yaml format. These are structured like a dictionary in python. There is one input file providing the training and hyperparameters (simplfy_MACE.json), and another specifying the computational resources that will ultimately be used to write a slurm script (machine_simplify.json). You have been provided all of the necessary inputs for training. 

.. code-block:: bash

  cd HandsOn11/Train/SRP
  ls 
  common  computerprofiles simplify_MACE.json

  ls computerprofiles
  machine_simplify.json

  ls common
  img001_DFTB3.hdf5

The img001_DFTB3.hdf5 file produced in the previous step has been placed in the common directory. Let's take a look at simplify_MACE.json:

.. code-block:: python

  {
    "default_training_param": {
      "model": {
        "type": "mace",
        "type_map": ["C","H","N","O","P","Mg","Ca","Na","Zn","S","HW","OW","mC","mCl","mH","mMg","mN","mNa","mO","mP","mS"],
        "r_max": 6.0,
        "sel": 128,
        "num_radial_basis": 8,
        "num_cutoff_basis": 5,
        "max_ell": 3,
        "interaction": "RealAgnosticResidualInteractionBlock",
        "num_interactions": 2,
        "hidden_irreps": "128x0e + 128x1o",
        "pair_repulsion": false,
        "distance_transform": "None",
        "correlation": 3,
        "gate": "silu",
        "MLP_irreps": "16x0e",
        "radial_type": "bessel",
        "radial_MLP": [
          64,
          64,
          64
        ]
      },
      "learning_rate": {
        "type": "exp",
        "start_lr": 0.001,
        "decay_steps": 2000,
        "stop_lr": 1e-05
      },
      "loss": {
        "type": "ener",
        "start_pref_e": 1,
        "limit_pref_e": 100,
        "start_pref_f": 100,
        "limit_pref_f": 100,
        "start_pref_v": 0,
        "limit_pref_v": 0
      },
      "training": {
        "numb_steps": 2000,
        "disp_file": "lcurve.out",
        "disp_freq": 100,
        "save_freq": 1000,
        "disp_training": true,
        "time_training": true,
        "profiling": false,
        "profiling_file": "timeline.json"
      }
    },
    "type_map": ["C","H","N","O","P","Mg","Ca","Na","Zn","S","HW","OW","mC","mCl","mH","mMg","mN","mNa","mO","mP","mS"],
    "pick_data": "common/img001_DFTB3.hdf5",
    "sys_configs": [],
    "sys_batch_size": ["auto"],
    "init_pick_number": 20000,
    "iter_pick_number": 10000,
    "numb_models": 1,
    "training_reuse_iter": 2,
    "training_reuse_start_lr": 0.001,
    "training_reuse_old_ratio": "auto",
    "training_reuse_numb_steps": 100000,
    "training_reuse_start_pref_e": 1,
    "training_reuse_start_pref_f": 100,
    "dp_compress": false,
    "model_devi_f_trust_lo": 0.08,
    "model_devi_f_trust_hi": 2.0,
    "labeled": true,
    "init_data_sys": [],
    "init_data_prefix": "",
    "fp_task_min": 0,
    "mlp_engine": "dp",
    "one_h5": true,
    "train_backend": "pytorch"
  }

The DPGEN program from DeePMD-Kit, which we ultimately use for active learning, has three phases: training, model deviation, and first principles (fp) calculations (ie single point calculations at the high level). The default training parameters are for the training step, which we will focus on in these section. They define the model including hyperparameters, training procedure, elements, and rcut. The hyperparameters are the default parameters suggested by the MACE developers that have been shown to perform well. The remaining parameters correspond to the model deviation and fp steps. For a pool-based active learning approach, these parameters will be relatively simple because the simulations and single point calculations have already been performed. Notably, we set model_devi_f_trust_lo to 0.08 ev/angstrom and model_devi_f_trust_hi to 2.0 eV/angstrom. This means any samples in the pool with model deviation below model_devi_f_trust_lo or above model_devi_f_trust_hi will be removed from the pool. init_pick_number is the initial number of samples taken from the pool to initiate training. Since we only have a small data set of 100 samples for proof of concept, dpgen simplify will conclude after a single iteration. We will discuss these more in a later section. 

For the sake of the workshop, we will only train one replica of the model ("numb_models": 1) rather than a committee of four models. In addition, we reduce the number of training steps from 300,000 to 2,000 ("numb_steps": 2000) so that a rough model can be obtained in less than 10 minutes.  

Now let's take a look at machine_simplify.json:

.. code-block:: bash

  {

      "train": [
    {
        "command": "dp",
                  "machine": {
                  "context_type": "LocalContext",
                  "batch_type": "Slurm",
                  "local_root": "./",
                  "remote_root": "./"
              },
              "resources": {
                  "number_node": 1,
                  "cpu_per_node": 1,
                  "gpu_per_node": 1,
                  "queue_name": "{{GPUparition}}",
                  "custom_flags" : ["#SBATCH --mem=32G",
            "#SBATCH --time=0-00:30:00",
            "group_size": 1,
                  "module_list": ["pydeepmdkit/deepmdkit/default"],
                  "envs": {
                      "OMP_NUM_THREADS": 1,
                      "TF_INTRA_OP_PARALLELISM_THREADS": 1,
                      "TF_INTER_OP_PARALLELISM_THREADS": 1
                  }

              }

    }
      ],
      

      "model_devi": [
    {
        "command": "dp",
        "machine": {
      "context_type": "LocalContext",
      "batch_type": "Slurm",
      "local_root": "./",
      "remote_root": "./"
        },
              "resources": {
                  "number_node": 1,
                  "cpu_per_node": 1,
                  "gpu_per_node": 1,
                  "queue_name": "{{GPUparition}}",
                  "custom_flags" : ["#SBATCH --mem=32G",
                                    "#SBATCH --time=0-00:30:00",
                  "group_size": 1,
                  "module_list": ["pydeepmdkit/deepmdkit/default"],
                  "envs": {
                      "OMP_NUM_THREADS": 1,
                      "TF_INTRA_OP_PARALLELISM_THREADS": 1,
                      "TF_INTER_OP_PARALLELISM_THREADS": 1
                  }
              }
    }
      ],
      

      "fp": [
    {
        "command": "dp",
        "machine": {
      "context_type": "LocalContext",
      "batch_type": "Slurm",
      "local_root": "./",
      "remote_root": "./"
        },
        "resources": {
                  "number_node": 1,
                  "cpu_per_node": 1,
                  "gpu_per_node": 1,
                  "queue_name": "{{GPUparition}}",
                  "custom_flags" : ["#SBATCH --mem=10G",
                                    "#SBATCH --time=0-00:30:00",
                  "group_size": 1,
                  "module_list": ["pydeepmdkit/deepmdkit/default"],
                  "envs": {
                      "OMP_NUM_THREADS": 1,
                      "TF_INTRA_OP_PARALLELISM_THREADS": 1,
                      "TF_INTER_OP_PARALLELISM_THREADS": 1
                  }
              }
    }
      ]
      
  }


DeePMD-Kit uses a functionality called DPDisbatcher to automatically write job submission scripts. It checks if the previous step has completed, and automatically launches the next one. The machine file is broken up into training, model deviation, and fp sections, and it contains all of the parameters that DPDisbatcher will use to write the slurm submission scripts. It is important that we select the GPU partition, as training is prohibitively slow on CPUs. 


Training a model
~~~~~~~~~~~~~~~~

Run the training:

.. code-block:: console

  {{deepmdload}}
  dpgen simplify simplify_MACE.json computerprofiles/machine_simplify.json 2>&1 | tee simplify.out
  INFO:dpgen:start simplifying
  INFO:dpgen:=============================iter.000000==============================
  INFO:dpgen:-------------------------iter.000000 task 00--------------------------
  INFO:dpgen:-------------------------iter.000000 task 01--------------------------
  INFO:dpgen:first iter, skip step 1-5
  INFO:dpgen:-------------------------iter.000000 task 02--------------------------
  INFO:dpgen:first iter, skip step 1-5
  INFO:dpgen:-------------------------iter.000000 task 03--------------------------

While training is running, in a seperate terminal, list the contents of the training directory. This directory will be a random string of letters and numbers, and will ultimately be deleted when the run is over. For example:

.. code-block:: bash

  ls c094a1e30c827c6dec19d3fb8fb4f84e4a82a30a
  000                                            f2ac8f58366c2ebde30048a701a8dc42df9ad071.sub
  c094a1e30c827c6dec19d3fb8fb4f84e4a82a30a.json  f2ac8f58366c2ebde30048a701a8dc42df9ad071.sub.run
  data.hdf5                                      f2ac8f58366c2ebde30048a701a8dc42df9ad071_flag_if_job_task_fail
  f2ac8f58366c2ebde30048a701a8dc42df9ad071.out   f2ac8f58366c2ebde30048a701a8dc42df9ad071_job_id

  ls c094a1e30c827c6dec19d3fb8fb4f84e4a82a30a/000
  input.json  input_v2_compat.json  lcurve.out  old  out.json  train.log

The file with the extension .sub is a bash script that sets the appropriate environmental variables and run the command contained in the file with extension .sub.run. These contentst are  

f2ac8f58366c2ebde30048a701a8dc42df9ad071.sub:

.. toggle:: f2ac8f58366c2ebde30048a701a8dc42df9ad071.sub

  .. code-block:: bash

    #!/bin/bash -l


    REMOTE_ROOT=$(readlink -f /data2/erika/Foundation_16wins/Refine_MTR2d_forTutorial/c094a1e30c827c6dec19d3fb8fb4f84e4a82a30a)
    echo 0 > $REMOTE_ROOT/f2ac8f58366c2ebde30048a701a8dc42df9ad071_flag_if_job_task_fail
    test $? -ne 0 && exit 1


    module load pydeepmdkit/deepmdkit/default


    export DPDISPATCHER_NUMBER_NODE=1
    export DPDISPATCHER_CPU_PER_NODE=48
    export DPDISPATCHER_GPU_PER_NODE=8
    export DPDISPATCHER_QUEUE_NAME=
    export DPDISPATCHER_GROUP_SIZE=4
    export OMP_NUM_THREADS=4
    export TF_INTRA_OP_PARALLELISM_THREADS=4
    export TF_INTER_OP_PARALLELISM_THREADS=4


    source $REMOTE_ROOT/f2ac8f58366c2ebde30048a701a8dc42df9ad071.sub.run

    cd $REMOTE_ROOT
    test $? -ne 0 && exit 1

    wait
    FLAG_IF_JOB_TASK_FAIL=$(cat f2ac8f58366c2ebde30048a701a8dc42df9ad071_flag_if_job_task_fail)
    if test $FLAG_IF_JOB_TASK_FAIL -eq 0; then touch f2ac8f58366c2ebde30048a701a8dc42df9ad071_job_tag_finished; else exit 1;fi


f2ac8f58366c2ebde30048a701a8dc42df9ad071.sub.run:

.. toggle::

  .. code-block:: bash

    cd $REMOTE_ROOT
    cd 000
    test $? -ne 0 && exit 1
    if [ ! -f cd2ec51e915d347212bcdcc7fa87318ba4fcff7e_task_tag_finished ] ;then
      export CUDA_VISIBLE_DEVICES=0; ( /bin/sh -c '{ if [ ! -f model.ckpt.pt ]; then dp --pt train input.json; else dp --pt train input.json --restart model.ckpt; fi }'&&dp --pt freeze ) 1>>train.log 2>>train.log 
      if test $? -eq 0; then touch cd2ec51e915d347212bcdcc7fa87318ba4fcff7e_task_tag_finished; else echo 1 > $REMOTE_ROOT/f2ac8f58366c2ebde30048a701a8dc42df9ad071_flag_if_job_task_fail;tail -v -c 1000 $REMOTE_ROOT/000/train.log > $REMOTE_ROOT/f2ac8f58366c2ebde30048a701a8dc42df9ad071_last_err_file;fi
    fi &


The trainin should take less than 10 minutes. In the process the train.log and lcurve.out files should populate. When it is available, take a look at lcurve.out, which tracks the loss:

.. code-block:: bash 

   head c094a1e30c827c6dec19d3fb8fb4f84e4a82a30a/000/lcurve.out
   # step       rmse_trn    rmse_e_trn    rmse_f_trn   lr      
  # If there is no available reference data, rmse_*_{val,trn} will print nan
        1      1.86e+00      1.35e-04      1.86e-01    1.0e-03
      100      1.39e+00      2.88e-03      1.39e-01    1.0e-03
      200      9.28e-01      5.43e-04      9.26e-02    7.9e-04
      300      8.12e-01      2.91e-04      8.11e-02    6.3e-04
      400      7.78e-01      1.46e-04      7.78e-02    5.0e-04
      500      7.04e-01      1.85e-04      7.04e-02    4.0e-04
      600      6.11e-01      9.11e-05      6.10e-02    3.2e-04
      700      6.19e-01      1.01e-05      6.19e-02    2.5e-04

The columns correspond to the training step (here we will run 2000 steps), the overall trainin error, the error in the energy term, the error in the force term, and the learning rate, which scales over the training. 

When the training has completed, intermediate files will be removed, and the outputs will be copied to iter.000001/00.train. The model deviation step will eventually fail because we are only training a single model, so these errors can be ignored once the training has completed. Navigate to the trainining directory and verify that your training has concluded sucessfully:

.. code-block:: bash

  cd iter.000001/00.train
  ls
  000  data.hdf5  data.iters  graph.000.pth

  cd 000
  ls 
  checkpoint  frozen_model.pth  input.json  lcurve.out  model.ckpt.pt  old  train.log

  grep -r "average training time:" train.log
  [2026-02-04 10:38:17,404] DEEPMD INFO    average training time: 0.1061 s/batch

  tail lcurve.out
  1100      6.61e-01      4.08e-04      6.56e-02    1.0e-04
  1200      6.02e-01      1.18e-04      6.01e-02    7.9e-05
  1300      6.61e-01      2.55e-04      6.59e-02    6.3e-05
  1400      4.82e-01      1.12e-04      4.81e-02    5.0e-05
  1500      5.53e-01      2.48e-04      5.50e-02    4.0e-05
  1600      5.72e-01      7.40e-05      5.72e-02    3.2e-05
  1700      6.64e-01      2.98e-04      6.61e-02    2.5e-05
  1800      5.62e-01      1.56e-04      5.61e-02    2.0e-05
  1900      5.70e-01      1.95e-05      5.70e-02    1.6e-05
  2000      5.97e-01      1.26e-04      5.97e-02    1.3e-05


We see that our model has been stored in graph.000.pth, the average timing has been reported in train.log, and see have reached 2000 steps in lcurve.out. 

Even with a short training, we should see that the loss has decreased. Plot the total loss with xmgrace:

.. code-block:: bash

  {{xmgraceload}}
  xmgrace -block lcurve.out -bxy 1:2


.. image:: /_static/files/WorkshopTutorials/2026_Amber_Workshop_SSB/Day5/lcurve_SRP.png
    :width: 600px


Model validation
~~~~~~~~~~~~~~~~

Now we will use dp test to validate the training set as well as a test set. For the sake of computational expense, we will use data from the second umbrella as the test set. DP test will output the predicted force and energy corrections compared to the actual values stored in the hdf5 file. Download and navigate to the dptest directory where you have been provided a script called test_models.sh:

.. code-block:: bash

  pwd 
  HandsOn11/Train/SRP

  cd dptest

Here are the contents of test_models.slurm:

.. code-block:: console

  #!/bin/bash
  #SBATCH --partition={{GPUpartition}}
  #SBATCH --job-name="model_eval"
  #SBATCH --output=dptest.out
  #SBATCH --error=dptest.err
  #SBATCH --nodes=1
  #SBATCH --ntasks-per-node=1
  #SBATCH --cpus-per-task=1
  #SBATCH --gpus-per-node=1
  #SBATCH --time=0-00:30:00
  #SBATCH --mem=8G


  {{deepmdload}}

  # Training set
  dp test -m ../iter.000001/00.train/graph.000.pth -s ../common/img001_DFTB3.hdf5 -d parity_TrainingSet  2>&1 | tee dptest_TrainingSet.out
  # Test set
  dp test -m ../iter.000001/00.train/graph.000.pth -s ../common/img002_DFTB3.hdf5 -d parity_TestSet  2>&1 | tee dptest_TestSet.out


Run dp test, This should take less than 5 minutes.

.. code-block:: bash

  sbatch test_models.slurm

The errors for every sample will be written to standard output. When the job is complete, take a look at the outputs:

.. code-block:: bash

  head parity_TrainingSet.f.out
  # ../common/img001_DFTB3.hdf5#/C15H16HW100N13O2OW51mC82mCl0mH97mN38mNa6mO46mP6: data_fx data_fy data_fz pred_fx pred_fy pred_fz
  -2.466142177581787109e-05 -1.112073659896850586e-04 1.159161329269409180e-04 3.054524768231203780e-06 3.561235189408762380e-06 -2.306260057594045065e-06
  -1.084312796592712402e-03 -1.249641180038452148e-03 1.153349876403808594e-03 3.313105116831138730e-05 -1.401095214532688260e-04 4.553010512609034777e-05
  6.721727550029754639e-04 -2.709150314331054688e-03 -1.417696475982666016e-04 -7.833027048036456108e-04 6.616767495870590210e-04 -9.980606846511363983e-04
  -4.771947860717773438e-04 2.542465925216674805e-03 -1.394093036651611328e-03 1.910269027575850487e-03 -2.701113931834697723e-03 2.913397504016757011e-03
  -5.411714315414428711e-03 1.503229141235351562e-03 3.334641456604003906e-03 4.188372986391186714e-05 -3.068845020607113838e-03 9.139326866716146469e-04
  3.090775012969970703e-02 5.400029942393302917e-03 -1.879560947418212891e-02 1.384735107421875000e-02 -3.181982785463333130e-02 1.924343779683113098e-02
  -1.811271905899047852e-02 8.609175682067871094e-03 4.196822643280029297e-03 -2.041579410433769226e-02 -7.022037636488676071e-03 -9.910331107676029205e-03
  -3.618022799491882324e-02 -3.521728515625000000e-02 -2.010071277618408203e-02 4.838545341044664383e-03 -2.629999816417694092e-02 3.696813806891441345e-02
  4.276752471923828125e-03 -9.536743164062500000e-06 -3.337264060974121094e-03 1.505715423263609409e-03 -1.232840149896219373e-04 -3.571693378034979105e-05


  head parity_TrainingSet.e.out
  # ../common/img001_DFTB3.hdf5#/C15H16HW100N13O2OW51mC82mCl0mH97mN38mNa6mO46mP6: data_e pred_e
  -3.903901417748034146e+04 -3.903902587962966936e+04
  # ../common/img001_DFTB3.hdf5#/C15H16HW100N13O2OW52mC82mCl0mH94mN37mNa6mO46mP7: data_e pred_e
  -3.903900805880523694e+04 -3.903890305559826811e+04
  # ../common/img001_DFTB3.hdf5#/C15H16HW102N13O2OW47mC85mCl0mH94mN36mNa5mO46mP7: data_e pred_e
  -3.903941908546729246e+04 -3.903946912129033444e+04
  # ../common/img001_DFTB3.hdf5#/C15H16HW102N13O2OW55mC88mCl0mH96mN37mNa6mO48mP7: data_e pred_e
  -3.903935605574178044e+04 -3.903914135217208241e+04
  # ../common/img001_DFTB3.hdf5#/C15H16HW104N13O2OW53mC85mCl0mH90mN36mNa6mO50mP7: data_e pred_e
  -3.903910261423454358e+04 -3.903933464041115803e+04


The forces are provided with x, y, z components, units are eV for energies and eV/angstrom for forces. The virial (v) is also provided, but we will focus on forces and energies because these were the components of our loss function. You have been provided the script plot_parity.py to plot the errors in the energies and magnitude of forces. 

.. code-block:: bash

  python plot_parity.py -f parity_TrainingSet.f.out -e parity_TrainingSet.e.out -l 'Train' -f parity_TestSet.f.out -e parity_TestSet.e.out -l 'Test' -s MTR1
  Forces:
  R2: 0.945
  RMSE: 1.636
  MAE: 0.703
  Energies:
  R2: 0.404
  RMSE: 2.837
  MAE: 2.316
  ----------
  Forces:
  R2: 0.945
  RMSE: 1.639
  MAE: 0.7
  Energies:
  R2: 0.079
  RMSE: 3.491
  MAE: 2.877
  ----------


The plot should look something like this:

.. image:: /_static/files/WorkshopTutorials/2026_Amber_Workshop_SSB/Day5/parity_SRP_MTR1.png
    :width: 600px

The errors obtained might be slightly different from independent training runs. The errors reported are given as Train,Test. As we can see, the errors for the test set are slightly higher than the training set, but both are quite low. 

Apply transfer learning
~~~~~~~~~~~~~~~~~~~~~~~

Now we implement transfer learning by initiating our model from a pre-trained foundational model. 

Navigate to the Refine directory:

.. code-block:: bash

  cd HandsOn11/Train/Refine
  ls 
  Foundation_model  common  computerprofiles  dptest  simplify_MACE.json

The pre-trained model file is stored in Foundation_model/00.train/000, and "training_init_model": true directs the program to read in this model rather than initiating the model with random weights and biases. simplfy_MACE.json has been modifed to reflect this:

.. code-block:: python

  {
    "default_training_param": {
      "model": {
        "type": "mace",
        "type_map": ["C","H","N","O","P","Mg","Ca","Na","Zn","S","HW","OW","mC","mCl","mH","mMg","mN","mNa","mO","mP","mS"],
        "r_max": 6.0,
        "sel": 128,
        "num_radial_basis": 8,
        "num_cutoff_basis": 5,
        "max_ell": 3,
        "interaction": "RealAgnosticResidualInteractionBlock",
        "num_interactions": 2,
        "hidden_irreps": "128x0e + 128x1o",
        "pair_repulsion": false,
        "distance_transform": "None",
        "correlation": 3,
        "gate": "silu",
        "MLP_irreps": "16x0e",
        "radial_type": "bessel",
        "radial_MLP": [
          64,
          64,
          64
        ]
      },
      "learning_rate": {
        "type": "exp",
        "start_lr": 0.001,
        "decay_steps": 2000,
        "stop_lr": 1e-05
      },
      "loss": {
        "type": "ener",
        "start_pref_e": 1,
        "limit_pref_e": 100,
        "start_pref_f": 100,
        "limit_pref_f": 100,
        "start_pref_v": 0,
        "limit_pref_v": 0
      },
      "training": {
        "numb_steps": 2000,
        "disp_file": "lcurve.out",
        "disp_freq": 100,
        "save_freq": 1000,
        "disp_training": true,
        "time_training": true,
        "profiling": false,
        "profiling_file": "timeline.json"
      }
    },
    "type_map": ["C","H","N","O","P","Mg","Ca","Na","Zn","S","HW","OW","mC","mCl","mH","mMg","mN","mNa","mO","mP","mS"],
    "pick_data": "common/img001_DFTB3.hdf5",
    "training_init_model": true,
    "training_iter0_model_path": ["Foundation_model/00.train/000"],
    "sys_configs": [],
    "sys_batch_size": ["auto"],
    "init_pick_number": 20000,
    "iter_pick_number": 10000,
    "numb_models": 1,
    "training_reuse_iter": 2,
    "training_reuse_start_lr": 0.001,
    "training_reuse_old_ratio": "auto",
    "training_reuse_numb_steps": 100000,
    "training_reuse_start_pref_e": 1,
    "training_reuse_start_pref_f": 100,
    "dp_compress": false,
    "model_devi_f_trust_lo": 0.08,
    "model_devi_f_trust_hi": 2.0,
    "labeled": true,
    "init_data_sys": [],
    "init_data_prefix": "",
    "fp_task_min": 0,
    "mlp_engine": "dp",
    "one_h5": true,
    "train_backend": "pytorch"
  }

Train the model using the same procedure in the previous section:

.. code-block:: bash

    dpgen simplify simplify_MACE.json computerprofiles/machine_simplify.json 2>&1 | tee simplify.out
    INFO:dpgen:start simplifying
    INFO:dpgen:=============================iter.000000==============================
    INFO:dpgen:-------------------------iter.000000 task 00--------------------------
    INFO:dpgen:-------------------------iter.000000 task 01--------------------------
    INFO:dpgen:first iter, skip step 1-5
    INFO:dpgen:-------------------------iter.000000 task 02--------------------------
    INFO:dpgen:first iter, skip step 1-5
    INFO:dpgen:-------------------------iter.000000 task 03--------------------------

When it is complete, run dp test and plot the results:

.. code-block:: bash

  cd dptest
  sbatch test_models.slurm

  ...

  python plot_parity.py -f parity_TrainingSet.f.out -e parity_TrainingSet.e.out -l 'Train' -f parity_TestSet.f.out -e parity_TestSet.e.out -l 'Test' -s MTR1
  Forces:
  R2: 0.998
  RMSE: 0.303
  MAE: 0.193
  Energies:
  R2: 0.919
  RMSE: 1.046
  MAE: 0.852
  ----------
  Forces:
  R2: 0.998
  RMSE: 0.344
  MAE: 0.209
  Energies:
  R2: 0.852
  RMSE: 1.398
  MAE: 1.116
  ----------

The results should look something like this:

.. image:: /_static/files/WorkshopTutorials/2026_Amber_Workshop_SSB/Day5/parity_Refined_MTR1.png
    :width: 600px

The errors in forces and energies for the training and test sets, even with very few training steps, have been significantly reduced by implementing transfer learning. 

Model deviation for data-efficient active learning
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

There are two potential active learning approaches one could take using the DPGEN program in DeePMD-kit. These are "dpgen simplify" and "dpgen run". Dpgen simplify is a pool-based active learning approach where an existing pool of data has already been labeld at the high level with single point calculations, and data is iteratively drawn from the pool for training. This is an active learning procedure because at each iteration, we compute a metric referred to as model deviation. Model deviation is a measure of uncertainty in the force predictions by a committee of stochastically trained models. The remaining pool is reduced before drawning new samples by removing samples for which the models already agree well, indicating high certainty in the estimate. In this way, training is focused on areas of greater uncertainty, and removing redundancy from the data set reduces the amount of training necessary. 

The model deviation process is automated in dpgen simplify, but in this section we will do it by hand to demonstrate how the uncertainty estimate promotes efficient training. In pool-based avtive learning, model deviation would be computed for the entire pool to prune the number of samples availble for training in the next iteration. For the sake of this tutorial, we will take the img002 data as the pool. For the specific reaction potential and the refined potential, you have been provied the 3 additional replicas of the model that have been trained with the same procedure, but initiated from independent, stochastic initial states. Download the model-devi directory for the SRP model and navigate to it:

.. attention:: 

  Insert download link

.. code-block:: bash

  cd HandsOn11/Train/SRP/model-devi
  ls
  graph.000.pth  graph.001.pth  graph.002.pth  graph.003.pth  img002_DFTB3.hdf5  model-devi.slurm   plot_modeldevi.py

You have been provided model-devi.slurm to compute the deviation in forces predicted by the four models. Here are the contents of model-devi.slurm:

.. code-block:: console

  #!/bin/bash
  #SBATCH --partition={{GPUpartition}}
  #SBATCH --job-name="model_devi"
  #SBATCH --output=dpmodeldevi.out
  #SBATCH --error=dpmodeldevi.err
  #SBATCH --nodes=1
  #SBATCH --ntasks-per-node=1
  #SBATCH --cpus-per-task=1
  #SBATCH --gpus-per-node=1
  #SBATCH --time=0-00:30:00
  #SBATCH --mem=8G


  {{deepmdload}}


  time dp model-devi -m graph* -s img002_DFTB3.hdf5 -o model-devi_SRP.out


Submit the job, which should complete in under 5 minutes:

.. code-block:: bash

  sbatch model-devi.slurm

In the meantime, download the model deviation inputs for the refined model, and compute the model deviation:

.. attention::

  Insert download link


.. code-block:: bash

  cd ../../Refine/model-devi
  sbatch model-devi.slurm

Take a look at the output in model-devi_Refine.out:

.. code-block:: bash

  head model-devi_Refine.out
  # img002_DFTB3.hdf5#/C15H16HW100N13O2OW47mC87mCl0mH98mN37mNa6mO55mP7
  # step      max_devi_v         min_devi_v         avg_devi_v         max_devi_f         min_devi_f         avg_devi_f         devi_e             
            0       3.340329e-04       1.328199e-04       2.359955e-04       4.042720e-02       9.702322e-09       4.536546e-03       2.945244e-05
  # img002_DFTB3.hdf5#/C15H16HW100N13O2OW49mC82mCl0mH97mN35mNa6mO54mP7
  # step      max_devi_v         min_devi_v         avg_devi_v         max_devi_f         min_devi_f         avg_devi_f         devi_e             
            0       3.974781e-04       8.601793e-05       2.444289e-04       3.997536e-02       8.229504e-11       4.046020e-03       3.907553e-05
  # img002_DFTB3.hdf5#/C15H16HW100N13O2OW49mC84mCl0mH94mN35mNa6mO51mP8
  # step      max_devi_v         min_devi_v         avg_devi_v         max_devi_f         min_devi_f         avg_devi_f         devi_e             
            0       5.629255e-04       7.516243e-05       2.851215e-04       6.032073e-02       5.259309e-10       5.036927e-03       5.452317e-05
  # img002_DFTB3.hdf5#/C15H16HW101N13O2OW50mC82mCl0mH95mN36mNa6mO51mP7


Remember that in the training input file we set model_devi_f_trust_lo to 0.08 ev/angstrom and model_devi_f_trust_hi to 2.0 eV/angstrom. This means any samples in the pool with maximum deviation in forces below model_devi_f_trust_lo or above model_devi_f_trust_hi will be removed from the pool. 

We will use this metric to calculate how many samples would remain in the pool after your short training. In the HandsOn11/Train/SRP/model-devi directory you have been provided a script called plot_modeldevi.py. This will extract the max_devi_f column from the output. Run the script to compare the results of the SRP and the refined potential:

.. code-block:: bash

  python plot_modeldevi.py -f model-devi_SRP.out -l 'SRP' -f ../../Refine/model-devi/model-devi_Refine.out -l 'Refine'

The result should look something like this:

.. image:: /_static/files/WorkshopTutorials/2026_Amber_Workshop_SSB/Day5/model_deviations.png
    :width: 600px


The black vertial line indicates model_devi_f_trust_lo. We can see that with the SRP, the models agree poorly, thus there is high uncertainty in the estimate. These samples would remain in the pool for further training in a pool-based active learning procedure. For the refined model, all samples fall below model_devi_f_trust_lo, indicating high certainty for this model and accurate results. These samples would be removed from the pool, thus saving time and resources training on redundant data. Taken together, the transfer learning process helps us achieve better accuracy quicker, and the model deviation based active learning approach improves data efficicency.  


Addtional actiivity: on-the-fly active learning
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Between hands on session 4 and the preceeding sections of this tutorial, you now have hands on experience running every component of an on-the-fly active learning procedure. This differs from pool-based active learning where the high level data has been precomputed. In this approach we will generate data "on-the-fly" by simulating the system with the version of the model trained in the last iteration. Therefore, the structures will be generated at a level of theory that increasingly mimics the target. Using the concept of model deviation, samples will be selected for further labeling with single point calculations. The prgram in DeePMD-Kit for this procedure is dpgen run.  

This is a more computationally intensive approach than pool-based active learning, therefore you will learn to set up the calculations, and the resulting model will be provided for you to analyze. 

Download the files for this section to the AL directory:

.. attention:: 

  Insert download link

.. code-block:: bash

  ls HandsOn11/Train/AL
  common  computerprofiles  generator.json  iter.000000  iter.000001  restarts Foundation_model record.dpgen

You now have been provided an updated version of iter.000001 containing the 4 model replicas. This is necessary for active learning, as model deviation cannot be computed with only one model.

You also have a new input file called generator.json. This has new inputs related to the model deviation and fp steps:

.. code-block:: python

    ...

    "training_init_model": true,
    "training_iter0_model_path": ["Foundation_model/00.train/000", "Foundation_model/00.train/001", "Foundation_model/00.train/002", "Foundation_model/00.train/003"],
    "sys_configs": [
      [
        "restarts/img001.rst7"
      ]
    ],
    "sys_batch_size": [
      "auto"
    ],
    "numb_models": 4,
    "training_reuse_iter": 2,
    "training_reuse_start_lr": 0.001,
    "training_reuse_old_ratio": "auto",
    "training_reuse_numb_steps": 100000,
    "training_reuse_start_pref_e": 0.02,
    "training_reuse_start_pref_f": 1000,
    "dp_compress": false,
    "model_devi_f_trust_lo": 0.08,
    "model_devi_f_trust_hi": 2.0,
    "labeled": true,
    "init_data_sys": [
      "MTR.hdf5"
    ],
    "init_data_prefix": "common",
    "fp_task_min": 10,
    "mlp_engine": "dp",
    "one_h5": true,
    "train_backend": "pytorch",
    "mass_map": [12,1,14,16,31,24,40,23,65,32,1,16,12,35,1,24,14,23,16,31,32],
    "qm_region": [
      ":69|@272-282,290-291,1975-1988"
    ],
    "qm_charge": [
      1
    ],
    "parm7": [
      "/PATH2/common/MTR.parm7"
    ],
    "mdin": [
      "/PATH2/common/ml.mdin"
    ],
    "disang": [
      "/PATH2/common/MTR.disang"
    ],
    "r": [
      [
        [
          -0.85970438,
          -1.7447896
        ]
      ]
    ],
    "nsteps": [
      2000
    ],
    "fp_task_max": 500,
    "fp_params": {
      "low_level_mdin": "/PATH2/common/lowlevel.mdin",
      "high_level_mdin": "/PATH2/common/highlevel.mdin"
    },
    "low_level": "DFTB3",
    "high_level": "PBE0",
    "cutoff": 6.0,
    "sys_format": "amber/rst7",
    "init_multi_systems": true,
    "model_devi_clean_traj": false,
    "model_devi_engine": "amber",
    "model_devi_skip": 0,
    "shuffle_poscar": false,
    "fp_style": "amber/diff",
    "detailed_report_make_fp": true,
    "use_clusters": true,
    "model_devi_jobs": [
      {
        "_comment": 0,
        "sys_idx": [
          0
        ],
        "trj_freq": 20
      },
      {
        "_comment": 1,
        "sys_idx": [
          0
        ],
        "trj_freq": 20
      },
      {
        "_comment": 2,
        "sys_idx": [
          0
        ],
        "trj_freq": 20
      }
    ]
  }


The default training parameters are the same. Now we must provide inputs for how QM/MM simulations will be run, including restart file, reaction coordinate values, output frequency, the qm region, the qm charge, and the level of theory. In this example, the input is set up to run 2 additional active learning iterations based on the length of "model_devi_jobs". We must also provide template file where the values will be inserted, which are located in common:

.. code-block:: bash

  ls common
  MTR.disang  MTR.parm7  highlevel.mdin  img001_DFTB3.hdf5  lowlevel.mdin  ml.mdin

lowlevel.mdin and highlevel.mdin will look familar from the first section of this tutorial, and ml.mdin will look familar from Hands on session 4. The mdin files contain place holders for inputs given in the json file, as this procedure could be used for multiple systems.

You have also been provided a new machine file called computerfile/machine_generator.json, which must allocate resources for QM/MM simulations and single point calculations:

.. code-block:: python

  {
    "train": [
      {
        "command": "dp",
        "machine": {
          "batch_type": "SlurmJobArray",
          "context_type": "LocalContext",
          "local_root": "./",
          "remote_root": "./"
        },
        "resources": {
          "number_node": 1,
          "cpu_per_node": 1,
          "gpu_per_node": 1,
          "custom_flags": [
            "#SBATCH --mem=32G",
            "#SBATCH --time=24:00:00",
            "#SBATCH --requeue",
          ],
          "module_list": [
            "pydeepmdkit/deepmdkit/default"
          ],
          "envs": {
            "OMP_NUM_THREADS": 1,
            "TF_INTER_OP_PARALLELISM_THREADS": 1,
            "TF_INTRA_OP_PARALLELISM_THREADS": 1
          },
          "group_size": 1,
          "queue_name": "gpu"
        }
      }
    ],
    "model_devi": [
      {
        "command": "mpirun -n 1 sander.MPI",
        "group_size": 1,
        "machine": {
          "batch_type": "SlurmJobArray",
          "clean_asynchronously": true,
          "context_type": "LocalContext",
          "local_root": "./",
          "remote_root": "./"
        },
        "resources": {
          "number_node": 1,
          "gpu_per_node": 1,
          "cpu_per_node": 1,
          "custom_flags": [
            "#SBATCH --ntasks-per-node=1",
            "#SBATCH --mem=24G",
            "#SBATCH --time=24:00:00",
            "#SBATCH --requeue"
          ],
          "envs": {
            "OMP_NUM_THREADS": 1,
            "TF_INTER_OP_PARALLELISM_THREADS": 1,
            "TF_INTRA_OP_PARALLELISM_THREADS": 1
          },
          "group_size": 128,
          "queue_name": "gpu",
          "module_list": [
            "dpmace/amber24/default"
          ]
        }
      }
    ],
    "prepare": [
      {
        "command": "dp",
        "machine": {
          "batch_type": "Slurm",
          "context_type": "LocalContext",
          "local_root": "./",
          "remote_root": "./",
          "clean_asynchronously": true
        },
        "resources": {
          "number_node": 1,
          "cpu_per_node": 1,
          "gpu_per_node": 0,
          "custom_flags": [
            "#SBATCH -c 1",
            "#SBATCH --mem=8G",
            "#SBATCH --time=6:00:00",
            "#SBATCH --requeue"
          ],
          "module_list": [
            "pydeepmdkit/deepmdkit/default"
          ],
          "envs": {
            "OMP_NUM_THREADS": 1,
            "TF_INTER_OP_PARALLELISM_THREADS": 1,
            "TF_INTRA_OP_PARALLELISM_THREADS": 1
          },
          "group_size": 1,
          "queue_name": "main"
        }
      }
    ],
    "fp": [
      {
        "command": "mpirun -n 4",
        "group_size": 1,
        "machine": {
          "batch_type": "Slurm",
          "clean_asynchronously": true,
          "context_type": "LocalContext",
          "local_root": "./",
          "remote_root": "./"
        },
        "resources": {
          "number_node": 1,
          "gpu_per_node": 1,
          "cpu_per_node": 4,
          "custom_flags": [
            "#SBATCH --ntasks-per-node=4",
            "#SBATCH --mem=128G",
            "#SBATCH --time=24:00:00",
            "#SBATCH --requeue"
          ],
          "gpu_per_node": 0,
          "group_size": 50,
          "number_node": 1,
          "queue_name": "gpu",
          "module_list": [
            "pydeepmdkit/deepmdkit/default"
          ]
        }
      }
    ]
  }

This approach would require multiple GPUs, so it will be run for you, but you have previously seen each component. 

The job would be launched as follows:

.. code-block:: bash

  dpgen run -d generator.json computerprofiles/machine_generator.json