Training a QM/MM+ΔMLP model with DeePMD-kit from AMBER simulation data ====================================================================== | Erika McCarthy\ :sup:`1`, Şölen Ekesan\ :sup:`1`, and Darrin M. York\ :sup:`1` | :sup:`1`\ Laboratory for Biomolecular Simulation Research, Institute for Quantitative Biomedicine and Department of Chemistry and Chemical Biology, Rutgers University, Piscataway, NJ 08854, USA Learning objectives ------------------- - Perform QM/MM umbrella sampling simulations to obtain a reaction path - Perform single point calculations to evaluate reference and target level energies and forces - Package the data for training using dpamber - Prepare input files for DeePMD-kit - Train an initial model using DP-GEN simplify - Refine a pretained model - Run QM/MM+ΔMLP simulations Relevant literature ------------------- - `Development of Range-Corrected Deep Learning Potentials for Fast, Accurate Quantum Mechanical/Molecular Mechanical Simulations for Chemcial Reactions in Solution `__ - `DeePMD-kit v2: A software package for deep potential models `__ - `DeePMD-kit v3: A Multiple-Backend Framework for Machine Learning Potentials `__ - `DeePMD-GNN: A DeePMD-kit Plugin for External Graph Neural Network Potentials `__ - `DP-GEN: A concurrent learning platform for the generation of reliable deep learning based potential energy models `__ - `MACE: Higher Order Equivariant Message Passing Neural Networks for Fast and Accurate Force Fields `__ - `Transferability of MACE Graph Neural Network for Range Corrected Δ-Machine Learning Potential QM/MM Applications `__ Introduction ------------ DeePMD-Kit supports training for various model architectures including Deep Potentials (DP) and Graph Neural Networks (GNN). There are two general approaches for training a QM/MM+ΔMLP model, depending on your end goal. The first being a specific reaction potential (SRP) approach in which you have a system of interest, and you just want a model that allows you perform long production sampling for that system, and maybe some minor variations. The second approach could be to train a more foundational model, in which case you may have many systems of interest you want to experiment with. In that case, it may be worth more time upfront when training for enhanced transferability. A DP model may be desirable for the first approach, but a GNN could be better for the latter. The training data preparation procedure is agnostic to the type of model you wish to train, but the contents may vary. This tutorial will use a single system as an example, but one could apply the procedure to multiple systems and combine the training data for a foundational model approach. Differences in procedure for a DP and MACE model will be highlighted. Outline ------- In this tutorial you will learn how to prepare QM/MM training data, train a QM/MM+ΔMLP model with DeePMD-Kit, then apply it to simulations. The preparation is agnostic to the type of model you wish to train, so the data could be used to train a Deep Potential (DP) or Graph Neural Network (GNN) model. We will assume the low level, semi-empirical model is DFTB3, and the high level, ab initio target model is PBE0/6-31G*. Example input files have been taken from simulations of the HDV ribozyme phosphoryl transfer reaction. The associated files can be downloaded from `Files.tar `__. The tutorial is structured as follows: 1. Preparing training data from AMBER QM/MM simulations - `1.1 String simulations <#string>`__ - `1.2 Production simulations <#prod>`__ - `1.3 Single point calculations <#snglpts>`__ - `1.4 Packaging the data <#dpamber>`__ 2. Training QM/MM+ΔMLP models - `2.1 Generating input files for a Deep Potential <#input1>`__ - `2.2 Generating input files for MACE <#input2>`__ - `2.3 Machine files <#machine>`__ - `2.4 Running DP-GEN simplify <#simplify>`__ - `2.5 Starting from a pretrained model <#pretain>`__ - `2.6 Evaluating a model <#eval>`__ 3. Running QM/MM+ΔMLP simulations - `3.1 QM/MM+ΔMLP simulations <#runsim>`__ 1. Preparing training data from AMBER QM/MM simulations ------------------------------------------------------- .. _string: 1.1 String simulations ~~~~~~~~~~~~~~~~~~~~~~ You will first need to peform umbrella sampling simulations to obtain the minimum free energy path using the low level potential. If your reaction is 1-dimensional, this is simply a linear interpolation from reactants to products; however, if your reaction contains multiple reaction coordinates, you must optimize the path using a string method. Here we will use the rapidily converging Surface Accelerated String Method (SASM) approach and assume the directory structure created by the N-dimension free energy surface (NDFES) program available in AMBERTOOLS. This could techniquically be done with the high level potential (or any potential for that matter), but most likely that sampling is prohibitively expensive. Below is an example slurm script to run surface accelerated string method (SASM) simulations on the Expanse supercomputer. This is run from the base directory. Your directory structure should start as follows: .. code-block:: bash user@cluster:~/YOUR/WORKING/DIR $ ls TEMPLATE_DFTB3.mdin equil nextSASM.sh runString.slurm TEMPLATE_PBE0.mdin hdf5.slurm reanalysis.slurm template user@cluster:~/YOUR/WORKING/DIR $ ls template 1ps.mdin TEMPLATE.disang qmmm.parm7 template.groupfile You should already have equilibrated windows in the equil directory. Here are the contents of runString.slurm: .. code-block:: bash #!/bin/bash #SBATCH --job-name="runString_HDV_DFTB3" #SBATCH -o slurm/slurm-%j.%a.out #SBATCH -e slurm/slurm-%j.%a.err #SBATCH --export=ALL #SBATCH --partition=compute #SBATCH --nodes=1 #SBATCH --ntasks-per-node=128 #SBATCH --mem=200G #SBATCH --cpus-per-task=1 #SBATCH --account=rut149 #SBATCH --time=2-00:00:00 #SBATCH --constraint=lustre module load amber/xtb/13may24 export LAUNCH="srun --mpi=pmi2 -K1 -N1 -n128 -c1 --exclusive" export EXE="sander.MPI" if [ ! -d slurm ]; then mkdir slurm fi for iter in init {01..50}; do if [[ "${iter}" == "init" ]]; then itdir=init echo "Running init directory" for win in {01..32}; do cp equil/img${win}.rst7 init/init${win}.rst7 done else itdir=it${iter} fi cd $itdir $LAUNCH $EXE -ng 32 -groupfile ../template/template.groupfile wait sleep 1 ################################### #### Check for convergence failure ################################### FAILED=0 for s in $(seq 32); do base=$(printf "img%02i" ${s}) if grep -q "Convergence could not be achieved" ${base}.mdout; then FAILED=1 echo "${base} failed SCF convergence!!!" else echo "SCF converged" continue fi done if [ "${FAILED}" == 1 ]; then echo "Detected a failed simulation" exit 1 fi ################################### #### Check for any other failures ################################### echo ${pwd} if grep -q " 5. TIMINGS" *.mdout; then sleep 1 cd ../. ./nextSASM.sh $iter 128 else echo "run incomplete" exit 1 fi done The above slurm script calls a bash script that uses NDFES to analyze the simulations and generate the next directory of simulations. Here is the contents of nextSASM.sh: .. code-block:: bash #!/bin/bash cit=$(printf %02d ${1#0}) NCPU=$2 export OMP_NUM_THREADS=${NCPU} export OPENBLAS_NUM_THREADS=${NCPU} ################### #General controls ################### template="./template" dirstr="it" disang=TEMPLATE.disang mdin="1ps.mdin" TEMP="--temp=298" pad="--pad=2" NSIM=32 fil_analysis="analysis.${cit}.log" truncate -s0 $fil_analysis #################################### # MakeAna -> ndfes-path-analyzesims #################################### MAXEQ="--maxeq=0.25" SKIPG="--skipg" NPREV="--nprev=1000 --neqit=0" MAKEANA="ndfes-path-analyzesims.py --curit=${cit} -d ${template}/${disang} ${TEMP} ${MAXEQ} ${SKIPG} ${NPREV} ${pad}" #################################### # EXE -> ndfes_omp #################################### binwid="-w 0.15" bootstrap="--nboot 50" # Check to see if we have the OpenMP version of ndfes if test -x "$(command -v ndfes_omp)"; then export NDFESEXE="ndfes_omp" elif test -x "$(command -v ndfes.OMP)"; then export NDFESEXE="ndfes.OMP" elif test -x "$(command -v ndfes)"; then export NDFESEXE="ndfes" else echo "Could not find ndfes" exit 1 fi MAKECHK="${NDFESEXE} --mbar ${binwid} ${bootstrap} -c metafile.all.chk metafile.all" ################################# # MakeSims -> ndfes_path options ################################# MAXIT="--maxit=100 --buffer" TOL="--distol=0.05 --angtol=0.5 --mlen=5" FIX0="" FIX1="" INTERP="--wavg=4 --wavg-niter=1 --minsize=10" SMOOTH="" PREDFC="" CPNEAR="--cp-nearest" FESCONV="--sasm" PATHPTS="--npathpts=50 --nsplpts=400" MAKESIMS="ndfes-path.OMP --curit=${cit} -d ${template}/${disang} -i ${template}/${mdin} ${MAXIT} ${TOL} ${FIX0} ${FIX1} ${INTERP} ${SMOOTH} ${PREDFC} ${CPNEAR} ${FESCONV} ${pad} ${PATHPTS}" ################### # Implementation ################## # Calculate previous and next iteration numbers next=$(printf %02d $((${cit#0} + 1))) old=$(printf %02d $((${cit#0} - 1))) # Establish directory names if [ ${cit} -lt 1 ]; then curdir=init else curdir=${dirstr}$cit fi newdir=${dirstr}$next if [ ${old} -lt 1 ]; then olddir=init else olddir=${dirstr}${old} fi # # Do not do analysis if the simulations did not run or did not complete # statfiles=($(for f in ${curdir}/*.stat; do if [ -e "${f}" ]; then echo "${f}"; fi; done)) nstat=${#statfiles[@]} if [ "${nstat}" -gt 0 ]; then echo "Analysis skipped because the following files exist: ${statfiles[@]}" exit 1 fi # # Prepare the dumpaves and metafile by running MAKEANA # echo "ANALYSIS START AT $(date)" > ${fil_analysis} 2>&1 echo "RUNNING: ${MAKEANA}" time ${MAKEANA} > $fil_analysis 2>&1 echo "FINISHED MAKEANA AT $(date)" >> $fil_analysis 2>&1 # # Run ndfes to get the FES # topdir=${PWD} cd ${curdir}/analysis echo "Running ndfes" ${MAKECHK} >> ${topdir}/$fil_analysis 2>&1 echo "FINISHED NDFES AT $(date)" >> ${topdir}/$fil_analysis 2>&1 cd ${topdir} # # Prepare a new directory for the next iteration by running MAKESIMS # if [ ! -d ${newdir} ]; then mkdir ${newdir} fi echo "RUNNING: ${MAKESIMS}" ${MAKESIMS} >> ${fil_analysis} 2>&1 echo "FINISHED MAKESIMS AT $(date)" >> $fil_analysis 2>&1 export OMP_NUM_THREADS=1 The above strings are set to run for 50 iterations, but one may wish to run more if the path is not converged. Submit the job. .. code-block:: bash sbatch runString.slurm .. _prod: 1.2 Production simulations ~~~~~~~~~~~~~~~~~~~~~~~~~~ Once you are satisfied with the convergence of the string, you will need to run production simulations on the final path to generate structures that will be used for traing. We have observed that 500 samples per window is ideal. To generate 500 samples, run 25 ps simulations with ntwx=50 for a 1 fs time step and ntwx=25 for a 2 fs timestep. Modify the runString.slurm script from "for iter in init {01..50}; do" to "for iter in 51; do" if iteration 51 is in fact your most recently generated string. Also, modify the mdin files from 1 ps to 25 ps. Here is an example: .. container:: box .. code:: bash for i in {01..32}; do sed -i "s/nstlim = 1000/nstlim = 25000/g" it51/img${i}.mdin; done sed -i "s/for iter in init {01..50}; do/for iter in 51; do" runString.slurm sbatch runString.slurm .. _snglpt: 1.3 Single point calculations ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Once the production simulations are done, you will need to reanalyze each frame of the trajectories with the high and low level using single point calculations. These calculations will ouput the energies and forces as mden and mdfrc files. The energies, forces, and coordinates constitute the training data. Here is an example reanalysis.slurm script. The script should be in the base directory. .. code-block:: bash #!/bin/bash #SBATCH --job-name="reanalysis_4Training_HDV" #SBATCH --output="slurm/reanalysis%a.slurmout" #SBATCH --error="slurm/reanalysis%a.slurmerr" #SBATCH --partition=shared #SBATCH --nodes=1 #SBATCH --ntasks=16 #SBATCH --cpus-per-task=1 #SBATCH --mem=48G #SBATCH --export=ALL #SBATCH -t 2-00:00:00 #SBATCH --account=rut149 #SBATCH --constraint=lustre #SBATCH --requeue #SBATCH --array=0-31 set -u qmreg=':81,220-224|@1-10,22-39,55-60,706-710,723-736,1969-1990,1992-1993' top=`pwd` for i in 51; do RC=($(seq -w 1 1 32)) R=${RC[${SLURM_ARRAY_TASK_ID}]} echo R is ${R} cd it${i} if [ ! -d prune_traj ]; then mkdir prune_traj; fi cd prune_traj BASE=img${R} PBE0=img${R}_PBE0 DFTB=img${R}_DFTB3 parm=qmmm.parm7 sed -e "s/XXXX/${R}/g" ${top}/TEMPLATE_DFTB3.mdin > img${R}_DFTB3.mdin sed -e "s/XXXX/${R}/g" ${top}/TEMPLATE_PBE0.mdin > img${R}_PBE0.mdin module load amber/20mlipi/05jun23 mpirun -n 16 sander.MPI -O -p ../../template/${parm} -i ${DFTB}.mdin -c ../${BASE}.rst7 -o ene_${DFTB}.mdout -y ../${BASE}.nc -frc ene_${DFTB}.mdfrc -e ene_${DFTB}.mden -inf ${DFTB}.mdinfo module unload amber/20mlipi/05jun23 module load amber/hfdf/05jun23 mpirun -n 16 sander.MPI -O -p ../../template/${parm} -i ${PBE0}.mdin -c ../${BASE}.rst7 -o ene_${PBE0}.mdout -y ../${BASE}.nc -frc ene_${PBE0}.mdfrc -e ene_${PBE0}.mden -inf ${PBE0}.mdinfo if [[ $(grep -r "TIMINGS" ene_${DFTB}.mdout) ]] && [[ $(grep -r "TIMINGS" ene_${PBE0}.mdout) ]]; then echo "${BASE} finished correctly" module purge module load dpgen_Aug092023 export OMP_NUM_THREADS=1 export TF_INTER_OP_PARALLELISM_THREADS=1 export TF_INTRA_OP_PARALLELISM_THREADS=1 hl=ene_img${R}_PBE0 ll=ene_img${R}_DFTB3 out=img${R}_DFTB3.hdf5 nc=../img${R}.nc dpamber corr --cutoff 6. --qm_region ${qmreg} --parm7_file ${top}/template/qmmm.parm7 --nc ${nc} --hl ${hl} --ll ${ll} --out ${out} else exit 1 fi cd ${top} done This will create a directory inside your production iteration dirctory called prune_traj. In this directory it will output forces and energies at the high and low level. The script is run as an array, and can be set up with a dependency on the production simulations (sbatch --dependency=afterok:JOBID reanalysis.slurm). This script switches between the two versions of amber because DFTB3 and PBE0 do not exist in the same version. Once the single point calculations are complete, it will use the dpamber program to read the low and high level energies and forces along with the coordinates to package them in hdf5 file format. It with carve out the coordinates of atoms within a 6 cuttoff (--cutoff) of the QM region to save to the hdf5 file. The script uses template mdin files called TEMPLATE_DFTB3.mdin and TEMPLATE_PBE0.mdin, examples of which are given below. These files should be at the same level as the reanalysis script. Be sure to modify these and the QM region in the reanalysis script to reflect your system. .. code-block:: bash DFTB3snglpt &cntrl ! IO ======================================= irest = 0 ! 0 = start, 1 = restart ntx = 1 ! 1 = start, 5 = restart ntxo = 1 ! read/write rst as formatted file iwrap = 1 ! wrap crds to unit cell ioutfm = 1 ! write mdcrd as netcdf imin = 6 ntmin = 1 ntpr = 1 ntwr = 0 ntwx = 0 ntwf = 1 ! print mdfrc file ntwe = 1 ! print mdene file ! DYNAMICS ================================= nstlim = 0 ! number of time steps dt = 0.001 ! ps/step ntb = 1 ! 1=NVT periodic, 2=NPT periodic, 0=no box ! TEMPERATURE ============================== temp0 = 298 ! target temp gamma_ln = 5.0 ! Langevin collision freq ntt = 3 ! thermostat (3=Langevin) ! PRESSURE ================================ ntp = 0 ! 0=no scaling, 1=isotropic, 2=anisotropic ! SHAKE ==================================== ntc = 2 ! 1=no shake, 2=HX constrained, 3=all constrained noshakemask = ":81,220-224|@1-10,22-39,55-60,706-710,723-736,1969-1990,1992-1993" ! do not shake these ntf = 1 ! 1=cpt all bond E, 2=ignore HX bond E, 3=ignore all bond E ! MISC ===================================== cut = 10.0 ifqnt = 1 ig = -1 nmropt = 0 lj1264 = 0 / &ewald dsum_tol = 1.e-6 / &qmmm qm_theory = 'DFTB3' qmmask = ':81,220-224|@1-10,22-39,55-60,706-710,723-736,1969-1990,1992-1993' qmcharge = 0 spin = 1 qmshake = 0 qm_ewald = 1 qmmm_switch = 1 scfconv = 1.e-10 verbosity = 0 tight_p_conv = 1 diag_routine = 0 pseudo_diag = 1 dftb_maxiter = 100 / &wt type = 'DUMPFREQ', istep1 = 8, / &wt type='END' / DISANG=imgXXXX.disang DUMPAVE=imgXXXX.dumpave LISTOUT=POUT LISTIN=POUT .. code-block:: bash PBE0snglpt &cntrl ! IO ======================================= irest = 0 ! 0 = start, 1 = restart ntx = 1 ! 1 = start, 5 = restart ntxo = 1 ! read/write rst as formatted file iwrap = 1 ! wrap crds to unit cell ioutfm = 1 ! write mdcrd as netcdf imin = 6 ntmin = 1 ntpr = 1 ntwr = 0 ntwx = 0 ntwf = 1 ntwe = 1 ! DYNAMICS ================================= nstlim = 0 ! number of time steps dt = 0.001 ! ps/step ntb = 1 ! 1=NVT periodic, 2=NPT periodic, 0=no box ! TEMPERATURE ============================== temp0 = 298 ! target temp gamma_ln = 5.0 ! Langevin collision freq ntt = 3 ! thermostat (3=Langevin) ! PRESSURE ================================ ntp = 0 ! 0=no scaling, 1=isotropic, 2=anisotropic ! SHAKE ==================================== ntc = 2 ! 1=no shake, 2=HX constrained, 3=all constrained noshakemask = ":81,220-224|@1-10,22-39,55-60,706-710,723-736,1969-1990,1992-1993" ! do not shake these ntf = 1 ! 1=cpt all bond E, 2=ignore HX bond E, 3=ignore all bond E ! MISC ===================================== cut = 10.0 ifqnt = 1 ig = -1 nmropt = 0 / &ewald dsum_tol = 1.e-6 / &qmmm qm_theory = 'HFDF' hfdf_theory = 'PBE0' hfdf_basis = '6-31G*' qmmask = ':81,220-224|@1-10,22-39,55-60,706-710,723-736,1969-1990,1992-1993' qmcharge = 0 spin = 1 itrmax = 50 qmshake = 0 scfconv = 1e-07 verbosity = 0 hfdf_mempercore = 2000 hfdf_restricted = T !hfdf_guess_mix = 0.5 !------------------------------------------------------- ! Ambient-Potential Composite Ewald hfdf_ewald=T,qm_ewald=1 hfdf_mm_percent=0.0, hfdf_qmmm_wswitch=0.0, qmmm_switch=0 !------------------------------------------------------- ! ! Cutoff QM/MM electrostatics; set use_pme=0 in &ewald !hfdf_ewald=F,qm_ewald=0 !hfdf_mm_percent=1.0,hfdf_qmmm_wswitch=0.0,qmmm_switch=0 ! ! MM-charge Ewald w/ short-range cutoff QM/MM correction !hfdf_ewald=F,qm_ewald=1 !hfdf_mm_percent=1.0,hfdf_qmmm_wswitch=0.0,qmmm_switch=1 ! ! MM-charge Ewald w/ short-range switched-cutoff QM/MM correction !hfdf_ewald=F,qm_ewald=1 !hfdf_mm_percent=1.0,hfdf_qmmm_wswitch=2.0,qmmm_switch=1 ! ! Nam-style Mulliken-charge PME w/ short-range cutoff QM/MM correction !hfdf_ewald=F,qm_ewald=1 !hfdf_mm_percent=0.0,hfdf_qmmm_wswitch=0.0,qmmm_switch=1 ! ! Nam-style Mulliken-charge PME w/ switched QM/MM correction !hfdf_ewald=F,qm_ewald=1 !hfdf_mm_percent=0.0,hfdf_qmmm_wswitch=2.0,qmmm_switch=1 ! / &wt type = 'DUMPFREQ', istep1 = 25, / &wt type='END' / DISANG=imgXXXX.disang DUMPAVE=imgXXXX.dumpave LISTOUT=POUT LISTIN=POUT .. _dpamber: 1.4 Packaging the data ~~~~~~~~~~~~~~~~~~~~~~ Now you should have a seperate hdf5 file for each window in the prune_traj directory. You can get a sense of what's inside using python: .. code-block:: python ipython In [1]: import h5py In [2]: file = h5py.File('img01_DFTB3.hdf5') In [3]: keynames = list(file.keys()) In [4]: print(keynames[0]) C23H41HW179Mg1N3O23OW89P2mC106mCl0mH110mMg3mN51mNa3mO65mP8 In [5]: print(file[keynames[0]].keys()) KeysViewHDF5 ['nopbc', 'set.000', 'type.raw', 'type_map.raw'] In [6]: print(file[keynames[0]]['set.000'].keys()) KeysViewHDF5 ['aparam.npy', 'coord.npy', 'energy.npy', 'force.npy'] Each structure is represented by a string of elements and the total number of that element in the cut out region. For each system string, there is an indicator of whether or not periodic boundary conditions are used (nopbc), the atom types present (type/type_map), and most importantly, the set of energies, forces and coordinates contained in set.000. Additionally, set.000 contains aparam, which stands for atomic parameters, which is only important for pairwise dprc models to assign atoms to residues. The following hdf5.slurm script can be run from the base directory: .. code-block:: bash #!/bin/bash #SBATCH --output="slurm/dpdata-%a.slurmout" #SBATCH --error="slurm/dpdata-%a.slurmerr" #SBATCH --nodes 1 #SBATCH --ntasks-per-node 1 #SBATCH --partition=debug #SBATCH -c 1 #SBATCH --mem=8G #SBATCH --time=00:30:00 #SBATCH --requeue #SBATCH --account=rut149 #SBATCH --constraint=lustre module load dpgen_Aug092023 export PYTHONPATH=/expanse/lustre/projects/rut149/emccart3/DPGEN_installs/Aug092023/local/lib/python3.9/site-packages echo "${PYTHONPATH}" it='it51' sysname='HDV' cat << EOF > hdf5.py from glob import glob from pathlib import Path import os import dpdata import numpy as np m = dpdata.MultiSystems() for dd in glob.glob('{}/prune_traj/*hdf5'.format(it)): if os.path.isfile(dd): m.from_deepmd_hdf5(dd) m.to_deepmd_hdf5("initial_{}.hdf5".format(sysname)) EOF python3.9 hdf5.py Modify the directory and systems names in this file for your system. This will produce two hdf5 files that can later be specified as the training data for DP-GEN. 2. Training a QM/MM+ΔMLP model with DeePMD-kit ---------------------------------------------- The following sections will detail the training procedure for a DP and MACE model. The general procedure is the same, but the input files will differ. Choose the input file instructions for the model type you wish to train. Please note that when performing training with DeePMD-kit, it will create many symbolic links, so you should ensure your file system supports this. You will also need to have ssh keys set up on any remote cluster you wish to run training on so that passwordless login is enabled. .. _input1: 2.1 Generating input files for a Deep Potential ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ For the sake of demonstration, let's imagine you have generated data for two seperate systems and want to combine them for training together, otherwise, just proceed with your single hdf5 file. You should have a directory containing the hdf5 files you want to combine: .. code-block:: bash user@local:~/YOUR/WORKING/DIR $ ls combine_hdf5s.py writejson_simplify.py initial_HDV_assoc.hdf5 initial_HDV_dissoc.hdf5 Here are the contents of combine_hdf5s.py .. code-block:: python from glob import glob import dpdata m = dpdata.MultiSystems() for dd in glob("*.hdf5"): m.from_deepmd_hdf5(dd) m.to_deepmd_hdf5("trainingset.hdf5") After running combine_hdf5s.py, you will have a single hdf5 file called trainingset.hdf5. We will need information from this file to write the input file for DeePMD-kit, which requires json or yaml format. You can think of these file formats like a python dictionary with keys and corresponding values. You have been provided a script called writejson_simplify.py to assist in generating the json file. This script requires functions from writejson.py and sysdef.py, also provided. Be sure to store those two scripts in a directory that is in your $PYTHONPATH, or in the working directory. Here are the arguments for writejson_simplify.py: .. code-block:: bash user@local:~/YOUR/WORKING/DIR $ python writejson_simplify.py -h usage: writejson_simplify.py [-h] -h5 HDF5 [-m MODEL] [-r RADIUS] [-os OUTSIMP] [-x EXCLUDE] Write training input json file for dpgen simplify. optional arguments: -h, --help show this help message and exit -h5 HDF5, --hdf5 HDF5 hdf5 file containing the training data. All data should be combined already -m MODEL, --model MODEL Type of model to train. Options are DP or MACE. -r RADIUS, --radius RADIUS Radius around QM region for training. Default: 6 angstrom -os OUTSIMP, --outsimp OUTSIMP Output file name for simplify (passive learning) json file. Default: simplify.json -x EXCLUDE, --exclude EXCLUDE json file with sel and exclusion lists from previous run of this script if you have it. This will avoid calculating exclude types and sel again to save time. The radius should be the same as what you used in dpamber to carve out the training region around the QM region. Let's generate an input file for a DP model. Run the script with the trainingset.hdf5 file as input: .. code-block:: bash user@local:~/YOUR/WORKING/DIR $ python writejson_simplify.py -h5 trainingset.hdf5 -m DP -r 6 -os simplify_DP.json Let's get a sense of what parameters the script set based on your training data, and what parameters are defaults. Earlier, we saw that the hdf5 file contains many system keys consisting of a string of atom names and their corresponding count. An "m" in front of an atom indicates that it is an MM atom. In simplify_DP.json, you will see a key called "type_map", which is simply the QM and MM atom types that occur in the training data. You will see sections for QM/QM interactions and for QM/MM interactions, each containing a "sel" and "exclude_types" key. The sel (short for selection) is the number of atoms of each type a particular atom type should expect to interact with during a simulation. This initializes the dimensions of tensors involved in training, so setting it too high could impede performance, but setting it too low could mean that if a higher number of atoms is encountered in a future application, the model might not conserve energy properly. This is more straightword for the QM/QM interactions because the QM region is generally fixed, so the sel is a list of the total number of each QM atom, and 0 for MM atoms. You will notice that we are using a hybrid descriptor, se_a_ebd_v2 for QM/QM and se_atten_v2 for QM/MM. This allows us to set the QM/MM sel to an integer rather than a list, which is better for performance. The script will compute the maximum number of each MM atom type that can be seen by a QM atom and take the total. This number is padded such that is it divisible by 16 for performance. The exclude_types lists indicate which interactions should be excluded from the calculation, so in the QM/QM section those indices will correspond to MM-MM and QM-MM pairs, and in the QM/MM section they are QM-QM and MM-MM pairs. These system specific keys are written to a file called exclude.json that the script can read if you don't want to recompute them. One may wish to tune numb_steps (the number of training steps), init_pick_number, and iter_pick_number based on the amount of training data you have. Typically, init_pick_number will be larger than iter_pick_number because you want the first iteration of training to see a representative sample of the data. The iter_pick_number controls how many new samples are added to training to replace the initial samples, so by setting this lower than init_pick_number you avoid completely replacing the dataset and getting sharp fluctuations in the model. You may wish to make these larger for larger training sets to reduce the number of training iterations necessary to get through all of your data. The rest of the input corresponds to the training procedure and hyperparameters that are defaults. Please refer to the DeePMD-kit manual and literature for detailed descriptions of these parameters. Here is a summary of how key parameters appear in the input file (shown in bold). .. math:: E_{i} = \begin{cases} N_{2}(D_{i}) & i \in \text{QM} \\ N_{2}(D_{i}) - N_{2}(0) & i \in \text{MM} \end{cases} **Fitting network** \(N_{2}\): 3 hidden layers with **240** neurons per layer. See **descriptor** → **fitting net** → **neuron** = **[240, 240, 240]** in `simplify_DP.json`. .. math:: D_{i} = (G_{i1})^{T} R_{i} (R_{i}^{T}) G_{i2} Atomic descriptor (\(D_{i}\)): input layer of fitting network (\(N_{2}\)). .. math:: (G_{i1})_{j} = N_{e,2}(\{s(r_{ij}), A^{j}\}) **Embedding matrix** (\(G_{i1}\)): output of the filtering network. **Filtering/embedding network** (\(N_{e,2}\)): 3 hidden layers with **25**, **50**, and **100** neurons. See **descriptor** → **neuron** = **[25, 50, 100]**. .. math:: A^{j} = N_{t}(\text{onehot}(\alpha_{i})) **Type embedding matrix** (\(A^{j}\)): output of the type embedding network. **Type embedding network** (\(N_{t}\)): 1 layer with **12** neurons. See **type embedding** → **neuron** = **[12]**. .. math:: \begin{align} G_{i1} &\in \mathbb{R}^{N \times M_1},\quad M_1 = 100 \\ G_{i2} &\in \mathbb{R}^{N \times M_2},\quad M_2 = 12 \end{align} **Truncated embedding matrix** (\(G_{i2}\)): the first **12** columns of the embedding matrix (\(G_{i1}\)). See **descriptor** → **axis neuron** = **12** Finally, the last input will be train_backend, which is set to tensorflow for a DP model. .. _input2: 2.2 Generating input files for MACE ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Now we will generate an input file for a MACE model. If you only wish to train a DP model, you may skip this section. Using trainingset.hdf5 again as input, rerun the writejson_simplify.py with -m set to MACE: .. code-block:: bash user@local:~/YOUR/WORKING/DIR $ python writejson_simplify.py -h5 trainingset.hdf5 -m MACE -r 6 -os simplify_MACE.json You will notice that simplify_MACE.json contains some new parameters in the "model" section: .. code-block:: python "r_max": 6.0, "sel": 128, "num_radial_basis": 8, "num_cutoff_basis": 5, "max_ell": 3, "interaction": "RealAgnosticResidualInteractionBlock", "num_interactions": 2, "hidden_irreps": "128x0e + 128x1o", "pair_repulsion": false, "distance_transform": "None", "correlation": 3, "gate": "silu", "MLP_irreps": "16x0e", "radial_type": "bessel", "radial_MLP": [ 64, 64, 64 ] These are the default model parameters suggested by the MACE developers that have been shown to perform well. Here is a summary of how key parameters relate to the input file. An atomic center is represented as follows: .. math:: \begin{align} \sigma_{i}^{(t+1)} &= (r_{i}, \theta_{i}, h_{i}^{(t+1)}) \\ h_{i}^{(t+1)} &= U_{t}(\sigma_{i}^{(t)}, m_{i}^{(t)}) \end{align} \(\sigma\) is a node representing the atomic center \(i\) with coordinates \(r\), attributes \(\theta\) (such as atomic number), and features \(h\). The features are derived from the previous message \(m\) passing iteration \(t\). The update function \(U\) is a linear function of the message. The message is constructed by pooling over single particle basis functions \(\phi\) constructed between neighbors \(i\) and \(j\): .. math:: \begin{align} \phi_{i,kl_{3}m_{3}}^{(t)} &= R_{k,l_{1}l_{2}l_{3}}^{(t)}(r_{ji}) Y_{l_1}^{m_1}(\hat{r_{ji}}) T_{kcL}^{(t)}(h_{j}^{(t)}, \theta_{i}^{(t)}, \theta_{j}^{(t)}) \\ R_{k,l_{1}l_{2}l_{3}}^{(t)}(r_{ji}) &= \text{MLP}(R_{n}(r_{ji}) f_{\text{cut}}(r_{ji})) \end{align} The single particle basis functions contain **radial basis functions**, \(R_n\), which are set to be **8 Bessel functions** multiplied by a smooth polynomial cutoff, \(f_{\text{cut}}\), of order **5**. The **MLP** contains 3 layers each with **64** neurons. \(Y\) are spherical harmonics, and \(T\) is a node embedding function. - See **model** → **radial_type** = **bessel** - **model** → **num_radial_basis** = **8** - **model** → **num_cutoff_basis** = **5** - **model** → **radial_MLP** = **[64, 64, 64]** The index \(t\) is the message passing iteration, which in our input has a maximum of **2**. - See **model** → **num_interactions** = **2** The message is constructed as: .. math:: \begin{align} m_{i,kLM}^{(t)} &= \bigoplus_{j \in N_{i}} \phi_{k\nu L}^{(t)}(\sigma_{j}^{(t)}, \sigma_{i}^{(t)}) \\ &= \sum_{\eta} w_{i,k\eta L}^{(t)} \sum_{\upsilon} C^{LM}_{\eta_{\upsilon}} \prod_{\xi=1}^{\nu} \sum_{j \in N(i)} \phi_{k\nu_{\xi}L}^{(t)}(\sigma_{i}^{(t)}, \sigma_{j}^{(t)}) \end{align} - The **correlation order**, \(\nu\), is set to **3** - The number of uncoupled **channels**, \(k\), is set to **128** - The index \(\eta\) is for a given symmetry - The maximum order of symmetry in the message, \(L_{\text{max}}\), is set to **3** - See **model** → **correlation** = **3** Using these defaults, we will consider up to 3 body interactions. The 2 body interactions are computed explicitly, and the 3 body interactions are approximated using tensor products of lower body interactions from previous messages. This is a hallmark of the MACE model which allows it to consider higher body order interactions efficiently, unlink other more expensive GNN's which may compute them explicitly. This also distinguishes it from the Deep Potential, which only considers 2 body interactions. Finally, the site energy is .. container:: math-block \\[ E_{i}^{(t)} = \\begin{cases} \\sum W_{\text{readout}} h_{i}^{(t)} & \\text{if } t < T \\\\ \\text{MLP}_{\text{readout}}^{(t)}(\{h_{i}^{(t)}\}_{k}) & \\text{if } t = T \\end{cases} \\] | The readout function \\(W\) is a **sigmoid linear unit function (silu)** except for the last message construction, in which case the readout function is a **MLP** with a single layer with **16 neurons**. | See **model** → **gate** = **silu** and | **model** → **MLP_irreps** = **16x0e**. Finally, the training backend will now be set to pytorch instead of tensorflow. Other parameters such as the form of the loss function, learning rate, training steps, output frequencies, and model deviation bounds are common between the DP and MACE inputs. .. _machine: 2.3 Machine files ~~~~~~~~~~~~~~~~~ In addition to the training parameter file, you will need a "machine file." This is a yaml or json file that describes how the job will be run. DeePMD-kit contains a program called dpdisbatcher, which can automatically transfer your files to a remote cluster, launch the jobs, and transfer the data back to your local working directory. You can also initiate the training directly from the cluster if you prefer to transfer the data after the training is complete. The job submission information such as slurm headers, environment modules, system architecture, and paralellization strategies are contained in the machine file to be used by dpdisbatcher. If you want to automatically sync data from your local machine to a cluster, ensure that you have ssh keys set up, as dpdisbatcher will require passwordless login for automated file transfers. The user must customize the machine file for the desired cluster. Below are two examples for running DP-GEN on the AMAREL cluster at Rutgers. The first uses a local context where all training data is currently on the cluster and will remain there, and the second uses an SSH context to automatically run data from your local machine to a remote cluster. .. code-block:: python { "train": [ { "command": "dp", "machine": { "context_type": "LocalContext", "batch_type": "Slurm", "local_root": "./", "remote_root": "./" }, "resources": { "number_node": 1, "cpu_per_node": 1, "gpu_per_node": 1, "queue_name": "gpu", "custom_flags" : ["#SBATCH --mem=64G", "#SBATCH --time=0-020:00:00", "#SBATCH --constraint=\"ampere|titan\"", "#SBATCH --exclude=gpu[005-006,012],cuda[001-008]"], "group_size": 1, "module_list": ["pydeepmdkit/deepmdkit/default"], "envs": { "OMP_NUM_THREADS": 1, "TF_INTRA_OP_PARALLELISM_THREADS": 1, "TF_INTER_OP_PARALLELISM_THREADS": 1 } } } ], "model_devi": [ { "command": "dp", "nchunks": 4, "true_error": "false", "machine": { "context_type": "LocalContext", "batch_type": "Slurm", "local_root": "./", "remote_root": "./" }, "resources": { "number_node": 1, "cpu_per_node": 1, "gpu_per_node": 1, "queue_name": "p_lbsr,gpu", "custom_flags" : ["#SBATCH --mem=32G", "#SBATCH --time=0-10:00:00", "#SBATCH --constraint=\"ampere|titan\"", "#SBATCH --exclude=gpu[005-006,012],cuda[001-008]"], "group_size": 1, "module_list": ["pydeepmdkit/deepmdkit/default"], "envs": { "OMP_NUM_THREADS": 1, "TF_INTRA_OP_PARALLELISM_THREADS": 1, "TF_INTER_OP_PARALLELISM_THREADS": 1 } } } ], "fp": [ { "command": "dp", "machine": { "context_type": "LocalContext", "batch_type": "Slurm", "local_root": "./", "remote_root": "./" }, "resources": { "number_node": 1, "cpu_per_node": 1, "gpu_per_node": 1, "queue_name": "p_lbsr,gpu", "custom_flags" : ["#SBATCH --mem=10G", "#SBATCH --time=0-10:00:00", "#SBATCH --constraint=\"ampere|titan\"", "#SBATCH --exclude=gpu[005-006,012],cuda[001-008]"], "group_size": 1, "module_list": ["pydeepmdkit/deepmdkit/default"], "envs": { "OMP_NUM_THREADS": 1, "TF_INTRA_OP_PARALLELISM_THREADS": 1, "TF_INTER_OP_PARALLELISM_THREADS": 1 } } } ] } .. code-block:: python { "train": [ { "command": "dp", "machine": { "context_type": "SSHContext", "batch_type": "Slurm", "local_root": "./", "remote_root": "/scratch/em989/mltrain/dpgen_workdir", "remote_profile": { "hostname": "amarel.rutgers.edu", "username": "em989" } }, "resources": { "number_node": 1, "cpu_per_node": 1, "gpu_per_node": 1, "queue_name": "gpu", "custom_flags" : ["#SBATCH --mem=16G", "#SBATCH --time=1-23:00:00", "#SBATCH --constraint=\"ampere|titan\"", "#SBATCH --exclude=gpu[005-006,012],cuda[001-008]"], "group_size": 1, "module_list": ["pydeepmdkit/deepmdkit/default"], "envs": { "OMP_NUM_THREADS": 1, "TF_INTRA_OP_PARALLELISM_THREADS": 1, "TF_INTER_OP_PARALLELISM_THREADS": 1 } } } ], "model_devi": [ { "command": "dp", "machine": { "context_type": "SSHContext", "batch_type": "Slurm", "local_root": "./", "remote_root": "/scratch/em989/mltrain/dpgen_workdir", "remote_profile": { "hostname": "amarel.rutgers.edu", "username": "em989" } }, "resources": { "number_node": 1, "cpu_per_node": 4, "gpu_per_node": 1, "queue_name": "p_lbsr,gpu", "custom_flags" : ["#SBATCH --mem=16G", "#SBATCH --time=1-23:00:00", "#SBATCH --constraint=\"ampere|titan\"", "#SBATCH --exclude=gpu[005-006,012],cuda[001-008]"], "group_size": 1, "module_list": ["pydeepmdkit/deepmdkit/default"], "envs": { "OMP_NUM_THREADS": 4, "TF_INTRA_OP_PARALLELISM_THREADS": 1, "TF_INTER_OP_PARALLELISM_THREADS": 1 } } } ], "fp": [ { "command": "dp", "machine": { "context_type": "SSHContext", "batch_type": "Slurm", "local_root": "./", "remote_root": "/scratch/em989/mltrain/dpgen_workdir", "remote_profile": { "hostname": "amarel.rutgers.edu", "username": "em989" } }, "resources": { "number_node": 1, "cpu_per_node": 1, "gpu_per_node": 1, "queue_name": "p_lbsr,gpu", "custom_flags" : ["#SBATCH --mem=16G", "#SBATCH --time=1-23:00:00", "#SBATCH --constraint=\"ampere|titan\"", "#SBATCH --exclude=gpu[005-006,012],cuda[001-008]"], "group_size": 1, "module_list": ["pydeepmdkit/deepmdkit/default"], "envs": { "OMP_NUM_THREADS": 1, "TF_INTRA_OP_PARALLELISM_THREADS": 1, "TF_INTER_OP_PARALLELISM_THREADS": 1 } } } ] } One should modify the module highlighed in green. The block highlighted in cyan means the data will be transfered to the cluster for training, and results tranferred back at every training iteration. If instead you are starting with your data already on the cluster, change "SSHContext" to "LocalContext", "remote_root" to "./", and remove the remote profile information. This should be done for the train, model_devi, and fp sections. The first section, train, controls the training step, which should be run on a GPU, expecially if training a MACE potential. We will train 4 models in parallel, so each will be alotted 1 GPU. The initial weights and biases are assigned stochastically, so the 4 models will deviate. The model deviation (model_devi) is computed to estimate the uncertainty in the forces for a given sample, and if it falls between our preset values of 0.08 and 0.25 eV/A, the sample will still be availble to be selected for future training. If it falls below 0.08 eV/A, it is deemed accurate and the sample is removed from the training set. Similarly, if it falls above 0.25 eV/A, it is deemed an outlier and removed. The first principles (fp) step labels structures for which the model is uncertain with their high level energies and forces for the next round of training. In simplify, that data is already available to us, so it is only a matter of grabbing it from our training set database rather than computing it. If you have a large amount of data, especially for training a MACE potential, it could be advisable to use distributed data parellelism. This means the initial model parameters will be broadcast across multiple GPUs, but each GPU trains based on a mini-batch of the data. Then when all instances have finished, the resulting models are averaged such that the final model is the same across the mini-batches. Here is an example splitting the data for a given model between 2 GPUs (8 total for 4 stochastic models): .. code-block:: python { "train": [ { "command": "torchrun --nproc_per_node=2 --no-python --rdzv-backend=c10d --rdzv_endpoint=localhost:0 dp", "machine": { "context_type": "SSHContext", "batch_type": "Slurm", "local_root": "./", "remote_root": "/scratch/em989/mltrain/dpgen_workdir", "remote_profile": { "hostname": "amarel.rutgers.edu", "username": "em989" } }, "resources": { "number_node": 2, "cpu_per_node": 2, "gpu_per_node": 2, "queue_name": "gpu", "custom_flags" : ["#SBATCH --mem=16G", "#SBATCH --time=1-23:00:00", "#SBATCH --constraint=\"ampere|titan\"", "#SBATCH --exclude=gpu[005-006,012],cuda[001-008]"], "group_size": 1, "module_list": ["pydeepmdkit/deepmdkit/default"], "envs": { "OMP_NUM_THREADS": 2, "TF_INTRA_OP_PARALLELISM_THREADS": 1, "TF_INTER_OP_PARALLELISM_THREADS": 1 } } } ], ... The model_devi and fp sections are not effected. For MACE models, which are based on the pytorch backend, we must use the command torchrun. The --no-python command means that we are running the dp command as opposed to directly running a python program. The flags --rdzv-backend=c10d and --rdzv_endpoint=localhost:0 where required in this case because the defailt port (29500) was already in use. In the training base directory, create a directory called "computerprofiles" and put machine.json in it. .. code-block:: bash user@local:~/YOUR/WORKING/DIR $ mkdir computerprofiles user@local:~/YOUR/WORKING/DIR $ mv PATH/2/machine.json computerprofiles .. _simplify: 2.4 Running DP-GEN simplify ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Now you can run DP-GEN simplify. Job progress will be logged to the console as well as to files called dpgen.log and dpdisbatcher.log. It may be convenient to run the job from a tmux session so that you can detach and attach for easier progress monitoring. One could also background the process and periodically check the log file. Run the following command to initiate simplify: .. code-block:: bash user@local:~/YOUR/WORKING/DIR $ dpgen simplify simplify_MACE.json computerprofiles/machine.json Reference ------------ Please cite: Yuzhi Zhang, Haidi Wang, Weijie Chen, Jinzhe Zeng, Linfeng Zhang, Han Wang, and Weinan E, DP-GEN: A concurrent learning platform for the generation of reliable deep learning based potential energy models, Computer Physics Communications, 2020, 107206. ------------ Description ------------ INFO:dpgen:start simplifying INFO:dpgen:=============================iter.000000============================== INFO:dpgen:-------------------------iter.000000 task 00-------------------------- INFO:dpgen:-------------------------iter.000000 task 01-------------------------- INFO:dpgen:first iter, skip step 1-5 INFO:dpgen:-------------------------iter.000000 task 02-------------------------- INFO:dpgen:first iter, skip step 1-5 INFO:dpgen:-------------------------iter.000000 task 03-------------------------- Training iterations will run until either all of the samples are deemed accurate, or the data is exhausted. When the job is complete, you will see the output "INFO : finished". .. _pretrain: 2.5 Starting from a pretrained model ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Let's consider the case where you have a pretrained model that you want to refine. In the case of simplify, this means you would have 4 existing model files from some previous training. They should be stored in the directory structure produced by dpgen. We will assume the following directory structure. .. code-block:: bash user@local:~/YOUR/WORKING/DIR $ ls Foundation_model Refined_model user@local:~/YOUR/WORKING/DIR $ ls Foundation_model iter.000000 iter.000001 iter.000002 iter.000003 iter.000004 iter.000005 iter.000006 iter.000007 iter.000008 user@local:~/YOUR/WORKING/DIR $ ls Foundation_model/iter.000008/00.train 000 001 002 003 data.init data.iters graph.000.pth graph.001.pth graph.002.pth graph.003.pth Let's say your final model files are in Foundation_model/iter.000008/00.train, and you want to train the refined model in the directory called Refined_model. Within the Refined_model, you will again have a common directory, but the training dataset should correspond to the specific system with which you want to refine the training. Copy the input json file from the Foundation_model directory, modify the pick_data key to match your new hdf5 file name, and add the "training_iter0_model_path" key: .. code-block:: python "pick_data": "common/MySystem.hdf5", "training_iter0_model_path": ["../Foundation_model/iter.000008/00.train/000", "../Foundation_model/iter.000008/00.train/001", "../Foundation_model/iter.000008/00.train/002", "../Foundation_model/iter.000008/00.train/003"], You can now initiate simplify as you did in the previous section: .. code-block:: bash user@local:~/YOUR/WORKING/DIR $ dpgen simplify simplify_MACE.json computerprofiles/machine.json .. _eval: 2.6 Evaluating a model ~~~~~~~~~~~~~~~~~~~~~~ Once you have completed simplify, let's check on the progress of our model. You have been provided multiple python scripts to help you plot output files from DeePMD-kit. Let's say your training concluded after iteration 4. In this example we will compare the model deviation and parity plots for the training sets from the first and last iteration. Go to the training directory for iteration 1 and run dp model-devi and dp test, then do the same for iteration 4: .. code-block:: bash user@local:~/YOUR/WORKING/DIR $ cd iter.000001/00.train user@local:~/YOUR/WORKING/DIR/iter.000001/00.train $ dp model-devi -m graph.*.pth -s data.iters/iter.000000/02.fp/data.000 user@local:~/YOUR/WORKING/DIR/iter.000001/00.train $ dp test -m graph.000.pth -s data.iters/iter.000000/02.fp/data.000 -n 100 -d results user@local:~/YOUR/WORKING/DIR $ cd ../../iter.000004/00.train user@local:~/YOUR/WORKING/DIR/iter.000001/00.train $ dp model-devi -m graph.*.pth -s data.iters/iter.000003/02.fp/data.000 user@local:~/YOUR/WORKING/DIR/iter.000001/00.train $ dp test -m graph.000.pth -s data.iters/iter.000003/02.fp/data.000 -n 100 -d results user@local:~/YOUR/WORKING/DIR/iter.000001/00.train $ cd ../../ Move the plot_modeldevi.py and plot_parity.py scripts to your working directory. First let's plot the model deviations: .. code-block:: bash plot_modeldevi.py -f iter.000001/00.train/model_devi.out -f iter.000004/00.train/model_devi.out -l iter1 -l iter4 1520 Percent accurate for ftrust=0.1 is 63.6% Percent accurate for ftrust=0.08 is 16.6% Percent accurate for ftrust=0.05 is 0.0% iter.000004/00.train/model_devi.out 500 Percent accurate for ftrust=0.1 is 90.8% Percent accurate for ftrust=0.08 is 53.2% Percent accurate for ftrust=0.05 is 0.0% .. container:: .. figure:: imgs/model_deviations.png :alt: Figure 1 **Figure 1.** Deviation between 4 model after the first simplify training iteration (blue) and last (orange). We have gone from the 4 models agreeing for ~17% of the training data to ~53%. The size of the training set has also decreased because samples that were deviously deemed accurate have been discarded. We may wish to add more data to this training to achieve higher accuracy, but for the sake of this tutorial we will continue with the current model. 3. Training a QM/MM+ΔMLP model with DeePMD-kit ---------------------------------------------- .. _runsim: 3.1 QM/MM+ΔMLP simulations ~~~~~~~~~~~~~~~~~~~~~~~~~~ Now we will apply the model to run simulations applying ΔMLP correction. Both DP and GNN models perform better on a GPU compared to a CPU, but Pytorch based GNNs have a larger performance drop-off on CPUs than the tensorflow based DP. Assuming the potential will be used to run new string simulations, we will modify the runString.slurm script run on a GPU node. Here are the contents of runString_ML.slurm written for the expanse cluster: .. code-block:: bash #!/bin/bash #SBATCH --job-name="runString_ML" #SBATCH -o slurm/slurm-%j.%a.out #SBATCH -e slurm/slurm-%j.%a.err #SBATCH --export=ALL #SBATCH --partition=gpu #SBATCH --gpus=8 #SBATCH --nodes=2 #SBATCH --cpus-per-task=1 #SBATCH --ntasks-per-gpu=4 #SBATCH --mem=224G #SBATCH --account=rut149 #SBATCH --time=2-00:00:00 #SBATCH --constraint=lustre module purge module load dpmace/amber24/default export OMP_NUM_THREADS=1 export DP_INTRA_OP_PARALLELISM_THREADS=1 export DP_INTER_OP_PARALLELISM_THREADS=1 export GFORTRAN_UNBUFERED_ALL=Y #PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True export LAUNCH="srun --mpi=pmi2 -K1 -N2 -n32 -c1" export EXE="sander.MPI" if [ ! -d "slurm" ]; then mkdir slurm fi for iter in init {01..50}; do if [[ "${iter}" == "init" ]]; then itdir=init echo "Running init directory" else itdir=it${iter} fi cd $itdir if [[ "${itdir}" == "it10" ]]; then echo ${itdir} else $LAUNCH $EXE -ng 32 -groupfile ../template/template.groupfile wait sleep 1 fi ################################### #### Check for convergence failure ################################### FAILED=0 for s in $(seq 32); do base=$(printf "img%02i" ${s}) if grep -q "Convergence could not be achieved" ${base}.mdout; then FAILED=1 echo "${base} failed SCF convergence!!!" else echo "SCF converged" continue fi done if [ "${FAILED}" == 1 ]; then echo "Detected a failed simulation" exit 1 fi ################################### #### Check for any other failures ################################### echo ${pwd} if grep -q " 5. TIMINGS" *.mdout; then sleep 1 cd ../. ./nextSASM.sh $iter 128 else echo "run incomplete" exit 1 fi done The nextSASM.sh script is unchanged. The template directory should now contain graph.000.pth if you trained a GNN, or graph.000.pb if you trained a DP. The 1ps.mdin template mdin file should now contain a dprc namelist pointing to the model file: .. code-block:: bash 1ps_ml.mdin &cntrl ! IO ======================================= irest = 1 ! 0 = start, 1 = restart ntx = 5 ! 1 = start, 5 = restart ntxo = 1 ! read/write rst as formatted file iwrap = 1 ! wrap crds to unit cell ioutfm = 1 ! write mdcrd as netcdf imin = 0 ntmin = 1 ntpr = 50 ntwr = 50 ntwx = 50 ntwf = 0 ! DYNAMICS ================================= nstlim = 1000 ! number of time steps dt = 0.001 ! ps/step ntb = 1 ! 1=NVT periodic, 2=NPT periodic, 0=no box ! TEMPERATURE ============================== temp0 = 298 ! target temp gamma_ln = 5.0 ! Langevin collision freq ntt = 3 ! thermostat (3=Langevin) ! PRESSURE ================================ ntp = 0 ! 0=no scaling, 1=isotropic, 2=anisotropic ! SHAKE ==================================== ntc = 2 ! 1=no shake, 2=HX constrained, 3=all constrained noshakemask = ":81,220-224|@1-10,22-39,55-60,706-710,723-736,1969-1990,1992-1993" ! do not shake these ntf = 1 ! 1=cpt all bond E, 2=ignore HX bond E, 3=ignore all bond E ! MISC ===================================== cut = 10.0 ifqnt = 1 ig = -1 nmropt = 1 / &ewald dsum_tol = 1.e-6 / &qmmm qm_theory = 'DFTB3' qmmask = ':81,220-224|@1-10,22-39,55-60,706-710,723-736,1969-1990,1992-1993' qmcharge = 0 spin = 1 qmshake = 0 qm_ewald = 1 qmmm_switch = 1 scfconv = 1.e-10 verbosity = 0 tight_p_conv = 1 diag_routine = 0 pseudo_diag = 1 dftb_maxiter = 100 / &wt type = 'DUMPFREQ', istep1 = 20, / &wt type='END' / DISANG=img.disang DUMPAVE=img.dumpave LISTOUT=POUT LISTIN=POUT &dprc idprc=1 mask=":81,220-224|@1-10,22-39,55-60,706-710,723-736,1969-1990,1992-1993" rcut = 6.0 intrafile="" interfile(1)="../template/graph.000.pth" / Modify the QM region, noshake mask, and dprc mask for your system. The mask in the dprc section should be your QM region. Change the file extension of the graph model file based on the type of model you trained.