User Guide

From FarmShare

(Difference between revisions)
Jump to: navigation, search
(repro -> repo & format path)
Line 1: Line 1:
-
If you have any questions, file [https://remedyweb.stanford.edu/helpsu/helpsu?pcat=farmshare HelpSU] or ask on [https://mailman.stanford.edu/mailman/listinfo/farmshare-discuss farmshare-discuss@lists.stanford.edu] The modification date of this page is in the footer below.  
+
Contact SRCC staff for support at: [mailto:srcc-support@stanford.edu srcc-support@stanford.edu], or post questions and concerns to the community [https://mailman.stanford.edu/mailman/listinfo/farmshare-discuss discussion list] at: [mailto:farmshare-discuss@lists.stanford.edu farmshare-discuss@lists.stanford.edu].
-
= Connecting  =
+
== Connecting  ==
-
*The public-facing hostname is corn.stanford.edu  
+
Log into <code>rice.stanford.edu</code>. Authentication is by SUNet ID and password (or GSSAPI), and [https://uit.stanford.edu/service/webauth/twostep two-step] authentication is required. A suggested configuration for OpenSSH and recommendations for two popular SSH clients for Windows can be found in [[Advanced Connection Options]].
-
*Only SSH connections are allowed. This also includes SFTP.  
+
-
**Only SSH protocol v2 is supported.  
+
-
**SSH fingerprint for corn is
+
-
  0b:e7:b4:95:03:c1:1e:07:df:04:ca:a2:3d:8e:e3:37
+
== Storage ==
-
*If you're behind a firewall, you may want to add "ServerKeepAliveInterval 60" or "ServerAliveInterval 60" to your SSH client's configuration.
+
Home directories are served (via NFS 4) from a dedicated file server, and per-user quota is currently 48 GB. Users may exceed this soft limit for up to 7 days, up to a hard limit of 64 GB.
-
== Connecting from Windows  ==
+
AFS is accessible from <code>rice</code> systems ''only''. A link to each user's AFS home directory, <code>~/afs-home</code> is provided as a convenience, but should only be used to access files in the legacy environment, and for transferring data. It should ''not'' be used as a working directory when submitting batch jobs, as AFS is not accessible from compute nodes. Please note that Kerberos authentication is required to access locations in AFS; run <code>kinit && aklog</code> to re-authenticate if you have trouble accessing any AFS directory.
-
You will want to use an SSH client like one of these:
+
Scratch storage is available in <code>/farmshare/user_data</code>, and each user is provided with a personal scratch directory, <code>/farmshare/user_data/sunetid</code>. The total volume size is currently 126 TB; no quotas are enforced, but old files may be purged without warning. The scratch volume is ''not'' backed up, and is ''not'' suitable for long-term storage, but can be used as working storage for batch jobs, and as a short-term staging area for data to be archived to permanent storage.
-
*PuTTY: http://www.chiark.greenend.org.uk/~sgtatham/putty/  
+
Local <code>/tmp<code> storage is available on most nodes, but size varies from node to node. On <code>rice</code>systems, <code>/tmp</code> is 512 GB, with a per-user quota of 128 GB. Users may exceed this soft limit for up to 7 days, up to a hard limit of 192 GB, and space is regularly reclaimed from files older than 7 days.
-
*SecureCRT: https://itservices.stanford.edu/service/ess/pc
+
-
== Connecting from OS X / Linux / other OS ==
+
== Transferring Files ==
-
 
+
-
You should probably just use the included SSH client. Stanford does provide an SSH GUI for OS X to help you track connection settings: http://itservices.stanford.edu/service/ess/mac/lelandssh
+
-
 
+
-
= Logging In  =
+
-
 
+
-
You can log in via SSH using your [http://sunetid.stanford.edu/ SUNet ID] credentials.
+
-
 
+
-
= Moving files to/from the cluster  =
+
Since you can connect via SSH, you can use the rsync or sftp or scp commands to upload/download files to farmshare. On Windows, you can use software like [http://filezilla-project.org/download.php FileZilla] or [http://winscp.net/ WinSCP]. On OS X users can use [http://cyberduck.ch/ Cyberduck] or [https://itservices.stanford.edu/service/afs/file-transfer/macintosh#macsftppublish Fetch], if you want a GUI. Linux and other unix like OS just use the included rsync, sftp or scp commands.  '''Farmshare cannot be used for restricted or prohibited data (PHI, PII, etc.'''
Since you can connect via SSH, you can use the rsync or sftp or scp commands to upload/download files to farmshare. On Windows, you can use software like [http://filezilla-project.org/download.php FileZilla] or [http://winscp.net/ WinSCP]. On OS X users can use [http://cyberduck.ch/ Cyberduck] or [https://itservices.stanford.edu/service/afs/file-transfer/macintosh#macsftppublish Fetch], if you want a GUI. Linux and other unix like OS just use the included rsync, sftp or scp commands.  '''Farmshare cannot be used for restricted or prohibited data (PHI, PII, etc.'''
-
= GitHub Repo Cloning  =
+
== GitHub Repo Cloning  ==
To avoid timeout errors when cloning a repo from GitHub, please clone the repo to your <code>/farmshare/user_data/SUNetID</code> and not your AFS space.
To avoid timeout errors when cloning a repo from GitHub, please clone the repo to your <code>/farmshare/user_data/SUNetID</code> and not your AFS space.
-
= Directory paths  =
+
== Mounting your files elsewhere  ==
-
 
+
-
Your [[AFS]] home directory is something like '/afs/ir/users/c/h/chekh/'. You'll need an [[AFS]] token to access this directory.  This AFS filesystem has a snapshot from one day ago available in .backup https://itservices.stanford.edu/service/afs/learningmore/backups
+
-
 
+
-
A shared scratch directory is /farmshare/user_data, and you get a /farmshare/user_data/username directory there, simply by logging in to any corn (a script will run and notice your login and create the directory within about 30min of your first farmshare login). /farmshare/user_data is NOT backed up.  Space is limited, and the number of users is quite large, so we will have to implement quotas.  This space is meant solely for use with active jobs running on Farmshare; it is not a repository or backup location for data.
+
-
 
+
-
There is also local scratch storage available on the compute nodes. The amount varies from ~70GB to ~500GB, depending on the node hardware. SGE will set the env vars $TMPDIR and $TMP to point to a directory like /tmp/$JOBID. Depending on your workload, it may be a good idea to copy input or reference data to the local scratch space on the node, and then copy the results back to your homedir.  The /tmp directory is cleaned up the usual way via the OS.
+
-
 
+
-
= Mounting your files elsewhere  =
+
Your AFS files can be accessed globally (literally), you just need the OpenAFS software installed. More info here: https://itservices.stanford.edu/service/afs/intro/mounting  
Your AFS files can be accessed globally (literally), you just need the OpenAFS software installed. More info here: https://itservices.stanford.edu/service/afs/intro/mounting  
Line 61: Line 41:
We also installed sshfs on the FarmShare machines, so you can mount files from your machines (which are accessible via SSH).
We also installed sshfs on the FarmShare machines, so you can mount files from your machines (which are accessible via SSH).
-
= Data Limits  =
+
== Data Limits  ==
Your AFS homedir is limited to 5GB of quota. You can request more space (temporarily) here: https://itservices.stanford.edu/service/storage/getmore  
Your AFS homedir is limited to 5GB of quota. You can request more space (temporarily) here: https://itservices.stanford.edu/service/storage/getmore  
Line 71: Line 51:
Your /farmshare/user_data directory is intended for data currently in use for codes you are actively running on Farmshare. It is not a data repository nor a location to use as a backup for other systems.  And /farmshare/user_data is NOT backed up.  As space is quite limited, and there are some thousands of active Farmshare users, we will have to implement quotas in the near future. Your usage should always be less than 1 TB.
Your /farmshare/user_data directory is intended for data currently in use for codes you are actively running on Farmshare. It is not a data repository nor a location to use as a backup for other systems.  And /farmshare/user_data is NOT backed up.  As space is quite limited, and there are some thousands of active Farmshare users, we will have to implement quotas in the near future. Your usage should always be less than 1 TB.
-
= Installed Software =
+
== Installed Software ==
Most software is installed on these systems via the package manager (e.g. '''dpkg -l''').  Older licensed software is installed in AFS (typically /usr/sweet/bin).  Newer software is managed by the [[FarmShare software|module]] command.  If there's any software you'd like, just let us know, and we can probably install it.
Most software is installed on these systems via the package manager (e.g. '''dpkg -l''').  Older licensed software is installed in AFS (typically /usr/sweet/bin).  Newer software is managed by the [[FarmShare software|module]] command.  If there's any software you'd like, just let us know, and we can probably install it.
-
 
+
== Running jobs on the cluster  ==
-
= Running jobs on the cluster  =
+
We use Grid Engine (used to be Sun Grid Engine (SGE)). There are three types of jobs: interactive, batch and parallel. You can start by reading the man page for 'sge_intro'. Then the man page for 'qsub'. We currently have a limit of 3000 jobs (running and/or queued) per user. We don't currently allow interactive jobs on the barleys because you can run interactive tasks on the corns. Job scheduling uses simple fairshare (modified by resource requirements).  
We use Grid Engine (used to be Sun Grid Engine (SGE)). There are three types of jobs: interactive, batch and parallel. You can start by reading the man page for 'sge_intro'. Then the man page for 'qsub'. We currently have a limit of 3000 jobs (running and/or queued) per user. We don't currently allow interactive jobs on the barleys because you can run interactive tasks on the corns. Job scheduling uses simple fairshare (modified by resource requirements).  
Line 82: Line 61:
Make sure you have your kerberos credentials before submitting jobs or else they will not be able to access your files in [[AFS]].
Make sure you have your kerberos credentials before submitting jobs or else they will not be able to access your files in [[AFS]].
-
== Running batch jobs  ==
+
=== Running batch jobs  ===
Use 'qsub'. This will allocate one slot on the cluster. See the bottom of the qsub man page for an example. Google 'SGE qsub' for more help.  
Use 'qsub'. This will allocate one slot on the cluster. See the bottom of the qsub man page for an example. Google 'SGE qsub' for more help.  
Line 88: Line 67:
Check how much memory your job uses. You can run just one job and see its peak memory usage after it's done. The standard barley node is 24 cores and 96GB RAM, so you shouldn't use more than 4GB/core. Make sure your submitted job doesn't use too much memory or it can crash the node.  
Check how much memory your job uses. You can run just one job and see its peak memory usage after it's done. The standard barley node is 24 cores and 96GB RAM, so you shouldn't use more than 4GB/core. Make sure your submitted job doesn't use too much memory or it can crash the node.  
-
== Running array jobs  ==
+
=== Running array jobs  ===
For jobs that vary only by one parameter, it is easier to submit an "array" job to reduce the amount of output in qstat. If you want to be a good citizen and you're submitting an array job with thousands of tasks, you may want to limit how many tasks you run simultaneously, using the -tc parameter to qsub.  
For jobs that vary only by one parameter, it is easier to submit an "array" job to reduce the amount of output in qstat. If you want to be a good citizen and you're submitting an array job with thousands of tasks, you may want to limit how many tasks you run simultaneously, using the -tc parameter to qsub.  
-
== Running parallel jobs  ==
+
=== Running parallel jobs  ===
Use 'qsub' with the '-pe' parameter. Using the '-pe' parameter allows you to request more than one slot per job. We have several different "parallel environments" defined, they differ in how the slots are allocated. If you want your slots on the same node, use '-pe fah'. If you want your slots spread across nodes, use '-pe orte'. Use 'qconf -sp orte' to see the settings, and 'man sge_pe' for more info.  
Use 'qsub' with the '-pe' parameter. Using the '-pe' parameter allows you to request more than one slot per job. We have several different "parallel environments" defined, they differ in how the slots are allocated. If you want your slots on the same node, use '-pe fah'. If you want your slots spread across nodes, use '-pe orte'. Use 'qconf -sp orte' to see the settings, and 'man sge_pe' for more info.  
-
== Running OpenMPI jobs  ==
+
=== Running OpenMPI jobs  ===
See [[OpenMPI]], contact farmshare-discuss with any questions.  
See [[OpenMPI]], contact farmshare-discuss with any questions.  
-
== job duration  ==
+
=== job duration  ===
There's a 48 hour limit on jobs in the regular queue, 15 min in the test queue and 7 days in long queue. You can use '-l h_rt=xx:xx:xx' to tell the scheduler how long your job will run, and your job will be killed if it hits that time limit.  Your job will make it into the long queue if and only if you request "-l longq=1".  Your job will be killed (sent SIG_KILL) when you reach the h_rt limit that you set for yourself.
There's a 48 hour limit on jobs in the regular queue, 15 min in the test queue and 7 days in long queue. You can use '-l h_rt=xx:xx:xx' to tell the scheduler how long your job will run, and your job will be killed if it hits that time limit.  Your job will make it into the long queue if and only if you request "-l longq=1".  Your job will be killed (sent SIG_KILL) when you reach the h_rt limit that you set for yourself.
Line 108: Line 87:
When jobs fail, you typically have to re-run them. So try to split them into many small chunks (but not too many).
When jobs fail, you typically have to re-run them. So try to split them into many small chunks (but not too many).
-
== checking on your jobs  ==
+
=== checking on your jobs  ===
Use the '''qstat''' command to check on your currently pending/running jobs. Use the '-M' flag to qsub to have the system e-mail you about your job if you want. Look through your output files for output of the job stdout and stderr streams. Use the '''qacct''' command on machine senpai2 (because that's where the accounting file lives) to see some information about jobs that already finished, e.g. '''qacct -j JOBID'''. If there is no record of the job in qacct, that means it didn't get written to the accounting file, which means it failed in an unusual way. Look at your output files to see what the error was.
Use the '''qstat''' command to check on your currently pending/running jobs. Use the '-M' flag to qsub to have the system e-mail you about your job if you want. Look through your output files for output of the job stdout and stderr streams. Use the '''qacct''' command on machine senpai2 (because that's where the accounting file lives) to see some information about jobs that already finished, e.g. '''qacct -j JOBID'''. If there is no record of the job in qacct, that means it didn't get written to the accounting file, which means it failed in an unusual way. Look at your output files to see what the error was.

Revision as of 17:11, 24 August 2017

Contact SRCC staff for support at: srcc-support@stanford.edu, or post questions and concerns to the community discussion list at: farmshare-discuss@lists.stanford.edu.

Contents

Connecting

Log into rice.stanford.edu. Authentication is by SUNet ID and password (or GSSAPI), and two-step authentication is required. A suggested configuration for OpenSSH and recommendations for two popular SSH clients for Windows can be found in Advanced Connection Options.

Storage

Home directories are served (via NFS 4) from a dedicated file server, and per-user quota is currently 48 GB. Users may exceed this soft limit for up to 7 days, up to a hard limit of 64 GB.

AFS is accessible from rice systems only. A link to each user's AFS home directory, ~/afs-home is provided as a convenience, but should only be used to access files in the legacy environment, and for transferring data. It should not be used as a working directory when submitting batch jobs, as AFS is not accessible from compute nodes. Please note that Kerberos authentication is required to access locations in AFS; run kinit && aklog to re-authenticate if you have trouble accessing any AFS directory.

Scratch storage is available in /farmshare/user_data, and each user is provided with a personal scratch directory, /farmshare/user_data/sunetid. The total volume size is currently 126 TB; no quotas are enforced, but old files may be purged without warning. The scratch volume is not backed up, and is not suitable for long-term storage, but can be used as working storage for batch jobs, and as a short-term staging area for data to be archived to permanent storage.

Local /tmp<code> storage is available on most nodes, but size varies from node to node. On <code>ricesystems, /tmp is 512 GB, with a per-user quota of 128 GB. Users may exceed this soft limit for up to 7 days, up to a hard limit of 192 GB, and space is regularly reclaimed from files older than 7 days.

Transferring Files

Since you can connect via SSH, you can use the rsync or sftp or scp commands to upload/download files to farmshare. On Windows, you can use software like FileZilla or WinSCP. On OS X users can use Cyberduck or Fetch, if you want a GUI. Linux and other unix like OS just use the included rsync, sftp or scp commands. Farmshare cannot be used for restricted or prohibited data (PHI, PII, etc.

GitHub Repo Cloning

To avoid timeout errors when cloning a repo from GitHub, please clone the repo to your /farmshare/user_data/SUNetID and not your AFS space.

Mounting your files elsewhere

Your AFS files can be accessed globally (literally), you just need the OpenAFS software installed. More info here: https://itservices.stanford.edu/service/afs/intro/mounting

You can make your /farmshare/user_data/ directory accessible directly from your workstation. Again, access is only allowed over SSH, so you can use something like SSHFS (via FUSE). One caveat is that SSHFS doesn't work with concurrent/parallel access, so this solution is only appropriate if you're not accessing files from several places at once. E.g. don't have cluster jobs write some files while you access the same files via SSHFS.

Windows: you can try ExpanDrive (used to be SFTPDrive), $39, or WebDrive, $60, or Docan (Free Software).

OS X: try OSXFUSE, or ExpanDrive (above)

Linux: you can use sshfs, e.g. on Debian (and derivatives):

  • Install: apt-get install sshfs
  • Mount: sshfs host:/mount/point /mount/point
  • Unmount: fusermount -u /mount/point

We also installed sshfs on the FarmShare machines, so you can mount files from your machines (which are accessible via SSH).

Data Limits

Your AFS homedir is limited to 5GB of quota. You can request more space (temporarily) here: https://itservices.stanford.edu/service/storage/getmore

Use 'fs quota' to see your utilization, or run the command 'check-stanford-afs-quota'. You may want to do something like the following command, which will generate a timestamped file in your homedir with the sizes of your directories. This command may take a while to run, as it will stat every single file in your homedir.

 du --exclude .backup -sh * | sort -h | tee ~/`pwd | tr '/' '_'`.du.`date +%Y-%m-%d`

Your /farmshare/user_data directory is intended for data currently in use for codes you are actively running on Farmshare. It is not a data repository nor a location to use as a backup for other systems. And /farmshare/user_data is NOT backed up. As space is quite limited, and there are some thousands of active Farmshare users, we will have to implement quotas in the near future. Your usage should always be less than 1 TB.

Installed Software

Most software is installed on these systems via the package manager (e.g. dpkg -l). Older licensed software is installed in AFS (typically /usr/sweet/bin). Newer software is managed by the module command. If there's any software you'd like, just let us know, and we can probably install it.

Running jobs on the cluster

We use Grid Engine (used to be Sun Grid Engine (SGE)). There are three types of jobs: interactive, batch and parallel. You can start by reading the man page for 'sge_intro'. Then the man page for 'qsub'. We currently have a limit of 3000 jobs (running and/or queued) per user. We don't currently allow interactive jobs on the barleys because you can run interactive tasks on the corns. Job scheduling uses simple fairshare (modified by resource requirements).

Make sure you have your kerberos credentials before submitting jobs or else they will not be able to access your files in AFS.

Running batch jobs

Use 'qsub'. This will allocate one slot on the cluster. See the bottom of the qsub man page for an example. Google 'SGE qsub' for more help.

Check how much memory your job uses. You can run just one job and see its peak memory usage after it's done. The standard barley node is 24 cores and 96GB RAM, so you shouldn't use more than 4GB/core. Make sure your submitted job doesn't use too much memory or it can crash the node.

Running array jobs

For jobs that vary only by one parameter, it is easier to submit an "array" job to reduce the amount of output in qstat. If you want to be a good citizen and you're submitting an array job with thousands of tasks, you may want to limit how many tasks you run simultaneously, using the -tc parameter to qsub.

Running parallel jobs

Use 'qsub' with the '-pe' parameter. Using the '-pe' parameter allows you to request more than one slot per job. We have several different "parallel environments" defined, they differ in how the slots are allocated. If you want your slots on the same node, use '-pe fah'. If you want your slots spread across nodes, use '-pe orte'. Use 'qconf -sp orte' to see the settings, and 'man sge_pe' for more info.

Running OpenMPI jobs

See OpenMPI, contact farmshare-discuss with any questions.

job duration

There's a 48 hour limit on jobs in the regular queue, 15 min in the test queue and 7 days in long queue. You can use '-l h_rt=xx:xx:xx' to tell the scheduler how long your job will run, and your job will be killed if it hits that time limit. Your job will make it into the long queue if and only if you request "-l longq=1". Your job will be killed (sent SIG_KILL) when you reach the h_rt limit that you set for yourself.

So the longest job that you can submit currently is 7 days, use "-l h_rt=168:00:00". But you should submit jobs less than 48hrs long, because there are many more regular job slots than long job slots.

When jobs fail, you typically have to re-run them. So try to split them into many small chunks (but not too many).

checking on your jobs

Use the qstat command to check on your currently pending/running jobs. Use the '-M' flag to qsub to have the system e-mail you about your job if you want. Look through your output files for output of the job stdout and stderr streams. Use the qacct command on machine senpai2 (because that's where the accounting file lives) to see some information about jobs that already finished, e.g. qacct -j JOBID. If there is no record of the job in qacct, that means it didn't get written to the accounting file, which means it failed in an unusual way. Look at your output files to see what the error was.

I usually look at the unfriendly output of this command:

 qstat -f -u '*'

You can look at some slighly more friendly job status output. Try this script to see current memory usage per job:

 /farmshare/user_data/chekh/qmem/qmem -u 

Or this pie chart http://www.stanford.edu/~bishopj/farmsharemem/ (give it a minute or two to self-update)

Personal tools
Toolbox
LANGUAGES