Tips and Tricks
Improving IO performance on /scratch and /projects
Peregrine uses a parallel file system called "Lustre" for the /scratch and /projects directories. Lustre presents data from multiple storage servers (OSSs) and disks (called targets or OSTs) as a single 'disk'. The Lustre filesystems on Peregrine are set to put one file on one OST by default. For larger files, it can be advantageous to set Lustre to stripe the file across multiple storage servers. The "lfs" command can be used at a file directory level to tell Lustre to stripe across more than one storage server.
Example:
lfs setstripe --count 10 <directory>
Will tell Lustre to stripe any new file in <directory> across 10 storage servers.
Files <1 gigabyte are probably best left with a stripe size of one. Files >1 gigabyte could benefit from striping. You can try setting the stripe count to the number of gigabytes you anticipate as a place to start. The example above is a good place to start for a 10 gigabyte file. There is a limit though: /projects has 54 OSTs. /scratch has 108 OSTs.
Here is a link to a page at NERSC which explains this in more detail: https://www.nersc.gov/users/storage-and-file-systems/optimizing-io-performance-for-lustre/
A particular item to note from the NERSC page is:
File striping will primarily improve performance for codes doing serial IO from a single node or parallel IO from multiple nodes writing to a single shared file as with MPI-IO, parallel HDF5 or parallel NetCDF.
Peregrine: Resetting Passwords
NREL HPC Passwords expire every 180 days, so please be cognizant and update your password as required. If you use SSH keys for login, please note that you still must change your password regularly or your account may be disabled.
If you forget your password, you may request a reset by emailing us at hpc-help@nrel.gov. For verification purposes, please make sure to email us from the address we have on file and include your username and a phone number in case we need to contact you. When your password reset is issued you will receive an email from us with a randomized temporary password. Use this password to log into Peregrine (or your authorized HPC system) and follow the prompts to create your new password.
We recommend storing your password in a secure password safe, such as KeePass.
For more detailed information regarding password requirements, update, and reset procedures please visit this page: https://hpc.nrel.gov/users/passwords
Using alloc tracker to manager your allocation on Peregrine
To manage your project allocation, we provide a tool called alloc_tracker which reports several things including the number of node hours allocated and the number used. The alloc_tracker command will show you how many node hours are left in your allocation, how many node hours will be forfeited at the end of the current allocation quarter if they are not used and how many node hours have been forfeited over the allocation year. (Note that our allocation year runs from November 1 to October 31.) alloc_tracker also can be used to find out how much space you’ve used in both user level and project level directories. There are several options available to get more detailed information, including how many node hours each member of the project used. Use the —help option on the alloc_tracker command to see what information is available.
28-day Automatic File Purge on /scratch starting November 1, 2016
HPC Operations currently purges files that have not been accessed for more than 90 days on the Peregrine /scratch file system. Files that have not been read in more than 90 days are eligible for automatic deletion. However, starting November 1st, files that have not been accessed in more than 28 days will be eligible for deletion. Plan your HPC work flow to make copies of important data to either the /projects file system, or to /mss storage for long-term archive.
Remember the /scratch file system is NOT backed up, once files are deleted they cannot be recovered. If you would like assistance, please send an email to hpc-help@nrel.gov to create a service request ticket.
Scratch is Scratch!
The /scratch file system on Peregrine is named scratch because it is intended for data that is only needed temporarily, while your jobs are running. It’s like scratch paper! It is not backed up and your files may be deleted (purged) if the /scratch file system gets too full. Which files are deleted is based on last access time — we may delete any files that haven’t been used in 28 days. While auto-purging is NOT being done now, depending on how fast new data is created in scratch, we may need to delete files with little warning to ensure that the file system remains usable. Delete files you don’t need and move the files you’re not actively using to /mss so that you don’t lose important data!
2016 Peregrine User Satisfaction Survey Results
The results of the 2016 Peregrine User Satisfaction Survey are now available. To maintain anonymity, we have removed all demographic information and anything that might identify respondents. Thank you to everyone who responded.
As a result of the survey, we will address several of the concerns in our FY17 planning activities. Some of the major themes that came up and the plans we have to address those issues include the following:
Running more jobs that use more cores:
We received feedback that people would like to be able to run more jobs that use more than 10,000 cores. Unfortunately, the topology of the network that connects nodes on Peregrine limits the number of nodes per job. The system is comprised of collections of nodes that are connected to each other in a non-blocking fat tree topology. Each of these groups of nodes is called a “scalable unit” or SU. Because the network that connects the SUs to each other and to the file systems has fewer connections, communicating between nodes in different scalable units is relatively slow. As a result, we restrict jobs to run using nodes that are all in the same scalable unit.
The initial Peregrine system was comprised of scalable units that contain 288 nodes. To allow jobs that use more than 288 nodes, when we expanded Peregrine in 2015 we acquired two scalable units that each has 576 nodes. The “batch-h” queue contains one full SU, which allows jobs to use up to 576 nodes (which corresponds to 13,824 cores). The nodes in the other SU are split between the batch-h queue (306 nodes), the large queue (262 nodes) and the Haswell queue (8 nodes). We could change how we distribute these nodes among queues to allow two simultaneous very large jobs but on Peregrine the maximum number of cores that can be used by a single job is 13,824. We expect that the system that replaces Peregrine will have the capability to run single jobs at larger scale.
New OS Image:
We plan to invest resources on creating an updated OS image for Peregrine during FY17. This represents a fairly large investment in resources to build, configure, and test the OS, Interconnect software, toolchains, and all compiled software. We expect this OS image to last the remaining life of Peregrine, which will be phased out starting in 2019 (original hardware) and 2020 (expansion hardware).
SW Environment:
The Peregrine software environment is necessarily complex due to the need to support a wide, changing, and growing diversity of disciplines, packages, workflows, and users. In addition, the connections between system-level software and runtime libraries imply a tension between the dual and dueling requirements of system stability and enabling ever-advancing application software simply. The process we have followed is to ensure system stability on the one hand by keeping system-level software and default toolchains constant, while on the other hand making available updated toolchains, libraries, and applications as needed by users.
As individual and project needs arise, we are able to work with you to enable your workflows, either through monthly office hours or by special request. An e-mail to hpc-help@nrel.gov is often all that’s needed to discover or acquire the tools you need. We will also be working to update the system software and associated default login environment as user needs require and our capacity permits.
To view the full report please go here: http://cs.hpc.nrel.gov/info/hpc-oversight/hpc-survey-results/2016-nrel-historical-detailed-summary-report-2016.pdf/view
The environment modules lifecycle
On Peregrine, “modules” are used to customize your runtime environment by loading and unloading modules that are needed to accomplish the task at hand. To keep the application environment manageable, we have several different collections of modules, each serving a different purpose. Each collection is in a different location in the /nopt file system:
/nopt/Modules/3.2.10/modulefiles : This collection is maintained in parallel to the application modules, and provides longer-term access to toolchains, key development tools (i.e., low-level libraries like MKL, debuggers and profilers), and system-level tools like nitro or robinhood.
/nopt/nrel/apps/modules/candidate/modulefiles : Modules undergoing testing. While these usually enable the latest versions of toolchains, middleware, and applications, they are subject to modification or deletion at any time.
/nopt/nrel/apps/modules/default/modulefiles : These are the modules “in production,” and are visible to all users by default. They should work correctly, but if you find one that doesn’t, e-mail hpc-help@nrel.gov to let us know.
/nopt/nrel/apps/modules/deprecated/modulefiles : As versions advance, we periodically clean up the production collection to keep software selection relatively up-to-date. Older versions are moved to this collection to provide users a period of time to migrate their workflows to newer versions if needed.
/projects/$PROJECT/modules/default/modulefiles : It is not widely appreciated, but anyone can create a modules collection themselves. For a project where multiple members need to have access to a single environment, their /projects/$PROJECT directory can host a modules collection that can be freely administered by one or more project members.
/home/$USER/modules/default/modulefiles : Analogously to modules for a project, an individual $USER can keep their own collection of modules and administer them freely.
In order to see modulefiles in any location, you need only add that location to your MODULEPATH environment variable. That can be done easily via module use /path/to/collection (put this /path/to/collection at the head of MODULEPATH, so modules in this collection take loading priority over those with the same name in other collections)
or
module use -a /path/to/collection (append this /path/to/collection to the end of MODULEPATH, so modules with the same name in other collections take loading priority over modules in this collection)
For module collections that you want persistent access to, the above 2 commands can be put into your shell startup script (e.g., .bash_profile for BASH).
Peregrine Cooling Systems - The Racks
Image copyright Hewlett Packard Enterprise
Each Peregrine rack contains 72 dual-node trays and uses water to keep the components cool. Some racks, called Cooling Distribution Units (CDUs), are dedicated to providing a clean and consistent water supply to the compute racks. Running vertically in the center of each rack is a device called the water wall, which directly contacts the node heat pipes to transfer heat to the isolated server cooling loop. Water constantly flows through the water wall to remove the heat generated by the nodes.
Each CDU contains a water pump, vacuum pump, and liquid-to-liquid heat exchanger, with connections to the isolated server cooling loop and facility water loop. The CDU regulates the supply water to the racks to maintain a consistent temperature into the racks and out to the facility water loop through the heat exchanger. The CDU also holds a vacuum on the isolated server cooling loop so that if a leak develops, air is drawn into the loop and detected rather than releasing water.
In the next installment of this series, we will elaborate on the facility water loop and energy recovery systems.
Trouble shooting Peregrine job failures: Part 1
If you are having trouble with Peregrine jobs failing and would like assistance trouble shooting them, gather as much information as you can about the job, particularly the job ID number. A common reason for jobs failing is due to nodes running out of memory, and the Linux "out-of-memory" system will start to kill processes. The processes are not terminated in a predictable manner, and the nodes are configured to reboot on kernel panic or out-of-memory conditions. The nodes may be able to leave clues in the system logs about why the node rebooted. The job exit code that is recorded in the output from the job may also provide information, however the codes are not always an accurate reflection of the cause of failure.
Here are links to standard exit codes we have seen on Peregrine:
The Linux Documentation Project
Adaptive Computing (Moab Exit Codes)
McGill University
For further assistance send the job id number, if possible, to open a request at hpc-help@nrel.gov. If you would like assistance configuring your jobs to gather information, such as placing debug statements in the job script or gathering the name of the job and the corresponding job id number, send your job scripts in with the request.
Peregrine Cooling Systems - The Nodes
The High Performance Computing environment at NREL represents a departure from traditional air-cooled datacenter practices in an effort to maximize efficiency even with high power-density environments, such as the Peregrine HPC cluster.
Each Peregrine tray contains two nodes which utilize heat pipes to transfer heat energy from the CPU and memory--which generate the majority of the heat--to the edge of the case. It is then transferred through thermal bus bars (specially designed interconnects) and into the water wall, where the cooling water circulates. We’ll have more on the thermal bus bar and water wall assembly in part 2 of this series. By pulling the heat out of the tray this eliminates the need to introduce cooling water into the tray with flexible tubing and interconnects, which can make service easier.
Air still flows through the trays inside the sealed racks to cool the other components but even that air is cooled and the heat captured by an air-to-water heat exchanger within the racks.
There are a number of advantages we gain by cooling with water—specifically warm water up to 80°F/27°C. Water is able to absorb and remove much more heat energy than air per unit of volume. This allows us to still provide adequate cooling while using less energy running pumps than fans would require. It also affords the ability to use more economical cooling systems, such as evaporative cooling. Finally, the waste heat we capture is reused for other needs, like heating our building.
In the next segment we’ll take a closer look at the rack-level cooling systems for Peregrine.
Tips for Using Peregrine Job Queues
Use the ‘shownodes’ command to find what node types are in what queues.
If you have a parallel job that needs a certain number of cores but not a set number of nodes, consider submitting it with –l procs=x instead of –l nodes=y:ppn=z
NREL HPC VPN
You can now connect to the NREL HPC resources using the NREL HPC VPN implementation. The NREL HPC VPN lets you access Peregrine and Gyrfalcon (mss) as well as a few other systems in NREL’s High Performance Computing datacenter. Key features provided by the VPN are file transfer support and the ability to run applications on Peregrine that use graphical user interfaces.
You can now connect to the NREL HPC resources using the NREL HPC VPN implementation. The NREL HPC VPN lets you access Peregrine and Gyrfalcon (mss) as well as a few other systems in NREL’s High Performance Computing datacenter. Key features provided by the VPN are file transfer support and the ability to run applications on Peregrine that use graphical user interfaces.
All current Peregrine users should have received an email message with instructions for setting up your OPT token generator, which is needed for connecting to VPN. If you have not already set up your OTP token, please follow the instructions here:
https://hpc.nrel.gov/users/connect/otp/otp-setup
Instructions for connecting NREL HPC VPN can be found at
https://hpc.nrel.gov/users/connect/vpn
If you are having difficulty with access, please send a message to hpc-help@nrel.gov.
Cobra Toolbox on Peregrine
The COBRA software is now available on Peregrine. The COBRA toolbox for MATLAB and the COBRApy Python modules enable constraint-based modeling and analysis of biochemical networks. More information can be found using this link:
http://hpc.nrel.gov/users/software/applications/cobra
Will my job run twice as fast if I use the Haswell nodes?
Haswell is the code name for the latest generation of Intel Xeon chips. Peregrine has 1152 new nodes, each with two 12-core Haswell processors. The peak performance of this processor is twice that of the Ivy Bridge Xeon processors. Will your code run twice as fast?
Haswell has a new instruction set, called AVX2. This provides FMA (fused multiply-add) support, which computes a*b + c in one step rather than first computing a*b and then adding c. This doubles the peak number of floating point operations that can be done on each clock cycle relative to the older SandyBridge and IvyBridge Xeon chips. The integer vector registers have been extended from 128 bits to 256 bits, so twice as much work (for example, calculating absolute values of integers in an array)can be done per cycle. Also, “gather" support has been enhanced so that vector elements can be loaded from non-contiguous memory locations.
-> If your code is well vectorized or spends most of its time in well vectorized math library functions that can use the new AVX2 instructions, you may see substantially increased performance on Haswell relative to IvyBridge. Linear algebra, such as dot products or matrix-matrix multiplication can often use FMA instructions. Taking advantage of the new instructions requires an updated application binary or math library which has been built to use these instructions.
-> If your code is not vectorized, you won’t see any performance increase due to the new vector instruction set.
You are likely to see a performance increase from other improvements Intel made in the chip. Haswell delivers about 10% higher instructions per clock cycle due to improvements in branch prediction, larger and deeper buffers, higher bandwidth L1 and L2 caches and higher memory bandwidth. No changes to the application binary are needed to access increased performance from these improvements.
A new feature, called “Uncore frequency scaling" allows chip components that are outside the processors cores to scale their frequency up or down depending on the nature of the code being executed. This includes the L3 cache and the on-chip network that connects the cores to each other and to the memory controllers. As a result, applications that are bound by memory and cache latency rather than by available arithmetic units can drive the “uncore” faster (without increasing the core frequency), which leads to better performance.
An NREL technical report with additional information about Haswell is available at http://www.nrel.gov/docs/fy16osti/64268.pdf.
What does "use or lose allocation" mean?
Allocations of node hours are segmented by allocation cycle quarters. A summary of these quarters is included bellow:
Quarter 1: Nov 1, 2015 - Jan 31, 2016
Quarter 2: Feb 1, 2016 - Apr 30, 2016
Quarter 3: May 1, 2016 - Jul 31, 2016
Quarter 4: Aug 1, 2016 - Oct 31, 2016
Node hours designated for a specific allocation cycle quarter must be used during that quarter. Any node hours that go unused by the close of an allocation cycle quarter will be forfeited.
The quarterly node hour plan is not a limit. Projects may use node hours ahead of the quarterly plan until the annual allocation is fully used.
My job crashed - where to look for clues about what happened
There are a few places you can look for information about a job that didn't run properly.
By default Peregrine creates 2 log files in the directory where the job was qsub'ed.
<job_name or submit_script_name>.o<job_id> contains standard output information
<job_name or submit_script_name>.e<job_id> contains standard error information
These two logs are the first place to look for information about why a job failed to run properly. Next, many applications have their own logs that may hold additional clues.
If you need help understanding information in the log files, or if there's just nothing in the log files that helps you understand why it didn't run properly, you can open a ticket by sending email to hpc-help@nrel.gov. When you open a ticket, please provide the following information:
- directory where the job was started (often where the submit script is located)
- directory where job logs appear
- job_id number(s)
With this information, we can investigate whether a system problem likely contributed to the job failure.
Using Nitro to efficiently run lots of short jobs or tasks
Nitro facilitates the execution of large numbers of short compute tasks without the overhead of individual scheduler jobs. If you have a workload of 10,000 High Throughput Computing (HTC) tasks, each of which has a very short runtime, traditionally this means submitting 10,000 separate jobs. Using Nitro, you can instead submit a single "Nitro Job”.
You combine all of the compute tasks in a single file. When the job starts, this file is then sent to Nitro and Nitro distributes the compute tasks across the processors on the nodes allocated to the job. This provides low scheduling overhead and improved response time for executing many small compute tasks.
You can find more information on using Nitro at http://cs.hpc.nrel.gov/info/how-tos/nitro-2.0.
Windows HPC – How to Request Access
Used by some NREL internal projects, WinHPC accommodates clustered computing needs within a Windows environment using applications such as CoolCalc, EnergyPlus, and MATLAB. While vastly smaller in scale than Peregrine, it is more specifically available to departments which have funded portions of the environment. If you have a need to use the WinHPC system, please email hpc-help@nrel.gov to begin the process.
More detailed information is available at https://hpc.nrel.gov/users/systems/winhpc/getting-started.
Using Globus to Move Large Data Sets
If you need to copy large datasets (>100GB) over the WAN then Globus is the right tool for the job. Globus provides services for transferring large datasets and enables you to quickly, securely, and reliably move your data to and from locations/endpoints you have access to. There is a Globus endpoint in the ESIF datacenter (nrel#globus) that allows you to copy files using Globus to /scratch on Peregrine. If you need to copy files to your laptop/desktop using Globus then you can use Globus Personal to make your desktop/laptop a Globus endpoint. To share files from your Globus Personal endpoint you will need a Globus Plus account, to request a Globus Plus account email hpc-help@nrel.gov.
You can find more information in the using Globus and file transfer best practices documentation.
ESIF HPC Energy Graphic
The ESIF HPC Data Center is designed to be the world's most energy efficient. You can see the current energy use by visiting: http://hpc.nrel.gov/COOL/
The ESIF data center regularly exceeds the facility design objective of achieving a Power Usage Effectiveness (PUE) less than 1.06. Also data center waste heat is used to heat ESIF and sometimes the campus district heating loop.
How do we pick system time?
HPC Operations staff try to schedule downtimes no more than quarterly. Whenever possible we coordinate building outages (cooling and electrical), hardware outages (major repairs and upgrades) and Software changes so all can be accomplished during a single downtime.
You will see the planned dates for the next scheduled downtime included each Monday with the weekly announcement. If you have a deadline coming up that could conflict with the scheduled system time, please contact us as early as you can. If we have flexibility we'll gladly shift dates.
We received feedback that the timing of the week-long downtime in September was problematic for projects needing to finish deliverables. We will avoid September system times in the future if at all possible.
Projects That Exceed Storage Allocation
If a project has exceeded its allocation of storage on the /projects filesystem, we will notify the users associated with the project and request that they move excess data to other filesystems. We reserve the right to take action by moving or deleting the most recent data if users associated with the project are unresponsive and the filesystem is in danger of becoming unusable.
New Storage System Coming Soon
Obsidian is a new storage system that is currently in the testing phase and is planned to be available in late November. Obsidian will be usable in a similar fashion as Gyrfalcon (Mass Storage) and will be accessible from the Peregrine login nodes using the paths /obs/users/${username} and /obs/projects/${projectname}. The storage system is better optimized to handle a large number of files (millions) than our current filesystems and is able to scale out as our storage needs change over time. Obsidian is configured with data protection mechanisms to prevent data loss; however, it is not being backed up. This means there are mechanisms in place to prevent data loss from hardware failures or disk corruption, but this does not protect against accidental file deletions or provide the ability to restore files from a previous point in time.
GLOBUS is now available for file transfer
Globus allows you to quickly, securely, and reliably move large amounts of data over the Wide Area Network. If you are transferring large datasets (>100GB) to/from outside of the datacenter then Globus would provide better performance compared to rsync or scp. For more information on using Globus you can go to http://hpc.nrel.gov/users/systems/globus-services.
Permissions Policy Update
There is an update to policy for permissions on the /scratch filesystem. World writable permissions for user scratch directories will no longer be allowed. If you want to allow other people to look at files and directories inside of /scratch/$USER, the permissions r-x for "other" is sufficient. Also if you would like to allow other users to write, alter or delete files in sub-directories beneath /scratch/$USER you can do this but should keep in mind the consequences. This applies to /home/$USER also. The $USER reference in the examples above represent your user name.
World writable permissions for user scratch directories will no longer be allowed. If you want to allow other people to look at files and directories inside of /scratch/$USER, the permissions r-x for "other" is sufficient. Also if you would like to allow other users to write, alter or delete files in sub-directories beneath /scratch/$USER you can do this but should keep in mind the consequences. This applies to /home/$USER also. The $USER reference in the examples above represent your user name.
Cyber Tip
Cyber Tip: Save your work!
Part of cyber security is a concept called "risk mitigation". Part of mitigating risk is having a backup of your laptop, so you don't lose weeks or years worth of work when it crashes. I quote from NREL's Cyber Securty Standards doc: "Always have a local backup here at the lab."
Most of the Scientific Computing have MacBooks. Since there's no good OS X backup solution deployed at NREL yet, here are a couple of options:
* Get a backup drive
Drives are cheap --you can get a 2 TB external for less than $150. Time Machine comes with OS X, is simple to set up, and is easy to use. Leave it at your desk and plug it in while you're working. Lock it up at night if you're extra paranoid. (If you're geeky, get a copy of Carbon Copy Cloner.)
* Use git to store your code. If your laptop dies, your code is somewhere safe.
* Use the IS personal network drive. You can drag-n-drop files into the network share. It's crude, but it'll work in a pinch.
To use your personal drive space:
Go to the Finder, command-k, smb://xhomea/home/<YOURID>
To get to the Scientific Computing Center's shared drive, to share
files with others:
Go to the Finder, command-k, smb://xshareb/2C00/
If you're into self-flagellation-like activities, you may want to read more about NREL's cyber security standards. This document would be a good starting point: http://thesource.nrel.gov/is/pdfs/cyber_security_standards.pdf