Coming soon to Fifemon: job resource monitoring

Wondering why your jobs have been put on hold? Want to better set your resource requests to make more efficient use of the grid (and to get your jobs starting faster)? This information, and more, will be available in the FIFE monitoring application, Fifemon, soon, and is already available for testing in pre-production (https://fifemon.fnal.gov/monitor-pp/).

Starting with your User Batch Details dashboard, you can see what jobs have been put on hold and why, as well as a complete listing of job clusters currently in the system. Included in this table are the maximum resources used with how much was requested. If a cell is highlighted in red, it means the job has exceeded its request and has been put on hold, requiring intervention to either decrease the amount the job uses or increase the request. Please contact FIFE Support through ServiceNow if you need assistance.

fifemon_job_table.pngfifemon_held_jobs.png
https://fifemon.fnal.gov/monitor-pp/dashboard/db/user-batch-details?var-user=mu2epro

Click on a cluster number, and you’ll be taken to the Job Cluster Summary dashboard showing a variety of information about that job cluster: request parameters and resources, number of processes in each state, resources used by running and completed processes, and a timeline of Condor events.

fifemon_cluster_summary.png
https://fifemon.fnal.gov/monitor-pp/dashboard/db/job-cluster-summary?var-cluster=7186322&from=1454391283934&to=1454434483934

We invite you to test these new features in pre-production and to provide feedback. We hope you find this information useful for tracking your job progress and for better understanding the resources your jobs are using. As a general rule of thumb, the higher your resource requests, the longer it will take your jobs to start running.

- Kevin Retzke