Service Updates - Monitoring: Why are my jobs not running?In the course of using a batch system, it is inevitable that users ask this question. Some solutions include opening a SNOW or GOC ticket, sending an email, pinging someone on IM, or dispatching a carrier pigeon to ask the admins to look into it. This quickly becomes untenable in a system with hundreds of users and tens of thousands of simultaneous jobs running, such as Fifebatch. Enter Fifemon, the near-real-time monitoring system for Fifebatch, where users can see at a glance the status of their jobs, their experiment’s jobs, and the status of the Fifebatch system as a whole, including local and offsite grids. From this assemblage of data, users can quickly answer the question of why their job isn't running. They can ascertain what action may help get their jobs running sooner. At least that’s the vision, one we are quickly developing into reality.
Photo from Fifemon on the Mu2e on 9/30/15: https://fifemon.fnal.gov:3000/dashboard/db/experiment-batch-details?var-experiment=mu2e
As this newsletter goes to press, the Fife team is in the process of releasing the next evolution of Fifemon, which is running on the open-source Graphite and Grafana packages. Adopting this framework allows us to focus effort onto what data is collected and how it’s presented. We can rapidly iterate on this. We are tracking over fifty thousand metrics related to the Fifebatch system, a number that increases every week, and we are constantly exploring new and better ways to present this data in a meaningful way. We plan to eventually open the system up to you, to develop your own views, to allow you to focus on the data that you care about! In the meantime, please explore the dashboards and graphs, and let us know what else you want to see (please don’t use carrier pigeon, SNOW tickets are less messy.)
- Kevin Retzke