Bluearc unmounting from GPGrid nodes - So long, and thanks for all the files
A long time ago, in a cluster far, far away, it was a period of rebellion against the limitations of local batch clusters.
In 2009, the 3,000 cores of the GP Grid Farm were a vast improvement over the 50-core FNALU batch system. GPGrid was connected to the then-new BlueArc data system with a 2-gigabit network link. A simple lock system deployed in late 2009, still in use today, avoided head contention on the underlying BlueArc data system, improving uptime in 2010 from 97 percent to 99.9997 percent. Deployment of the IFDHC tools as new projects came online kept uptime fairly good, 99.95 percent in 2014.
But there are new issues that locks cannot fix. Our BlueArc servers have about 1 gigabytes-per-second service capacity. Single GPGrid worker nodes now have that much capacity. We are now running as many as 30,000 user processes on FermiGrid, sustaining over 3 gigabytes-per-second locally. The dCache storage elements deployed in 2015 can handle this load. BlueArc cannot.
GP Grid networking throughput
We need to proceed this year with the BlueArc Unmount process described in http://cd-docdb.fnal.gov/cgi-bin/ShowDocument?docid=5522 and https://cdcvs.fnal.gov/redmine/projects/fife/wiki/FermiGridBlue. We need to go farther, removing even GridFTP access to BlueArc data. See https://cdcvs.fnal.gov/redmine/projects/fife/wiki/FGB-DATASCHED. We will be contacting Liaisons to schedule the data area unmounts.
The existing Bluerc data areas remain a valuable resource for interactive work, where full Posix file access may be needed.
- Arthur E. Kreymer