In the early days of working for SHARCNET, my colleague and I decided to standardize how cluster metrics were computed across our internal data frames. As mentioned in a previous post, part of the solution was pandas.
The second part was to figure out how to deploy the package for others to contribute to, as well as install on their own specific HPC clusters. Some quick searching of course revealed PyPi and pip were the way to go.
Coming from a slightly different angle this time, I found that researchers were often isolating themselves to less (strictly fewer!) resources on HPC systems by not investigating what the node feature mixture looked like.
As such this talk was created to help direct potentially abstract development efforts towards optimizing for which feature sets are most available on a HPC cluster.
Below is my abstract for the talk as well as the recording:
Eventually I got to the point in data analytics where keeping things in lists, or list of lists was no longer quite cutting it. My processing was slowly starting to grind to a halt, and things were getting way too abstract.
I decide to call up a friend who had worked in the business longer than me and they suggested “pandas”. I was vaguely familiar as users/clients had used it in the past.
Back when I first got hired at SHARCNET, I used a lot of Python. I mean a lot. What this meant is that I quickly became the lightning rod for all Python related questions (and commentary).
During a fun Friday chat, a colleague remarked that Python was on average 40x slower than C++. I defended my current language of choice saying it was better than that, surely. To make a long story short, I was wrong.