PyCon is the largest annual gathering of the Python user community, a group to which Pandata enthusiastically belongs. Several members of Team Panda were able to participate in this year’s PyCon, conveniently held in Cleveland. While there were too many interesting talks to recap, we wanted to highlight a few that sparked the most thought and discussion.
Ethics in Data Science is a topic that is often at the forefront of our minds at Pandata. Two talks by J. Henry Hinnefeld and Manojit Nandi on model fairness stood out. The former talk stressed that it is not enough to build models with extra constraints to solve the fairness question, but that the subtleties of fairness need a human touch. Different groups experience different realities or ground truths. The question becomes is fairness equality or equity? Equity is giving everyone what they need to be successful. Equality is treating everyone the same. A model can treat everyone equally, but if they do not start from the same starting point, the model is not equitable. One view is then that the model is unfair. Not only do different groups have different ground truth conditions, but data can be unintentionally biased through human bias. Taking these issues into consideration when designing a model is an important ethical issue.
The talk by Nandi focused on how data and algorithms can be implemented ethically. Math is not racist, but the way we may use algorithms to, for example, identify who gets a bank loan can be. How can we take steps to mitigate this problem? Ethics training for data scientists is essential – awareness that there are biases in data and in how we interpret outcomes is the first step to combatting harm. Diversity on data science teams brings key insights from the perspective of underrepresented groups. There are also software packages that have been developed to mathematically address biases and ethics including AI360 and Fair Test.
There was a myriad of technical talks, ranging from code testing to building solvers to data visualization and everything in between. An interesting and relevant talk by William Horton Focused on using GPUs in Python code.
Moore’s law refers to the idea that the processing speed of a circuit doubles roughly every two years. However, as transistors on computer chips continue to get smaller and smaller, the ability to produce results in line with Moore’s law has begun to slow due to fundamental limitations of physics. Yet the scale of data continues to expand rapidly, creating bottlenecks in the processing speed of workloads. These bottlenecks are due to the limitation of how many parallel processes can be run at one time based on the limited number of cores and threads in current CPU offerings. In comes a new contender – one previously dedicated to providing gamers with smoother experiences in video games – the GPU.
GPUs can contain thousands of smaller cores in comparison to the double-digit core count of current CPU offerings. They are also built to specialize in certain types of computations common in data science. Surely this must require large changes in code, you might wonder. Software like CUDA, a parallel computing platform developed by nVidia, has multiple ways to be used as a “drop-in” within existing code. CUDA implementations offer a rich environment that gives you a wealth of control regarding how GPU resources are utilized to help decrease the processing time of code by immense amounts. Libraries like PyCUDA and Numba take this one step further and allow developers to write code in Python that is “translated” in order to be used efficiently on a GPU with minimal refactoring of code. With GPU-enabled code giving faster results in the rage of 30x faster (or more!), we truly are entering a new age of processing that only seems to be getting better and better as time goes on.
Numerous Python hobbyist talks were fun to attend for the craftier Pandas. Talks on using Python to generate cross-stitch patterns by Katie McLaughlin, making music in Python by Jessica Garson, and using signal processing in Python to generate music for and control a player piano by JP Bader gave us many ideas of projects for our spare time.
One of the most fascinating lessons of PyCon was that Python is more than just a coding language or professional tool, it is almost a way of life. The undercurrent of inclusiveness, intellectual curiosity and passion that permeated the conference was inspiring. As a team, we walked away with an array of new tools to address model fairness, optimize code, and learn new techniques that we look forward to applying to client work. There also might be a Python-generated Panda cross-stitch in our future.
Co-written by Pandata team members Hannah Arnson, Lead Data Scientist, Julie Novic, former Data Analyst, and Joseph Homrocky, former Data Engineer.