By Hannah Arnson & Joseph Homrocky
Data science is often thought of as a black box – in goes data, outcomes actionable information. Inside that black box lives a complex mix of programming, mathematics, statistics, hardware, and software, each requiring significant expertise. Because of the diverse nature of components that go into data science, it takes a team of individuals to turn data into insight. The key players on the team are the data engineer, data analyst, and data scientist. Here, we highlight the roles of data engineer and data scientist.
Enter the Data Engineer…
Architects of the data science world, data engineers create the framework that allows data scientists to perform their activities in the most efficient manner possible. With so many production environments available – both cloud and local – there is always something new to learn and a plentiful amount of questions and concerns to take into consideration when designing a solution.
Such considerations are highly based on the customer’s needs. These needs vary greatly but include budgetary concerns, privacy considerations (for example, HIPAA), ease of use, etc. Generally, each project only has a handful of considerations to account for but the simple fact that so many needs exist is a perfect example of why data engineers require constant learning and research in order to truly be effective and successful at their job. In such a rapidly changing atmosphere, one can quickly become ineffective without a passion to produce a quality outcome, as new technologies are coming to life every day.
As stated, data engineers are responsible for the framework and architecture that will support the data scientist in their work. This includes, but is not limited to, working with databases (both relational – such as SQL – and non-relational – such as NoSQL), working with data in many commonly used forms (Excel/CSV, JSON), cleaning and eliminating unusable data, and figuring out what software and hardware best suits the customer’s needs. Understanding both the current customer environment, future customer environment, data quality (or access) issues, and other road bumps requires extensive research and modification. This process makes certain that the solution created by the engineer and the scientist is usable, understandable (for maintenance, regular users, or both), and scalable for the customer’s data volume.
Enter the Data Scientist…
A unique feature of the data scientist role is the use of the Scientific Method. A data scientist takes a business question, frames it analytically, uses a mix of programming, mathematics, and statistics to address the question, and then report back findings in a useful way. More specifically, step one is to formalize the question and determine the underlying assumptions. Then, he or she explores if this question is answerable given the existing data and infrastructure (made accessible by the data engineer) and if so, determines a best analytical approach. Analysis often involves machine learning or a family of techniques in which a subset of data is used to “train” the computer to detect patterns in the data set which are then used to make predictions or classifications on the remaining and any future data. This whole process is often iterative as different hypotheses are tested and approaches are found to be incorrect or un-implementable due to data limitations. It is also up to the data scientist to treat all results with a healthy dose of skepticism: is a result truly a result or is it (more likely) a statistical fluke or result of underlying bias or pattern in the data?
Once the inner critic has been satisfied, it is time to prototype the analysis, which involves building out a functional version of the model. The final production is passed off to the data engineer who can build it out at scale and on the appropriate computing infrastructure. However, the data scientist is not yet done. In what is arguably the most important part of the role, he or she must then deliver the results in a way that both captures all of the assumptions and nuances of the analyses and understandable to the end user, typically a business-minded person. This communication often takes the form of an interactive dashboard using business intelligence tools such as Tableau or Power BI.
As they say, there is no “I” in team, which holds true for data science. At Pandata, each member of our team is a key component in the data science process, working together to derive value from data for our clients. To learn more, contact us at firstname.lastname@example.org.
Hannah Arnson is a Data Scientist & Joseph Homrocky is a Data Engineer, both at Pandata.