Data Science Glossary
Fed-up with buzzword bingo? We demystify some common industry terms.
A series of instructions or recipe for manipulating data to achieve an end goal. We use programming languages such as Python or R to implement algorithms. Algorithms can include processes ranging from simple addition to extremely complex neural networks.
Artificial Intelligence (AI)
Data should be used to derive actionable business intelligence. Our primary goal is to use data to contribute to business value. This is done through statistical analysis, data visualization/reporting, and machine learning.
An interactive data visualization or series of visualizations that allow stakeholders to explore various dimensions of data. We develop dashboards using tools such as Tableau or PowerBI with the goal of decipherability and ease of use such that the end user can independently drill into data details of explore high level summary information.
Data engineering involves planning, designing, and implementing information systems. This includes data storage as well as the pipelines that data scientists use to access and transform data.
An organization’s data can be augmented in ways that improve business insight and empower predictive analytics. We use extensive knowledge of open source data to supplement and enrich your proprietary data sources.
Data Lake vs Data Warehouse
Where you store your data is dependent on what type of data you have. A “Data Lake” is used when all you have is raw, unprocessed data that frequently has varying structures that do not have any relations between one another. A “Data Warehouse” is used to store structured or relational data from many sources, not just one.
Data can be used to provide business intelligence, but if a stakeholder cannot understand it, it is difficult to convert that intelligence into business value. Visualization and reporting bridge that gap. This is also necessary when presenting results from statistical analysis or machine learning.
Using architectures like deep neural networks to perform machine learning. If the situation calls for it, deep learning can outperform classical methods and provide state of the art performance. We find that deep learning is most useful with sequential data, image data, or learning from simulated environments.
Exploratory Data Analysis (EDA)
A critical early stage in any data-related project, EDA involves exploring the available data and summarizing the main characteristics, often using visualizations. It can provide additional insight to the data set, and result in ideas and hypotheses to explore with more formal statistical modeling.
Extract, Transform and Load (ETL)
In order to prepare a cleaned data set for querying and further use, ETL refers to the extraction of data from one or more sources, the transformation of data into a proper format or structure, and the loading of data into a target database.
Feature selection is used in machine learning – selecting only the relevant features from data and removing features that are redundant or irrelevant allows for simplification of models and reduces training time.
Apache Hadoop is an open-source software framework for distributed storage and processing of data. Hadoop benefits from the distribution of files across the nodes in a Hadoop cluster and the processing of data in parallel across multiple nodes. Hadoop can be deployed on local computer clusters, in the cloud (using services like Amazon’s AWS or Microsoft’s Azure), or as a combination hybrid solution.
Machine learning algorithms allow computers to learn from data in order to perform specific tasks. Most often, this is some form of prediction or optimization, although it can also be useful for general pattern mining.
Natural Language Processing (NLP)
Much of the world’s data comes in the form of natural language, which is often unstructured. We combine classical methods and modern deep learning to gain actionable insights and predictive analytics with text data in all forms.
Although pattern mining is useful for all forms of machine learning, it is most useful in “unsupervised” settings, when data cannot be naturally used for predictive analytics. It often provides business intelligence on its own, and can be used as a stepping stone to performing predictive analytics.
Recommender systems are used to predict a user’s preferences based on inputs such as a user’s historical preferences or the preferences of similar users. Common uses of recommender systems include suggestions generated by streaming content services like Spotify and YouTube and product recommendations generated by Amazon and many other e-commerce sites.
Supervised vs. Unsupervised Machine Learning
Supervised machine learning uses training data which includes an input and the expected output. Once trained, the model will accept a previously unseen input and predict the output based on the function developed during training. Common examples of supervised learning algorithms include decision trees, linear and logistic regression, and k-nearest neighbor. Common applications include predicting future patterns or classifying categories.
Unsupervised learning is more exploratory in nature. Output categories are not included in the training set, and a common goal is to find previously undetected patterns. A common example of unsupervised learning algorithm is k-means and common applications are clustering or anomaly detection.
This is most often used to gather high-level knowledge of data. This high-level knowledge is used to motivate further business intelligence efforts. Statistical analysis can create actionable business intelligence on its own or in combination with a reporting solution. It is also often considered a necessary ingredient for machine learning.