The future of data science lies in automation
Data science is a wide-ranging field that has found successful applications in both scientific and business domains. Companies, in particular, have been heavily investing in all things data in their quest to become data-driven.
Naturally, with every business-minded investment comes the idea of optimisation. Data science is no different in that regard. While companies are pouring in money, they are also thinking of ways to make the most out of those resources. Automation is an inevitable part of optimisation and, often, the first course of action.
Data science may seem like a field that's nearly impossible to automate due to its inherent complexity. There are so many steps, from data extraction to modelling, all of which seem to require human input. We've thought that way, however, about many things and still found ways to automate processes.
Data science in parts
Generally, a lot of current data science is done through the use of machine learning. Proper employment of ML can ease all of the predictive work that is most often the end goal for data science projects, at least in the business world.
Before continuing on, I should note that most of my experience and, as such, the article will revolve around supervised learning. While there are other learning approaches, supervised learning is likely the most popular and frequently used in data science.In any case, all of data science can be separated into several distinct parts, which converge to create the entire field. These may be considered data exploration, data engineering, model building, and interpretation.
Data exploration largely revolves around discovering the needs, goals, and requirements of a particular task. Each dataset has to come from some source (or a multitude of them). However, it's not always clear how to go about performing such a task. For example, an eCommerce business might need, for some reason, all pricing data for a specific category from a variety of regions.
Additionally, exploration will often work with some datasets to discover the goal-driven questions, the potential for visualisation, etc. All of these aspects require quite extensive human judgment and are domain and goal specific. As a result, automation for data exploration is likely somewhat far away.
Data engineering, which is the process of actually acquiring, labelling, wrangling, and transforming data, is often the most time-consuming aspect. Unfortunately, the area has so far had little success in automating the tasks.
The other two parts, however, have much more potential. Data interpretation, to some surprise, has been shown to have the potential for automation. In 2014, a group of researchers created a natural language model that could interpret basic regression models and even draft a full report with explanations with an impressive degree of veracity.
Since then, there have been various business implementations that aim to do the same thing, only for the creation of more actionable and less academic insights. Numerous companies, such as PowerBI, have integrated automated insight generation, albeit at a somewhat limited capacity. Soon enough, however, I believe we'll get complete overviews from business intelligence systems.
Model building, the practice of selecting algorithms, tuning parameters, evaluating performance, and creating machine learning models, has already seen a decent degree of success through the usage of AI. Primarily, these have been achieved through AutoML.
AutoML
AutoML has been making the rounds as the next step in data science. In short, some part of machine learning, outside of getting all the data ready for modelling, is picking the correct algorithm and fine-tuning (hyper)parameters.
After data accuracy and veracity, the algorithm and parameters have the highest influence on predictive power. While, in many cases, there is no perfect solution, there's plenty of wiggle room to make optimisations. Additionally, there's always some theoretical near-optimal solution that can be arrived at mostly through calculation and decision-making.
Yet, arriving at these theoretical optimisations is exceedingly difficult. In most cases, the decisions will be heuristic, and any errors will be removed after experimentation. Even with extensive industry experience and professionalism, there is just too much room for error.
AutoML systems, such as Python libraries (e.g., Auto-sklearn), use advancements in mathematics and computer science to automatically select both algorithms and fine-tune parameters. Research and experimentation have shown that various AutoML systems can often optimise pipelines and accurate results at uncanny rates.
While AutoML does not and will not completely automate data science, it has the potential to take a significant portion of manual work off of the shoulders of humans. Its potential lies in simplifying a usually fairly difficult part of machine learning.
Making machine learning easier
Automation is not only about optimising resource costs, as I have alluded to at the beginning of the article, but it also removes the barrier-to-entry to some activities. Machine learning has two major hurdles to its accessibility.
Data acquisition and engineering is the first part. Most issues, however, have been waning. Data acquisition has been made easier through the emergence of web scraping, public datasets, and other phenomena. Labelling and wrangling still remain largely unchanged, but finding the necessary data has often been the primary challenge in data science.
AutoML, however, makes machine learning a little more accessible by reducing the requirements for creating an optimised model. Currently, the technology can still run into issues when high-quality data is not available, so it's definitely not a cure-all, and general machine learning knowledge is required.
Within the near future, however, AutoML has the most potential to completely automate a part of data science and provide easier access to the field for less experienced practitioners. Additionally, large language models or natural language processing will aid data scientists in producing easy-to-read interpretations.
Finally, while currently only theoretical, I expect that data engineering will be the next in line for automation. Data universalisation, normalisation, and extraction can already be automated, and all that is needed is to find solutions that can be scaled.