🧱

Required AI stack in business

image

Introduction

The purpose of this short text is to explain which is the default stack that a hypothetical user e.g. Data Scientist, (🧑🏻‍💻 from now on) should learn to execute a successful AI project in the business world.

Moreover, is that the reader understands which are the consequences for the 🧑🏻‍💻, and the market of this learning process.

I want to start dividing all the DS (Data Science) stack, i.e. the pile of tools that 🧑🏻‍💻 has to master to complete an AI work successfully in a business environment.

Without the aim of being exhaustive, we can divide the AI stack into the following:

  1. Basic technologies: Python, SQL, BASH, ...
  2. Data Loading: pandas, airflow, ...
  3. Data Cleaning: pandas, nltk, openCV, ...
  4. Data Visualization: plotly, seaborn, D3js, ...
  5. Models, selection, training and scoring: sklearn, tensorflow, pytorch, ...
  6. MLOps: AWS S3, docker, kubernetes, ...

The Problem

Some days ago, I received an image coming from a LinkedIn post (principal argument for writing this blog), and the above sentence said: "Basic MLOps stack":

image

Can you see where the End User is? Check it twice!

There are over 14 different technologies inside this little scheme that 🧑🏻‍💻 must learn and master before deploying a model into production! And it is just the last item on the list.

According to the Kaggle survey 2020, the percentage of time spent deploying the models according to over +20k Data Scientists is 11%. Here are the results of this particular question of the survey:

image

Image extracted from this awesome notebook!

This is just the particular case of MLOps, and there is plenty of information and blogs about the best tools related to Data Loading, Cleaning, Visualization, Modelling, etc. I'm not going to do a deep dive into every one of those topics as I would not like to abuse readers patience.

If you are a Data Scientist, probably you are aware of this problem. By the end of the day, 95% of your time is dedicated to learning programming languages, fighting bugs, ETLs, etc. And this is a very inefficient way of spending your time as a Data Scientist because none of that is related to the problem that you are trying to solve.

Moreover, substantial accuracy improvements come from changing your perspective about the problem, adding more data, improving data quality, etc. But not from solving compiling errors, syntax errors, or learning the intricate internal structure of instances of a cloud provider. That takes the majority of your time and gets you distracted from your real job. The same happens to 🧑🏻‍💻.

Consequences (The stack-overflow effect)

  1. 🧑🏻‍💻 will be frustrated because he cannot learn a good portion of this gigantic stack in the time required for his job (the stack overflow effect). Usually, and according to the Kaggle survey, most of the Data Scientists work alone or are 2 in the company. Thus, it is impossible to split the stack appropriately.
  2. The company managers will be frustrated as well because of their high expectations and their lack of knowledge in the field. Therefore, thinking that a DS can do an AI project alone or in a group of 2. But this is similar to saying that a worker can build a skyscraper. In that context, it will be obvious that many people from different skill sets are required to perform that task, such as architects, workers, property developers, etc.
  3. Finally, this effect has consequences in the market, programmers and DS do not fit correctly in the roles, and people that know a good portion of the stack (Super Users or 🧙🏻‍♂️ from now on) are very uncommon. Therefore, and prices for hiring them are incredibly high.

The Solution 🔮

Obviously, I have to say that PurpleDye is the solution to that problem. And it was our initial intention (Xavi and mine) since we started building it. There are other solutions but usually, other companies try to sell AI to other businesses or solve a very specific AI problem (which is the way to go according to many reputed investors).

Contrary to that, we are trying to reinvent and convert the tools that Data Scientists use in their day to day work, and give them the IDE 2.0, or what we call PurpleDye which is a no-code development platform for AI.

But why the no-code approach is a valid solution for the stack overflow effect?

  1. 🧑🏻‍💻 needs just to learn some basics and focus on the problem, not on the technologies that are involved in the solution of the problem.
  2. 🧑🏻‍💻 has more time for simulating and testing, as this system is free of errors that are generated by the programmer while coding e.g. syntax errors or type errors.
  3. Also, 🧑🏻‍💻 has a more general view about all steps of the process and thus can explain it better to other teams.

Consequences

  1. The first consequence is that there is no stack overflow effect, and thus, 🧑🏻‍💻 does not need to be a 🧙🏻‍♂️. Therefore, many more valid candidates will surge in the market for doing AI projects, and the shortage of 🧙🏻‍♂️ in the market will be fulfilled.
  2. Also, the super users are essential as there are some tasks that only they can perform, but those tasks become the exception, not the rule.

🔥

Make your comments on social networks and tag me for discussion!

⚠️

The opinions expressed in this publication are those of the author. They do not purport to reflect the opinions or views of any company related to the author or its members.

🚧

Nothing in this publication constitutes professional and/or financial advice, nor does any information on the site constitute a comprehensive or complete statement of the matters discussed or the law relating thereto.