by Natalie Beyer

What can data really tell us?

As a data science consultant, I go to clients and companies who have collected a lot of data but don’t know what to do with it. One of the main things that we do is to try and consolidate information and try to give it a decision-helping mechanism. We do not automate the decisions completely, but try to give the employees the help and ability to make the decision themselves.

By consolidating the information, it makes it easier to reach a decision.

One project that is reaching its end right now is helping a company with 40,000 customers that hopes to analyze which of those customers are prone to cancel their contract. The collected data comes from different systems. One of which is from the sales team and the information they are collecting when interacting with a customer. This includes metrics such as: “How old are you? What are you going to do with the product we provide?” After the customer is in the system, more information can be retrieved with time, such as how long they have had their account. Data grows like an organism over time.

Now that the data has been collected, we can use it. For example, if a customer has been with the company for 10 years, they are likely not prone to go to a different, competing service. Companies are considering the data they have and are thinking about how to make strategies with this information.

Of course, customers can ask what sort of data companies collect from them. Our clients only use the data of a customer that is directly connected with and produced by their interaction with this customer.

With this data, what we found is that customers who have only been with the company for one year were prone to cancel their contract. It’s just a statistical likelihood that helped our model, and not a 100% certainty.

For me, doomsday scenarios about AI are completely far away from what I experience every day. It’s not comparable to what I see in movies.

Machine learning & predictive analysis

In the realm of predicting things via machine learning, it’s already possible in a lot of sectors and use cases. For instance, you can do things like churn prediction where you can see whether the certain customer is going to cancel, or is more prone to cancel their contract.

What the whole data science community is doing right now and headed towards is to try and get the interpretability of this whole machine learning algorithm. I think we are on a good track. In the media sometimes you read and see references to the black box problem, but there are methods where you can see which parts of the data that the model was looking at. You can see what the basis was for the algorithm to make its decision and which factors or columns had the most weight in making this decision.

For example, if a customer has a 90% probability of churning, the model tells me the factors for coming to this conclusion. You can see exactly why the algorithm chose this high probability prediction.

The basis of all the algorithms and machine learning is a table or CSV file. There are several columns for a customer, which act as the influence factors for the one targeted column that you are looking to predict. This table can grow as long as you like with different columns, such as listing clicks on the page in a certain time, or when the account last logged in. These are all set by human agents in the beginning; we set the parameters for the algorithm to work.

Then, the algorithm comes into play. The computer scientists of the last decades provided us with many different algorithms in open source tools. The good thing is that everyone in the industry uses mainly these highly developed, state-of-the-art algorithms and we all work together to further develop them. Even though the actual training of the algorithm is in many cases then only one line of code, the whole data preparation and feature generation as explained above is a complex software project.

One of the commonly used algorithms is tree-based. It is similar to when you have a decision tree and split the data into the smallest parts, just with more complexity and larger trees. Afterwards, they are not just decision trees, but real forests.

Data mirrors reality

It’s not that algorithms are learning by themselves, they learn from history. You already have a small table where you know the target and realities are. The algorithm goes through the decision trees and gets to the same ends. It does not get new information. Ultimately, it is hidden information that is not visible to the eye, it is connected via trees.

One of the commonly used algorithms is tree-based. It is similar to when you have a decision tree and split the data into the smallest parts, just with more complexity and larger trees. Afterwards, they are not just decision trees, but real forests.

The main thing we try to do is to communicate and find what an influencing factor for our target is. We throw the craziest ideas at it, because you don’t know what an influence factor might end up being. They are just the relationships that we are not able to see with our eyes.

We get together as many employees of the company as we can, from all different sectors, and try to share their knowledge in order to get an idea as to what an influencing factor might be. So at the first iteration, it could be just maybe ten columns of possible influence factors, but with communication, that table grows. Right now, at the end of a project, we have 107 influence columns. That’s 107 potential influence factors.

AI within borders

For me, doomsday scenarios about AI are completely far away from what I experience every day. It’s not comparable to what I see in movies.

You just have the data that you have and see how to make something out of it. To go from that to a general artificial intelligence, I don’t know how much data you would need, but it’s so far away from what companies are collecting right now. It’s not even the current end goal.

Companies that I work with collect data to make their employees’ lives easier. Collecting data is often very inward-looking, and not outward. The result of those inward-looking projects are specific AIs. Or specific AI can not, and I do not even know how they should in the nearer future, decide for themselves which data to add. This is the big human factor in our specific AIs: We define which data points the AI gets for training.

Or even in these very media prominent reinforcement learnings like Go or Jeopardy AIs, these AIs can only solve these problems very well, which were defined for them by a human. While the human beaten by the AI can do indefinitely more things and has his own initiative to learn more things.

Technology, AI and ethics.

Data mirrors reality

by Natalie Beyer

What can data really tell us?

Machine learning & predictive analysis

Data mirrors reality

AI within borders

No comments