Introduction to Machine learning workflow
Training data lets a model learn so the model can make predictions. Reaching a result that is true requires a series of steps ensuring that the outcome is factual.
For instance, if we would like to predict the price that a series of apartments and how the prices will change next year, by data-scrapping the historic data or connecting to public DBS we get a series of markers that will give us the Raw data we require.
- Square feet
- Area
- Year built
- Price sold
- others
Since we have this dataset, the next step is from the raw data extracting their features and reformating the dataset.
In this case, we would like square feet, distance to metro or historical location, etc.
After defining our markers we need to split the datasets between test datasets and start training our model.
Training the data model is one of the key elements, where we can choose different models (neural networks or logistic regressions and many more) that will be valuable in one other context, depending on the datasets we have.
We can´t assume that our model is usable, in that case, we will take a look at the already sold apartments and see how accurately it predicts the sale price.
There is an issue here: Our model has seen the outcome already and knows that data, and for that reason, we need the test dataset: We put the test dataset (unseen data) to get the models predictions and calculate the average error of the predictions or what per cent of X did the model accurately predict within an X% margin. We need to define the threshold that we will consider a successful performance.
If the performance isn’t good enough, we will need to come back and re-train the model tunning it.