General ML Pipeline
Machine Learning (ML) / Artificial Intelligent (AI) pipeline consists of multiple steps for training and using a model including data collection, data preprocessing, model training, model prediction, and model evaluation.
Beta version: The platform is in a development stage of the final version, which may be less stable than usual. The efficiency of platform access and usage might be limited. For example, the platform might crash, some features might not work properly, or some data might be lost.
These are basic steps for collecting a dataset:
Data preprocessing is a step to transform and clean data into the format suitable for a model. Preprocessing relies on the type of data, which is required differently depending on a machine learning algorithm. Please follow the guidelines below:
The data split procedure is used to evaluate a predictive or machine learning model. Generally, the data is separated into 2 sets: training and testing data.
The purpose of each set is as follows:
The simplest method for splitting data is to shuffle the dataset and split it into an arbitrary percentage, for example, 80% training, and 20% testing. The shuffling is carried out to make sure that each split has similar characteristics. Although this is generally fine for most datasets, some datasets need more caution in splitting. The number of samples in training and test sets should be large and diverse enough to ensure that the datasets are representative of the original dataset.
After splitting data into training and testing sets, a model is trained by the algorithm using the training data. There are various algorithms depending on the expected purpose and outcome. Each algorithm has its own following mandatory parameters to config. Our ACP provides automated hyperparameters tuning in the training process. This would find the best parameters that give a model the best results.
After training a model, you will get the final model to make predictions on new data. Then, you need to give input data to the model to predict the expected output.
Model performances can be evaluated by predicting the test data, which is unseen data. To calculate the performance measure, you can compare the prediction’s results with the actual results with evaluation metrics or statistic techniques that suit each algorithm.