AI categorizes text content by classifying topics or identifying responsible staff mentioned in the text.
Beta version: The platform is in a development stage of the final version, which may be less stable than usual. The efficiency of platform access and usage might be limited. For example, the platform might crash, some features might not work properly, or some data might be lost.
For “Customized AI" mode: you are required to prepare only a dataset for training in case you want to train your model.
The input data should be in the format of a ‘comma separated value’ file (.csv format). In other words, the input data is shown as a table of data separated by “,” or a comma sign. After having uploaded the dataset, a user is required to select types of data for each column. The detail of possible column types is presented as follows:
Since text classification is a statistical-based machine learning model, an optimal amount of training data is required to ensure model performance. Moreover, the quality of data, as well as the number of samples per class is important. If the model is trained on a dataset that is too small, the model could potentially overfit, which might drop model performance. Likewise, when the number of samples per class is unbalanced, the model will be most inclined to poorly predict minority class. For example, if the news topic dataset contains 90% political news and 1% science news, it is unlikely that the model can capture the essence of science news.
Prior to extracting features and feed vectors to a model, the text will be preprocessed according to the following rules:
All these preprocessing methods are done by using PythaiNLP library. Additionally, we also use a Thai word tokenization model called newmm which can also be found in the PythaiNLP library.
In this model pipeline, the TF-IDF was used as a feature extractor. The goal of feature extraction is to convert the text sentence into a set of numbers that we called “feature”. To do so, the feature extractor counts the word frequency of each sentence and weights each count by the frequency of the whole document. This ensures that the feature would take some stop words into account when extracting features. Stop words are words that appear frequently but convey no important information about the sentence, e.g., ‘มี’ or ‘การ’.
The text classification model uses a Machine Learning model calculated from TF-IDF from Thai language Text to predict the probability of provided label. The ACP ML pipeline for text classification would automatically choose the suitable method to process the TF-IDF data, including finding the cutoff for words that appear too frequently or too scarce. Finally, the model is built by perceptron algorithm on the preprocess data, which is one of the light and robust machine learning algorithms. Thus, the model is simple and fast but yields good performance.
For a text classification, we use accuracy, F1 score, precision, recall and confusion matrix to evaluate a model. Generally, higher accuracy, F1 score, precision, and recall mean better performance.
the text classification model receives list of sentences as an input. The API JSON input format as shown below.
{
"inputs": [
"sentence-1"
, "sentence-2"
, "sentence-3"
...
]
}
The response of the text classification model API will be a list of JSON where each element is another key-value object that contains its sentence and its predicted probability of each class. The API response would be in the following JSON format:
[
{
“text”: <sentence-1>,
“results”: {
“Label A”: <prob-labelA-of-sentence-1>,
“Label B”: < prob -labelB-of-sentence-1>,
“Label C”: <prob-labelC-of-sentence-1>,
…
}
},
…
]
In this case, we used the news categorization model as an example. The response of the classification model API would be in the following JSON format:
[
{
“text”: < -an unlabeled news article- >,
“results”: {
“Political News”: <prob- Political -of-sentence-1>,
“Sport News”: < prob - Sport -of-sentence-1>,
“Entertainment News”: <prob- Entertainment -of-sentence-1>,
…
}
},
]