Technology can bring many improvements, but the natural language still presents a challenge. As long as a tool isn’t available in the user’s language, it doesn’t bring value. This is an emerging problem in Natural Language Processing (NLP) – a subfield of linguistics, artificial intelligence, and computer science that uses neural models to study the interactions between computers and human language. It aims to create a computer system capable of understanding human language. NLP tasks involve speech recognition, language understanding, and language generation.
The vast majority of NLP progress has happened within the English language due to the relative simplicity of linguistic rules and data availability. Unfortunately, the NLP landscape for the Polish language is vastly different.
Our Machine Learning team had set out on a mission to contribute to this area of science. The goal was to create a machine learning-driven chatbot that could act as an online customer support service across different sectors, from eCommerce to legal and energy companies.
Due to the project’s exploratory nature, our developers trained the base model from scratch. First, they employed data from a variety of sources: Polish Wikipedia, the Polish version of CommonCrawl, the National Corpus of the Polish language, the Corpus of Parliamentary Discourse (parliamentary debates and interpellations), https://czytac.com/, and www.wolnelektury.pl literature database.
We created an appropriate word processor for each data set using regular expressions. We also checked every dataset for removable items that aren’t part of the language, such as links, integral parts of the article structure, or duplicate titles. The search was carried out using regular expressions, and the found items could be processed by the processor (for example, deleted).
At the time of project planning, Transformer-architecture-based models have not yet established themselves as state-of-the-art for NLP tasks. LTSM was the most common approach. Our team used it together with GPT-2, training the models on texts extracted from various sources.
Once the model could understand and generate texts on general topics, the next step was to train it in specific sectors and empower it to carry out niche tasks such as contract generation (for the legal industry) or ticket recognition and routing (for the energy sector). Here, we adapted niche open-access Polish databases, English databases translated with the Google Translate API, and data acquired through web scraping.
Throughout the whole process, the biggest issue was the acquisition of quality data in sufficient amounts. The next obstacle we had to overcome was the complexity of the Polish language, which is strongly inflectional. This means that a single morpheme (the smallest speech unit) denotes multiple grammatical, syntactic, or semantic features. This resulted in various problems related to the correctness of the generated text and called for repeated model retraining. Once the NLP models generated satisfactory results, the chatbot was equipped with additional features.
Throughout this project, our Machine Learning team cooperated in the Scrum model.
We developed a chatbot solution that handled a variety of tasks using Polish as the natural language.
Our team developed a machine comprehension model learning pipeline using the Hugging Face, Pytorch, and PytorchLightning libraries. After studying the literature on the subject and preliminary empirical research, the team decided to use AdamW as the optimizer while training the model. The initial research was carried out on the superconvergence phenomenon and the implementation of solutions enabling it during learning.
The training and learning processes were carried out in computing environments equipped with graphics cards. In the case of a large data set, the team initiated the learning process on many graphics cards using the data-parallel approach. The process first took place in the AWS cloud environment with the use of NVIDIA Tesla K80 cards, but as a result of frequent lack of availability, the calculations were transferred to the local environment using NVIDIA GeForce RTX 3090 Ti.
The team also carried out manual model validation using texts that weren’t part of the data set. The model was evaluated by asking questions about the previously entered text fragment. Its task was to indicate the text fragment being the answer, or if the text fragment didn’t contain an answer, an indication that the text didn’t contain it.
To achieve that, we built a simple application consisting of two modules: Reader and Retriever. The Reader was a previously trained model, while the Retriever was a simple algorithm that searched for text fragments from Polish Wikipedia that may contain an answer to a given question. Thanks to this, we eliminated the need to provide the model with context each time, which greatly facilitated the evaluation. Due to the quality of the teaching data, the model often failed to return the right answers, but this was most often the case with an inattentively formulated question. After modifying the query (for example, When did the baptism of Poland happen? → Which year was Mieszko I baptized?), the model provided correct or meaningful answers significantly more often.
If we were looking to design such a chatbot in English, developing the NLP model from scratch would not be necessary. Many such models are already available as open-source projects. As we described above, that’s not the case with Polish.
Currently, the chatbot only integrates one model trained on question answering. The chatbot is used to demonstrate the possibilities of the developed solutions (models trained on individual tasks) in the form of a finished product.
The 4soft team approached this NLP problem in a creative way, taking advantage of the available solutions to make real progress in this challenging area. Thanks to the skills and expertise of the team, the project was a success that opened the door to further work that promises many exciting applications.