Presentation of speakers :
This workshop is presented by :
Introduction
Maya Sahraoui has overcome the constraint of few annotated data using distant supervision applied to Named entity recognition, and has adapted a promising self-training method to confidently propagate pertinent annotations to new unlabelled data. She also devised a comprehensive test protocol to assess the evaluation.
This concept can be applied in many fields (marketing, maintenance,...) where you have to create a knowledge base from a few annotated datasets.
Main notions covered
The final goal is to use botanical precise description of plants to create graphs that represent their structure, in order to perform measures and comparaisons. Creating these graphs requires that all the elements describing a species are labeled according to certain classes (organs : flower, fruit, leaf, plant; descriptors : form, color, surface, disposition), which is very tedious to do by hand. Therefore, NLP is used to classify words describing plants into classes, but with the constraint of very few annotated data, Frugal AI techniques need to be implemented.
As we said, for any given species, each organ or descriptor must be classified before it can be used to create a graph. This is why distant supervision is relevant : it’s a technique used in NLP to automatically generate labeled training data by leveraging existing external sources of information. The model is supervised by an external (or distant) source of knowledge.
In this case, we will be using an annotated external botanical glossary to classify the similar words in our species descriptions.
In order to perform this distant supervision, we will take advantage of a NER model. This model identifies and classifies named entities in text into predefined categories. In other words, it scans text data to find names and assign a label to each name.
In our case, it will leverage the knowledge provided by the annotated external glossary to recognize key-plant-related-words and classify them.
For instance, in this sentence “Calyx aestivation valuate, campanulate, (2-)2.4-3.3 mm long”, the NER model will detect Calyx, valuate and campanulate, and classify them according to the labeled glossary.
Applying NER (BERT base model in this case) to distant supervision allows using an external source of information to help the NER model classify the names it finds.
Using this method can lead to 2 types of noise in the training data :
- Missing annotations : some names were not annotated because the external glossary was missing some words
- Wrong class association : since each word is associated to a class without any context, errors can happen
To overcome the noise induced, the model will be self-trained in a teacher-student architecture.The concept relies on making a first set of predictions with the model (teacher), and then adding the most reliable predictions to the training dataset in order to re-train the model (student) on more data.
This approach avoids overfitting on the initial training set, which is potentially noisy as it was annotated distantly.
These techniques accurately illustrate how one can overcome the challenges of few annotated data in the context of Frugal AI : devising innovative machine learning solutions that overcome the issue of small training datasets and challenging data annotations.
If you are interested in Frugal AI, make sure you check our website so you don’t miss the upcoming events related to this topic.
The full video presentation is available on our youtube channel and you can download the slides of the presentation here. To go further, we encourage you to read the research paper related to this presentation.
Thanks for reading and see you soon !