2406-Bunka-Bunka
Description
RAG hasn’t yet replaced fine-tuning LLMs. The latter remains particularly relevant for adapting to specific domains or tasks, such as legal or customer support.
One of the critical parameters for successful fine-tuning is the quality of the textual dataset used for training: a poor dataset can affect the model’s neutrality or lead to hallucinations.
In this workshop, we invite you to explore and filter a textual dataset using the Bunka library in order to fine-tune an LLM for a new specialized domain.
At the end, we will compare the performance with and without the use of Bunka.
A GPU-enabled environment will be provided.
More details about Bunka:
Bunka is a library that allows for rapid and intuitive visualization of semantic spaces. It enables the study of a textual dataset by spatially representing the positions of embeddings, which can, for example, help identify the topics covered by the dataset.
Bunka also enables dataset filtering based on various criteria to retain only the data of interest.
These operations are particularly useful in the context of fine-tuning LLMs.
The data:
The dataset used will either be an open-source dataset or one provided by one of our members. (If you are one of our members and have an unstructured data set you would like to explore during this workshop, feel free to contact marc.devaugiraud@datacraft.paris to discuss its relevance and use).