Loading Events
28 Jun 2024 13:00 - 17:00
datacraft –
3 Rue Rossini
75009 Paris, France
+ Google Map

Workshop – Exploring datasets for RAG and fine-tuning: how to refine textual data ?


  • Charles de Dampierre, CEO, Bunka.ai
  • Louis-marie Lorin, COO, Bunka.ai



Level in Machine Learning
Good knowledge of Machine Learning/DA/AI Level in Python Good knowledge of Python

Talk description

Bunka provides an intuitive and innovative way of exploring textual datasets, using visual maps to give an overview of your data. Bunka enhances traditional visualizations by adding a fully customizable dimensional layer. Whether ML engineers want to analyze perplexity, toxicity ( than can be define ), or various other text quality metrics, Bunka has them covered.

Bunka’s technology is fully adaptive and connects to various frameworks, whether it is the latest LLM model, a new state-of-the-art embedder, or another clustering algorithm.

Since RAG and fine-tuning require clean datasets to perform optimally, being able to visually analyze the quality of a dataset and evaluate the impact of poor-quality data on these models is crucial.

In this workshop, we will use an open-source dataset consisting of conversations between users and ChatGPT, which could represent generic customer service chats. The interest lies in its generality, as it contains standard chats applicable to any use case.

First, we will represent and visualize the chat titles from ChatGPT by embedding them, followed by doing the same process for the chat content.

Next, we’ll analyze the toxicity of the responses using a RoBERTa classifier to compare with the efficiency of Bunka. Toxicity will be defined as the extent to which a response contains offensive, discriminatory, or otherwise harmful content, according to a metric we will establish during the workshop.

After that, we will evaluate the quality of responses using Bunka, a tool designed to improve AI model performance by enhancing data quality through advanced visual topic modeling and cluster analysis.

Finally, we’ll identify and eliminate redundant data, resulting in a filtered and optimized dataset, ready for LLM fine-tuning or RAG applications.


< All past workshops