REX – Data Provenance and Responsible AI for Social Media Privacy and Workforce Intelligence
Guilherme Medeiros Machado, Associate Professor – ECE
Subankhar Chanti, post-doc researcher – ECE
REX – Data Provenance and Responsible AI for Social Media Privacy and Workforce Intelligence
by Guilherme Machado Medeiros, associate professor & Subhankar Maity, post-doc researcher – ECE
— Presentation In English —
As large language models (LLMs) increasingly power intelligent systems, from conversational interfaces to code generation, the need to understand the provenance of their training data has become critical.
Our guests’ research tackles whether specific data samples contribute to LLM behaviour, focusing on code repositories and social media content that differ sharply from traditional text datasets.
Through data provenance analysis and membership inference attacks, we deliver measurable transparency into how models learn, memorise, and potentially expose sensitive information. The project tests authentic code and social media samples across state-of-the-art LLMs (GPT-5, Gemini 2.5 Flash, Mistral 24B, LLaMA 3.3 70B, DeepSeek, etc.).
We develop statistical and visual frameworks to detect the presence of training data reliably.
In professional environments, employee-contributed code often feeds LLMs for code assistance, creating blind spots in workforce capabilities. Our detection reveals when proprietary code influences outputs beyond documented team knowledge, highlighting specific learning gaps.
This drives targeted upskilling programs, ensuring organisations maintain competitive expertise while protecting intellectual property from unauthorised model memorisation.
Social media content creates acute privacy risks when absorbed into LLMs, as user-generated posts can reveal personal details through model outputs. Our methods allow individuals to flag whether their personal content was used in training, enabling proactive privacy protection. This user-centric approach demands accountability from AI systems trained on public data, aligning with GDPR requirements for transparency and data rights.
This research converts regulatory challenges into strategic opportunities
– Privacy-first AI respecting user consent and personal data rights
– IP protection for proprietary code and internal knowledge assets
– Workforce analytics linking AI outputs to actionable training needs
– Market differentiation through transparent, auditable intelligence systems
Lu.ma: Inscriptions
datacraft* est le club des Data Scientists, Chercheurs et Ingénieurs en IA. Rejoignez-nous !
