Michael Anthony PRO
MikeDoes
AI & ML interests
Privacy, Large Language Model, Explainable
Recent Activity
posted
an
update
about 3 hours ago
Traditional data leak prevention is failing. A new paper has a solution-oriented approach inspired by evolution.
The paper introduces a genetic-algorithm-driven method for detecting data leaks. To prove its effectiveness, the researchers Anatoliy Sachenko, Petro V., Oleg Savenko, Viktor Ostroverkhov, Bogdan Maslyyak from Casimir Pulaski Radom University and others needed a real-world, complex PII dataset. We're proud that the AI4Privacy PII 300k dataset was used as a key benchmark for their experiments.
This is the power of open-source collaboration. We provide complex, real-world data challenges, and brilliant researchers develop and share better solutions to solve them. It's a win for every organization when this research helps pave the way for more adaptive and intelligent Data Loss Prevention systems.
🔗 Read the full paper to see the data and learn how genetic algorithms are making a difference in cybersecurity: https://ceur-ws.org/Vol-4005/paper19.pdf
#OpenSource
#DataPrivacy
#LLM
#Anonymization
#AIsecurity
#HuggingFace
#Ai4Privacy
#Worldslargestopensourceprivacymaskingdataset
reacted
to
their
post
with ❤️
5 days ago
Anonymizing a prompt is half the battle. Reliably de-anonymizing the response is the other.
To build a truly reliable privacy pipeline, you have to test it. A new Master's thesis does just that, and our data was there for every step.
We're excited to showcase this work on handling confidential data in LLM prompts from Nedim Karavdic at Mälardalen University. To build their PII anonymization pipeline, they first trained a custom NER model. We're proud that the Ai4Privacy pii-masking-200k dataset was used as the foundational training data for this critical first step.
But it didn't stop there. The research also used our dataset to create the parallel data needed to train and test the generative "Seek" models for de-anonymization. It's a win-win when our open-source data not only helps build the proposed "better solution" but also helps prove why it's better by enabling a rigorous, data-driven comparison.
🔗 Check out the full thesis for a great deep-dive into building a practical, end-to-end privacy solution: https://www.diva-portal.org/smash/get/diva2:1980696/FULLTEXT01.pdf
#OpenSource
#DataPrivacy
#LLM
#Anonymization
#AIsecurity
#HuggingFace
#Ai4Privacy
#Worldslargestopensourceprivacymaskingdataset
posted
an
update
6 days ago
Anonymizing a prompt is half the battle. Reliably de-anonymizing the response is the other.
To build a truly reliable privacy pipeline, you have to test it. A new Master's thesis does just that, and our data was there for every step.
We're excited to showcase this work on handling confidential data in LLM prompts from Nedim Karavdic at Mälardalen University. To build their PII anonymization pipeline, they first trained a custom NER model. We're proud that the Ai4Privacy pii-masking-200k dataset was used as the foundational training data for this critical first step.
But it didn't stop there. The research also used our dataset to create the parallel data needed to train and test the generative "Seek" models for de-anonymization. It's a win-win when our open-source data not only helps build the proposed "better solution" but also helps prove why it's better by enabling a rigorous, data-driven comparison.
🔗 Check out the full thesis for a great deep-dive into building a practical, end-to-end privacy solution: https://www.diva-portal.org/smash/get/diva2:1980696/FULLTEXT01.pdf
#OpenSource
#DataPrivacy
#LLM
#Anonymization
#AIsecurity
#HuggingFace
#Ai4Privacy
#Worldslargestopensourceprivacymaskingdataset