Training Vision AI Models with Public Data: Privacy and Availability Concerns

Research output: Contribution to journalArticlepeer-review

Abstract

This paper contributes to research on the ethics of utilizing publicly available images and videos in training AI models by analyzing five prominent open research datasets containing images and videos collected from web user-generated content. This study investigates the current unavailability of these images and videos to understand the extent to which users remove or limit the visibility of their content. This could indicate their opposition to the perpetual use of their images or videos in open datasets, current AI models, or the training of future models. The findings reveal that all five datasets have a substantial number of items that are no longer accessible via their original URLs. Further, a longitudinal analysis over two and a half years reveals a statistically significant increase in this unavailability. The study identifies and categorizes the factors driving this unavailability, including account termination, content being made private by users, and items removed by platforms due to policy violations. This study shows that a significant portion of users may eventually choose to remove their content from the web. This adds valuable insights to AI ethics research, highlighting privacy and the users' right to be forgotten in the context of publicly shared images and videos.

Original languageEnglish
JournalCommunications of the Association for Information Systems
Volume56
StatePublished - 2025

Keywords

  • Artificial Intelligence
  • Data Privacy
  • Ethics
  • Open Data
  • Right to be Forgotten

Fingerprint

Dive into the research topics of 'Training Vision AI Models with Public Data: Privacy and Availability Concerns'. Together they form a unique fingerprint.

Cite this