TY - JOUR
T1 - Training Vision AI Models with Public Data
T2 - Privacy and Availability Concerns
AU - Alsudais, Abdulkareem
N1 - Publisher Copyright:
© 2025, Association for Information Systems. All rights reserved.
PY - 2025
Y1 - 2025
N2 - This paper contributes to research on the ethics of utilizing publicly available images and videos in training AI models by analyzing five prominent open research datasets containing images and videos collected from web user-generated content. This study investigates the current unavailability of these images and videos to understand the extent to which users remove or limit the visibility of their content. This could indicate their opposition to the perpetual use of their images or videos in open datasets, current AI models, or the training of future models. The findings reveal that all five datasets have a substantial number of items that are no longer accessible via their original URLs. Further, a longitudinal analysis over two and a half years reveals a statistically significant increase in this unavailability. The study identifies and categorizes the factors driving this unavailability, including account termination, content being made private by users, and items removed by platforms due to policy violations. This study shows that a significant portion of users may eventually choose to remove their content from the web. This adds valuable insights to AI ethics research, highlighting privacy and the users' right to be forgotten in the context of publicly shared images and videos.
AB - This paper contributes to research on the ethics of utilizing publicly available images and videos in training AI models by analyzing five prominent open research datasets containing images and videos collected from web user-generated content. This study investigates the current unavailability of these images and videos to understand the extent to which users remove or limit the visibility of their content. This could indicate their opposition to the perpetual use of their images or videos in open datasets, current AI models, or the training of future models. The findings reveal that all five datasets have a substantial number of items that are no longer accessible via their original URLs. Further, a longitudinal analysis over two and a half years reveals a statistically significant increase in this unavailability. The study identifies and categorizes the factors driving this unavailability, including account termination, content being made private by users, and items removed by platforms due to policy violations. This study shows that a significant portion of users may eventually choose to remove their content from the web. This adds valuable insights to AI ethics research, highlighting privacy and the users' right to be forgotten in the context of publicly shared images and videos.
KW - Artificial Intelligence
KW - Data Privacy
KW - Ethics
KW - Open Data
KW - Right to be Forgotten
UR - http://www.scopus.com/inward/record.url?scp=85215820276&partnerID=8YFLogxK
M3 - Article
AN - SCOPUS:85215820276
SN - 1529-3181
VL - 56
JO - Communications of the Association for Information Systems
JF - Communications of the Association for Information Systems
ER -