Privacy preserving large language models: ChatGPT case study based vision and framework

Imdad Ullah; Najm Hassan; Sukhpal Singh Gill; Basem Suleiman; Tariq Ahamed Ahanger; Zawar Shah; Junaid Qadir; Salil S. Kanhere

doi:10.1049/blc2.12091

Privacy preserving large language models: ChatGPT case study based vision and framework

Imdad Ullah
, Najm Hassan
, Sukhpal Singh Gill
, Basem Suleiman
, Tariq Ahamed Ahanger
, Zawar Shah
, Junaid Qadir
, Salil S. Kanhere

Management Information Systems

Research output: Contribution to journal › Article › peer-review

9 Scopus citations

Abstract

The generative Artificial Intelligence (AI) tools based on Large Language Models (LLMs) use billions of parameters to extensively analyse large datasets and extract critical information such as context, specific details, identifying information, use this information in the training process, and generate responses for the requested queries. The extracted data also contain sensitive information, seriously threatening user privacy and reluctance to use such tools. This article proposes the conceptual model called PrivChatGPT, a privacy-preserving model for LLMs consisting of two main components, that is, preserving user privacy during the data curation/pre-processing and preserving private context and the private training process for large-scale data. To demonstrate the applicability of PrivChatGPT, it is shown how a private mechanism could be integrated into the existing model for training LLMs to protect user privacy; specifically, differential privacy and private training using Reinforcement Learning (RL) were employed. The privacy level probabilities are associated with the document contents, including the private contextual information, and with metadata, which is used to evaluate the disclosure probability loss for an individual's private information. The privacy loss is measured and the measure of uncertainty or randomness is evaluated using entropy once differential privacy is applied. It recursively evaluates the level of privacy guarantees and the uncertainty of public databases and resources during each update when new information is added for training purposes. To critically evaluate the use of differential privacy for private LLMs, other mechanisms were hypothetically compared such as Blockchain, private information retrieval, randomisation, obfuscation, anonymisation, and the use of Tor for various performance measures such as the model performance and accuracy, computational complexity, privacy vs. utility, training latency, vulnerability to attacks, and resource consumption. It is concluded that differential privacy, randomisation, and obfuscation can impact the training models' utility and performance; conversely, using Tor, Blockchain, and Private Information Retrieval (PIR) may introduce additional computational complexity and high training latency. It is believed that the proposed model could be used as a benchmark for privacy-preserving LLMs for generative AI tools.

Original language	English
Pages (from-to)	706-724
Number of pages	19
Journal	IET Blockchain
Volume	4
Issue number	S1
DOIs	https://doi.org/10.1049/blc2.12091
State	Published - Dec 2024

Keywords

artificial intelligence
blockchain applications and digital technology
blockchain platforms
blockchain standards
data protection
information security
models and analysis
security of data

Access to Document

10.1049/blc2.12091

Cite this

@article{79d89182b0c140b3986437b9de1c728a,

title = "Privacy preserving large language models: ChatGPT case study based vision and framework",

abstract = "The generative Artificial Intelligence (AI) tools based on Large Language Models (LLMs) use billions of parameters to extensively analyse large datasets and extract critical information such as context, specific details, identifying information, use this information in the training process, and generate responses for the requested queries. The extracted data also contain sensitive information, seriously threatening user privacy and reluctance to use such tools. This article proposes the conceptual model called PrivChatGPT, a privacy-preserving model for LLMs consisting of two main components, that is, preserving user privacy during the data curation/pre-processing and preserving private context and the private training process for large-scale data. To demonstrate the applicability of PrivChatGPT, it is shown how a private mechanism could be integrated into the existing model for training LLMs to protect user privacy; specifically, differential privacy and private training using Reinforcement Learning (RL) were employed. The privacy level probabilities are associated with the document contents, including the private contextual information, and with metadata, which is used to evaluate the disclosure probability loss for an individual's private information. The privacy loss is measured and the measure of uncertainty or randomness is evaluated using entropy once differential privacy is applied. It recursively evaluates the level of privacy guarantees and the uncertainty of public databases and resources during each update when new information is added for training purposes. To critically evaluate the use of differential privacy for private LLMs, other mechanisms were hypothetically compared such as Blockchain, private information retrieval, randomisation, obfuscation, anonymisation, and the use of Tor for various performance measures such as the model performance and accuracy, computational complexity, privacy vs. utility, training latency, vulnerability to attacks, and resource consumption. It is concluded that differential privacy, randomisation, and obfuscation can impact the training models' utility and performance; conversely, using Tor, Blockchain, and Private Information Retrieval (PIR) may introduce additional computational complexity and high training latency. It is believed that the proposed model could be used as a benchmark for privacy-preserving LLMs for generative AI tools.",

keywords = "artificial intelligence, blockchain applications and digital technology, blockchain platforms, blockchain standards, data protection, information security, models and analysis, security of data",

author = "Imdad Ullah and Najm Hassan and Gill, \{Sukhpal Singh\} and Basem Suleiman and Ahanger, \{Tariq Ahamed\} and Zawar Shah and Junaid Qadir and Kanhere, \{Salil S.\}",

note = "Publisher Copyright: {\textcopyright} 2024 The Author(s). IET Blockchain published by John Wiley \& Sons Ltd on behalf of The Institution of Engineering and Technology.",

year = "2024",

month = dec,

doi = "10.1049/blc2.12091",

language = "English",

volume = "4",

pages = "706--724",

journal = "IET Blockchain",

issn = "2634-1573",

publisher = "John Wiley and Sons Inc",

number = "S1",

}

TY - JOUR

T1 - Privacy preserving large language models

T2 - ChatGPT case study based vision and framework

AU - Ullah, Imdad

AU - Hassan, Najm

AU - Gill, Sukhpal Singh

AU - Suleiman, Basem

AU - Ahanger, Tariq Ahamed

AU - Shah, Zawar

AU - Qadir, Junaid

AU - Kanhere, Salil S.

PY - 2024/12

Y1 - 2024/12

N2 - The generative Artificial Intelligence (AI) tools based on Large Language Models (LLMs) use billions of parameters to extensively analyse large datasets and extract critical information such as context, specific details, identifying information, use this information in the training process, and generate responses for the requested queries. The extracted data also contain sensitive information, seriously threatening user privacy and reluctance to use such tools. This article proposes the conceptual model called PrivChatGPT, a privacy-preserving model for LLMs consisting of two main components, that is, preserving user privacy during the data curation/pre-processing and preserving private context and the private training process for large-scale data. To demonstrate the applicability of PrivChatGPT, it is shown how a private mechanism could be integrated into the existing model for training LLMs to protect user privacy; specifically, differential privacy and private training using Reinforcement Learning (RL) were employed. The privacy level probabilities are associated with the document contents, including the private contextual information, and with metadata, which is used to evaluate the disclosure probability loss for an individual's private information. The privacy loss is measured and the measure of uncertainty or randomness is evaluated using entropy once differential privacy is applied. It recursively evaluates the level of privacy guarantees and the uncertainty of public databases and resources during each update when new information is added for training purposes. To critically evaluate the use of differential privacy for private LLMs, other mechanisms were hypothetically compared such as Blockchain, private information retrieval, randomisation, obfuscation, anonymisation, and the use of Tor for various performance measures such as the model performance and accuracy, computational complexity, privacy vs. utility, training latency, vulnerability to attacks, and resource consumption. It is concluded that differential privacy, randomisation, and obfuscation can impact the training models' utility and performance; conversely, using Tor, Blockchain, and Private Information Retrieval (PIR) may introduce additional computational complexity and high training latency. It is believed that the proposed model could be used as a benchmark for privacy-preserving LLMs for generative AI tools.

AB - The generative Artificial Intelligence (AI) tools based on Large Language Models (LLMs) use billions of parameters to extensively analyse large datasets and extract critical information such as context, specific details, identifying information, use this information in the training process, and generate responses for the requested queries. The extracted data also contain sensitive information, seriously threatening user privacy and reluctance to use such tools. This article proposes the conceptual model called PrivChatGPT, a privacy-preserving model for LLMs consisting of two main components, that is, preserving user privacy during the data curation/pre-processing and preserving private context and the private training process for large-scale data. To demonstrate the applicability of PrivChatGPT, it is shown how a private mechanism could be integrated into the existing model for training LLMs to protect user privacy; specifically, differential privacy and private training using Reinforcement Learning (RL) were employed. The privacy level probabilities are associated with the document contents, including the private contextual information, and with metadata, which is used to evaluate the disclosure probability loss for an individual's private information. The privacy loss is measured and the measure of uncertainty or randomness is evaluated using entropy once differential privacy is applied. It recursively evaluates the level of privacy guarantees and the uncertainty of public databases and resources during each update when new information is added for training purposes. To critically evaluate the use of differential privacy for private LLMs, other mechanisms were hypothetically compared such as Blockchain, private information retrieval, randomisation, obfuscation, anonymisation, and the use of Tor for various performance measures such as the model performance and accuracy, computational complexity, privacy vs. utility, training latency, vulnerability to attacks, and resource consumption. It is concluded that differential privacy, randomisation, and obfuscation can impact the training models' utility and performance; conversely, using Tor, Blockchain, and Private Information Retrieval (PIR) may introduce additional computational complexity and high training latency. It is believed that the proposed model could be used as a benchmark for privacy-preserving LLMs for generative AI tools.

KW - artificial intelligence

KW - blockchain applications and digital technology

KW - blockchain platforms

KW - blockchain standards

KW - data protection

KW - information security

KW - models and analysis

KW - security of data

UR - https://www.scopus.com/pages/publications/85209119067

U2 - 10.1049/blc2.12091

DO - 10.1049/blc2.12091

M3 - Article

AN - SCOPUS:85209119067

SN - 2634-1573

VL - 4

SP - 706

EP - 724

JO - IET Blockchain

JF - IET Blockchain

IS - S1

ER -

Privacy preserving large language models: ChatGPT case study based vision and framework

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this