Can Large Language Models Provide Security & Privacy Advice? Measuring the Ability of LLMs to Refute Misconceptions

🔬 Research Summary by Arjun Arunasalam, a 4th-year Computer Science Ph.D. student at Purdue University researching security, privacy, and trust on online platforms from a human-centered lens.

[Original paper by Yufan Chen, Arjun Arunasalam , and Z. Berkay Celik]

Overview: Users often seek Security and Privacy (S&P) advice from online resources, which help users understand S&P technologies and tools and suggest actionable strategies. Large Language Models (LLMs) have recently emerged as trusted information sources, so understanding their performance as an S&P advice resource is critical. This paper presents an exploratory analysis of State-of-the-Art LLM’s ability to refute publicly held S&P misconceptions.

Introduction

In recent years, LLMs have emerged as the most prominent NLP technology, with end-users leveraging ChatGPT and Bard interfaces to engage with these powerful AI tools. LLMs have also become prominent information sources; users interact with LLMs to seek health information, stock market advice, and even job interview guidance. This provides prime conditions for LLMs’ adoption as an S&P advice tool.

In this paper, we empirically aim to answer:

Are LLMs reliable in providing S&P advice by correctly refuting user-held S&P-related misconceptions?

To answer this research question, we query Google Scholar with S&P seed keywords to carefully curate 122 unique S&P misconceptions comprising various categories (e.g., IoT/CPS, Web Security). We then query two popular LLMs (Bard and ChatGPT) to comprehend their ability to refute these misconceptions and annotate the resulting data.

Bard and ChatGPT demonstrate a non-negligible error rate, incorrectly supporting popular S&P misconceptions. Error rates increase when LLMs are repeatedly queried or provided paraphrased misconceptions. Our exploration of information sources for responses revealed that LLMs are susceptible to providing invalid URLs or pointing to unrelated sources. Our efforts motivate future work in understanding how users can better interact with this technology.

Key Insights

Evaluating LLMs with a Four Experiment-Approach

We designed four experiments (E1–E4) to extensively evaluate Bard and ChatGPT’s capability to provide S&P advice.

Measuring Ability to Respond to Misconceptions

In E1, we query each misconception one time (a single trial) for both ChatGPT and Bard. In E2, we evaluate the consistency of LLMs in responding to S&P misconceptions. The models that users interact with (via web interfaces) are non-deterministic – asking the same question twice may not result in identical responses. To simulate real-world scenarios where multiple individuals may ask the same question and receive different responses, we conducted four additional trials per misconception, generating a total of 488 additional responses from each model. In E3, we evaluate the effectiveness of LLMs in handling paraphrased queries since users may query chatbots in various ways. To do so, we use paraphrasing tools to produce an augmented dataset consisting of paraphrases of our original misconception. Our approach aims to evaluate the general effectiveness of LLMs, consistency, and susceptibility to paraphrasing.

Responses produced by these LLMs are labeled either as Support, Negate (the correct answer), Partially Support (LLMs are less committal in their support), and Noncommittal (when LLMs refuse to take a stance). We take a conservative approach when defining error in these experiments: we consider the misconception results incorrect if any of the trials result in a Support label. Similarly, we consider responses with a Negate correct, and for repeated trial experiments, a misconception producing accurate results must have Negate responses across all trials. We defined error rate as the ratio of misconceptions producing an incorrect result over the total number of 122 misconceptions.

Understanding Informational Resources Provided

We additionally evaluate LLM’s ability to provide reliable sources. Thus, in E4, we queried to obtain the URLs of the resources that inform the LLMs’ responses when responding to a misconception. We followed up on queries in E2 with the prompt, “Can you provide the URLs of your source?”. We further examined resulting URLs to verify whether they point to existing websites and examined the information contained within them.

Response To Misconceptions

Unsurprisingly, our findings reveal that State-of-the-Art LLMs may not always be effective when it comes to providing security and privacy advice. Our empirical results from E1 highlight that although both models correctly Negate misconceptions ~ 70 % of the time, they also demonstrate a non-negligible error rate of 21.3%.

The Impact of Repeated or Paraphrased Queries

Through our analysis in E2, we discovered that models are vulnerable to repeated queries. When queried with a misconception, there is no guarantee that LLMs can consistently refute the misconception. Our experiments reveal that when repeatedly queried with the same misconception, both models show a non-negligible tendency to be inconsistent in their stance. However, they also yield an increased error rate of 28.7%.

Responses from both models also contain confusing patterns that may mislead unassuming users. For instance, when asked about the misconception, “Under GDPR, individuals have an absolute right to be forgotten.” ChatGPT responds with:

“Yes, it is true that under GDPR… individuals have a “right to be forgotten” … However, this right is not absolute … “

Since ChatGPT and Bard’s responses tend to be elaborate, users who do not pay great attention may be misled by these responses.

The error rate increases to 30.95% when the queries are paraphrased in E3. Paraphrasing queries reduces LLM consistency and causes a significant increase in error rate and reduction in correctness compared to multiple questions for a misconception.

Resources and their Reliability

In E4, we discovered Bard divulges URL sources that inform their response less frequently than ChatGPT. However, when comparing URL validity (whether the URL directs to an existing webpage), Bard’s URL sources are more likely to be valid. Interestingly, though a URL may be valid, we also discover that such URLs may point to websites that are completely unrelated to the misconception or even sources containing false information (factual errors on security and privacy).

To illustrate, when responding to “Under GDPR, parental consent is always required when collecting personal data from children.”, ChatGPT directs users to a website that ignores cases where parental consent is not necessary when processing children’s personal data, directly misinforming the user. These findings highlight how LLMs may direct users to sources that do not exist and problematic /false resources.

Between the lines

Our work underscores the importance of taking a measured approach to leveraging powerful LLM tools despite their impressive capabilities. As LLMs gain further footing in end-users’ lives, it is integral that the research community ensures that the content produced by these powerful tools is trustworthy. Consuming inaccurate or false security and privacy advice can harm the end-users, who may be misled into implementing insecure and privacy-compromising suggestions.

We hope our work prompts further research into LLM use as an information source. First, we encourage future research to understand better how end-users interact with this emerging technology. Second, we hope that the research community implements education efforts such as guidelines on using LLMs responsibly. Third, future efforts should also focus on demystifying the cause behind inaccurate responses and how models can deprioritize suspicious sources that inform their responses. Finally, in designing LLM tools for specialized needs, the AI research community fosters collaboration with domain experts to produce tools that can appropriately cater to the end-user.