AI Bots and the Challenges of Real-Time Search Results

Trends are fast moving toward the integration of generative AI with real-time search engine data, but real challenges remain over the accuracy and veracity of information. Are the Chatbots feeling lucky? And, who’s culpable when they dish out incorrect information?

AI chat tools tend to slip up at times, producing errors described as “hallucinations”. Liability for damages caused by AI systems is still a complex and largely unresolved area of law.

Just a couple of weeks ago, the BBC reported on a story involving a certain New York lawyer, who found himself in hot water because of a ChatGPT error. The bot allegedly provided legal information which later turned out to be incorrect. The lawyer said he hadn’t realised data from ChatGPT could possibly be wrong.

It is a challenge to attribute responsibility because AI systems which, while designed by humans, operate independently and make decisions based on patterns in data rather than explicit programming.

Bing and Bard both come with strong disclaimers regarding limitations. The providers are for the most part exempt from responsibility for any damages caused by the AI bots. Of course, that doesn’t do much to help the end user.

Google emphasised the need for “rigorous testing” when its Bard chatbot made a notorious error during an early demonstration — a faux pas which cost the company something in the region of $100bn (£82.7bn).

The problem is that AI chatbots tend to present results as if they were definitive answers. In most cases, they are, but they can also just as confidently be wrong.

New and improved iterations of the AI models intend to iron out the errors. ChatGPT-4 was released mid-March with the promise of better reasoning and more accurate responses.

Contrary to OpenAI’s assertion that GPT-4 is 40% more accurate than its 3.5 version, a recent research conducted by NewsGuard found that the updated version might actually be more susceptible to spreading inaccurate information.

NewsGuard is a service that evaluates the credibility of news and information websites. Their test involved feeding the current and older versions of ChatGPT with 100 fabricated news stories. The expectation was that it would challenge certain statements and deliver accurate responses to prevent the dissemination of false information.

“ChatGPT-3.5 generated misinformation and hoaxes 80% of the time when prompted to do so in a NewsGuard exercise using 100 false narratives from its catalog of significant falsehoods in the news. NewsGuard found that its successor, ChatGPT-4, spread even more misinformation, advancing all 100 false narratives.”

ChatGPT-4 produces more misinformation than predecessor

Extracting accurate answers from search engines in the era of ‘false news’ is bound to be a struggle for chatbots. After all, the challenge is significant enough for humans. OpenAI has warned against the use of ChatGPT without proper fact checking. While the bots logical reasoning seems to have been improved, some are concerned the AI might now be more accepting of false statements from users.

LLMs have the potential to mirror and sustain the biases present in their training data, resulting in the inadvertent production of politically biassed, prejudiced, or offensive text. Since the AI lacks the ability to verify facts or stay up to date in real time, responses may become outdated or incorrect if the data they were trained on becomes obsolete. But what about AI models using live search engine data?

The Washington Post conducted its own test just this week, with the aim of assessing the reliability of Microsoft’s Bing AI. Specifically, they wanted to test the credibility of Bing’s references. Like any good scholar, is it able to cite its sources properly?

“We wanted to understand whether the AI was actually good at researching complex questions. So we set up an experiment with Microsoft’s Bing chat, which includes citations for the answers its AI provides. The sources are linked in the text of its response and footnoted along the bottom with a shortened version of their addresses. We asked Bing 47 tough questions, then graded its more than 700 citations by tapping the expertise of 10 fellow Washington Post journalists.”

Geoffrey A. Fowler and Jeremy B. Merrill, Washington Post

Bing’s answers and citations often varied, so each question was asked three times. Asking 47 questions and closely evaluating the sources, the Washington Post was found that almost one in ten were questionable.

Chatbots are only as good as the data they’re trained on. If a chatbot doesn’t have enough data to draw from, it may provide users with incorrect information. This is especially true for chatbots that use machine learning algorithms to learn and adapt to user requests.

Typically, chatbots come with features like ‘thumbs up’ and ‘thumbs down’ buttons, or other means for users to rate whether they’re happy or unhappy with a given response. When a user indicates they’re unsatisfied, it represents an opportunity for the chatbot to enhance its performance. For businesses that make use of LLMs, it’s also a moment to apologise for potential mistakes.

Certain companies and organisations are taking their own measures. Stack Overflow, a Q&A platform for developers, recently imposed a temporary restriction on users posting answers generated by ChatGPT. This move came in response to the chatbot’s tendency to produce responses that seemed believable but were, in fact, false.

Of course, one major reason why chatbots can give wrong information is that they have not correctly understood the question. More often than not that’s an issue with ambiguous wording on the part of the user. To mitigate this risk, chatbots should be set up to ask more questions to confirm understanding.

For businesses, it’s a key feature to look for is its ability to accurately detect its limitations and effortlessly escalate to a human-centric support channel, such as live chat. This capability not only enhances the efficiency of the customer service process but also significantly boosts overall customer satisfaction.

The incorporation of live web search results expands the horizons of AI chatbots, enabling them to offer more comprehensive and relevant information. With access to real-time data, chatbots can keep pace with current events, provide the latest statistics, and offer insights from diverse sources. Users can now engage in conversations that reflect the most recent developments, enhancing the overall experience and usefulness of AI chatbots.

The availability of live web search results brings challenges that cannot be ignored. A critical concern lies in determining the accuracy and veracity of the information retrieved. Unlike traditional search engines, chatbots need to evaluate and assess the reliability of search results in real time. Relying solely on search algorithms may lead to the propagation of misinformation or biassed content, potentially compromising the quality and trustworthiness of the responses generated. Measures must be taken to enhance the capabilities of AI chatbots in evaluating the accuracy and veracity of search result data.

Collaborations with fact-checking organisations may be the key. It is essential to prioritise transparency, allowing users to understand the basis of the information provided and facilitating their own evaluation of its reliability.

Striving for accuracy, veracity, and transparency should be at the forefront of efforts to ensure that AI chatbots provide reliable and trustworthy information. By implementing robust verification mechanisms and fostering collaborations with authoritative sources, we can embrace these exciting times while upholding the necessary precautions to safeguard the integrity of AI chatbot responses.

Advanced algorithms will no doubt be developed to better check the credibility of sources, consider multiple perspectives, and detect potential biases. Nevertheless, while chatbots excel at personalized interactions, search engines remain the preferred choice for discovering new information and exploring the internet’s vast resources — at least for the time being.

Thom Harrison