ChatGPT and Large Language Models: Their Risks and Limitations
For more on artificial intelligence (AI) in investment management, check out The Handbook of Artificial Intelligence and Big Data Applications in Investments, by Larry Cao, CFA, from the CFA Institute Research Foundation.
Performance and Data
Despite its seemingly “magical” qualities, ChatGPT, like other large language models (LLMs), is just a giant artificial neural network. Its complex architecture consists of about 400 core layers and 175 billion parameters (weights) all trained on human-written texts scraped from the web and other sources. All told, these textual sources total about 45 terabytes of initial data. Without the training and tuning, ChatGPT would produce just gibberish.
We might imagine that LLMs’ astounding capabilities are limited only by the size of its network and the amount of data it trains on. That is true to an extent. But LLM inputs cost money, and even small improvements in performance require significantly more computing power. According to estimates, training ChatGPT-3 consumed about 1.3 gigawatt hours of electricity and cost OpenAI about $4.6 million in total. The larger ChatGPT-4 model, by contrast, will have cost $100 million or more to train.
OpenAI researchers may have already reached an inflection point, and some have admitted that further performance improvements will have to come from something other than increased computing power.
Still, data availability may be the most critical impediment to the progress of LLMs. ChatGPT-4 has been trained on all the high-quality text that is available from the internet. Yet far more high-quality text is stored away in individual and corporate databases and is inaccessible to OpenAI or other firms at reasonable cost or scale. But such curated training data, layered with additional training techniques, could fine tune the pre-trained LLMs to better anticipate and respond to domain-specific tasks and queries. Such LLMs would not only outperform larger LLMs but also be cheaper, more accessible, and safer.
But inaccessible data and the limits of computing power are only two of the obstacles holding LLMs back.
Hallucination, Inaccuracy, and Misuse
The most pertinent use case for foundational AI applications like ChatGPT is gathering, contextualizing, and summarizing information. ChatGPT and LLMs have helped write dissertations and extensive computer code and have even taken and passed complicated exams. Firms have commercialized LLMs to provide professional support services. The company Casetext, for example, has deployed ChatGPT in its CoCounsel application to help lawyers draft legal research memos, review and create legal documents, and prepare for trials.
Yet whatever their writing ability, ChatGPT and LLMs are statistical machines. They provide “plausible” or “probable” responses based on what they “saw” during their training. They cannot always verify or describe the reasoning and motivation behind their answers. While ChatGPT-4 may have passed multi-state bar exams, an experienced lawyer should no more trust its legal memos than they would those written by a first-year associate.
The statistical nature of ChatGPT is most obvious when it is asked to solve a mathematical problem. Prompt it to integrate some multiple-term trigonometric function and ChatGPT may provide a plausible-looking but incorrect response. Ask it to describe the steps it took to arrive at the answer, it may again give a seemingly plausible-looking response. Ask again and it may offer an entirely different answer. There should only be one right answer and only one sequence of analytical steps to arrive at that answer. This underscores the fact that ChatGPT does not “understand” math problems and does not apply the computational algorithmic reasoning that mathematical solutions require.
The random statistical nature of LLMs also makes them susceptible to what data scientists call “hallucinations,” flights of fancy that they pass off as reality. If they can provide wrong yet convincing text, LLMs can also spread misinformation and be used for illegal or unethical purposes. Bad actors could prompt an LLM to write articles in the style of a reputable publication and then disseminate them as fake news, for example. Or they could use it to defraud clients by obtaining sensitive personal information. For these reasons, firms like JPMorgan Chase and Deutsche Bank have banned the use of ChatGPT.
How can we address LLM-related inaccuracies, accidents, and misuse? The fine tuning of pre-trained LLMs on curated, domain-specific data can help improve the accuracy and appropriateness of the responses. The company Casetext, for example, relies on pre-trained ChatGPT-4 but supplements its CoCounsel application with additional training data — legal texts, cases, statutes, and regulations from all US federal and state jurisdictions — to improve its responses. It recommends more precise prompts based on the specific legal task the user wants to accomplish; CoCounsel always cites the sources from which it draws its responses.
Certain additional training techniques, such as reinforcement learning from human feedback (RLHF), applied on top of the initial training can reduce an LLM’s potential for misuse or misinformation as well. RLHF “grades” LLM responses based on human judgment. This data is then fed back into the neural network as part of its training to reduce the possibility that the LLM will provide inaccurate or harmful responses to similar prompts in the future. Of course, what is an “appropriate” response is subject to perspective, so RLHF is hardly a panacea.
“Red teaming” is another improvement technique through which users “attack” the LLM to find its weaknesses and fix them. Red teamers write prompts to persuade the LLM to do what it is not supposed to do in anticipation of similar attempts by malicious actors in the real world. By identifying potentially bad prompts, LLM developers can then set guardrails around the LLM’s responses. While such efforts do help, they are not foolproof. Despite extensive red teaming on ChatGPT-4, users can still engineer prompts to circumvent its guardrails.
Another potential solution is deploying additional AI to police the LLM by creating a secondary neural network in parallel with the LLM. This second AI is trained to judge the LLM’s responses based on certain ethical principles or policies. The “distance” of the LLM’s response to the “right” response according to the judge AI is fed back into the LLM as part of its training process. This way, when the LLM considers its choice of response to a prompt, it prioritizes the one that is the most ethical.
Transparency
ChatGPT and LLMs share a shortcoming common to AI and machine learning (ML) applications: They are essentially black boxes. Not even the programmers at OpenAI know exactly how ChatGPT configures itself to produce its text. Model developers traditionally design their models before committing them to a program code, but LLMs use data to configure themselves. LLM network architecture itself lacks a theoretical basis or engineering: Programmers chose many network features simply because they work without necessarily knowing why they work.
This inherent transparency problem has led to a whole new framework for validating AI/ML algorithms — so-called explainable or interpretable AI. The model management community has explored various methods to build intuition and explanations around AI/ML predictions and decisions. Many techniques seek to understand what features of the input data generated the outputs and how important they were to certain outputs. Others reverse engineer the AI models to build a simpler, more interpretable model in a localized realm where only certain features and outputs apply. Unfortunately, interpretable AI/ML methods become exponentially more complicated as models grow larger, so progress has been slow. To my knowledge, no interpretable AI/ML has been applied successfully on a neural network of ChatGPT’s size and complexity.
Given the slow progress on explainable or interpretable AI/ML, there is a compelling case for more regulations around LLMs to help firms guard against unforeseen or extreme scenarios, the “unknown unknowns.” The growing ubiquity of LLMs and the potential for productivity gains make outright bans on their use unrealistic. A firm’s model risk governance policies should, therefore, concentrate not so much on validating these types of models but on implementing comprehensive use and safety standards. These policies should prioritize the safe and responsible deployment of LLMs and ensure that users are checking the accuracy and appropriateness of the output responses. In this model governance paradigm, the independent model risk management does not examine how LLMs work but, rather, audits the business user’s justification and rationale for relying on the LLMs for a specific task and ensures that the business units that use them have safeguards in place as part of the model output and in the business process itself.
What’s Next?
ChatGPT and LLMs represent a huge leap in AI/ML technology and bring us one step closer to an artificial general intelligence. But adoption of ChatGPT and LLMs comes with important limitations and risks. Firms must first adopt new model risk governance standards like those described above before deploying LLM technology in their businesses. A good model governance policy appreciates the enormous potential of LLMs but ensures their safe and responsible use by mitigating their inherent risks.
If you liked this post, don’t forget to subscribe to Enterprising Investor.
All posts are the opinion of the author. As such, they should not be construed as investment advice, nor do the opinions expressed necessarily reflect the views of CFA Institute or the author’s employer.
Image credit: ©Getty Images /Yuichiro Chino
Professional Learning for CFA Institute Members
CFA Institute members are empowered to self-determine and self-report professional learning (PL) credits earned, including content on Enterprising Investor. Members can record credits easily using their online PL tracker.