With law firms reporting that generative AI is slashing the time involved in certain tasks, calls are growing for benchmarking and quality standards

Although firms and corporate legal departments which apply generative AI to their day-to-day work are reporting substantial time savings, a recent report by Stanford University researchers tested the accuracy of mainstream vendors’ legal research tools. Does the legal sector need quality benchmarks for generative AI, or a better understanding of the capabilities of different generative AI resources? And which large language models (LLMs) are the best fit for different use cases?

Joanna Goodman

Joanna Goodman

Much online discussion of the accuracy or otherwise of generative AI tools for legal research followed the publication of two versions of a controversial study by researchers at Stanford University in California. Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools questioned the accuracy of Thomson Reuters’ products Ask Practical Law and AI-Assisted Research. The report also looked at LexisNexis’ Lexis+ AI, which recently launched in the UK, and Open AI’s GPT-4. Mike Dahn, head of Westlaw Product Management, challenged the report’s findings: ‘The results from this paper differ dramatically from our own testing and the feedback of our customers.’ He also invited the Stanford research team ‘to work together to develop and maintain state-of-the-art benchmarks across a range of legal use cases’.

In the UK, larger firms have been testing gen AI products and publishing their findings. Ashurst produced a detailed report Vox PopulAI: Lessons from a global law firm’s exploration of Generative AI, which produced time savings of 80% in drafting corporate filings, 59% in drafting industry/sector-specific reports, and 45% in creating first draft legal briefings. ‘Across the tools used and attempts to draft multiple types of documents, an average of 77% of post-trial survey respondents agreed or strongly agreed that usage of GenAI helped them get to a first draft quicker.’

Ashurst’s research also raised issues around accuracy and quality. It does not suggest setting quality standards because ‘quality in a legal context is multidimensional, with subjective and objective elements’. Addleshaw Goddard, which has also shared its gen AI learnings in a series of webinars, is fine-tuning its gen AI tools to improve accuracy and reduce hallucinations.

There are new activities around gen AI quality standards and benchmarking. The Legal IT Innovators Group – a membership organisation of around 90 IT leaders in UK law firms – has announced a Legal Industry AI Benchmarking Collaboration initiative, led by John Craske, chief innovation and knowledge officer at CMS. This is designed to establish benchmarks and standards for law firms to use when assessing gen AI tools in a rapidly evolving market. It will help member firms that are not in the vanguard of gen AI adoption understand which tools and strategies are producing the best results. It may then guide their decisions around investing in gen AI. It will also help consultancies involved in the project by highlighting law firms’ common priorities and concerns.

Model management

The ‘quality’ required of any technology will depend on how it is being applied. During the pandemic, Teams and Zoom became the go-to platforms for video communication. These replaced sophisticated video-conferencing systems in some offices because they are affordable, accessible and good enough. Currently, OpenAI is the go-to option for gen AI. This is partly because people have become familiar with ChatGPT, which is free. With a few exceptions, notably Robin AI’s contract review and analysis which is built on Anthropic’s Claude 3.5 Sonnet, most legal tech gen AI products and integrations are built on OpenAI’s GPT models, as are many firms’ gen AI chatbots. Some products offer a choice of models – that is, you can use GPT-3.5 Turbo or GPT-4 Turbo depending on the use case. This reflects the significant price difference to using the bigger model (GPT-3.5 Turbo is roughly 20 times cheaper).

The reaction of newer, more agile vendors to gen AI accuracy concerns is to address the challenge head-on. Last week, conveyancing scale-up Orbital Witness, which provides rapid AI-powered title checking for property transactions, became the first lawtech to offer a gen AI accuracy guarantee. This is underwritten by First Title Insurance and covers Orbital Witness’s residential property product, so that in the event of an error which results in a compensation claim, the law firm doing the conveyancing would not have to claim on its professional indemnity insurance.

OpenAI-backed Harvey has been adopted by many large law firms, including Ashurst. But until now it was reticent about publicly demonstrating its legal AI platform. It has posted a detailed product demo video on its website, accompanied only by Chopin’s Waltz No.2 in C-sharp minor – that is, there is no audio explanation.

The accuracy and quality of gen AI output can be improved significantly by prompt engineering (asking an LLM the right questions in the right order will generate higher-quality output). Many vendor products and law firm models create standardised prompts for specific purposes. Prompt engineering is an increasingly important and desirable skill. The Financial Times reported that chip-making equipment manufacturer ASML advertised what may be the first prompt engineering position for an in-house legal department where gen AI is already delivering substantial time savings.

AI overreach

A Relativity Fest panel on global AI regulation raised concerns about overreach and prompt engineering being used maliciously to make large language models (LLMs) produce harmful or dangerous content as part of a conversation. Retired judge Dr Victoria McCloud, having explained how skilful prompting can be used to manipulate LLMs to operate in ways that are prohibited by the EU AI Act, drew an interesting analogy with the Human Fertilisation and Embryology Act as a model for AI regulation: ‘It created a regulatory authority, but it didn’t regulate; it had broad parameters [and the scope] to resolve issues ethically as the anticipated future unravelled; and it drew a line between not stifling research while controlling Frankenstein-style experiments.’ McCloud called for light-touch regulation rather than attempting to legislate for problems as they arise.

Gen AI pick ‘n’ mix

Travers Smith, a pioneer of gen AI in legal, recently spun out its AI function into an independent AI software company, headed by former director of legal technology, Shawn Curran. Jylo is a unique brand name created by prompting ChatGPT. It combines Travers Smith’s gen AI products Analyse – which uses LLMs to interrogate large volumes of unstructured data – and the open source YCNBot (which enables organisations to build their own ChatGPT chatbots), as well as a unique market. Here, companies can create custom gen AI products and prompts. A main advantage of Jylo is that it allows users to control the cost of gen AI deployment. It does this by offering integration with a selection of LLMs. ‘If you want to use GPT-4o for discovery, it will cost more,’ Curran explains, ‘but if you’re creating a standard confidentiality agreement, you could use Llama 3, which is 133 times cheaper.’

'Prompts should be part of the product, as they are the element that adds the most value to the process'

Shawn Curran, Travers Smith

This raises the question, who should select the model? ‘We concluded that that would be the person who is writing the prompts. Prompts should be part of the product, as they are the element that adds the most value to the process,’ Curran says.

Shawn Curran

Shawn Curran, Travers Smith

A lawyer who needed to interrogate a portfolio of leases could select or create a product which had three key prompts, then upload the documents and apply the prompts. They could then use another series of prompts to verify the output, and potentially automate the entire process and monetise the prompts. Consequently, prompt engineering becomes about creating products, rather than applying a skill. Jylo is providing a unique gen AI ‘pick ‘n’ mix’ of LLMs, prompts and chatbots, and potentially a glimpse into the future.

In-house leads the way

Corporate legal is leading the way in adopting gen AI, which boosts productivity, enabling teams to handle more work in-house and reduce reliance on external counsel.

Relativity’s recent survey discussed at Relativity Fest in London revealed that some general counsel not only use AI, but require external law firms to use it too. These GCs include specific questions on their RFP (request for proposal) form. And while only 27% of respondents were currently using generative AI, 75% planned to introduce it/increase its use over the next few years. However, in-house counsel are focused on risk, particularly in respect of sensitive data. So the panel stressed the importance of knowing what data you have and understanding its associated risks, building the right data governance structure, and knowing which gen AI use cases – and LLMs – are the best fit.

Elephant in the room

While every legal and lawtech conference includes panels on multiple aspects of gen AI being used in legal services, few touch on the consequences for the business of law. While the popular refrain that rather than replacing lawyers, gen AI will free them up to concentrate on more fulfilling, higher-value work may be reassuring, its longer-term impact is unclear. The elephant in the room is what this really means for law firms. If full-service firms have to pivot to AI-augmented services to remain competitive, will they have enough higher-value work for the same number of lawyers and business support professionals as AI eats into the bread-and-butter work that represents most of their practices? And if not, what will their lawyers do with the extra hours saved by using gen AI?