Time to turn on the legal data taps

3 Comments

A bizarre dividend of the Enron scandal is a collection of emails. Twenty years after the fraudulent stock-market darling collapsed, dragging a top accountancy firm down with it, half a million Enron emails posted on the web by US regulators have become an important academic resource.

This interest is not just ghoulish. The so-called Enron Corpus is one of the largest collections of real emails freely available for study. As such it is hugely valuable for training artificial intelligence programs in how human beings communicate in real life. (Obviously the vast bulk of the archive deals with more mundane matters than corporate fraud.) The database has helped teach algorithms to spot email spam and even, reputedly, to train the first release of Apple's home AI system Siri.

Artificial intelligence needs data just as much as natural intelligence needs oxygen. Today's AI programs do not work like their failed pre-programmed predecessors, but by learning from the real world. The way to teach a self-driving car to recognise a traffic light is not to load up data about traffic lights in a decision-tree, but to program it to detect common features from millions of examples of humans recognising traffic lights. (How to assemble a database of millions of humans recognising a traffic light? What do you think those 'I am not a robot' routines on web services are doing as a sideline?)

The same principle applies to designing systems to make legal decisions: the algorithms roams over thousands, or preferably millions, of real-world examples in search of patterns. But where can we find these detailed real-world examples? It goes without saying that lawyers are highly wary about professional privilege - not to mention the Data Protection Act.

As a result, a lack of relevant 'big data' has emerged as a major roadblock to the development of legal AI.

At the end of last year, however, we saw two significant announcements in this area.

One was the long overdue agreement by the British and Irish Legal Information Institute (BAILII) to allow bulk access to its dataset for research - albeit to just one team, to support the University of Oxford's Unlocking the Potential of AI for English Law project. Up to now BAILII's terms and conditions have forbidden the bulk downloading or 'scraping' of its diligently curated database of 400,000 cases. (It's an open secret that this goes on anyway, but the perpetrators, even when based outside the UK, don't like to shout about it.)

The breakthrough seems to have been the governance arrangements proposed by Oxford researchers in the report Building a Justice Data Infrastructure published last year with support from UK Research and Innovation.

But, excellent as it is, BAILII's database can only take you so far. By definition, it deals only with cases that end up in court - and a sub-set of mainly superior courts, at that. Obviously only a tiny fraction of legal matters culminate in a beautifully crafted judicial conclusion following a guided tour of carefully chosen authorities. Training an legal AI algorithm on this data would be roughly akin to training a self-driving car with data about F1 grands prix. Data scientists warn of the 'N = All fallacy' - the temptation to believe that the data you happen to have to hand represents the real world.

This is where the second announcement scores. Improving access to legal data is one of the explicit goals of the government-funded Lawtech Sandbox. Among the first cohort of innovators piloting the scheme is a tool to help businesses and other organisations detect emerging disputes with customers, suppliers or others before they go legal. Early warning signs might include a change in tone in email correspondence, or one party suddenly going quiet. The first prototype was tested on publicly available data, including the Enron Corpus. In the next phase, the developers will use real communiations data from at least two multinational businesses. A data transfer/non-disclosure agreement has already been reached with one; the second is 'very close', the project's leader, Dr Mimi Zou, tells me.

Whether this will mark the start of a trend remains to be seen, however. The corporates in question are opening up their databases only under assurances of a strict data protection regime. Moreover, they have a clear business case for participation: an accurate dispute avoidance tool would be a boon to businesses looking to cut their legal spend. Persuading data custodians in other sectors to put in the effort may be more tricky. But if research into legal AI is to be driven by improving access to justice rather than merely corporate efficiency, a way needs to be found to turn on the data taps.