Bias and Ethics in AI-Enabled Legal Technology: Examining the Role and Impact of Human Inputs on AI-Rendered Results in Legal Matters


Peter Gronvall

Nathaniel Huber-Fliflet

Jianping Zhang

Authored on: 
Tuesday, March 14, 2023


Advancements in the application of data analytics and artificial intelligence technologies are increasingly influencing the way legal service providers, including law firms, technology partners, and even in-house corporate legal teams, operate in the types of large-scale legal matters that prevail today.

At unprecedented rates, corporations, law firms, and state and federal enforcement agencies are accepting and adopting the use of advanced technology solutions to help understand and engage with essentially infinite data volumes – and vastly disparate data types – that fall within the scope of today’s legal matters.  Current innovative approaches, such as workflow automation, artificial intelligence, machine learning, and algorithm-driven data analytics are enabling the discovery of the most relevant facts in these high data volume legal matters.

As the legal field embraces rapid technology innovation, a vigorous debate has emerged around what boundaries those technologies should abide by, especially as they relate to how humans and machines should interact in the ‘fact discovery’ process.

This debate’sprevailing themes, include focusing on affirming the efficacy of machine training methodologies, employing defensible technology validation processes, and understanding where human intervention in an AI-enabled workflow should end, and where AI-rendered results can be trusted.  These tenets generally frame the ‘machines versus humans’ debate that is now alive and well – and, some would suggest, far from being resolved – with interested and well-vested parties on both sides.  This debate is driving a vigorous academic search for two principles which we seek to address in our research: (i) understanding the ‘right’ balance between machines and humans; and (ii) understanding how to evaluate the roles of human inputs in impacting the quality of AI-rendered results. 

The first phase of our research explored what a fair and ‘right’ balance between machines and humans could be, in machine-learning scenarios, and we highlight that research first within this article.  Next, we describe our new study, which seeks to expand upon this debate to understand further the nuanced role of humans in guiding, testing, and applying results generated by machines, to outcomes, with particular focus on how those results impact legal case strategies.

I.              Technology Assisted Review in the Legal Services Domain

The wide embrace of a suite of machine learning tools, collectively called Technology Assisted Review or ‘TAR’, has been met with acclaim – but also enduring challenges – from judges, enforcement agency officials, as well as corporations and their legal counsel.

On one hand, it is essentially undisputed by now that TAR has proven to help produce substantial value for its users.  This is especially true in the ‘fact discovery’ phases of legal matters, primarily because at minimum, TAR has proven to be very good at searching massive troves of data and identifying those documents that are most important and relevant to the legal matters at hand.  TAR has also proven to be quantifiably cheaper than the brute-force workflows of yesterday, especially when compared to the costs and timelines associated with the conventional, linear approaches that lawyers have historically used to find relevant documents in discovery.

Corporate clients and their counsel worked to shift the inertia beyond traditional approaches to data review, towards solutions that deployed technology.  The result of these efforts was a successful, disruptive challenge: to introduce the increasing use of technology assisted discovery that delivered legally-reliable outcomes.  Because TAR has worked so well at this point, TAR has essentially become the new ‘default setting’ in large-scale legal matters (as considered by adverse parties, enforcement agencies, and judges) and many believe that TAR is as good or better than a human attorney-based review efforts.  The widespread and growing adoption of TAR has fueled speculation that TAR will ultimately result in attorneys being largely supplanted and replaced by machines, to make judgment calls on what facts are ‘in’ or ‘out’ of discovery, in how those facts are treated as dispositive to the legal matter at hand.  

The controversy here, in simple terms, is that technology-rendered decisions will eventually overtake human decisions as having finality.  There is passionate worry among some that in AI-legal technology scenarios, humans will have initial input in shaping how documents are classified but at its logical extreme, the concern is that human judgment as to fact relevancy will be supplanted by the lone decisions that AI processes make in identifying relevant documents.  This worry is justified, as some US regulators are requiring that machine-rendered decisions dictate final decisions as to what any particular case’s relevant documents are, in preference to what lawyers would ultimately decide.

This exciting but disruptive technology creates a compelling challenge, whichforces us to discuss the boundaries of how far TAR should go.  Further within this debate, some industry observers have hypothesized that an ‘artificial intelligence invasion’ could ultimately erode and maybe even bring about the ‘extinction of the legal profession’ in the expansive, multi-billion-dollar electronic discovery realm.  

As far-fetched as that may sound, some voices in this space have suggested that in the data discovery process, AI, machine learning and predictive coding (essential components of TAR), would find purchase as making final relevancy decisions in legal matters.  But here is the issue:  if machine-rendered decisions are given final sway in making relevancy calls, attorneys and their clients are concerned that that could, in many scenarios, obviate the need for attorney review and judgment as it relates to finding the documents most relevant in large-scale legal matters.

As published scholars and case-experienced practitioners, we set out to advance the dialogue that seeks the right balance between TAR assisting attorneys, on one hand, and attorneys retaining their role in influencing the way AI is deployed as well as their ability to test and overturn AI-rendered results, on the other.  To reconcile this challenge, the authors of this article conducted an academic study, based on real-time client matter data, to identify and understand the boundaries of the proper ‘swing back’ of the presumption of nearly-sole reliance on TAR, to an accepted balanced regimen that factors the judgment and decision making of attorneys, working with TAR results, to render their final judgment as to what facts should and should not be relevant to legal matters, in any fact-finding exercise.

A.  Predictive Coding in TAR

Predictive coding, a form of TAR, is a data review methodology designed by humans that ultimately results in a predictive model that helps classify documents as to their relevancy to the legal matter at hand.  Predictive models classify documents based on ‘relevancy’ factors pertaining to the subject matters that attorneys want to find: classifications that are important to their advocacy of the case.  Those relevancy factors are derived by attorneys and used to ‘train’ the technology to predict what documents qualify as relating to those concepts. Those same methodologies are also deployed to screen for other important considerations, including finding privileged attorney-client communications, usually for the purpose of keeping those documents from being disclosed in compulsory-process scenarios.

Because predictive coding has proven to be effective and cost efficient, especially when fine-tuned and shaped by human intervention, it is now widely adopted and accepted by both sides of the legal ledger.  Predictive coding is rightfully presumed, at times, to be more accurate than manual (i.e., human, document-by-document) review and for the most part, studies have shown that the acceptance of its results is well founded.  In fact, many case studies have proven that predictive coding adds speed, quality and cost efficiency for parties in legal matters.  But in no instance have attorneys been willing to accept predictive coding decisions as final, beyond the final input of attorney judgment.

Over the past few years, this article’s authors found themselves at the center of inspired counter-reactions to the use of this technology, resulting in a study comprised of experiments, examining what the proper balance might be, in terms of using predictive coding calibrated and informed by human inputs.  Our remit in those studies was to explore the outer limits of ‘how far’ parties in legal matters should rely upon technology-rendered solutions, and to examine what the right combination of ‘computers and humans’ should be, in making ‘relevant evidence’ determinations.

B.   Phase One: Finding the Best Combination of Human and Machine Inputs

In a recent study, the authors conducted research across real-life legal matters that employed predictive coding techniques, in order to evaluate the effectiveness and necessity of human involvement to reach final, reliable and ‘true’ results.  The study, titled, Humans Against The Machines:Reaffirming The Superiority Of Human Attorneys In Legal Document Review And Examining The Limitations Of Algorithmic Approaches To Discovery, examined the popular view that machines are more consistent and trustworthy in their document-relevancy assessments than humans, and thus that AI-rendered decisions should not be subject to final override by attorneys. In this study, our opening hypothesis was that this ‘machines over humans’ view was untenable, given that predictive models and machines are not perfect and that the attorneys rightfully should be considered qualified to make final relevancy decisions to correct machine-rendered results, when necessary.

The study evaluated the impact that manual review, overseen by subject-matter experts, has on the results of a document review powered by predictive coding.  The results demonstrated that human attorneys, when included as decision makers in the final relevancy decision making process, improved the overall precision of the document review when compared to predictive coding alone.  We found that this finding squarely tilts against the ‘machines over humans’ narrative; it revealed that the best results in legal scenarios happen when a blend of technology-enabled solutions derived with human (i.e., attorney) intervention is paired with ultimate decision making by attorneys.

We concluded that study by considering the significant risks inherent in relying on predictive coding alone to drive high-quality, legally defensible document reviews, purposefully to seek the right balance of humans and technology solutions in data discovery exercises.  Along the way, we believe that we refined the approach to finding a solution equilibrium: one that exploits technology to get through data quickly, but one that also protects against unwanted data disclosures, including preventing the disclosure of documents containing attorney-client privilege and work-product communications.

Through our evaluation of technology-powered approaches to legal matters, implicating predictive coding’s capabilities, limitations, and drawbacks, we arrived at the conclusion that the much talked-about ‘rise of computer over humans’ is really a false alarm.  We determined that the use of predictive coding in document review, while itself a powerful tool in legal matters, ultimately presents an important potential boundary that must be identified of just how far technology should carry us.   And of equal importance, our study has convincingly impressed upon our clients – and upon us – that human intervention has a proper place in setting and governing the data review process to mitigate inadvertent disclosures or compelled disclosures, when machines are at work.

This study’s conclusion now leads us, importantly, to what we believe is the natural evolution of our research, understanding machine bias. With humans creating classification-focused algorithms, guiding how machines should interrogate data, and interpreting the results of this iterative process, we now ask: Is there now a new incumbent need to evaluate, scrutinize and establish the efficacy and merit of machine-rendered results to protect against inherent bias, skewed results, and vulnerability to legal challenges?

II.           Phase Two: Understanding the Bias Humans Impose on TAR-Rendered Results

Arriving at an answer to this new fundamental question – Do humans impact bias on TAR-rendered results? – is not easy, nor is it linear.  There is no current playbook or body of precedential authority to provide guidance as to where the boundaries of technologies should prevail upon the legal process, and as it follows, where the proper role of attorneys is, in handling machine-rendered results.  

But so far, fundamental elements of the practice of law and advocacy remain  widely accepted: humans build algorithms based upon their views as to which types of documents are important to their cases.  And thus, humans must remain instrumentally involved in influencing the inputs that algorithms use to render results.  And they must also remain critical to evaluating those results, and have a role in deciding which documents are important to any legal matter.  

Where this human intervention most often plays out in a data review scenario, attorneys are tasked with evaluating how TAR-rendered ‘responsiveness’ decisions are correlated with the ultimate determinations as to which documents are produced to requesting parties.  Finally, and maybe most importantly, this evaluation of the proper role of technology in legal matters comes down to the preservation of the fundamental principle of attorney advocacy: lawyers must preserve their critical role in engaging with results produced by technology tools, to test for accuracy, truth and the threat of bias.

A.  Training AI Models with Human Guidance

The role of humans in helping to guide a supervised machine-learning process is arguably more impactful than the plain outputs of a machine-rendered process.  Machine learning algorithms are designed by humans, trained using labeled training data, and grounded in attorneys’ inputs on what types of documents are important to the case at hand.  In these scenarios, the algorithm trained by humans has an important impact on the successful application of machine learning to the matter at hand.  The potentially larger impact of this technology is derived from the selection of training data in the development of the classification algorithm.  From a supervised learning point of view, the algorithm asseses the human-labled training data to derive the responsiveness features that are present in the dataset.

The inspiration driving the next phase of our research rests upon the following truth:  It is proven that iteration and human intervention in building predictive models, in some instances, could result in biased and error-prone results.  In current technology-assisted scenarios, lawyers and their technology partners must be present to confront – and protect against – this human-imposed computer error factor. They must endeavor to design well-constructed and unbiased processes – to achieve results as close to the defined classifications that the machine learning process was intended to achieve.

This leads to the important risk question which our next study undertakes: While the impact of the human-imposed computer error factor might, in most instances, be inconsequential in how an algorithm identifies responsive documents in a legal matter, does an algorithm, however well designed, still stand to be challenged for its results?

With this in mind, the authors observed that a critical new aversion for risk in this realm is emerging.  In practical terms, and as an alternative to more automated, ‘black box’ applications of machine learning, lawyers and their clients are increasingly comfortable embracing the notion that they would rather have an understandable and transparent technology-based process with an accepted error coefficient.  This observation has led us to believe that a fulsome understanding of a bias coefficient is preferable to operating with a set of technology-rendered results that appear to meet the performance benchmark but where those results are difficult or impossible to explain.

Thus, the new imperative when adopting an AI-based approach that minimizes legal risk is to achieve an end result that is transparent and explainable. And that is our critical remit, as technologists and practitioners: to meet the underlying legal goals of case matters while also exploring and leveraging the emerging fields of machine learning within the domains of Explainable AI and Responsible AI.

B.   AI Solutions: Explainable and Responsible

Explainable AI systems are inherently designed to produce results, actions and decisions that are understandable from a human/end-user perspective.  For example, during typical AI legal document review, documents can be identified as ‘responsive’ or otherwise important to the legal matter at hand, if one or more of the text snippets in any document are deemed responsive.  In those scenarios, an ‘explainable’ predictive model would be deployed to tag ‘responsive’ snippets within the documents as explanations of why the document was flagged as responsive, so attorneys can evaluate – and opine one way or the other – on the text that the model used for its document classification decision.

Responsible AI is an additional discipline. The field of Responsible AI studies and evaluates relevant topics related to building ethical, accountable, and transparent AI systems, with the prevailing notion that human intervention around ‘machine results’ always retains a fundamental role in the derivation of said results.  The Responsible AI field assesses the impact of biased data and algorithms in deriving explainable decisions. Responsible AI has an emerging presence in the broader field of AI, and it has promise to help inform the establishment of best practices and socially-conscious applications of AI technology in the legal technology realm, as well as others where AI is proliferating.

The complexity of AI systems and the emerging focus on their ethical impacts are drawing attention from academics, lawyers and legal technology experts engaged in the discussion on how to apply AI to legal matters as well as day-to-day scenarios, within corporate and organizational environments.

As an example, the robust debate around autonomous ‘devices’ (i.e., autonomous vehicles and other connected devices in the Internet of Things schema), sets out to understand and manage certain risks inherent in autonomous devices generating data that is later evaluated by AI.  The underlying questions, in both academic and legal circles, are deeply inquisitive on how critical decisions by devices are made, which calls into question how those devices are programmed to operate.

In this realm, we see the need to address ‘AI output’ bias to ensure that parties are indeed granted accurate results, ones that correctly identify documents that fall within the purview of their compulsory-process mandates.  This challenge has given the legal realm itself an important decision to make: whether it can safely rely on AI-derived solutions to identify the most important documents in their legal matters.

To further our goal of bringing transparency into what we could consider a properly balanced human role of applying AI to the legal domain, this article’s authors are pursuing a new research component: to add an academic-based intelligence to how AI-generated results should be interpreted and trusted.

C.  Deepening Our Understanding of Algorithms that May Produce Biased Results

Pursuing this research has given us an exciting new remit to explore through research: to achieve a clear understanding of how human-designed algorithms impact machine learning decisions.  As such, we are evaluating the performance of various machine learning algorithms and know that it is not necessarily easy to evaluate how different algorithms produce biases.  Through our research, we have found that biases can be impacted by (i) selected training data, (ii) learning algorithms using certain parameter settings, and (iii) the variety of features used when building the underlying models.

We believe training data has the greatest impact to the results of a supervised machine learning process and are encouraged that evaluating this hypothesis is relativity straightforward.  Additionally, and probably most importantly, we believe that lawyers may have the biggest impact to the process through how they prepare training and validation data.

We will examine the ways in which human judgment and inputs can ensure that AI outputs are garnered to produce ethical, responsible, and accurate results.  Human judgment paired with machine-rendered inputs are critical legal advocacy.  No single legal client wants to avail itself – or its legal position – to decisions made about their documents purely by machines.  While we appreciate the power and intelligence of algorithmic assistance, we are focused on finding the right balance of those results, finally etched upon by attorney decisions.

As a part of this ongoing practical and academic examination, we remain committed to studying the efficacy and attributes of technology-enhanced results, based on training data, and evaluating how those results might be relied upon – or in some instances questioned for bias – depending upon what that data holds.  

An important element of this research focuses on the underlying technology techniques as well as the data set itself: this becomes an exercise in scrutinizing the quality of the inputs that go into AI workflows, to search for any attributes that would require us to question even unintentional, inherent quality concerns with the training process.

For example, we would explore whether using training data from a particular classification of data ‘custodian’ (male or female, for example) produces biased results that detract from overall truth-seeking objectives.  We will also filter for certain types of ‘issue focuses’ within training data sets, to see whether the resulting model is biased to a given subset of issues.

For the final component of our study, we will evaluate the impact on bias that might be borne from the document labels themselves.  We will set out to examine the labeling of responsive documents from key selected custodians to test the model’s bias.

Underlying all of this is a need to test and define the impact of human involvement in building search algorithms.  It is important to establish a very basic principle:  we are only embarking on the deployment – and scrutinty – of technology assisted review – for the ultimate goal to discover truth and accuracy in the facts that underpin legal matters.

As such, we focus on the proper techniques in  (i) training and validating data preparation, (ii) selecting algorithms and methodologies including active learning and other available approaches, (iii) choosing search features, including the selection of parameters of an algorithm, (iv) determining cut-off scores and their implications, and (v) manually reviewing documents to substantiate our research.

III.        Impact of the Proposed Study

The thrust of this research is highly consequential for the practice of law and the administration and achievement of truth and justice.  Corporate clients and their outside counsel are deeply invested in the inquiries, examanations, and results of this study.  It is essential for the legal community to know how best to harness technology to achieve speed, cost efficiency and the finding of ‘truth’ in legal scenarios. And we hope that this study will be an invaluable contribution to the fascinating integration of AI into legal services, and an important check on its outer limits.

Throughout the next phase of our study we will explore and describe the ways in which bias, mechanics, and vigilance on AI outputs can be realistically evaluated.  We will focus on keeping the technology accessible and reliable, within what we see as important new boundaries, while striving for the discovery of truthful, equitable and defensible results.