Finding the needle in the digital Multilingual haystack
If you are a Compliance Officer, Litigator or Company Executive that works with large populations of foreign language documents or are contemplating a merger with a multi-national company that will deal with these types of documents, this article is for you. This is not a marketing publication. It delves into the details of how your organization can benefit from a technology that until now has been elusive and often fails when others try to execute on a strategy of maintaining control or reducing the population of documents that are within their infrastructures. At the end of this article you will also learn how you can utilize our services to establish such a solution within your organization.
Finding the needle in the digital Multilingual haystack
Whether you manage your organization’s Ediscovery needs, are a litigator working with multi-national corporations or are a Compliance officer, you commonly work with multilingual document collections. If you are an executive that needs to know everything about your organization, you would have a triage strategy helping you get to the right information ASAP. If the document count is over 50-100k you typically employ native speaking reviewers to perform a linear one by one review of documents or utilize various search mechanisms to help you in this endeavor or both. What you may find is that most documents being reviewed by these expensive reviewers is often irrelevant or requires an expert to review. If the population includes documents from 3 or more languages, then the task becomes even more difficult!
There is a better solution. A solution that if used wisely can benefit your organization, save time/money and a huge amount of headache. I am proposing that in these document populations the first thing you need to do is eliminate non-relevant documents and if they are in a foreign language you need to see an accurate translation of the document. In this article you will learn in detail how to improve quality of these translations using machines at a cost of hundreds of times less than human translation and naturally much faster.
With the advent of new machine translation technologies comes the challenge of proving its efficacy in various industries. Historically MT has been looked at not only inferior but as something to avoid. Unfortunately, the stigma that comes with this technology is not necessarily far from the truth. Adding to that, the incorrect methods utilized in presenting its capabilities by various vendors has led to its demise in active use across most industries. The general feeling is “if we can human translate them, why use an inferior method” and that is true for the most part, except that human translation is very expensive, especially when the subject matter is more than a few hundred documents. So is there really a compromise? Is there a point where we can rely on MT to complement existing human translations?
The goal of this article is to look under the hood of these technologies and provide a defensible argument for how MT can be super charged with human translations. Human being’s innate ability to analyze content provides an opportunity to help and aid some of these machine learning technologies. An attempt to transfer that human based analytical information into a training model for these technologies can provide translation results that are dramatically improved.
Machine Translation technologies are based on dictionaries, translation memories and some rules based grammar that differs from one software solution to another. Although there are newer technologies that utilize statistical analysis and mathematical algorithms to construct these rules and have been available for the past several years, unfortunately individuals that have the core competencies to utilize these technologies are few and far between. On top of that, these software solutions are not by themselves the whole solution and just a part of a larger process that entails understanding language translation and how to utilize various aspects of each language and features of each of the software solutions.
I have personally witnessed most if not all the various technologies utilized in MT and about 5 years ago, developed a methodology that has proven itself in real life situations as well. Below is a link to a case study on a regulatory matter that I worked on.
Regulatory Authority Case Study
If followed correctly, these instructions can turn machine translated documents into documents with minimal post editing requirements and at a cost of hundreds of times less than human translation. They will also look more closely like their human translated counterparts with proper flow of sentence and grammatical accuracy, far beyond the raw machine translated documents. I have referred to this methodology as “Enhanced Machine Translation”, still not human translation but much improved from where we have been till now.
To understand the nuances of language translation we first must standardize our understanding of the simplest components within most if not all languages. I have provided a summary of what this may look like below.
- Standard/Expanded Dictionaries
- Dimensions of a words definition in Context
- Stereotypical description of characteristics
- Between concepts, attributes and definitions
- Part of Speech / Grammar Rules
- Common understanding based on existing document examples
Simply accepting that this base of understanding is common amongst most if not all languages is important, since the model we will build on makes assumptions that these building blocks will provide a solid foundation for any solution that we propose.
Furthermore, familiarity with various classes of technologies available is also important, with a clear understanding of each technology solution’s pros and cons. I have included a basic summary below.
- Basic (Linear) rule based Tools
- Online Tools (Google, Microsoft, etc.)
- Statistical Tools
- Tools combining the best of both worlds of rules based and statistical analysis
Linear Dictionaries & Translation Memories
- Ability to understand the form of word (noun, verb, etc.) in a dictionary
- One to one relationship between words/phrases in translation memories
- Fully customizable based on language
- Inability to formulate correct sentence structure
- Ambiguous results, often not understandable
- Usually a waste of resources in most case use examples if relied on exclusively
Statistical Machine Translation
- Ability to understand co-occurrence of words and building an algorithm to use as reference
- Capable of comparing sentence structures based on examples given and further building on the algorithm
- Can be designed to be case-centric
- Words are not numbers
- No understanding of form of words
- Results could be similar to some concept searching tools that often fall off the cliff if relied on too much
Now that we understand what is available, building a model and process that takes advantage of benefits of various technologies, while minimizing the disadvantages of them would be crucial. In order to enhance any and all of these solution’s capabilities, it is important to understand that machines and machine learning by itself cannot be the only mechanism we build our processes on. This is where human translations come into the picture. If there was some way to utilize the natural ability of human translators to analyze content and build out a foundation for our solutions, would we be able improve on the resulting translations? The answer is a resounding yes!
BabelQwest: A combination of tools designed to assist in Enhancing Quality of MT
To understand how we would accomplish this, we need to review some of the machine based concept analysis terminologies first. In a nutshell these definitions and solutions are what we have actually based our solutions on. I have made reference to some of the most important of these definitions below. I have also enhanced these definitions with how as linguists and technologists we will utilize them in building out the “Enhanced Machine Translation” (EMT for short) solutions.
- Classification: Gather a select representative set of the documents from the existing document corpus that represent the majority of subject matters to be analyzed
- Clustering: Build out documents selected in the classification stage to find similar documents that match the cluster definitions and algorithms of the representative documents
- Summarization: Select key sections of these documents as keywords, phrases and summaries
- N-Grams: N-Grams are the basic co-occurrence of multiple words that are within any context. We will build these N-Grams from the summarization stage earlier and create a spreadsheet with each depicting each N-Gram and their raw machine translated counterparts. The spreadsheet is built into a voting worksheet that allows human translators to analyze each line and provide feedback as to the correct translations and even whether certain N-Grams captured should be part of the final training seed data or not. This seed data will fine tune the algorithms built out in the next stage down to the context level and with human input. A basic depiction of this spreadsheet is shown below.
- Simultaneously human translate the source documents that generated these N-Grams. The human translation stage will build out a number of document pairs with the original content in the original language in one document and the human translated English version in another document. These will be imported into a statistical and analytical model to build the basic algorithms. By incorporating these human translated documents into the statistical translation engine training, the engine will discover word co-occurrences and their relations to the sentences they appear in as well as discovering variations of terms as they appear in different sentences. They will be further fine-tuned with the results of the N-Gram extraction and translation performed by human translators.
- Define and/or extract key names and titles of key individuals. This stage is crucial and usually the simplest information to gather, since most if not all parties involved already have references in email addresses, company org charts, etc. that can be gathered easily.
- Start training process of translation engines from the results of the steps above (multilevel and conditioned on volume and type of documents)
- Once a basic training model has been built we would test machine translate original representative documents and compare with their human translated counterparts. This stage can be accomplished with as little as less than one hundred documents to prove the efficacy of this process. This is why we refer to this stage as the “Pilot” stage.
- Repeat the same steps with a larger subset of documents to build a larger training model and to prove the overall process is fruitful and can be utilized to machine translate the entire document corpus. We refer to this stage as the “Proof of Concept” stage and it is the final stage. We would then start staging the entirety of the documents subject to this process in a “Batch Process” stage.
In summary, we are building a foundation based on human intellect and analytical abilities to perform the final translations. In using an analogy of a large building, the representative documents and their human translated counterparts (pairs) serve as the concrete foundation and steel beams, the N-Grams serve as the building blocks in between the steel beams and the key names and titles of individuals serve as the facia of the building.
Naturally we are not looking to replace human translation completely and in cases where certified human translations are necessary (Regulatory compliance, court submitted documents, etc.) we will still rely heavily on this aspect of the solution. Although the overall time and expense to complete a large-scale translation project is reduced by hundreds of times. The following chart depicts the ROI of a case on a time scale to help understand the impact such a process can have.
This process has additional benefits as well. Imagine for a moment a document production with over 2 Million of Korean language documents that were produced over a long-time scale and from various locations across the world. Your organization has a choice of either reviewing every single document and classifying them into various categories utilizing native Korean native reviewers or utilize an Enhanced Machine Translation process to provide a larger contingent of English speaking reviewers to search and eliminate non-relevant and classify the remainder of the documents.
One industry that this solution offers immense benefits is in the Electronic Discovery & litigation support industry, where majority of attorneys that are experts in various fields are English speaking attorneys and by utilizing these resources along with elaborate searching mechanisms (Boolean, Stemming, Concept Search, etc.) in English they can quickly reduce the population of documents. On the other hand, if the law firm relied only on native speaking human reviewers, a crew of 10 expert attorney reviewers, each reviewing 50 documents per hour (4000 documents per day on an 8-hour shift) would take them 500 working days to complete the review, which each charging hourly rates that can add up very quickly.
We have constructed a chart from data over the past 15 years performing this type of work for some of the largest law firms around the world that shows the impact of a proper document reduction or classification strategy may have at every stage of their litigation. Please note the bars start from the bottom to top, with MT being the brown shaded area.
The difference is stark and if proper care is not given to implementation it often prevents organizations from knowing the content of documents within their control or supervision. This becomes a real issue with Compliance Officers that must rely on knowing every communication that occurs or has occurred within their organization at any given time.
We have been providing electronic discovery, managed services and consulting to AMLAW100 firms and large organizations throughout the technology, pharmaceutical, manufacturing and other industries for almost 2 decades. We can provide your organization with the proper know-how of establishing this solution internally as well as provide EMT services on a contractual basis. Since we work with many law firms and deal with many documents that are confidential and mostly court protected, we can provide assurances as to the non-disclosure of any information through contractual agreements. We are also looking to expand our know-how in this core competency and are exploring partnering up with organizations that can help us scale up our capabilities into newer industries.
In summary, we are a smaller organization with a niche that can benefit the larger organizations.
- iQwest is looking to scale up our capabilities with the right partners
- Company vision is to expand our know how into new industries/verticals
- Training of internal resources can be accomplished in 6 months with minimal capital investment required
- We currently have the capability to provide close to 1 million translations per week
- Expanding processing capabilities to Zurich, Frankfurt and Hong Kong
Mr. Pete Afrasiabi the President of iQwest, is a veteran of aggregating technology assisted business processes into organizations for almost 3 decades and in the litigation support industry for 18. He has been involved with projects involving MT (over 100 million documents processed), Manages Services and Ediscovery since the inception of the company as well as deployment of technology solutions (CRM, Email, Infrastructure, etc.) across large enterprises prior to that. He has a deep knowledge of business processes, project management and extensive experience working with C-Level executives.
iQwest Information Technologies, Inc.