Calais installed base: 7,000 developers, 2,000,000 pieces of content processed per day
The Calais Initiative (Calais) comprises several tools for processing text, but the core product is a Natural Language Processing (NLP) engine. When presented with a body of text, the Calais Web service returns the “named entities” (the categories to which the document’s key terms and concepts are assigned), facts, and events it discovers within the document. The relationships between these items are also identified and embedded in the results. Essentially, the results are the Semantic metadata of the document and can be thought of as the document’s “knowledge content,” which can be published and made available for searching and navigation.
On its own, and applied to one or two small, short documents, this might not seem terribly valuable. But deployed on the Web and made available as a free service, Calais is in a position to process massive amounts of data (text, quantitative, graphic, etc.) and extract their knowledge content. Once this task is complete, this content can be searched individually or combined with other similar content and searched in a larger context. This larger context can be based on other Web content, proprietary Thomson Reuters content, a combination of the two or the context of select data sources that may address a specific area of interest.
Ultimately, Calais’s goal is to be the world’s best tool for extracting the structure of any kind of content, recognizing its type, the concepts that are contained, their relationships, and doing so not just within a single file, but across a span of files that could be as large as the Web itself.
Demand from large organizations, including well established publishers, has grown at an unexpectedly high rate. This has led Thomson Reuters to introduce three contract-based versions of Calais in addition to the original free service:
Calais Professional - same as the free service but now backed by an SLA and with higher transaction limits.
Calais Professional for Publishers - Calais Professional tailored to meet the needs of large scale publishers and tied to an annual contract.
ClearForest On-Premise Solutions - ClearForest is the original name of the technology that makes Calais work. Now that it's available as a stand alone application, enterprises will be able to closely tailor the service to their needs, ensure the privacy of their proprietary content and also have access to what's under the hood for even further customization.
Thomson Reuters is another key differentiator – the fact that Calais is sponsored by a global information giant suggests that this entrant will be with us for a long time. Furthermore, at this time Calais is in the final stages of testing its “infinite scalability” initiative, (e.g., cloud computing) designed to address growth in demand and/or spikes in utilization.
Another distinguishing characteristic is the rate at which the service has been adopted (the fact that it’s free is worth repeating). The net effect has been to discard the original projections for usage because demand has so vastly exceeded expectations. Note that until very recently, demand for Calais has existed almost entirely outside of any Thomson Reuters media property. This state of affairs is changing rapidly, with internal inquiries arriving with greater frequency.
Deploying Calais against the vast, professionally developed and controlled content in the Thomson Reuters empire would be a remarkable step in the company’s evolution. After 150 years as a traditional news wire service and publisher, Thomson Reuters’ content could quickly become something not yet fully defined, but possibly far more powerful and useful than what traditional publishers have offered before.
Six/Twelve Month Plans:
In January ’09, Calais is scheduled to launch Release 4, which will open the door to the world of “Linked Data,” a critical step toward fulfilling the promise of the Semantic Web. Essentially, URIs (Uniform Resource Identifiers) allow for the linking of individual data elements, a concept that goes much further than linking containers like files, pages, documents, or databases as we’re accustomed to on the WWW. The Semantic Web term for each pointer that leads to a datum is “dereferenceable URI”.
Wikipedia does a nice job of explaining references and their consequent dereferencing by using house addresses and houses. In this case, a house address is the reference, or pointer. Using this pointer and finding the actual house is the same as dereferencing the address.
In Calais’s case, after extracting the entities (e.g., people, places, companies, etc.) from your content you could then link to (or retrieve for processing by an application) relevant data on DBpedia, The CIA World Fact Book, Freebase, or a rapidly growing number of other compatible data sources. If you’re a talented content producer, the additional leverage that comes from linking to these “external” data could make your offering substantially more useful and in turn, much more valuable.
Let’s build on the example above, where the entities in an original document have been linked to data residing on DBpedia and The CIA World Fact Book. The idea is that the entities extracted from each source can be linked manually, through search results, or as a result of processing by an application. Simply knowing that these entities have an association can be valuable, but the key is that the URI provides a pointer to the specific data – not the file, not the document, and not the database, but to the actual datum, value, or record that’s stored in one of these containers. There’s no longer a need to call an entire file or database, read it to find what you’re looking for and then put it to use. Instead, you call just what you need – the specific data that matter to you.
This process is faster (read: cheaper in computer processing terms) and those URIs you’ve amassed can be reused by other people and applications because these pointers are durable and they persist – if the data remain in place, then each datum will keep the same individual URI (again: cheaper, highly reliable, and standardized to ensure universal access and use). It’s simply easier to exchange pointers to specific data (dereferenceable URIs) than it is to exchange potentially huge data files or documents.
Once documents and information assets are connected to the Linked Data cloud, deep connections can be made between the entities, facts and events therein. This can, for instance, enable the resolution of complex queries, such as: “Which company boards of directors include CEOs that have been involved in the sup-prime mortgage meltdown?”.
Let’s start with the premise that Thomson Reuters has 150 years of experience creating, managing, and presenting content that people want. Over this period, the company has amassed a body of high quality content that’s possibly the largest in the world. This content will continue to grow, but the advent of the Web has unleashed a torrent of content on a genuinely planetary scale. Since this content is outside Thomson Reuters editorial and/or production controls, the company considers it to be “wild” content. This doesn’t mean it’s bad – some of it’s exceedingly good.
Based on the environmental factors below, Calais puts Thomson Reuters in a position to extend its core competencies to include content it controls as well as wild content because:
The fundamental nature of publishing and using content is changing.
“World Wild Content” will dwarf the content Thomson Reuters controls.
Professionally produced content will continue to merit a premium.
The Open Access movement and similar efforts by academics, researchers, and other content authors seeking to retain control of their work will continue and grow.
Thomson Reuters has extensive experience in every aspect of the content industry.
Flexible integration/interoperation of different types of content may provide powerful added value.