Natural language processing is one way to support searches (and other things) through large numbers of documents and other text. The other way is to use a language independent approach.
The advantage of the latter approach is obvious - you only need one product and, provided it can recognize the characters used in any particular alphabet, then it will run, using statistical analysis to assess the relative importance of the various words used.
The big disadvantage of natural language processing is that you have to have a separate version of the product for each language and, where appropriate, for each dialect of a language. Thus, for example, Portuguese is rather different in Brazil and Portugal. Not, of course, to mention the UK's separation from the United States because of our common language.
What this means is that if you want to deploy a search engine (for example) for corporate wide use with 50 different languages, then you will almost certainly choose to use a language independent product (such as Autonomy's), since neither of the leading natural language products (Inxight SmartDiscovery and SPSS' LexiQuest) is likely to support the necessary languages. However, this position is changing as these vendors increasingly add new languages. For example, Inxight currently supports 26 languages and counting.
The advantage of natural language processing can be seen when you consider the following two statements: "San Francisco Giants take over National League West" and "Giant SUVs take over western San Francisco".
Now, using a statistical approach, you get two instances each of San, Francisco, Giant, west (both of these are reduced to their root forms), take and over. This is not entirely useful. If you use national language processing, on the other hand, then the software will recognise that San Francisco is an entity in its own right, as is the National League West and the San Francisco Giants. This is illustrated further below. It is not difficult to see how natural language processing preserves context while language independent processing loses it.
What this means is that language independent processing doesn't understand the text that it is processing. So, when you do a search based on such a system all it can do is to bring back the first few lines of any relevant document (À la Google). Using natural language processing, on the other hand, you can summarize documents so that you get a precis of each document, which is considerably more useful.
Another problem with language independent processing is that it can have problem with getting back to the roots or stems of words. This is easy enough in English, when flight, flying, flew and so forth can all be stemmed back to fly. However, as a counter-example, stemming a German word such as lebensversicherungsgesellschaftsangestellter, which means "life insurance company employee", is just a trifle more difficult.
In practice, you really need different stemming algorithms for different languages and, in the case of German, quite complex stemming algorithms.
The bottom line is that natural language processing makes considerably more sense in any document intensive environment, given always that it supports the languages you require. No doubt there are other considerations - associated products and facilities, performance, platform compatibility and so forth - but on a pure like for like basis, natural language processing would seem to have the edge over language independent processing, no matter how attractive the latter might be as a concept.
Copyright 2003 IT-Director.com provides IT decision makers with free daily e-mails containing news analysis, member-only discussion forums, free research, technology spotlights and free on-line consultancy. To register for a free email subscription, click here.
For more information:
- Looking for free research? Browse our comprehensive White Papers section by topic, author or keyword.
- Are you tired of technospeak? The Web Services Advisor column uses plain talk and avoids the hype.
- For insightful opinion and commentary from today's industry leaders, read our Guest Commentary columns.
- Hey Codeheads! Start benefiting from these time-saving XML Developer Tips and .NET Developer Tips.
- Visit our huge Best Web Links for Web Services collection for the freshest editor-selected resources.
- Visit Ask the Experts for answers to your Web services, SOAP, WSDL, XML, .NET, Java and EAI questions.
- Couldn't attend one of our Webcasts? Don't miss out. Visit our archive to watch at your own convenience.
- Choking on the alphabet soup of industry acronyms? Visit our helpful Glossary for the latest lingo.
- Discuss this article, voice your opinion or talk with your peers in the SearchWebServices Discussion Forums.