The idea of the Semantic Web hopes to improve your chances of locating relevant information on the Internet by the underlying meaning (semantics) of content. As generations of Artificial Intelligence researchers have found, the human brain is a lot better at pulling the semantic intent out of text than computer programs. Creating the Semantic Web is not an easy task. But, along with HTML and Web services standardization, the Semantic Web has been a major preoccupation of the W3C.
Thinking about how to represent semantics started right at the beginning of the World Wide Web, but the first use of the "Semantic Web" term appears to have been in 1994 at the first WWW conference. There have been many tries at standards for labeling the Web with metadata with the hope that eventually computers can perform semantics sensitive location of data. The formal W3C Semantic Web Activity started formally in 2001 - this quote from the introduction gives the initial intent:
The Semantic Web is about two things. It is about common formats for integration and combination of data drawn from diverse sources, where on the original Web mainly concentrated on the interchange of documents. It is also about language for recording how the data relates to real world objects. That allows a person, or a machine, to start off in one database, and then move through an unending set of databases which are connected not by wires but by being about the same thing.
There are a number of evolving technologies which contribute to the goal of the Semantic Web, so the W3C does not attempt to cram them all into a single standard. Some of these parallel developments are highly developed and widely used while others are experimental and get little notice.
- XML and Unicode The use of XML syntax together with XML Schema plus the UNICODE standard for representing almost all written languages are well established and create a firm foundation for future development. We are fortunate that XML has achieved near universal acceptance because other Semantic Web technology has not been so lucky.
- RDF (Resource Description Framework) Intended to be a commonly used system, the W3C has released a set of RDF related standards, but developers have taken the idea in a variety of directions such as RSS with many alternate versions and industry confusion. The confusion can be seen in the many different interpretations of what RSS really stands for.
- OWL the Web Ontology Language OWL is a W3C recommendation for markup intended to improve the possibility of machine interpretation of content. Deficiencies in the 2004 specification led to a OWL 2 specification (2009). OWL is closely related to RDF and requires XML schema. If your eyes glaze over at contemplating such abstract concepts as Ontology, don't worry about it. The most important point of OWL is the requirement for a common vocabulary with shared meanings if the Semantic Web is to be realized.
- SPARQL is a query language intended to make use of RDF data. Currently the project is soliciting suggestions for improvements to the 2008 version.
An example of the power of well developed Ontology
Long ago, biologists recognized that they had a serious semantic problem in trying to describe nature - namely that the same plant or animal might have dozens of different names in different parts of the world. The solution was a system of taxonomy providing an agreed naming system for all living things. In addition there is widespread agreement on the meaning of terms describing the relationships between organisms. Thus biology already comes with semantics applicable to locating data on the Web. Biological taxonomy is a subset of the general topic, Ontology, which addresses classifying things. Thus ontology occupies a central place in conceptual diagrams of the Semantic Web.
Impediments to the Semantic Web Goal
I divide the impediments into three classes: accidental, by design and deliberate.
In the accidental class we have incorrect vocabulary usage. It drives me nuts - I have seen confusion between the correct use of "rein" and "reign" even in high traffic news site reports. I suspect this problem is partially caused by spell checkers - "its a real word that sounds like what I want so it must be right!" A functioning Semantic Web will require consistent vocabularies.
An impediment by design is the trend toward Rich Internet Applications (RIA) which build presentation by assembling multiple content sources from a wide variety of technologies. Major parts of the RIA may not be HTML at all when "plug-ins" such as Flash and Silverlight or streaming video have the real content. The actual presented content could be just about anything since it is assembled from user inputs, so how can you possibly put the relevant Web semantics tags where an indexer can see them?
By deliberate impediments to the Semantic Web I mean such things as "Search Engine Optimization" and the creation of fake Web sites and links to promote malware and identity theft criminal activity. Every natural calamity, such as the Haitian earthquake, has been followed by proliferation of Web sites attempting to use public interest to hijack Web searches away from valid sites and to spread malware. Therefore, we find "Trust" as the top layer of diagrams summarizing the Semantic Web. The earliest academic version of the Web assumed the people and places were who they said they were, so creating Trust was not an initial design requirement.
I think the concepts of the Semantic Web are improving the Web experience, but only in restricted areas such as within an academic field or industry. The wild surge of invention which the Internet has sustained shows no sign of stopping and most developers are not paying much attention to the goals of the Semantic Web.
Additional Semantic Web Resources
Review of RSS related projects and specifications. - XML Cover Pages
FBI news release on Haitian Earthquake Scams. - ic3.gov