SIREn: Efficient semi-structured Information Retrieval for Lucene
Efficient, large scale handling of semi-structured data (including RDF) is increasingly an important issue to many web and enterprise information reuse scenarios.
Querying graph structured data (RDF) is commonly achieved using specific solutions, called triplestores, typically based on DBMS backends. In Sindice we however needed something much more scalable than DBMS and with the desirable features of the typical Web Search engines: top-k query processing, real time updates, full text search, distributed indexes over shards, etc.
While Lucene has long offered these capabilities, its native capabilities are not intended for large semi-structured document collections (or documents with very different schemas). For this reason we developed SIREn - Semantic Information Retrieval Engine - a Lucene plugin to overcome these shortcomings and efficiently index and query RDF, as well as any textual document with an arbitrary amount of metadata fields.
Given its general applicability, we are delighted to release SIREn under the GNU Affero General Public License, version 3 open source license. We hope businesses will find SIREn useful in implementing solutions upon the Web of Data.
Read our SIREn case study in Lucene in Action, 2nd Edition