I stumbled across Mongo the other day. MongoDB is one of the non-relational (NoSQL http://en.wikipedia.org/wiki/Nosql) data stores along the lines of CouchDB.
If you're interested, here are some links to MongoDB :
http://www.mongodb.org/
http://try.mongodb.org/
http://www.mongodb.org/display/DOCS/PHP+Language+Center
The try.mongodb.org link is especially useful. It provides a JavaScript shell and a tutorial which allows you to try MongoDB easily.
One of my first encounters with the new bread of NoSQL data stores is CouchDB. I think it was one of the early NoSQL stores, but it seems like MongoDB has very good support among popular PHP frameworks -- see the PHP Language Center above.
I also ran across a CouchDB book at http://books.couchdb.org/relax/. I plan to read the first few chapters to become more familiar with this genre of data store.
Of course, there's been NoSQL data stores around for quite a while -- think Berkeley DB and Sleepycat Software. But the new twist with some of the new NoSQL stores is storage in JSON.
One problem, I would think, is getting legacy documents into a JSON representation. Simply dumping a legacy doc structure into one JSON 'doc-blob' won't be searchable/filterable. I guess one could, however, parse what we have and create a reasonable JSON representation out of that.
I haven't used NoSQL data stores too much, but they seem superior to RDBMSes for "loosely defined, hierarchical" data that exists in many cases.
Seems like CouchDB has pretty good integration with Lucene. MongoDB just seems to discuss Lucene as a "future" project (http://www.mongodb.org/display/DOCS/Project+Ideas)
I've used Lucene some as a user, but heard a lot about it. It seems especially strong in terms of faceted search. Sphinx seems to be faster than Lucene, but I think Lucene is more widespread and is probably easier to work with and has more features. Zend has a PHP port of Lucene, but we'd probably want to use the Java implementation as it's probably faster and better supported.
If we just stored the legacy files as a single JSON 'clob' with a bit of easily obtainable metadata, I suspect we could filter/facet on the metadata, but it'd be hard to do a fine-grained search on data in the clob.
For example, if one had a 'student' JSON doc like this :
{ studentdoc:{ doctype:{ value:'weekly-essay', createtime:'2010-03-23 15:00', studentid:'1000053', body:'\n Once upon a time...\n' } } }
One could filter on doctype and createtime, but I think without fine-grain JSON in the clob (body) it'd be impossible, for example, to find all stories that had a 6-foot cowboy in the body. I suspect Lucene is good at free-form text searches, but can't handle this type of quantitative search. This is perhaps obvious, but it may be worth it to do some parsing of important documents to get something more searchable.
I believe a NoSQL db should be more scalable than an RDBMS, but to me the real attraction seems to be the innate hierarchical storage structure which JSON (or even XML) allows. Much of our data come in a hierarchical format, so simply converting that to JSON may be easier than the gymnastics required by conversion to a relational format.
Here's another good article on the "new wave" of NoSQL data stores :
http://www.readwriteweb.com/enterprise/2009/02/is-the-relational-databas...