Our data set consists of gigabytes upon gigabytes of pickled Python dictionaries, CSV files, and plain text, with the odd bit of Excel or Word. I have three goals:
- Maintain this monstrosity
- Create a searchable index
- Build a new version for the future.
The entire app is a single monolithic Python app: there is no such thing as a “front end” or a “back end” or “middleware”. It’s a web app but there’s no templates; it generates HTML via print statements. The same Python file may include standalone logic,or shared logic to be used by other components. It’s a bit of a mess. Lastly the framework it uses is basically abandonware; I haven’t tried to see if it runs under any Python after 2.4, and you can be sure it won’t work under 3.
My first task was the search problem. I started with Whoosh but after about a year, it started to run into performance problems, and I’d also learned enough about information retrieval that I wanted some more features. The Whoosh guy is awesome and he’s done a hell of a thing, though; I cannot recommend it enough for smaller projects, but I needed more. I’d attended a talk at Pycon about Elasticsearch, so I switched to that, and it’s been awesome.
My strategy was pretty simple: a cron job to regenerate the world. Since Elasticsearch is really, really fast, it took perhaps 30 minutes to reindex the entire data set, and since it’s not a 24/7 use case running it at night is no big deal. (I’d like to provide real-time search but my users rarely need it; they’re content to have today’s new data appear tomorrow)
This worked so well for 2 reasons. First, I’d learned enough about the “common data set” that I could make the custom indexer pretty easy to work with since I knew enough about my users search needs that I could ignore 99.9% of the data. And second, Python dictionaries map really well to JSON, which Elasticsearch uses as its input and output.
In building the regenerate-the-world scripts, I had written a huge amount of code to 1)walk the entire flat-file “database” and 2)make lots and lots of sense of it all. I did stuff like, “ensure that every disparate part of the app always refers to a Project by the faux-primary-key ‘projectid’ instead of ‘pj’ and ‘projid’ and whatever else”. My indexer did a pretty decent job of cleaning up this semi-schemaless data; so now what?
Since our app uses CouchDB, it was my first choice, and very quickly abandoned. I loathe CouchDB. It makes a lot of sense in our app, but not for a general-purpose data store.
Up next was “any ol’ RDBMS”, which means MySQL. Attempts to hammer the semi-schemaless data into relational format resulted in a data model so complex and byzantine, it was practically recursive. Instead of 3rd normal form I made a wormhole into a hell-dimension. So, no.
Despondent and generally upset, I tried MongoDB. And it worked! Experiments worked really well!
- As I said, Python dictionaries map very well to JSON/BSON so the amount of friction in import/export was minimal.
- Ad-hoc queries
- easy blob storage for stuff like Word documents
- It’s fast (importing the world took perhaps 20 minutes)
- It’s easy to set up (compile and go, basically)
- Support for every language and platform I could think of
- Has some replication capability in case I ever need it
I wasn’t really sure about a couple things, mainly backup-and-restore, but that was really my only concern, and the Mongo docs on the topic seemed straightforward enough; my users can tolerate an hour of downtime.
And now, the point of my little story: I think Mongo DB is picked on more than just about any platform save PHP. There is so much fear, uncertainty, and doubt spread about it, it’s started to leak into my world and freak me out.
Consider the most recent thing, the “randomly log stuff” bit in the Java driver. Places like /r/shittyprogramming were all over it with digital brickbats. Every thread was then a free-for-all of “here’s now MongoDB screwed me over/Here’s why MongoDB sucks” stories from all over the internets.
Panic set in. This data is mission-critical; while my users can tolerate small amounts of downtime and don’t need OTP-type features, it’s still mission-critical data. Have I fucked up royally here? Have I set myself up for epic fail? Or am I just giving in to the sort of FUD that pervades every goddamn internet discussion about any sort of technology? Let’s face it: people pile on and rarely are they anywhere nearly as awesome as they think they are.
At this point I’m not entirely sure what to do. My thought was to return to the cold comfort of MySQL, using a Friendfeed-style schemaless system. It’s a huge orthogonal step but I’ve recovered horribly fucked MySQL databases after three-too-many bottles of Tequila, so it’s safe and well-understood. It puts the impetus on me to write the entire friggin’ access layer, but whatever. I know about Postgres and JSON, but I don’t know Postgres at all.
Am I giving in to FUD? Do I stay the course, trusting that my proven, real-world positives outweigh potential negatives?