How to setup Lucene/Solr for a B2B web app?

You summoned me from the FogBugz StackExchange. My name is Jude, I'm the current search architect for FogBugz Here's a rough outline of how the FogBugz On Demand search architecture is set up: For reasons related to data portability, security, etc. , we keep all of our On Demand databases and indices seperate While we do use Lucene (Lucene.NET, actually), we've modded its backend fairly substantially so that it can store its index entirely in the database. Additionally, a local cache is maintained on each webhost so that unnecessary database hits can be avoided whenever possible Our filters are almost entirely database-side (since they're used by aspects of FogBugz outside of search), so our search parser seperates queries into full-text and non-full-text components, executes the lookups, and combines the results.

This is a little unfortunate, as it voids many useful optimizations that Lucene is capable of making There are a few benefits to what we've done. Managing the accounts is quite simple, since client data and their index are stored in the same place. There are some negatives too, though, such as a set of really pesky edge case searches which underperform our minimum standards.

Retrospectively, our search was cool and well done for its time. If I were to do it again, however, I would discourage this approach Simply, unless your search domain is very special or you're willing to dedicate a developer to blazingly fast search, you're probably going to be outperformed by an excellent product like Solr or Xapian If I were doing this today, unless my search domain was extremely specific, I would probably use Solr or Xapian for my database-backed full-text search solution. As for which, that depends on your auxiliary needs (platform, type of queries, extensibility, tolerance for one set of quirks over another, etc. ) On the topic of one large index versus many(!) scattered indices: Both can work.

I think the decision really lies with what kind of architecture you're looking to build, and what kind of performance you need. You can be pretty flexible if you decide that a 2-second search response is reasonable, but once you start saying that anything over 200ms is unacceptable, your options start disappearing pretty quickly. While maintaining a single large search index for all of your clients can be vastly more efficient than handling lots of small indices, it's not necessarily faster (as you pointed out).

I personally feel that, in a secure environnent, the benefit of keeping your client data seperated is not to be underestimated. When your index gets corrupted, it won't bring all search to a halt; silly little bugs won't expose sensitive data; user accounts stay modular- it's easier to extract a set of accounts and plop them onto a new server; etc I'm not sure if that answered your question, but I hope that I at least satisfied your curiousity :-).

You summoned me from the FogBugz StackExchange. My name is Jude, I'm the current search architect for FogBugz. Here's a rough outline of how the FogBugz On Demand search architecture is set up: For reasons related to data portability, security, etc., we keep all of our On Demand databases and indices seperate.

While we do use Lucene (Lucene. NET, actually), we've modded its backend fairly substantially so that it can store its index entirely in the database. Additionally, a local cache is maintained on each webhost so that unnecessary database hits can be avoided whenever possible.

Our filters are almost entirely database-side (since they're used by aspects of FogBugz outside of search), so our search parser seperates queries into full-text and non-full-text components, executes the lookups, and combines the results. This is a little unfortunate, as it voids many useful optimizations that Lucene is capable of making. There are a few benefits to what we've done.

Managing the accounts is quite simple, since client data and their index are stored in the same place. There are some negatives too, though, such as a set of really pesky edge case searches which underperform our minimum standards. Retrospectively, our search was cool and well done for its time.

If I were to do it again, however, I would discourage this approach. Simply, unless your search domain is very special or you're willing to dedicate a developer to blazingly fast search, you're probably going to be outperformed by an excellent product like Solr or Xapian. If I were doing this today, unless my search domain was extremely specific, I would probably use Solr or Xapian for my database-backed full-text search solution.

As for which, that depends on your auxiliary needs (platform, type of queries, extensibility, tolerance for one set of quirks over another, etc. ) On the topic of one large index versus many(!) scattered indices: Both can work. I think the decision really lies with what kind of architecture you're looking to build, and what kind of performance you need. You can be pretty flexible if you decide that a 2-second search response is reasonable, but once you start saying that anything over 200ms is unacceptable, your options start disappearing pretty quickly.

While maintaining a single large search index for all of your clients can be vastly more efficient than handling lots of small indices, it's not necessarily faster (as you pointed out). I personally feel that, in a secure environnent, the benefit of keeping your client data seperated is not to be underestimated. When your index gets corrupted, it won't bring all search to a halt; silly little bugs won't expose sensitive data; user accounts stay modular- it's easier to extract a set of accounts and plop them onto a new server; etc. I'm not sure if that answered your question, but I hope that I at least satisfied your curiousity :-).

Jude, I appreciate your answer, your effort, and simply that you took time out of your busy schedule for this. I will keep your advice in mind, along with Shalin and @Mikos. Thank you so much.

– Bill Paetzke May 25 '10 at 2:54 To all-- I accepted @Blinky's answer because he has been there, done that--with almost the exact same scenario as I face. @Mikos and Shalin offered great suggestions, too. And I will consider all their great advice when implementing search on my web app.

– Bill Paetzke May 26 '10 at 0:29.

I am still unclear on what exactly from the 5K databases users are searching for, why you need Lucene, and the data sizes in each database. But I will take a whack anyway: You should be looking at Multicore Solr (each core = 1 index) and you have a unique URL to query. Authentication will still be a problem and one (hackish) way to approach it would be to make the URL hard to guess.

Your webservers can query the Solr instance/core depending on what they have access to. I'd suggest staying away from the filter approach and creating one huge index combining all databases. HTH.

Thanks @Mikos, I will look into the multi-core Solr. Yes, I am vague on the type of data stored. But I can say that clients have 100k to 10 mil records.

Right now my "search engine" consists of dynamic sql queries--which is slow and limiting. I read Lucene is better than full-text catalogs--faster and more scalable. – Bill Paetzke Apr 25 '10 at 4:57 1 Happy to help.

I have done a similar effort recently, but if you database fields contain plenty of text, then Lucene/Solr will blow your socks off (cf. Dyn. Sql), plus you also get faceting as a bonus to better filter the results.

Just a couple of lessons learned: 1. Do not store the entire record in the index (is tempting to do so), only store what you absolutely need to , such as the record identifier (a db record => a document in Lucene). 2.

Once your search is performed, use the record ids to retrieve the records from individual db. I found this approach worked best in my case. HTH – Mikos Apr 25 '10 at 5:02.

Shalin Shekhar Mangar answered me on the Solr-user mailing list and by private email. Shalin is a contributor to Solr and an author of the upcoming book Solr in Action. S reply on the mailing list: How would you setup the index(es)?

I'd look at setting up multiple cores for each client. You may need to setup slaves as well depending on search traffic. Where do you store the index(es)?

Setting up 5K cores on one box will not work. So you will need to partition the clients into multiple boxes each having a subset of cores. Would you need to add a filter to all search queries?

Nope, but you will need to send the query to the correct host (perhaps a mapping DB will help) If a client cancelled, how would you delete their (part of the) index? (this may be trivial--not sure yet) With different cores for each client, this'd be pretty easy. S reply by email: I've worked on a similar use-case in the past and we used the multi-core approach with some heavy optimizations on the Solr side.

See http://wiki.apache.org/solr/LotsOfCores - I haven't been able to push these changes into Solr yet.

I'll try out his approach with a small subset of clients. If Solr doesn't work well, I will wait for his "LotsOfCores" change to be pushed. S change might go in the next Solr release (within the next few months?).

– Bill Paetzke Apr 25 '10 at 18:42.

I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.

Related Questions