Pages

Thursday, December 20, 2012

Crafting an Apache Solr Schema.xml


Using Apache Solr with Drupal is fairly simple thanks to the apachesolr module, but recently we were tasked with making Solr a vital component of a custom Django project. The Drupal module comes with a Solr schema.xml that is already set up specifically to play nice with Drupal, but we had to craft our own. Setting up Solr, filling it with data, and getting it back out again is relatively easy. However, much like taking a north-Philly street brawler and turning him into Rocky, it takes a bit of work to do it well.
Possibly the single most important factor in successfully creating a custom Solr schema is how well you know the data. This is where a slight bit of artisanship comes into play. To be as efficient as possible the schema has to reflect not just the data types, but where the data needs to go and how it is going to be used on a field by field basis.
Anytime that I move data from one point to another, I always take some time to plot out all of the data that I start out with, and were exactly it needs to go. In this case, not just where it needs to go in Solr, but where and how it will be used by viewers of the final site. Only after I have answered those questions do I start thinking about the mapping and how to get it there. For simple one-to-one migrations of data, that process might get you most of the way there, but with Solr you need to be mindful of many other factors.

What is the schema.xml file?

The Solr schema.xml (typically found in the solr/conf/ directory) is where you tell Solr what types of fields you plan to support, how those types will be analyzed, and what fields you are going to make available for import and queries. Solr will then base its Lucene underbelly on what you define. After installation you will have a generic example schema.xml that I highly recommend you read through and even use as a foundation of your own schema. It can be tempting to glaze over large block comments in an example file, but you will find great examples, explanations, and even some solid performance and tuning tips all through the file. It is definitely worth the time to read it over before you set up camp and make it your own.

Field Types, Fields, and Copy Fields

The big picture of the file really breaks down to three major areas. First you define your field types to dictate which Solr java class the field type utilizes and how fields of this type will be analyzed. This is where most of the "magic" is outlined in your schema.
Field Types
Then you define your actual fields. A field has a name attribute which is what you will use when importing and querying your data, and it points back to a designated field type. This is where you tell Solr what data is indexed, and what data is stored.
Fields
Optionally you may choose to have some copy fields which can be used toCopy Fieldsintercept incoming data going to one field and fork a clone of it off to another field that is free to be a different field type.

You do not have to index and store it all

A concept in Solr that you do not want to miss out on is that not everything has to be indexed and not everything has to be stored. Solr takes a dual path with data, keeping what is indexed completely separate from what is stored. When you think about the complex operations that Solr does on your data to dissect, convert, and if you say so - even turn it backwards in order to make it indexable, it makes sense that you would not be able to turn around and reuse that same data to display back to the viewer. So now you have two fundamental questions to ask yourself on each field:
  1. Will this field need to be searched against?
  2. Will this field need to be presented to the viewer?
The answers to these questions are directly reflected in the indexed and stored attributes to your field elements. These days even terabytes are relatively cheap, but performance is still priceless. If Solr doesn't need it, don't make it wade through it to get what it does need.

Analyzers, Tokenizers, and Token Filters: know 'em - love 'em

Much of the power of Solr comes from the field types that you define and what combination of char filters, tokenizers, and token filters you use against your data. When you set up a field type you can define one shared analyzer or one for the index, and a completely different one for query results.
Within your analyzer(s) you can start out with char filters. A char filter will modify your data in some way before it is actually analyzed. A good example would be the PatternReplaceCharFilterFactory which can apply a regular expression find/replace on your data before Solr even starts to break it apart.
Each analyzer can have one tokenizer. Which one you use will depend heavily on the data you intend to pass through it. Think of this as the fundamental way that you want the analysis of the data to be approached. From there you can also apply token filters to further manipulate your data. Examples would be referencing a stop words list, applying a list of synonyms, or forcing all characters to lowercase.
Field Type Example

Because index and storage are separate, they can be analyzed separately

A great example of this would be if you were going to index and store a document that had plenty of content, but also thick with html markup. When you index that data you want to make sure the mark up is not getting indexed, otherwise things like class names might give you false positives down the road. However when you go to serve up the content to the viewer you are going to want that html intact. Solr can handle this by allowing you to define a field type where the index analyzer uses the HTMLStripCharFilterFactory char filter to purge out all html entities prior to tokenization, but the query analyzer would not use the char filter and instead present the originally stored content in search results.

Dynamic fields are rad, but specifically defined fields are rad2

The most common usage of dynamic fields is to allow your scripts to pass in data with a suffix like "_s" (example: address_s) and then Solr will process whatever you pass as a string. This comes in handy when you have data structures that may change over time and you want to allow your scripts to handle this gracefully without coder intervention on either the scripts or the schema. The Drupal module makes heavy and clever use of this feature. The two downsides are that typically your dynamic fields are going to be have more generalized tokenizers and filters applied and they are most likely going to index and store everything in order to play friendly with any data you throw at it. Whenever possible I would recommend defining specific fields for any data that is going to carry significant weight in your indexing or queries. This will allow you to get much more granular with how that data is processed.
Dynamic Fields

Why send twice when you can copyField

If you need the same field to be indexed or stored in more than one way, there is no need to modify your scripts to send the same field twice. Just utilize the copy field feature of Solr. Another useful trick is to use copy fields to combine multiple fields into one indexable field. This can greatly simply running queries.

Where to start and where to find more

If you have not installed Solr yet, you can find the download and install documentation here: http://lucene.apache.org/solr/downloads.html.
If you are just experimenting you can get by running from the Jetty Java servlet container that comes with the download. But for production and serious development you will want to upgrade to something with a little more muscle likeApache Tomcat to dish out your Solr needs.
Solr AdminIf this is your first time working with Solr I would recommend spending some time just experimenting with the solr/admin/ interface. A lot can be learned from pushing in some of the example xml files and testing out some queries. This is also a great place to test out your latest field type concoctions.
Some great documentation is provided on some of the most common tokenizers and filters in the apache.org wiki. If those do not meet all of your needs you can also find plenty of others out in the wild. For example, we needed a geo hash that could handle multiple values on a field, so we tracked down and used an alternative for the single value one that comes with Solr.
You will find a lot of material and concepts at play and the best way to find your bearings is to just dive in. When in doubt, push some data, query it, refine the schema as needed.

Deleting / Purging Cache, Users, Nodes, Comments (Housekeeping)


If you have developed a drupal site and are maintaining it too, you will soon find that your database will keep growing with lot of unwanted posts, comments and user registrations. There is a need to purge this data from time to time to keep storage levels at optimum and also your website updated with only relevant content.
I was looking at what could be the different ways in which you can clear data / Purge data in drupal.  I have logically divided it into 4 categories
  1. Purging Cache
  2. Purging Users
  3. Purging Comments
  4. Purging posts
Lets look at each one of them
  1. Cache actions provides rules actions for clearing drupal caches. It currently provides actions for:
  • Clearing Drupal cache bins
  • Clearing CSS/JS cache
  • Clearing The cache of specific views
  • Clearing The cache of specific panel pages
  • Clearing The cache of specific mini panels

  1. The Cleaner module allows the admin to set a schedule for clearing caches, watchdog, and old sessions. Its available only for Drupal 5 and Drupal 6
There are functions in Drupal which will cause "expired" entries in some cache tables to be deleted. This is vastly improved in Drupal 6. "Minimum_cache_lifetime" is a partial solution, but still not totally complete.  There are still times and/or cache tables that don't get cleared in any of those scenarios. Many sites will not be impacted by this, but a few will (just search on drupal.org and you will see many posts from people having problems).
  1. Also refer to this documentation on clearing cache
  2. The LoginToboggan module offers several modifications of the Drupal login system in an external module by offering the following features and usability improvements: Optionally have unvalidated users purged from the system at a pre-defined interval (please read the CAVEATS section of INSTALL.txt for important information on configuring this feature!).
  3. The inactive_user module provides Drupal administrators with a way to automatically manage inactive user accounts. This module has two goals: to help keep users coming back to your site by reminding them when they've been away for a configurable period of time, and to cleanup unused accounts.
  4. User Prune lets you mass delete inactive users based on criteria you specify.
The module classifies inactive users into two categories: users who have never logged in before, and users who have logged in at least once. For users that have never logged in before, you can choose to prune users based on how long they've been registered. For users that have logged in before, you can chose to prune users based on both how long they've been registered, and how long its been since they last logged in. The pruning specification you select can be saved as a cron job, or executed a single time.
  1. Deleting nodes at the database level is a bad idea because there are so many tables in the database that contain node-related data. If deleting nodes in bulk at admin/content/node is not adequate for your needs then you need to make use of the node delete api function. Example usage to delete page type nodes.
$node_type = 'page';
  
//fetch the nodes we want to delete
$result = db_query("SELECT nid FROM {node} WHERE type='%s'",$node_type);
while ($row = db_fetch_object($result)){
  node_delete($row->nid);
  $deleted_count+=1;
}
//simple debug message so we can see what had been deleted.
drupal_set_message("$deleted_count nodes have been deleted");

  1. To delete duplicate nodes  refer to this code snippet
  2. Deleting Comments – If you have to delete specific comments or spam comments you can use views bulk operations to mass delete comments Install the views bulk operations module, create a page view of comments, and choose "Bulk Operations" as the view style. Make sure you select "Delete comment" as one of the selected operations. When you view the view page you created, you'll be able to bulk delete comments.
  3. Deleting nodes can be done using Views bulk operations module in the same way as above.

Which is a best webhosting for Drupal


I know web hosting is so unreliable these days. Its full of made up promises with fine prints.  Unlimited space - is a hype. It's a marketing gimmick. Don't fall for it!Sometimes it's  difficult to find good web hosts with great support.
So how do you find a  web host company which can best support your drupal website, Such that whenever there is a problem, someone can solve it quickly.
Drupal will need PHP and MySQL support This common setup is better known as LAMP(linux, apache, mysql, php) is themost economical and stable solution. In addition you would need a good user interface for control panel, like cpanel. First of all, lets check what are the additional features you would need in a drupal host compared to normal websites, these will be useful if you are planning for high end website
  1. memcache, it is a high-performance memory object caching system. It will  speed up database-driven websites by caching data and objects in RAM. This is very effective in managing the load on your database, which for most web applications including Drupal, is the biggest performance bottleneck and risk to scalability.
  2. APC, which stands for Alternative PHP Cache, is a free PHP extension that will optimize the performance of PHP applications by caching PHP code in a compiled state. Check if the host has this.
  3. When a proxy, like Squid, is configured as a reverse proxy it can act as a caching mechanism for web pages. Because the reverse proxy sits between the internet and the webserver, it can intercept all requests and respond to them by serving cached content. This reduces the load on the webserver (and the database).
  4. FastCGI and mod_php are two of the most commonly used approaches. Most people are using mod_phpbecause that is the default on nearly all Linux distributions. FastCGI is often used on shared hosts to provide additional security, but has performance hit.  So choose if you want better performance (mod_php) or most security ( FastCGI)
  5. Varnish is a HTTP accelerator that caches web pages for future use. Its mainly for anonymous users
  6. Apache Solr is a search platform that uses a variety of features to speed up complicated databases queries. While Solr can drastically reduce the amount of time required for queries in large pools of data, it should only be applied for sites where the increased memory usage doesn't outweigh the benefits.
  7. Security – A generic requirement is that it should support SFTP file transfers and SSL for secure data transfers.
In conclusion, if you are looking for shared hosting and something which is good for your pocket, then you may not get good performance, security or customer support. Having said that , there are good shared hosting providers for Drupal. If you are looking to host professional and high end sites, then its better you select from above list of features, in addition to the standard features provided by webhosting. Drupal is a powerful CMS and if the hardware or hosting supports it well then it can really do wonders for you. Fast websites are still in fashion and will continue to outsmart the slower alternatives.

Use CDN jQuery for your Drupal theme


Simple code to use jQuery from Google CDN. Change the version "1.8.0" to your preferred version.
You can also switch to use Microsoft CDN, or other sources by changing the jQuery path below.
/**
* Implements hook_js_alter().
*/
function YOUR_THEME_js_alter(&$js) {
  if (isset($js['misc/jquery.js'])) {
    $jquery_path ='http://ajax.googleapis.com/ajax/libs/jquery/1.8.0/jquery.min.js';
    $js['misc/jquery.js']['data'] = $jquery_path;
    $js['misc/jquery.js']['version'] = '1.8.0';
    $js['misc/jquery.js']['type'] = 'external';
  }
}
This code may conflict with jQuery Update module.