Pages

Thursday, December 20, 2012

5 drupal modules to Purge users, Clean-up Spammers and Spam posts


Lets look at some of the options available with drupal modules to control user spam, clean the posts after user have spammed, clean up or Purge unwanted users from drupal website.

1. Ban Unpublish Module

This is probably the best option, since it also integrates with Advanced user module and View Bulk operations module(VBO).
The Ban and Unpublish module makes it easier to clean up after registered spammers and other problem users by implementing a bulk operation that appears at Home > Administer > User management > Users.
This module adds a single drop-down option to the list page that performs the following in a single pass:
  • Ban the user's email address.
  • Ban the username.
  • Block the account.
  • Kick the user off the server, if active.
  • Unpublish any nodes that were published by the user.
  • Unpublish any comments that were published by the user
As mentioned before the above operations will be added to VBO and Advanced user module for doing more bulk operations

2. Country wise banning of spammers

Country Ban can be used to set entire countries to "read only" or to ban their access completely. Setting a country to "read only" disables all account access for that region, automatically logging out any user which resides there and preventing new accounts from being generated. Websites set to 'read only' will still be able to be viewed anonymously. The website admin may also set a "complete ban", which will block the website entirely for all users of the configured region.
Country Ban is dependent upon IP-based Determination of a Visitor's Country as well as Country Codes API. These modules provide the country of origin that Country Ban requires to filter properly.

3. Advanced user Module

The advanced user module's main job is to allow additional filtering which is not possible with the inbuilt drupal user module. Once you have filtered the users, you can do operations like
  1. Mass emailing
  2. Blocking
  3. Unblocking
  4. Deleting
It allows filtering of users based on the user.module fields and optionally the profile.module fields. The fields available for filtering can be configured using the module settings. Eg. Site admin may search through 1000s of users to display all users who have not accessed their account.
  • Filtering of users based on:
  1. Permissions
  2. Status
  3. Created
  4. Accessed
  5. Email
  6. User Id
  7. Admin Selected Profile Fields
  • Filtering on fields can be grained to
  1. Is Equal To
  2. Is Not Equal To
  3. Is Less Than
  4. Is Greater Than
  5. Is Less Than or Equal To
  6. Is Greater Than or Equal To
  7. Contains
  8. Does Not Contain
  9. Begins With
  10. Ends With
  • Multiple filters can be refined to be AND or OR operations giving you the greatest control of the data selections.
  • Administrative options include notification of user data changes.The notification emails include:
    • User's email address
    • A link to google and yahoo searches for user's email address - great for doing a quick spammer check on the user's email address.

4. Prevent spammers from registering - Spambot

Spambot protects the user registration form from spammers and spambots by verifying registration attempts against the Stop Forum Spam (www.stopforumspam.com) online database. It also adds some useful features to help deal with spam accounts.
This module works well for sites which require user registration before posting is allowed (which is most forums).
  • Checks (username, email, ip address) data against the www.stopforumspam.com blacklist. Blacklisting can be based on either of email, username or IP address (with configurable thresholds).
  • Bulk reporting of users as spammers.
  • Scanning of existing user accounts (via cron).
  • New 'spam' tab for user accounts with some useful information and actions.
  • Uses IP addresses from core statistics and User Stats if they are enabled.
  • Optional auto-reporting of blacklisted registration attempts to www.stopforumspam.com
If you need more IP address statistics of your users, please consider the User Stats module

Conclusion

Install Ban Unpublish, VBO, Advanced user modules and you should be able to do most of the operations around spam control including banning users, removing spam posts, purging spammers etc. If you still face heavy spam install Country Ban Module to restrict users from a particular geography. And yes do install spambot if your website has heavy traffic. Sometimes legitimate users may also be blocked by Spambot so be careful.

Stopping spam in Drupal – Users, Posts, Comments


One of the most common problems faced by many drupal sites is tons of spam, There are either spam users or spam content or spam comments. There are all sort of nasty things that happen by spammers and it is a huge risk to the website’s credibility and growth. Read on for detailed analysis and recommendations on preventing spam
I personally would not visit a forum again if the content is unmoderated and spam. So how do you avoid the spam. There are three methods
Automatically
                Install modules or write code that block the spam users, comments and content For example there isspam.module which uses Bayesian logic to filter content, or Akismet which sends the content to the Akismet for checking via several tests, Akismet is known to be the best method so far for preventing comment spam. Another method is  Bad Behavior which looks for spammer-like activity and blocks those users. The issue with this method is that sometimes genuine users also get blocked thereby creating a negative spiral.
Setup for these types of solutions needs some work and  they can slow your site down considerably depending on the amount of spam that you get and the power of your server.
Challenging Users
These includes turing tests like captcha.module which presents a math problem or "numbers embedded in an image, or KittenAuth - the cute alternative, you're presented 5 images of animals and you pick the kitty. There's also things you can do in comment.module setup like requiring contact information, or requiring previews.
Catchpa Riddler module will let you ask custom questions to the users  during registration and if they answer correctly  they are allowed in. This very useful for niche sites since the spammers will not know answer to simple question related to your website’s domain.
Administrators prevent spam
This will require work, but if your site has less traffic this is the most effective way. You can either  disallow commenting without approval in admin/access and/or something like Comment Mail which sends an email with approve/deny links to the admin every time a comment is submitted. Or you could have an army of content moderators to delete spam when they find it.
Likewise you can disallow any registration without administration approval.
The main advantage of this method is there are 100% results however the delay in getting approvals may turn down people.
I generally check the user credentials during registration and only allow them to register if they have entered all profile fields. THis method will cut down 95% of spammers. Looking at Profile fields during registration you will deny a whole lot of people. Then I moderate content every 12 to 24 hours and delete spam comments and posts and block the user. This method works for my niche sites very well.

So what is the best way to prevent drupal spam? 
Well there is no single answer. Based on your problems you will have to select the best answer. It depends on your spam volume, whether you want to block bots or unwanted users, and how much time you can devote to website moderation.
  1. Mollom which is developed by drupal founder is still not mature enough to be used, there are downtimes and sluggisness in response which plague them for now.
  2. Spamcide – Adds a hidden field to forms that only spam-bots will see and fill in. But this will not work for human spammers
  3. Spambot - Checks member details against the Stop Forum Spam system. An effective method but will not work for  anonymous posts. So to use this you will have to stop anonymous posting and also be prepared
To me it seems like Bad Behaviour module and Akismet should work for most  users and would be the best option to prevent spam
Note:
If you enable CAPTCHA module, it disables page caching. So if you have comments form at bottom of each page then the page will not be cached. So every page will be treated as if the user has logged in and it will incur a query to database. Do the following to open comments form on new page.
1) go to admin/content/types
2) Click the edit link for the page, blog entry, or other type you want to modify
3) Scroll down to Comment Settings
4) Find "Location of Comment Submission Form"
5) Select the "Display on separate page" option
6) Click Save Content Type
At times the CAPTCHA module does not work effectively, if that is the case use reCAPTCHA module. It is very effective and simple to install and use. The reCAPTCHA module uses the reCAPTCHA web service to improve the CAPTCHA system and protect email addresses. Just get your public and private keys from recaptcha site and plug it in your site.

Crafting an Apache Solr Schema.xml


Using Apache Solr with Drupal is fairly simple thanks to the apachesolr module, but recently we were tasked with making Solr a vital component of a custom Django project. The Drupal module comes with a Solr schema.xml that is already set up specifically to play nice with Drupal, but we had to craft our own. Setting up Solr, filling it with data, and getting it back out again is relatively easy. However, much like taking a north-Philly street brawler and turning him into Rocky, it takes a bit of work to do it well.
Possibly the single most important factor in successfully creating a custom Solr schema is how well you know the data. This is where a slight bit of artisanship comes into play. To be as efficient as possible the schema has to reflect not just the data types, but where the data needs to go and how it is going to be used on a field by field basis.
Anytime that I move data from one point to another, I always take some time to plot out all of the data that I start out with, and were exactly it needs to go. In this case, not just where it needs to go in Solr, but where and how it will be used by viewers of the final site. Only after I have answered those questions do I start thinking about the mapping and how to get it there. For simple one-to-one migrations of data, that process might get you most of the way there, but with Solr you need to be mindful of many other factors.

What is the schema.xml file?

The Solr schema.xml (typically found in the solr/conf/ directory) is where you tell Solr what types of fields you plan to support, how those types will be analyzed, and what fields you are going to make available for import and queries. Solr will then base its Lucene underbelly on what you define. After installation you will have a generic example schema.xml that I highly recommend you read through and even use as a foundation of your own schema. It can be tempting to glaze over large block comments in an example file, but you will find great examples, explanations, and even some solid performance and tuning tips all through the file. It is definitely worth the time to read it over before you set up camp and make it your own.

Field Types, Fields, and Copy Fields

The big picture of the file really breaks down to three major areas. First you define your field types to dictate which Solr java class the field type utilizes and how fields of this type will be analyzed. This is where most of the "magic" is outlined in your schema.
Field Types
Then you define your actual fields. A field has a name attribute which is what you will use when importing and querying your data, and it points back to a designated field type. This is where you tell Solr what data is indexed, and what data is stored.
Fields
Optionally you may choose to have some copy fields which can be used toCopy Fieldsintercept incoming data going to one field and fork a clone of it off to another field that is free to be a different field type.

You do not have to index and store it all

A concept in Solr that you do not want to miss out on is that not everything has to be indexed and not everything has to be stored. Solr takes a dual path with data, keeping what is indexed completely separate from what is stored. When you think about the complex operations that Solr does on your data to dissect, convert, and if you say so - even turn it backwards in order to make it indexable, it makes sense that you would not be able to turn around and reuse that same data to display back to the viewer. So now you have two fundamental questions to ask yourself on each field:
  1. Will this field need to be searched against?
  2. Will this field need to be presented to the viewer?
The answers to these questions are directly reflected in the indexed and stored attributes to your field elements. These days even terabytes are relatively cheap, but performance is still priceless. If Solr doesn't need it, don't make it wade through it to get what it does need.

Analyzers, Tokenizers, and Token Filters: know 'em - love 'em

Much of the power of Solr comes from the field types that you define and what combination of char filters, tokenizers, and token filters you use against your data. When you set up a field type you can define one shared analyzer or one for the index, and a completely different one for query results.
Within your analyzer(s) you can start out with char filters. A char filter will modify your data in some way before it is actually analyzed. A good example would be the PatternReplaceCharFilterFactory which can apply a regular expression find/replace on your data before Solr even starts to break it apart.
Each analyzer can have one tokenizer. Which one you use will depend heavily on the data you intend to pass through it. Think of this as the fundamental way that you want the analysis of the data to be approached. From there you can also apply token filters to further manipulate your data. Examples would be referencing a stop words list, applying a list of synonyms, or forcing all characters to lowercase.
Field Type Example

Because index and storage are separate, they can be analyzed separately

A great example of this would be if you were going to index and store a document that had plenty of content, but also thick with html markup. When you index that data you want to make sure the mark up is not getting indexed, otherwise things like class names might give you false positives down the road. However when you go to serve up the content to the viewer you are going to want that html intact. Solr can handle this by allowing you to define a field type where the index analyzer uses the HTMLStripCharFilterFactory char filter to purge out all html entities prior to tokenization, but the query analyzer would not use the char filter and instead present the originally stored content in search results.

Dynamic fields are rad, but specifically defined fields are rad2

The most common usage of dynamic fields is to allow your scripts to pass in data with a suffix like "_s" (example: address_s) and then Solr will process whatever you pass as a string. This comes in handy when you have data structures that may change over time and you want to allow your scripts to handle this gracefully without coder intervention on either the scripts or the schema. The Drupal module makes heavy and clever use of this feature. The two downsides are that typically your dynamic fields are going to be have more generalized tokenizers and filters applied and they are most likely going to index and store everything in order to play friendly with any data you throw at it. Whenever possible I would recommend defining specific fields for any data that is going to carry significant weight in your indexing or queries. This will allow you to get much more granular with how that data is processed.
Dynamic Fields

Why send twice when you can copyField

If you need the same field to be indexed or stored in more than one way, there is no need to modify your scripts to send the same field twice. Just utilize the copy field feature of Solr. Another useful trick is to use copy fields to combine multiple fields into one indexable field. This can greatly simply running queries.

Where to start and where to find more

If you have not installed Solr yet, you can find the download and install documentation here: http://lucene.apache.org/solr/downloads.html.
If you are just experimenting you can get by running from the Jetty Java servlet container that comes with the download. But for production and serious development you will want to upgrade to something with a little more muscle likeApache Tomcat to dish out your Solr needs.
Solr AdminIf this is your first time working with Solr I would recommend spending some time just experimenting with the solr/admin/ interface. A lot can be learned from pushing in some of the example xml files and testing out some queries. This is also a great place to test out your latest field type concoctions.
Some great documentation is provided on some of the most common tokenizers and filters in the apache.org wiki. If those do not meet all of your needs you can also find plenty of others out in the wild. For example, we needed a geo hash that could handle multiple values on a field, so we tracked down and used an alternative for the single value one that comes with Solr.
You will find a lot of material and concepts at play and the best way to find your bearings is to just dive in. When in doubt, push some data, query it, refine the schema as needed.

Deleting / Purging Cache, Users, Nodes, Comments (Housekeeping)


If you have developed a drupal site and are maintaining it too, you will soon find that your database will keep growing with lot of unwanted posts, comments and user registrations. There is a need to purge this data from time to time to keep storage levels at optimum and also your website updated with only relevant content.
I was looking at what could be the different ways in which you can clear data / Purge data in drupal.  I have logically divided it into 4 categories
  1. Purging Cache
  2. Purging Users
  3. Purging Comments
  4. Purging posts
Lets look at each one of them
  1. Cache actions provides rules actions for clearing drupal caches. It currently provides actions for:
  • Clearing Drupal cache bins
  • Clearing CSS/JS cache
  • Clearing The cache of specific views
  • Clearing The cache of specific panel pages
  • Clearing The cache of specific mini panels

  1. The Cleaner module allows the admin to set a schedule for clearing caches, watchdog, and old sessions. Its available only for Drupal 5 and Drupal 6
There are functions in Drupal which will cause "expired" entries in some cache tables to be deleted. This is vastly improved in Drupal 6. "Minimum_cache_lifetime" is a partial solution, but still not totally complete.  There are still times and/or cache tables that don't get cleared in any of those scenarios. Many sites will not be impacted by this, but a few will (just search on drupal.org and you will see many posts from people having problems).
  1. Also refer to this documentation on clearing cache
  2. The LoginToboggan module offers several modifications of the Drupal login system in an external module by offering the following features and usability improvements: Optionally have unvalidated users purged from the system at a pre-defined interval (please read the CAVEATS section of INSTALL.txt for important information on configuring this feature!).
  3. The inactive_user module provides Drupal administrators with a way to automatically manage inactive user accounts. This module has two goals: to help keep users coming back to your site by reminding them when they've been away for a configurable period of time, and to cleanup unused accounts.
  4. User Prune lets you mass delete inactive users based on criteria you specify.
The module classifies inactive users into two categories: users who have never logged in before, and users who have logged in at least once. For users that have never logged in before, you can choose to prune users based on how long they've been registered. For users that have logged in before, you can chose to prune users based on both how long they've been registered, and how long its been since they last logged in. The pruning specification you select can be saved as a cron job, or executed a single time.
  1. Deleting nodes at the database level is a bad idea because there are so many tables in the database that contain node-related data. If deleting nodes in bulk at admin/content/node is not adequate for your needs then you need to make use of the node delete api function. Example usage to delete page type nodes.
$node_type = 'page';
  
//fetch the nodes we want to delete
$result = db_query("SELECT nid FROM {node} WHERE type='%s'",$node_type);
while ($row = db_fetch_object($result)){
  node_delete($row->nid);
  $deleted_count+=1;
}
//simple debug message so we can see what had been deleted.
drupal_set_message("$deleted_count nodes have been deleted");

  1. To delete duplicate nodes  refer to this code snippet
  2. Deleting Comments – If you have to delete specific comments or spam comments you can use views bulk operations to mass delete comments Install the views bulk operations module, create a page view of comments, and choose "Bulk Operations" as the view style. Make sure you select "Delete comment" as one of the selected operations. When you view the view page you created, you'll be able to bulk delete comments.
  3. Deleting nodes can be done using Views bulk operations module in the same way as above.