Pages

Monday, August 20, 2012

A simple HTTP PHP class to crawl a URL for internal and external URLs


Here's a simple PHP class I wrote to crawl a URL and return a list of internal and external URLs. I've used it in the past for development purposes [only] to find 404s and repetition in URL structure. IE: It does not read in robots.txt files or obey any similar rules. Just thought I'd pull it out of the archives and share on the web..

#!/usr/bin/php

<?php
class Crawl {

  protected $regex_link;
  protected $website_url;
  protected $website_url_base;
  protected $urls_processed;
  protected $urls_external;
  protected $urls_not_processed;
  protected $urls_ignored;

  public function __construct($website_url = NULL) {
  
    // enable error tracking, grr.
    ini_set('track_errors', true);
    
    // setup variables
    $this->regex_link = "/<\s*a\s+[^>]*href\s*=\s*[\"']?([^\"' >]+)[\"' >]/isU";
    $this->urls_processed = array();
    $this->urls_external = array();
    $this->urls_not_processed = array();
    $this->urls_ignored = array(
      '/search/apachesolr_search/',
      '/comment/reply/',
    );
    
    // validate argument(s)
    $result = $this->validate_arg_website_url($website_url);
        
    // error check
    if (!$result) {
      return FALSE;
    }
    
    // set website argument
    $this->website_url = $website_url;
    
    // get url base
    $url_base = $this->get_url_base($this->website_url);
    
    // error check
    if (!$url_base) {
      return;
    }
    
    // set website url base
    $this->website_url_base = $url_base;
    
    // add url to list of urls to process
    $this->urls_not_processed[] = $this->website_url;
    
    while(count($this->urls_not_processed)) {
      $this->process_urls_not_processed();
    }
    
    // sort data
    sort($this->urls_processed);
    sort($this->urls_external);
    
  }
  
  protected function validate_arg_website_url($website_url = NULL) {
  
    // validate argument
    if (!(is_string($website_url) && (substr($website_url,0,7)=='http://' || substr($website_url,0,8)=='https://'))) {
      return FALSE;
    }

    return TRUE;    
      
  }
  
  protected function get_url_base($url = NULL) {
  
    // validate url
    if (!$url || !strlen($url)) {
      return FALSE;
    }
    
    $url_parts = parse_url($url);
    
    // validate
    if (!is_array($url_parts)) {
      return FALSE;
    }
    
    // explode host on '.'
    $exploded = explode('.', $url_parts['host']);
    
    // return host and domain extension
    $url_base = $exploded[count($exploded)-2] . '.' . $exploded[count($exploded)-1];
    
    
    return $url_base;

  }

  protected function scan_url($url) {

    // validate url
    if (!is_string($url) || !$url || !strlen($url)) {
      return FALSE;
    }

    // ensure url has not already been processed
    if (in_array($url, $this->urls_processed)) {
      return FALSE;
    }
    
    // add url to processed list
    $this->urls_processed[] = $url;

    // remove any previously saved errors
    unset($php_errormsg);
    
    // load page contents
    $page_contents = file_get_contents($url);        

    // check for error when loading url; text starting with "file_get_contents"
    $error_text = 'file_get_contents';
    if (isset($php_errormsg) && substr($php_errormsg,0,strlen($error_text))==$error_text) {
      return FALSE;
    }

    // check for additional errors
    elseif ($page_contents === false || !strlen($page_contents)) {
      return FALSE;
    }

    // execute regex
    preg_match_all($this->regex_link, $page_contents, $matches);
   
    if (is_array($matches) && isset($matches[1])) {
      return array_unique($matches[1]);
    }
   
    return FALSE;

  }
  
  protected function process_matches($matches = NULL) {
  
    // validate
    if (!$matches || !is_array($matches) || empty($matches)) {
      return FALSE;
    }
    
    foreach ($matches as $match) {
      
      // ensure match exists
      if (empty($match)) {
        continue;
      }
      // ignore anchors
      elseif (substr($match,0,1)=='#') {
        continue;
      }
      // ignore javascript
      elseif (substr($match,0,11)=='javascript:') {
        continue;
      }
      // ignore mailto
      elseif (substr($match,0,7)=='mailto:') {
        continue;
      }

      // check for internal urls that begin with '/'
      if (substr($match,0,1)=='/') {
        $match = 'http://' . $this->website_url_base . $match;
      }
      
      // remove trailing slash
      if (substr($match, -1)=='/') {
        $match = substr($match, 0, -1);
      }
      
      // ensure href starts with http or https
      // NOTE: this needs work, URL could begin with relative paths like '../', ftp://, etc.
      if (!(substr($match,0,7)=='http://' || substr($match,0,8)=='https://')) {
        $match = 'http://' . $this->website_url_base . '/' . $match;
      }

      // check if url is to be ignored
      foreach ($this->urls_ignored as $ignored) {
        if (stripos($match, $ignored) !== FALSE) {
          continue 2;
        }
      }

      // get url base
      $url_base = $this->get_url_base($match);
      
      // check for external url
      if ($url_base != $this->website_url_base) {
      
        if (!in_array($match, $this->urls_external)) {
          $this->urls_external[] = $match;
        }
        continue;
      
      }
      
      // check if url has already been processed
      if (in_array($match, $this->urls_processed)) {
        continue;
      }

      // add url to list of urls to process
      if (!in_array($match, $this->urls_not_processed)) {
        $this->urls_not_processed[] = $match;
      }      
    
    // end: foreach  
    }
    
    return TRUE;
  
  }
  
  protected function process_urls_not_processed() {
  
    if (empty($this->urls_not_processed)) {
      return FALSE;
    }
  
    // get unprocessed url
    $url = array_shift($this->urls_not_processed);
    
    // scan url
    $matches = $this->scan_url($url);

    // error check
    if (!$matches || !is_array($matches) || empty($matches)) {
      return FALSE;
    }
  
    $this->process_matches($matches);
  
  }
  
  public function output_all_urls() {
  
    echo "===== INTERNAL URLS =====\n";
    foreach ($this->urls_processed as $url) {
      print $url . "\n";
    }
  
    echo "===== EXTERNAL URLS =====\n";
    foreach ($this->urls_external as $url) {
      print $url . "\n";
    }
  
  }

}
?>


It can be used as such..
<?php
$website_url = 'http://www.example.com';
$crawl = new Crawl($website_url);
$crawl->output_all_urls();
?>

Building Custom Panel Panes (CTools Content Types) in Drupal 7


Content types (a.k.a. panel panes) are one of CTools' most visible, versatile, and easy to learn APIs. In this tutorial, we're going to build a custom content type (or panel pane depending on what you call it) that display's a twitter user's latest feeds. As with any new CTools content type, you need to have at minimum three files in following structure:
  1. gtwitpane/gtwitpane.module
  2. gtwitpane/gtwitpane.info
  3. gtwitpane/plugins/content_types/gtwitpane_pane.inc
Our .info file is pretty standard. Note we don't need to define gtwitpane_pane.inc in the files array.
gtwitpane.info:
name = Ghetto Twitter Pane
description = Provides a ghetto twitter pane to panels
core = 7.x
dependencies[] = ctools
files[] = gtwitpane.module
This hook alerts CTools to load our content types (panel panes) from the module's plugins directory. You may have noticed your module doesn't need to explicitly declare individual content types. CTools is dumb (in a good way) and assumes any .inc file in /plugins/content_types needs to be loaded as a plugin. So unlike normal, you don't even have to clear your cache for a new plugin to get discovered. However, if you are editing a panel, and add a pane, you'll want to update and save that pane before you will see your content type.
gtwitpane.module:
<?php/*
* Implements hook_ctools_plugin_directory -
* This lets ctools know to scan my module for a content_type plugin file
* Detailed docks in ctools/ctools.api.php
*/ 
function gtwitpane_ctools_plugin_directory($owner$plugin_type) {
  
// we'll be nice and limit scandir() calls
  
if ($owner == 'ctools' && $plugin_type == 'content_types') {
    return 
'plugins/content_types';
  }
}
?>
As you can see, CTools content types are much cleaner than hook block. A config form is not required, but included in this example.
gtwitpane_pane.inc
<?php/**
* This plugin array is more or less self documenting
*/
$plugin = array(
  
// the title in the admin
  
'title' => t('Ghetto Twitter pane'),
  
// no one knows if "single" defaults to FALSE...
  
'single' => TRUE,
  
// oh joy, I get my own section of panel panes
  
'category' => array(t('Ghetto Twitter'), -9),
  
'edit form' => 'gtwitpane_pane_content_type_edit_form',
  
'render callback' => 'gtwitpane_pane_content_type_render');

/**
* Run-time rendering of the body of the block (content type)
* See ctools_plugin_examples for more advanced info
*/
function gtwitpane_pane_content_type_render($subtype$conf$context NULL) {
  
// our output is generate by js. Any markup or theme functions
  // could go here.
  // that private js function is so bad that fixing it will be the
  // basis of the next tutorial
  
$block->content _gtwit_ghetto_js_that_is_bad($conf['twitter_username']);
  return 
$block;
}

/**
* 'Edit form' callback for the content type.
*/
function gtwitpane_pane_content_type_edit_form(&$form, &$form_state) {
  
$conf $form_state['conf'];
  
$form['twitter_username'] = array(
    
'#type' => 'textfield',
    
'#title' => t('twitter username'),
    
'#size' => 50,
    
'#description' => t('A valid twitter username.'),
    
'#default_value' => !empty($conf['twitter_username']) ? $conf['twitter_username'] : 'nicklewisatx',
  );
  
// no submit
  
return $form;
}

/**
* Submit function, note anything in the formstate[conf] automatically gets saved
*/
function gtwitpane_pane_content_type_edit_form_submit(&$form, &$form_state) {
  
$form_state['conf']['twitter_username'] = $form_state['values']['twitter_username'];
}

/**
* This js handling kills kittens.
*/
function _gtwit_ghetto_js_that_is_bad($twitter_username) {
  
$output '<script src="<a href="http://widgets.twimg.com/j/2/widget.js">http://widgets.twimg.com/j/2/widget.js</a>"></script>';
  
$output .= "<script>new TWTR.Widget({
            version:2,type:'profile',rpp:3,interval:6000,width:'auto',height:'auto',theme:{shell:{background:'#ffffff',color:'#000000'},
            tweets:{background:'#ffffff',color:'#000000',links:'#227eb3'}},
            features:{scrollbar:false,loop:false,live:false,hashtags:true,timestamp:true,avatars:false,behavior:'all'}}).render().setUser('$twitter_username').start();
  </script>"
;
  return 
$output;
}
?>
That's all there is to it. Stay tuned, next we're going to cover the CTools object cache, as well as javascript technique in Drupal 7.
Futher Information and Notes:
1. 
More advanced examples of ctools content types take a look at ctools/custom_plugin_example/plugins/content_types. I assume ctools_plugin_example works in 7 :-D.
2. The correct term for "panel pane" in CTools world is  "content type" - and "ctools content types" have nothing to do with "node content types". Confused yet? 3.  Also check out advanced_help for ctools. Believe it or not, there is *tons* of ctools documentation "hidden" there.

40+ Essential Drupal Modules


If you are new to drupal, then this list is for you. These are some of the best of the best drupal modules. Everything from standard framework modules, to location and mapping is covered. Note that if you've been emersed in drupal for some time, than this will be "old news".

The Big Three

"The big three" are important enough that they deserve a category of their own. Most drupal modules worth using have integrated with one of these three. Their importance simply can't be stressed enough.
  • Content Construction Kit (CCK) - Part of drupal 7; still a contrib in drupal 6. Allows you to define new content types (e.g. blog entry, event, or employee record...) and add "fields" to them. A fieldcould be plain text, an image, a flash video, or whatever. You can also adjust how these fields display in the live view. No drupal install should be without this module.
  • Views - Broadly speaking, this module empowers non programmers to build dynamic streams of content displaying any number of fields. The content may come from nodes (a.k.a. content typesand fields), users, system log entries, etc. You can display this stream in any number of formats including RSS feeds, tables, or just the vanilla view for a content type. You can also create pages or blocks -- its very tightly interwoven with drupal. Nearly every drupal module worth using is integrated with this module. Extremely powerful when used in combination with CCK.
  • Panels -

    I believe Panels + CCK & Views is a hint at what drupal will look like 3 years into the future. I had to change my pants after the first time I witnessed it. At a very simple level, you could think of it as a layout manager. Create a 1,2,3 column layout. Or a 3 column layout with a full width footer and header, and plop pieces of content in them -- say a view, a block, or anode. That description, however does not do it justice. Since version 3, its positioned itself as a replacement for drupal core's clunky block system. It can now override a node page, and can be used to place content all over the place. It also introduced a concept of contexts, selections rules, and relationships. These are concepts that deserve a series of blog posts, but lets just say its solving some of the weirdest, mind numbing, bug creating problems found in advanced websites. Ironically, I used to hate this module, but after version 3 I will defend its awesomeness to the death!

For Administration Sanity

  • Admin Menu - Quick Dropdown menu to all admin areas. Makes any setting only a click away, instead of 3 to 6 clicks away.
  • RootCandy - A theme specially designed for administration. Drupal 7 comes with an admin theme included, but this is still highly recommended in drupal 6.

Content and SEO

  • Pathauto - Automatically create human readable URLS from tokens. A token is a piece of data from content, say the author's username, or the content's title. So if you set up a blog entry to use tokens like [author-name]/[title] then a blog entry by "Phil Withersppon" titled "my great day" will be rewritten example.com/phil-witherspoon/my-great-day.
  • Printer, email, and PDF Versions - There are still people out there who prefer to print out content to read later. This module does just that, and also lets them send your content via email.
  • NodeWords - A very poorly named module that's great at letting you edit meta tags.
  • Page Title - Lets you set an alternative title for the <title></title> tags and for the <h1></h1> tags on a node.
  • Global Redirect - Enforces numerous well thought out SEO rules, for example since I don't use this module you could access my content at "http://www.nicklewis.org/node/1062". This module however will search for the alias and 301 to the proper URL http://www.nicklewis.org/40-essential-drupal-6-modules. (thanks Jeff!)
  • Path Redirect - Simple idea: make it easy to redirect from one path to another. Does a good job at it.
  • Taxonomy manager - Makes large additions, or changes to taxonomy really really easy and painless.
  • Node Import - Made it shockingly easy to import 2000 csv rows, tie rows to CCK fields (or locations), and even will file it under the right taxonomy terms in hierarchy so long as you plan ahead.

Navigation

  • Menu Block - Lets you split menus into separate blocks based on depth. Say you have a top level menu link "Articles" with sub menu links "Politics", "Technology", "lifestyle". This block would let you show the sub menus in the right sidebar, and the top level "article" as tabs in the header.
  • Taxonomy Menu - Automatically generate menu items for categories. Handles syncing between taxonomy and menus, and is ready to be used in conjunction with views or panels.
  • Custom Breadcrumbs - Set up custom breadcrumb paths for content so that every page doesn't just have a breadcrumb back to "home". (note: i've used menu_trails a lot too.)
  • Nice Menus - Drop down menus (for people who are into that kind of thing).

WYSIWYG Editors + Image Uploading

  • WYSIWYG API - The standard integration module.
  • CKEditor - Currently my favorite WYSIWYG editor. WYSIWYG API only supports CKEditor on its dev version (at the time of this writing). For the time being, I use this module instead of WYSIWYG api. Regardless, the rest of the world probably uses WYSIWYG api.
  • IMCE - File browser / image inclusion for WYSIWYG editors. CKeditor is integrated out of the box, WYSIWYG API implementations require a bridge module.

Video and Image Handling

  • Filefield - Base CCK file upload field. Useful on its own, but also required by other essential modules.
  • ImageAPI, ImageCacheImagefield - These three work together. ImageAPI handles low level integration with server side image processing (e.g ImageMagick). ImageCache allows you to set up presets for automatic resizing, cropping, and a host of other operations you'll probably never need to use. ImageField then provides an upload field to a piece of content, which you can use imagecache presets to resize in the display. Imagefield is very well integrated with Views and CCK. The paintings on the right show a bunch of images automatically resized using this technique.
  • Lightbox2 - If you've set up your imagefields, lightbox2 lets you add another layer of options. For example, display image resized at 300px wide on the page, but blow it up to full size when clicked. Like Imagefield, lightbox 2 is well integrated with Views and CCK. Very powerful combination.
  • Embedded Media Field - Embed video and audio files from dozens of third party providers ranging from youtube, to services you've probably never heard of.

User Profile, Ratings & Notifications

  • Content Profile - The core profile module sort of sucks. This turns profiles into nodes allowing you all the options of views and CCK.
  • Voting API + Fivestar - The standard voting widget of Drupal.
  • Notifications - Provides the ability to send emails when someone comments, or replies to a comment. Has a host of other features.
  • Captcha + Recaptcha - Standard Antispam system. In use on this very site.

Stuff Marketers Will Love

  • Webform - We all know visitors love filling out forms. This module lets your marketing team create custom forms, and collect whatever info they want.
  • Google Analytics - Simple integration of drupal with google Analytics.
  • Service Links - Easy "share this" links for content. Supports digg, facebook, delicous and a bunch of other social web 2.0 services.

Events and Calendars

  • Date - CCK field for handling dates, and date ranges.
  • Calendar - Integrated and controlled by views.

Location and Mapping

  • Location - Standard API for collecting addresses and lat/long. Integrated with Views and CCK. Somewhat difficult to use, but its a somewhat difficult problem it solves.
  • Gmap - Display locations in GMap.

Ecommerce

For Developers

  • Devel - Offers an enormous amount of information for developers, including: theme template variables, and overrides, browsable data structures, and datasets for performance-tuning. Just the debug function dsm(); makes it worth the download.
  • Backup & Migrate -- Greatly eases the pain of moving changes from your local development environment to the live server and vice versa.
  • Drush - Its actually not really module, but a "Swiss army knife" set of tools that are run from a command line. One example command is "drush dl views": running it will automatically download the latest version of views and place it in the right drupal folder. 1 second command instead of a 1 minute process of downloading from drupal, uploading via FTP. There's many other commands that are just as useful.

Web services in Drupal 8


The future is a world where content management systems need to output data to many more devices and integrate with more and more systems and services. Today, Drupal is optimized for outputting HTML and core ships with an old XML-RPC backend. If we want Drupal to be the go-to platform in a world with many different devices and integrated services, we need to fundamentally change that.
The goal of Drupal 8's Web Service Initiative is to make Drupal equally good at outputting data as XML, JSON and other non-HTML formats, to expose Drupal's functionality through a RESTful interface, but also for Drupal to better support different page layouts when delivering HTML pages to different devices.
HTML5 is obviously a big part of such a future as well, however, for that I'll setup a dedicated initiative. I believe both can be worked on (mostly) in parallel and by different people. I'm still talking to different people about the HTML5 initiative. Both initiatives combined should get us in a great position with Drupal 8.

Larry Garfield

I've decided to make Larry Garfield (aka Crell) the Initiative Owner for the Web Services Initiative in Drupal 8. As the Initiative Owner Larry will act as the project manager and/or technical lead. In this role, Larry and myself will work closely together on defining the architecture and approach. This means I don't need to be involved in every small conversation myself, but that I still need to sign off on the approach and implementation.
Larry was the obvious choice for two reasons. First, he is a great architect able to tackle complex problems, as demonstrated by his work on Drupal 7's database abstraction layer. Second, Larry has already put in a lot of thought andeffort into this.
As many of you know, this initiative is near and dear to my heart. I trust that Larry will do a great job. For more details, please read Larry's announcement blog post or join the technical conversation.