This presentation is by Christian Heinrich, the project leader for the OWASP “Google Hacking” project.  Presentation published on  Dual licensed under OWASP License and AU Creative Commons 2.5.

OWASP Testing Guide v3 – Spiders/Robots/Crawlers

1. Automatically traverses hyperlinks

2. Recursively retrieves content referenced

Behavior governed by the robots exclusion protocol.  New method is <META NAME=”Googlebot” CONTENT=”nofollow”>  Not supported by all Robots/Spiders/Crawlers.  Traditional method is robots.txt located in web root directory.  Regular expressions supported by minority only.  “User-agent: *” applies to all spiders/robots/crawlers or you can specify a specific robot name.  Can be intentionally ignored.  Not for httpd access control or digital rights management.

Testing – Robots Exclusion Protocol

  1. Sign into Google Webmaster Tools
  2. On the dashboard, click the URL
  3. Click “Tools”
  4. Click “Analyze robots.txt”

Search Engine Discovery

Microsoft Remote Desktop Web Connection: intitle:Remote.Desktop.Web.Connection inurl: tsweb

VNC: “VNC Desktop” inurl:5800

Outlook Web Access: inurl:”exchange/logon.asp”

Outlook Web Access: intitle:”Microsoft Outlook Web Access – Logon”

Adobe Acrobat PDF: filetype:pdf

Google caught onto this and is now displaying a “We’re sorry” message with certain searches.  To get around, use different search queries that returns overlapping results.

Google Advanced Search Operators: “site:” and “cache:”  Two ways of using “site:”.  EIther as “” where you get that specific subdomain’s results or “” where you get all hostnames and subdomains. Use “” to display an indexed web page in the google cache.  There is also a site operator labeled “Cached” which will do the same thing.

You can get updates of the latest relevant Google results (web, news, etc) using Google Alerts.

Download Indexed Cache

Google SOAP Search API.  Query limited to either 10 words or 2048 bytes.  One thousand search queries per day and limited to search results within 0-999.  Up to 10K possible results from 10 different search queries.

$Google_SOA_Search_API -> doGoogleSearch( $key, $q, $start, $maxResults, $filter, $restricts, $safeSearch, $lr, $ie, $oe );

See presentation for response.

Proof of concept tool is “” or “Download Indexed Cache” that downloads the search results.  Licensed under the Apache License 2.0.  Tool produces a URL and cachedSize response.

OWASP Google Hacking Project

Tools built using Perl using CPAN Modules SOAP::Lite, Net::Google, and Perl::Critic.  Development environmetn is based on Eclipse with EPIC Plug-in.  Subversion repository is at


Upcoming presentations at ToorCon X in San Diego, SecTor 2008 in Toronto, Canada, and RUXCON 2K8 in Sydney, Australia.

“TCP Input Text” Proof of Concept

“Speak English” Google Translate Workaround

Refactor and 3rd Project review of PoC Perl Code with public release at RUXCON 2K8 in November 2008.

Check in at after RUXCON 2K8

4 hr “half day” training course Q1 2009