XML Search Feed

  • Purchase access to Gigablast's index on a cost per usage basis.

  • The cost is $2.50 for every 1000 (one thousand) queries performed on the precise search engine.

  • The cost is $1.00 for every 1000 (one thousand) queries performed on the fast search engine.

  • The XML search feed searches over 1 billion of the top pages on the web.

  • Cached web pages (archived copies) are provided as part of the feed service and the retrieval of one archived page counts as a single query.

  • Sign up now to start accessing the feed.

  • You can use the search results however you want. You can rearrange them, embed ads, etc.

Search Feed Input

To get search results from Gigablast use a url like:
http://www.gigablast.com/search?q=test&xml=1&userid=123456&code=abcd123 where:

userid=X X is the secret User ID you were issued when making a successful deposit into your account. This is required.
code=X X is the secret XML Feed Code you were issued when making a successful deposit into your account. This is required.
q=X X is the query in UTF-8. See some examples of queries and special operators.
precise=1 Specify precise=1 to use the more accurate, but slower, index. If you do not specify precise=1 as a cgi parameter, then the faster, but less accurate, index is used by default.
xml=1 Use this to request the XML feed, otherwise you will get HTML.
n=X returns X search results. Default is 10. Max is 50.
s=X returns results starting at result #X. The first result is result #0. Default is 0. Max is 499.
ns=X returns X summary excerpts in the summary of each search result.
site=X returned results will have URLs from the site, X.
sites=X returned results will have URLs from the space-separated list of sites, X. X can be up to 500 sites. A site can include sub folders. This allows you to build a Custom Topic Search Engine.
plus=X returned results will have all words in X. Like a default AND.
minus=X returned results will not have any words in X.
sc=X X can be 0 or 1 to respectively disable or enable site clustering. Default is 1.
dr=X X can be 0 or 1 to respectively disable or enable duplicate result removal. Default is 1.
psc=X X ranges from 0 to 100 and is the 'percent similar cutoff' such that a search result that is X% similar to a search result above it will be hidden from view. psc is only valid when dr is set to 1 (see above). If psc is 100 then only documents that are exactly alike are deduped. Default is 80, but 0 if the raw parameter is used.
qh=X X can be 0 or 1 to respectively disable or enable highlighting of query terms in the titles and summaries. Default is 1, but 0 if the raw parameter is used.
dt=X X is a space-separated string of meta tag names. Do not forget to url-encode the spaces to +'s or %%20's. Gigablast will extract the contents of these specified meta tags out of the pages listed in the search results and display that content after each summary. i.e. &dt=description will display the meta description of each search result. &dt=description:32+keywords:64 will display the meta description and meta keywords of each search result and limit the fields to 32 and 64 characters respectively. When used in an XML feed the <display name="meta_tag_name">meta_tag_content</> XML tag will be used to convey each requested meta tag's content.
spell=X X can be 0 or 1 to respectively disable or enable spell checking. If enabled while using the XML feed, when Gigablast finds a spelling recommendation it will be included in the XML tag. Default is 0 if using an XML feed, 1 otherwise.
nrt=X X is the maximum number of related topics, also known as GigaBits, to be displayed.
qlang=X X is a typically two letter language identifier, like en for English, de for German, or fr for French, etc. It will give heavy penalties to documents known to be in a different language than the one specified. The default is English.
ff=X X is 1 to enable family filter, 0 otherwise.

Site Clustering

It is often undesirable to have many results listed from the same site. Site Clustering will essentially limit the number returned results from any given site to two, but it will provide a link which says "more results from this site" in case the searcher wishes it.

Duplicate Results Removal

When dup results removal is enabled Gigablast will remove results that have the exact same content as other results. The psc parameter can be used to dedup documents with similar content.


Cached Web Page Input

To get a cached web page from Gigablast use a url like:
http://www.gigablast.com/get?d=266571445106&ih=1&q=test&c=main   where:


d=X X is the docId of the page you want returned. DocIds are 64-bit, so you'll need 8 bytes to hold one. DocIds can be harvested from the XML search feed output.
c=X X is collection that contains the document. Usually this is main.
ih=X X is 1 to include the Gigablast header in the returned page, and 0 to exclude it.
ibh=X X is 1 to include the Gigablast BASE HREF tag in the cached page. The default is 1.
q=X X is the the query that, when present, will cause Gigablast to highlight the query terms on the returned page.
cas=X X can be 0 or 1 to respectively disable or enable click and scroll. Default is 1.
strip=X X can be 0, 1, 2 or 3. If X is 0 then no stripping is performed. If X is 1 then image and other tags are removed. An X of 2 is another form of removing tags. If X is 3 then all tags are removed. Default is 0.


The Output

Gigablast allows you to receive the search results in a number of formats useful for interfacing to your program. Here is an example of the XML feed.

We plan for the output of the precise search engine to have all the same output fields as the fast search engine, but for now it is missing some because the code was significantly overhauled for the new search algorithm and we are still putting pieces back together.

output only from precise search engine (&precise=1) is in green
output only from fast search engine (&precise=0) is in red

The XML reply has the following format (but without the comments):



<?xml version="1.0" encoding="utf-8" ?>

# It consists of one, and only one, response.
<response>

  # the current time on the search engine
  <currentTimeUTC>1373944554</currentTimeUTC>

  # How long in milliseconds to compute these results?
  <responseTimeMS>2373</responseTimeMS>

  # Total number of documents in the collection being searched.
  <docsInCollection>2060245584</docsInCollection>

  # If any error was received in processing the request, it will be here.
  <error><![CDATA[Out of memory]]></error>
  # The numeric code of the error, if any, goes here. If no error, this is 0.
  <errno>32790</errno>

  # Total number of search results for the query. This is an exact count
  # for the precise search engine, but an approximation for the fast search
  # engine.
  <hits>4838158</hits>

  # This is "1" if more results are available after these, "0" if not.
  <moreResultsFollow>1</moreResultsFollow>

  # If present and value is 1, some words in the query was censored for 
  # adult content. Only used if &ff=1 is specified. (Family Filter)
  <queryCensored>1</queryCensored>

  # If present, the value is the number of results that were censored for 
  # adult content. Only used if &ff=1 is specified. (Family Filter)
  <resultsCensored>3</resultsCensored>

  # If this tag is present, it will hold an alternate spelling recommendation 
  # for the query. The &spell=1 parameter must be present in the query url,
  # however, for you to get a spelling recommendation back.
  <spell><![CDATA[nose]]></spell>

  # If this tag is present, it contains the list of query words that were 
  # ignored as individual words, but not necessarily as part of a phrase
  <ignoredWords><![CDATA[the in of]]></ignoredWords>


  # The list of search results, each one enclosed in a <result> tag.
  <result>

    # Each result has a title. This may be empty if none was found on the page.
    <title><![CDATA[My Homepage]]></title>

    # Each result has a summary. This may be empty. The summary is generated 
    # so as to contain the query terms if possible.
    <sum><![CDATA[All about my interests and hobbies]]></sum>

    # If this result is categorized under the DMOZ Directory, data about each
    # category it is in will be enclosed in a <dmoz> tag.
    <dmoz>
      # The category ID number of this category.
      <dmozCatId>172</dmozCatId>
      # The path of this category in the directory.
      <dmozCat><![CDATA[Health: Dentistry]]></dmozCat>
      # Title of this result as listed in the directory.
      <dmozTitle><![CDATA[My Homepage]]></dmozTitle>
      # Description of this page as listed in the directory.
      <dmozDesc><![CDATA[A Dentist's Home Page]]></dmozDesc>
    </dmoz>
    # If the directory is being given along with the results, this is the number of
    # stars given to this page based on its quality.
    <stars>3</stars>

    # Each result may have a sequence of <display> tags if the feed input
    # contained a dt parameter. This allows you to extract
    # information contained in meta tags in the content of each search result.
    # To obtain the contents of the author meta tag, you would need to pass in
    # dt=author.
    <display name="author"><![CDATA[Contents of the meta author tag]]></display>

    # Each result has a URL. This should never be empty.
    <url><![CDATA[http://www.mydomain.com/mypage.html]]></url>

    # The size of the page in kilobytes. Accurate to the tenth of a kilobyte.
    <size>5.6</size>

    # The time the page was LAST indexed. It may not have been indexed in a 
    # long time if the page's content has not changed. The time is expressed 
    # in seconds since the epoch. (Jan 1, 1969)
    <indexed>1064367311</indexed>

    # The time the page was FIRST indexed. Expressed in UTC 
    # in seconds since the epoch. (Jan 1, 1969)
    <firstIndexedDateUTC>1064367311</firstIndexedDate>

    # The time the page was published.
    <pubDate>1058477041</pubDate>

    # if the pubDate above is really the last modified date then this is 1. 
    # This is taken from the HTTP reply of the web server when downloading 
    # the page. The time is expressed in seconds since the epoch (Jan 1, 1969)
    # and is in UTC.
    <isModDate>1</isModDate>

    # The assigned docid for this page. This number is unique and used 
    # internally by Gigablast to identify this page. It is used to retrieve the
    # "cached copy" of the page.
    <docId>65990704587</docId>

    # The site the result is from. A site is a measure of control.
    <site><![CDATA[mydomain.com/]]></site>

    # When it was last spidered, a UTC timestamp
    <spidered>1064367311</docId>

    # When doing site clustering, this tag will be present if the result is 
    # from the same hostname as a previous result for the same query. It 
    # indicates that you might want to indent the result. Any further results 
    # from this same hostname will be stripped from the feed.
    <clustered>1</clustered>

    # This is a standard HTTP MIME content classification of the result. It is 
    # not present if the page is text/html. Otherwise, it will be one of the
    # following: text/plain
    #            text/xml
    #            application/pdf
    #            application/msword
    #            application/vnd.ms-excel
    #            application/mspowerpoint
    #            application/postscript
    <contentType><![CDATA[text/plain]]></contentType>

    # This is the language the page was detected as.
    <language><![CDATA[English]]></language>

    # The character set this page was originally encoded in. 
    <charset><![CDATA[utf-8]]></charset>

  </result>

  <result>

  ...
  </result>

  ...

</response>
Error Codes



Key
aError used by an add or delete collection operation.
iError used by an inject (or delete) operation.
sError used by a search operation.


C error codes
1Operation not permitteda - Did not have permission in the working dir to create/delete the collection subdir.
2No such file or directorya - When creating the subdir for the collection in the working dir, a directory component in pathname does not exist or is a dangling symbolic link.
5Input/output errora,i,s - There was an error writing or reading data to or from the disk, most likely due to a hardware failure.
9Bad file descriptora,i,s - Read or write on a bad file descriptor. This should not happen.
12Cannot allocate memorya,i,s - Out of memory.
13Permission denieda,i - The working directory, or its parent does not allow write permission.
17File existsa - The collection subdir already exists in the working dir.
28No space left on devicea,i - There is no room on the drive to write data because the drive is full, or the user's disk quota is exhausted.
105No buffer space availablea - Collection name limit of 16 is exceeded.
Gigablast error codes
32769Try doing it againa,i,s - Resources temporarily unavailable.
32770Add denied, db is closingi - Gigablast is shutting down, so the inject failed.
32771Record not foundi - When looking up old document for injected URL it was not found when it should have been. This is due to data corruption.
32775Could not get the default tagdb recordi - The default tagdb*.xml (ruleset) file was not found. Make sure that the ruleset used by tagdb or by the Url Filters page for this url is present in the working dir.
32777Something is wrong with replyi - Received bad internal reply. You should never see this error.
32784Bad engineera - Collection name being added contains an illegal character, or an empty name was provided, or the name is more than 64 characters.
i - No URL was provided, or URL has no hostname. Or provided URL is currently being injected. Or 500 injects are currently in progress.
32785Can not add because db is closingi - Gigablast is shutting down, so the inject failed.
32789Buf too smalli - Injected URL was longer than 1024 characters. Or the injected document was too big to fit in memory, so consider increasing in gb.conf.
32792Bad cached documenti,s - A cached document was corrupt on disk.
32793Document is missing query termss - A document in the search results did not contain all the query terms.
32795No docidi - No docids were available to inject the URL. The database has reached its limit.
32797No udp slots availablea,i,s - There was a shortage of sockets, please try again.
32811Doc bad content typei - The URL's file extension is not recognized as an indexable file type.
32842Query too bigs - Query was too long.
32843Query was truncateds - Query was truncated.
32844Boolean query has too many operandss - Query has too many operands.
32849Bad mimei - The provided HTTP mime (if the hasmime flag was set) was not present or illegal.
32855DNS sent an unknown response codei - DNS error
32856DNS refused to talki - DNS error
32858DNS timed outi - DNS error
32863No collection recorda,i,s - Referenced collection does not exist.
32864Shutting down the serveri - Gigablast is shutting down, so the inject failed.
Copyright © 2013. All rights reserved.