SearchFeed
  • The XML search feed is available for the following: blog search, wikipedia search, dmoz search and web search in general.
  • Gigablast has many powerful features.
  • Combine the XML Search Feed with Site Search or Custom Topic Search for maximum customization.
  • Cached web pages (archived copies) are provided as part of the feed service and the retrieval of one archived page counts as a single query.
  • Here are the Input Parameters
  • To parse the XML Output consider using PHP. [external link]
  • All pages in the web directory can be returned in XML by appending a ?raw=8 or ?raw=9 to the url. (Use http://dir.gigablast.com/Top?raw=9 for the top page)
  • Help advance science. By using Gigablast you fund important research in information retrieval. Please contact sales for sales related information.

The Input
To get search results from Gigablast use a url like:
http://www.gigablast.com/search?q=test&sc=0&dr=0&raw=8&nrt=11
where:
n=X returns X search results. Default is 10. Max is 50.
s=X returns results starting at result #X. The first result is result #0. Default is 0. Max is 499.
ns=X returns X summary excerpts in the summary of each search result.
site=X returned results will have URLs from the site, X.
sites=X returned results will have URLs from the space-separated list of sites, X. X can be up to 500 sites. A site can include sub folders. This allows you to build a Custom Topic Search Engine.
plus=X returned results will have all words in X. Like a default AND.
minus=X returned results will not have any words in X.
rat=1 returned results will have ALL query terms. This is also known as a default and search. rat means Require All Terms.
sc=X X can be 0 or 1 to respectively disable or enable site clustering. Default is 1, but 0 if the raw parameter is used.
dr=X X can be 0 or 1 to respectively disable or enable duplicate result removal. Default is 1, but 0 if the raw parameter is used.
psc=X X ranges from 0 to 100 and is the 'percent similar cutoff' such that a search result that is X% similar to a search result above it will be hidden from view. psc is only valid when dr is set to 1 (see above). If psc is 100 then only documents that are exactly alike are deduped. Default is 80, but 0 if the raw parameter is used.
raw=X X ranges from 0 to 9 to specify the format of the search results. raw=8 requests the XML feed. raw=9 requests the XML feed in utf8.
qh=X X can be 0 or 1 to respectively disable or enable highlighting of query terms in the titles and summaries. Default is 1, but 0 if the raw parameter is used.
bq=X X can be 0 or 1 or 2. 0 means the query is NOT boolean, 1 means the query is boolean and 2 means to auto-detect. Default is 2.
dt=X X is a space-separated string of meta tag names. Do not forget to url-encode the spaces to +'s or %%20's. Gigablast will extract the contents of these specified meta tags out of the pages listed in the search results and display that content after each summary. i.e. &dt=description will display the meta description of each search result. &dt=description:32+keywords:64 will display the meta description and meta keywords of each search result and limit the fields to 32 and 64 characters respectively. When used in an XML feed the <display name="meta_tag_name">meta_tag_content</> XML tag will be used to convey each requested meta tag's content.
spell=X X can be 0 or 1 to respectively disable or enable spell checking. If enabled while using the XML feed, when Gigablast finds a spelling recommendation it will be included in the XML tag. Default is 0 if using an XML feed, 1 otherwise.
nrt=X X is the maximum number of related topics, also known as Giga Bits, to be displayed.
sdate=1 Sort results by date.
date1=X X is the minimum publish date to be returned in the search results. Documents with publish dates before X will be removed from the search results.
date2=X X is the maximum publish date to be returned in the search results. Documents with publish dates after X will be removed from the search results.
iu=X X is the url of an image to co-brand on the search results page.
iw=X X is the width of the image in pixels on the search results page.
ih=X X is the height of the image in pixels on the search results page.
lang=X X is a typically two letter language identifier, like en for english, de for german, or fr for french, etc.
qcs=X Content encoding of the provided query (the q parm). The default is "utf-8". You can also use "iso-8859-1" or any other official character set name.
ff=X X is 1 to enable family filter, 0 otherwise.

Site Clustering

It is often undesirable to have many results listed from the same site. Site Clustering will essentially limit the number returned results from any given site to two, but it will provide a link which says "more results from this site" in case the searcher wishes it.

Duplicate Results Removal

When dup results removal is enabled Gigablast will remove results that have the exact same content as other results. The psc parameter can be used to dedup documents with similar content.

Cached Web Page Parameters

To get a cached web page from Gigablast use a url
like: http://www.gigablast.com/get?d=94555390410&ih=1&q=my+query&c=main   where:

d=X X is the docId of the page you want returned. DocIds are 64-bit, so you'll need 8 bytes to hold one. DocIds can be harvested from the XML search feed output.
c=X X is collection that contains the document. Usually this is main.
ih=X X is 1 to include the Gigablast header in the returned page, and 0 to exclude it.
ibh=X X is 1 to include the Gigablast BASE HREF tag in the cached page. The default is 1.
q=X X is the the query that, when present, will cause Gigablast to highlight the query terms on the returned page.
cas=X X can be 0 or 1 to respectively disable or enable click and scroll. Default is 1.
strip=X X can be 0, 1, 2 or 3. If X is 0 then no stripping is performed. If X is 1 then image and other tags are removed. An X of 2 is another form of removing tags. If X is 3 then all tags are removed. Default is 0.


Special Query Terms
url:X Matches result that has the url X.
ext:X Matches results which have the file extension X. Like, html, exe, gif, etc.
link:X Obsolete. Replaced by links:X.
ilink:X Obsolete. Replaced by links:X.
links:X Matches results that link to the url, X. External linkers are sorted above internal, and linkers with link text are sorted above those without.
sitelink:X Matches results that link to a page somewhere from the site, X. External linkers are sorted above internal, and linkers with link text are sorted above those without.
site:X Matches results from site, X.
coll:X Obsolete.
ip:X Matches results from the IP address, X. X can be the top three IP numbers, like 1.2.3, as well.
suburl:X Matches results that have X as a part of their url.
isclean:X Obsolete. Replaced by family filter, use &ff=1.
type:X Matches results of this content type. X can be html, doc, exe, etc.
filetype:X Same as type:X above.
tag:X Obsolete.


The Output

Gigablast allows you to receive the search results in a number of formats useful for interfacing to your program. Here is an example of the XML feed.

The XML reply has the following format (but without the comments):

# The XML reply uses the Latin-1 Character Set (ISO 8859-1) when using raw=8
<?xml version="1.0" encoding="ISO-8859-1" ?>

# OR when using raw=9
<?xml version="1.0" encoding="utf-8" ?>

# It consists of one, and only one, response.
<response>

  # If any error was received in processing the request, it will be here.
  <error>Out of memory</error>
  # The numeric code of the error, if any, goes here.
  # See all the Error Codes, but the   # following errors are most likely:
  # 32771 - A cached page was not found when it should have been.
  #    12 - There was a shortage of memory to properly process the request.
  # 32863 - Queried collection does not exist.
  <errno>32790</errno>

  # Total number of documents in the collection being searched.
  <docsInCollection>2060245584</docsInCollection>
  # An APPROXIMATION of the total number of search results for the query.
  <hits>4838158</hits>
  # This is "1" if more results are available after these, "0" if not.
  <moreResultsFollow>1</moreResultsFollow>

  # If present and value is 1, some words in the query were censored for content.
  <queryCensored>1</queryCensored>
  # If present, the value is the number of results that were censored for content.
  <resultsCensored>3</resultsCensored>
  # If this tag is present, it will hold an alternate spelling recommendation 
  # for the query. The &spell=1 parameter must be present in the query url,
  # however, for you to get a spelling recommendation back.
  <spell>nose</spell>

  # If this tag is present, it contains the list of query words that were 
  # ignored as individual words, but not necessarily as part of a phrase
  <ignoredWords>the in of</ignoredWords>
  # This is how many of the search results contain ALL of the query terms.
  # It is only used for printing the "blue bar" for doing SuperRecall
  <minNumExactMatches>300</minNumExactMatches>

  # The list of related topics, each enclosed by <topic> tags. 
  # You must provide a topics parameter to the query url to get topics.
  <topic>
    # Each topic has a score. A score of 50% or more is considered pretty good.
    <score>63</score>
    # Out of the documents scanned, how many contain this topic.
    <docCount>4</docCount>
    # The docIds of the documents scanned that contain this topic.
    <docId>9030668134</docId>
    <docId>265962215563</docId>
    <docId>43940265200</docId>
    <docId>264861015824</docId>
    # The topic name.
    <name><![CDATA[Race Cars]]></name>
    # And OPTIONALLY the name of the meta tag it was derived from.
    <from>keywords</from>
  </topic>

  # The list of search results, each enclosed in <result> tags.
  <result>

    # Each result has a title. This may be empty if none was found on the page.
    <title><![CDATA[My Homepage]]></title>

    # Each result has a summary. This may be empty. The summary is generated 
    # so as to contain the query terms if possible.
    <sum><![CDATA[All about my interests and hobbies]]></sum>

    # If this result is categorized under the DMOZ Directory, data about each
    # category it is in will be enclosed in a <dmoz> tag.
    <dmoz>
      # The category ID number of this category.
      <dmozCatId>172</dmozCatId>
      # The path of this category in the directory.
      <dmozCat><![CDATA[Health: Dentistry]]></dmozCat>
      # Title of this result as listed in the directory.
      <dmozTitle><![CDATA[My Homepage]]></dmozTitle>
      # Description of this page as listed in the directory.
      <dmozDesc><![CDATA[A Dentist's Home Page]]></dmozDesc>
    </dmoz>
    # If the directory is being given along with the results, this is the number of
    # stars given to this page based on its quality.
    <stars>3</stars>

    # Each result may have a sequence of <display> tags if the feed input
    # contained a dt parameter. This allows you to extract
    # information contained in meta tags in the content of each search result.
    # To obtain the contents of the author meta tag, you would need to pass in
    # dt=author.
    <display name="author"><![CDATA[Contents of the meta author tag]]></display>

    # Each result has a URL. This should never be empty.
    <url><![CDATA[http://www.mydomain.com/mypage.html]]></url>
    # The size of the page in kilobytes. Accurate to the tenth of a kilobyte.
    <size>5.6</size>
    # The time the page was last indexed. It may not have been indexed in a 
    # long time if the page's content has not changed. The time is expressed 
    # in seconds since the epoch. (Jan 1, 1969)
    <indexed>1064367311</indexed> 
    # The time the page was last modified. This is taken from the HTTP reply 
    # of the web server when downloading the page. It is 0 if unknown. The time
    # is expressed in seconds since the epoch. (Jan 1, 1969)
    <lastMod>1058477041</lastMod>

    # The assigned docid for this page. This number is unique and used 
    # internally by Gigablast to identify this page. It is used to retrieve the
    # "cached copy" of the page.
    <docId>65990704587</docId>

    # When doing site clustering, this tag will be present if the result is 
    # from the same hostname as a previous result for the same query. It 
    # indicates that you might want to indent the result. Any further results 
    # from this same hostname will be stripped from the feed.
    <clustered>1</clustered>

    # When Topic Clustering is being used, these will display results which 
    # are considered similar to this result and have been clustered under it. 
    # Each similar result is enclosed in a <similar> tag. 
    <similar>
      # The url for the similar result.
      <url><![CDATA[http://www.similar.com/]]></url>
      # The title of the similar result.
      <title><![CDATA[A similar topic]]></title>
    </similar>
    # If this is present and set to 1, there are more similar results beyond 
    # those given here. 
    <moreSimilar>1</moreSimilar>

    # This is a standard HTTP MIME content classification of the result. It is 
    # not present if the page is text/html. Otherwise, it will be one of the
    # following: text/plain
    #            text/xml
    #            application/pdf
    #            application/msword
    #            application/vnd.ms-excel
    #            application/mspowerpoint
    #            application/postscript
    <contentType>text/plain</contentType>

    # This is the language the page was detected as.
    <language><![CDATA[English]]></language>
    # The quality of the document as determined by Gigablast. Ranges from 0 to 100.
    <quality>80</quality>
    # The character set this page was originally encoded in. 
    <charset><![CDATA[utf-8]]></charset>
    
  </result>

  <result>

  ...
  </result>

  ...

  # If the directory has been requested, this node will include the directory
  # structure for the requested category.  Typically this is above the results.
  <directory>
    # Category ID for the displayed directory structure.
    <dirId>172</dirId>
    # Directory path of this category listing.
    <dirName>Health: Dentistry</dirName>

    # Specifies if the directory listing is displayed in a Right-To-Left format.
    <dirIsRTL>1</dirIsRTL>
    # Sub-Categories listed as letters meant to be displayed as a letter bar.
    # Each sub-category will be enclosed in a <letterbar> tag.
    <letterbar><![CDATA[Health/Dentistry/A]]>    
    # Every sub category will include a count of how many urls are listed under it.
      <urlcount>5<urlcount>

    </letterbar>
    # Normal sub-categories listed in groups.  These are listed in order of group
    # and alphabetically within each group. Each sub-category is enclosed in a
    # <narrow2>, <narrow1>, or <narrow> tag.
    <narrow2><![CDATA[Health/Dentistry/Regional]]>
      <urlcount>0<urlcount>

    </narrow2>
    <narrow1><![CDATA[Health/Dentistry/Association]]>
      <urlcount>122<urlcount>
    </narrow1>
    <narrow><![CDATA[Health/Dentistry/Children]]>

      <urlcount>24<urlcount>
    </narrow>
    # Symbolically linked sub-categories physically under a different category.
    # These will be interwoven alphabetically within the respective narrow groups.
    # The name listed before the path is the symbolic name.  
    # Each symbolically linked
    # sub-category is enclosed in a <symbolic2>, <symbolic1>, or 
    # <symbolic> tag.
    <symbolic2><![CDATA[Dentophobia:Health/Mental_Health/Disorders/Anxiety/Phobias/Dentophobia]]>

      <urlcount>2<urlcount>
    </symbolic2>
    <symbolic1><![CDATA[Dental_Laboratories:Buisness/Healthcare/Products_and_Services/Dentistry/]]>
      <urlcount>71<urlcount>

    </symbolic1>
    <symbolic><![CDATA[Products:Shopping/Health/Dental]]>
      <urlcount>71<urlcount>
    </symbolic>
    # Seperate categories in the directory which are related to this one.
    <related><![CDATA[Society/Issues/Health/Dentistry]]>

      <urlcount>4</urlcount>
    </related>
    # This category in other languages in the directory.
    <altlang><![CDATA[Basque:World/Euskara/Osasuna/Odontologia]]>
      <urlcount>7</urlcount>

    </altlang>
  </directory>

</response>
 
Search | Careers | Products & Services | Contact Us | About Us | Partners | Privacy Policy
Copyright © 2010-2020 Gigablast, Inc. All rights reserved.