Custom Solr Function Query that returns the Alexa site rank of an input URL, host, or domain. This can be useful for boosting web documents based on their Alexa rank.
In other words, it's a poor man's PageRank.
This module also performs some light caching, in order to show good etiquette to Alexa, who probably don't want folks hammering their servers.
First off, put the following JARs into your Solr's lib/ directory:
(Yes, I'm requiring Google Guava for this. It helps protect my sanity when I code in Java these days.)
Next, add the following definition to your solrconfig.xml
:
<valueSourceParser name="siterank" class="org.healthonnet.lucene.siterank.SiteRankSourceParser">
<bool name="doCache">true</bool>
<str name="cacheSpec">concurrencyLevel=16,maximumSize=8192,softValues</str>
<bool name="extractDomainFromUrl">true</bool>
</valueSourceParser>
Shown above are all the configuration parameters with their default values. You can leave them out if you're okay with the defaults.
doCache
: True if caching should be enabledcacheSpec
: Configuration for the cache, in Guava CacheBuilderSpec format.extractDomainFromUrl
: Are you inputting full URLs, likehttp://www.google.com/mail
? Then set this to true. Otherwise, if you're inputting stripped-down domain or host names, such asgoogle.com
, then set it to false. This is used as a performance improvement at the caching level, so we don't have to look up the same domain over and over again just because the URL is different.
This module defines a new function called siterank()
.
The function takes in a string (either a full URL or a domain/host - see above) and outputs the reciprocal rank of the site, which is a double between 0.0 and 1.0. 0.0 is returned if the site is not found in the ranking.
The reciprocal rank is simply:
1.0 / rank
...so e.g. Google will probably have a reciprocal rank of 1.0 (1.0 / 1.0), WebMD might have 0.00239234 (1/0 / 418) and MyCoolHipsterSiteNobodyKnowsAbout.com might have 0.0000000198867735 (1.0 / 50284678).
Most likely you will want to wrap this function in something like exp()
to smooth the values,
and to deal with cases where the function returns 0.0. So the recommended usage is:
exp(siterank(myUrlOrHostField))
...which you can use as a boost function, e.g.
http://mySite:8983/solr/select?q={!boost b=exp(siterank(myUrlOrHostField))}:
So for instance, in the above examples, Google would have a score of 2.71828, WebMD would get 1.00239520844, and the hipster site would get 1.00000001989. Tweak the formula as you see fit.
See my blog post on boosting for more details about boosting in Solr.
In the future, I'd like to expand this module to output rankings from other sources than Alexa, including custom config files.
Download the code and do:
mvn install