serveral library offers diffbot API wrapper for Python (e.g. https://github.com/diffbot/diffbot-python-client or https://github.com/andreypopp/diffbot ), but these library has several limits. The library diffbotpy supports various kinds of API.
pip install -e git+ssh://[email protected]/wacul/diffbotpy.git@master#egg=diffbotpy-0.1
diffbotpy supports almost all diffbot APIs below (except for account API):
-
Analyze API, Article API, Discussion API, Image API, Product API, Video API
-
Crawlbot API, Bulk API, Search API
import diffbot
token = "aaa"
fetcher = diffbot.SingleFetcher(token=token)
target_url = "https://aaa.co.jp/"
# simplely fetch
#extractors = fetcher.fetch_analyze_extractors(
# target_url = target_url,
#)
extractors = fetcher.fetch_article_extractors(
target_url = "http://google.co.jp",
args=diffbot.SingleFetcher.generate_article_args(
fields = "meta",
)
)
for extractor in extractors:
print(extractor)
print(extractor.get_raw_data())
data = extractor.get_raw_data()
description = data["meta"]["description"]
keyword_list = data["meta"]["keywords"].split(",")
print(extractor.get_title())
print(description)
print(keyword_list)
import diffbot
job_operator = diffbot.BulkJobOperator(
token=token,
job_name = "topPageUrl_12751"
)
# send request to bulk/data and get response data
data = job_operator.fetch_raw_data()
## get list of extractor object.
# extractors = job_operator.fetch_extractors()
print(data[0])
for ext in extractors:
print(ext.get_title())
searcher = diffbot.Searcher(
token,
job_name,
)
#raw_data = searcher.fetch_raw_data(query = "type:article")
#print(raw_data)
extractors = searcher.fetch_search_extractors(query = "type:article")
for ext in extractors:
# print(ext.get_raw_data())
print(ext.get_title())
print(ext.get_page_url())
import diffbot
job_name = "topPageUrl_12751"
job_operator = diffbot.BulkJobOperator(
token=token
)
# for new bot name
job_operator.start_job(
target_url_list=new_url_lst,
apiurl = job_operator.generate_apiurl("article"), # https://api.diffbot.com/v3/article
)
# fetch object using search API
searcher = job_operator.fetch_completed_searcher()
Some Errors raise in paticular cases:
-
DiffbotTokenError => You may put the invalid username. Confirm your settings. See also
Settings
Section. -
DiffbotResponseError => The response error occurs. see also https://www.diffbot.com/dev/docs/error/.
-
DiffbotJobStatusError => Some jobs are not completed in Bulk or Crawl API. Wait until the jobs completed, see also status code list, https://www.diffbot.com/dev/docs/bulk/api.jsp.
-
DiffbotUnexpectedBodyError => diffbot's API response is not JSON, sometimes, so.