Skip to content

cpmaynard/interest-graph

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Prismatic Interest Graph API

Table of Contents

##What does this do?

###Topic Tagging This service automatically analyzes the content of a document or piece of text and reports the interests present in the article. An interest is a non-hierarchical, single-phrase summary of the thematic content of a piece of text; examples include Functional Programming, Celebrity Gossip, or Flowers. At Prismatic, we’ve been using interests to automatically analyze the content of text in order to help connect people with the content they find interesting. Our interest graph can automatically analyze a piece of text and determine which interests it is about.

###Topic Similarity The service provides an endpoint for returning the set of topics that are similar to a given query topic.

###Aspect Tagging This service automatically analyzes the content of a webpage, analyzes the DOM, and reports the aspects, which describe the structure or function of the webpage.

###Feeds API The service provides recent, high-quality documents from all over the web for a given query (which can include both topics and aspects), including extracted metadata for each URL. The Interest Graph Explorer includes an interactive demo for the Feeds API.

###What's new? We are working hard to continually extend and improve the functionality of the Interest Graph API. Stay up-to-date by reading the change log.

##How do I use the service?

Step 1: Acquire an access token

Head over to http://interest-graph.getprismatic.com, enter your email address, and some additional info about how you plan to use the service, and we will email you an API access token for our free tier.

Our free tier offers a limited number of calls to each API. For more details about the free tier and our paid plans, please visit our Developer page.

Step 2: Make a query

Once you have your access token, you can try tagging a URL or piece of text via our web interface. Click the link in the email you received with your token to find an interface where you can explore the API and make queries.

You can also make requests programmatically. For example, if we want to run the tagging service on the Wikipedia article about Machine Learning, we can curl the service:

curl -H "X-API-TOKEN: <API-TOKEN>" 'http://interest-graph.getprismatic.com/url/topic' --data 'url=http%3A%2F%2Fen.wikipedia.org%2Fwiki%2FMachine_learning'

where the <API-TOKEN> is a stand-in for the access token string.

Step 3: Interpret the response

The response comes in the form of a JSON map, with a key topics that has a list of topic tags. Each topic tag has a numeric id of the topic in the system, a human-readable topic name topic, and a score. The score is a real value between 0 and 1, and represents the degree to which a significant part of article is about the corresponding topic.

As a Schema:

{:topics [{:id long
           :topic String
           :score Num}]}

##How stable is the service?

We are committed to offering a stable, robust, and reliable API for our customers. The change log documents important changes to the API, and will include a clear migration path when breaking changes are required.

Free tier rate limits are subject to change; please see our
Developer page for the latest information.

##What are all of the supported API endpoints?

The Interest Graph API swagger documentation lists all of the supported endpoints, their descriptions, and input/output specifications.

We have a number of endpoints that can analyze a piece of text or URL and return the aspects or topics. We also have a feeds endpoint that returns a feed of recent documents about a given aspect and/or topics.

You will need an access token in order to programmatically access the API. Passing the token is done in the X-API-TOKEN header. If for some reason you have trouble passing headers, you can alternatively pass it in a query parameter ?api-token=<API-TOKEN>. Omitting the token from both the query parameter and header will result in a 401 status code from the server.

Requests are rate limited based on your service package. Please see our Developer page for the latest information, or contact [email protected] with questions about rate limits.

##I think the system made a mistake, where can I report it?

Our approach to topic modeling is inherently data-driven, and as with all data-driven models, it is subject to some noise. It is impossible to have 100% precision and recall on all queries. There are some articles that might be mis-tagged with incorrect interests, and some articles whose content reflects a particular topic that our models fail to detect. On the whole, these models do a good job, but errors are inevitable. We will record all reported errors in order to feed them back into our training pipeline to ensure it improves over time. To report an error, visit our Topic Classification Error Reporting Page.

##Do you have the topic I care about?

We have over 5k modeled interests, and while we try to model the most popular interests that are applicable over a wide range of applications, we do not currently model everything. To check whether your topic is currently modeled, visit our Topic Search Page. Although we strongly encourage exploring the set of available topics via search -- it will return results even if there is no substring match -- the full list of topics is also available.

##What aspects do you currently model?

The Aspect Hierarchy organizes the web into a taxonomy of classes. It is structured from general to specific, where each class (e.g. Article) can be further refined into subclasses based on more specific attributes (e.g. News vs. Interview).

aspect hierarchy

Each oval represents a class of webpages, and each diamond is an attribute that further partitions the webpages of its parent into mutually exclusive subclasses. For example, every webpage has exactly one type (e.g. Image, Article, Commerce, or Other), and every Article is further classified into a single content type. Therefore, a webpage can’t be both an Event and a Review because it can’t have type both Commerce an Article, but it can be both an Event and Risque.

Currently, there are two top-level classifications: type and flag_nsfw:

The type attribute partitions webpages into mutually exclusive sets of content types according to the primary focus of the webpage.

Content Type Primary Focus of Webpage Example URL
Image image example
Article textual content example
Audio audiofile such as song, podcast example
Video video example
Commerce offer a product or other entity example

The Article class is further refined according to the primary focus of the content of the text.

Type of Content Primary Focus of Content Example URL
Review review of a product, piece of media, or app example
News story about a recent or significant event example
Recipe instructions for preparing a dish example
Deal timely savings on product or service, but not a direct page where the product can be purchased example
Interview content presented in a question and answer format example
Listicle content presented in a numbered or bulleted list example

Each webpage in the Commerce class is partitioned based on the product that is offered.

Entity Offered Description of Entity Example URL
Product a tangible item for purchase example
Job a paid position of employment example
Event tickets for purchase to a show, concert, or other event example

Each of the preceding subdivisions also contain the subclass Other that is applied to all webpages that do not fall into one of the aforementioned sets.

The top-level flag_nsfw attribute partitions webpages into those that are safe for work and those that are not. Those that are not safe for work are divided into Porn, Softcore, and Risque. Porn applies to content that contains nudity published by the sex industry. Softcore pertains to articles that are not Porn but whose primary focus is imagery that objectifies people in sexual ways. Risque is for content that is sexually suggestive, but not covered by the first two categories. Note: at the moment, content classification for NSFW aspects is determined solely based on the text and metadata of the web page -- not the imagery.

You can also use the /aspects endpoint to programmatically list the set of all currently supported aspects, in the format expected by the /doc/search endpoint.

##You don’t currently model my interest. Where can I submit a request for you to model a new interest?

Currently, the set of interests is fixed. Given our resources, we are limited in how many interests we can reliably model. While we do plan to expand the set of modeled interests, we will prioritize which interests we add based on aggregate demand. If you would like to submit a request to model new topic, please visit our Interest Submission Page.

My question is not listed here.

If there is a question or issue that you don't see addressed here, please email us at [email protected].

About

Interest Graph

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Shell 100.0%