class based harvester creation

Making a classed based harvester

If you're creating a harvester for a service that uses OAI-PMH, you can create a harvester using scrapi classes that will harvest data and send normalized data through the SHARE pipeline.

You can find the base class definition for the OAI PMH class in the scrapi code, available here.

To create a class-based harvester, follow these steps:

Create a file named __init__.py where you will create an instance of the harvester class.

Your __init__.py will have 3 main parts:
- The imports section at the top, where you'll import the base OAI harvester class
- Your instance of the harvester class, with some key areas defined:
  - the name of your provider (as it will show up in the source field)
  - the base url where you will make your OAI requests. Should include everything before the ? in the request url
  - the property list - a list of everything that does not match items in the main outer schema.
  - a list of "approved sets" - if your provider has a certain set of items with a particular "setSpec" entry that should make their way into the notification service, list the approved "setSpec" items here. Only those entries that are in the approved setSpec list will be normalized and set to the notification Service.
- The consume and normalize function definitions. These will be very simple, and will consist of your class object and their harvest and normalize functions.
Here's an example of what your __init__.py file might look like:
```
from __future__ import unicode_literals

from scrapi.base import OAIHarvester


calpoly = OAIHarvester(
    name='calpoly',
    base_url='http://digitalcommons.calpoly.edu/do/oai/',
    property_list=['type', 'source', 'publisher', 'format', 'date'],
    approved_sets=[
        'csusymp2009',
        'acct_fac',
        'aerosp',
        'aero_fac',
    ]
)

consume = calpoly.harvest
normalize = calpoly.normalize
```
Create your requirements.txt and dev-requirements.txt files. They should be the same as other class-based harvesters.

Here are examples of what the files should look like:
- requirements.txt
```
lxml==3.4.0
requests==2.4.1
nameparser==0.3.3
python-dateutil==2.2
```
- dev-requirements.txt
```
git+http://github.com/chrisseto/scrapi.git
-r requirements.txt
```
Create a github repo for your harvester, and add all of your files.

Here's the github repo for the CalPoly Harvester

Create the configuration file for your harvester.

It will be a json file with a number of parameters set:
- days: the day range you'd like your harvester to be run
- hour: the hour of the day to run the harvester
- minute: the minute of the day to run the harvester
- shortName: the name that will display as "source" in search results
- longName: the more formal name of your source
- url: the location of your github repo for your harvester
- fileFormat: the data format of your raw files - all OAIPMH records are in xml format
- collisionCategory: for internal use, just enter 1
```
{
    "days": "mon-sun",
    "hour": "23",
    "minute": "59",
    "shortName": "calpoly",
    "longName": "Digital Commons at Cal Poly",
    "url": "https://github.com/erinspace/CalPoly.git",
    "fileFormat": "xml",
    "collisionCategory": 1
}
```
Create a pull request to add your configuration file to the repo ScrapiConsumers

SHARE at ARL • SHARE Knowledgebase • SHARE on Twitter

Technical Overview

Creating a Harvester

Running Harvesters with ScrAPI

Consuming Notifications - Feed Options

Issues & Using the Issue Tracker

Experimental Push API

Use Cases

SHARE is a project of the ARL, AAU, and APLU. Development of the SHARE Notification Service is being carried out in partnership with the Center for Open Science and is supported by generous funding from The Institute of Museum and Library Services (IMLS) and the Alfred P. Sloan Foundation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

class based harvester creation

Making a classed based harvester

Clone this wiki locally