Skip to content

class based harvester creation

Erin Braswell edited this page Jan 28, 2015 · 11 revisions

Making a classed based harvester


If you're creating a harvester for a service that uses OAI-PMH, you can create a harvester using scrapi classes that will harvest data and send normalized data through the SHARE pipeline.

You can find the base class definition for the OAI PMH class in the scrapi code, available here.

To create a class-based harvester, follow these steps:

  1. Create a file named __init__.py where you will create an instance of the harvester class.

    Your __init__.py will have 3 main parts:

    • The imports section at the top, where you'll import the base OAI harvester class
    • Your instance of the harvester class, with some key areas defined:
      • the name of your provider (as it will show up in the source field)
      • the base url where you will make your OAI requests. Should include everything before the ? in the request url
      • the property list - a list of everything that does not match items in the main outer schema.
      • a list of "approved sets" - if your provider has a certain set of items with a particular "setSpec" entry that should make their way into the notification service, list the approved "setSpec" items here. Only those entries that are in the approved setSpec list will be normalized and set to the notification Service.
    • The consume and normalize function definitions. These will be very simple, and will consist of your class object and their harvest and normalize functions.

    Here's an example of what your __init__.py file might look like:

    from __future__ import unicode_literals
    
    from scrapi.base import OAIHarvester
    
    
    calpoly = OAIHarvester(
        name='calpoly',
        base_url='http://digitalcommons.calpoly.edu/do/oai/',
        property_list=['type', 'source', 'publisher', 'format', 'date'],
        approved_sets=[
            'csusymp2009',
            'acct_fac',
            'aerosp',
            'aero_fac',
        ]
    )
    
    consume = calpoly.harvest
    normalize = calpoly.normalize
  2. Create your requirements.txt and dev-requirements.txt files. They should be the same as other class-based harvesters.

    Here are examples of what the files should look like:

    • requirements.txt
    lxml==3.4.0
    requests==2.4.1
    nameparser==0.3.3
    python-dateutil==2.2
    
    • dev-requirements.txt
    git+http://github.com/chrisseto/scrapi.git
    -r requirements.txt
    
  3. Create a github repo for your harvester, and add all of your files.

  1. Create the configuration file for your harvester.

    It will be a json file with a number of parameters set:

    • days: the day range you'd like your harvester to be run
    • hour: the hour of the day to run the harvester
    • minute: the minute of the day to run the harvester
    • shortName: the name that will display as "source" in search results
    • longName: the more formal name of your source
    • url: the location of your github repo for your harvester
    • fileFormat: the data format of your raw files - all OAIPMH records are in xml format
    • collisionCategory: for internal use, just enter 1
    {
        "days": "mon-sun",
        "hour": "23",
        "minute": "59",
        "shortName": "calpoly",
        "longName": "Digital Commons at Cal Poly",
        "url": "https://github.com/erinspace/CalPoly.git",
        "fileFormat": "xml",
        "collisionCategory": 1
    }
  2. Create a pull request to add your configuration file to the repo ScrapiConsumers

SHARE Logo

Technical Overview

Creating a Harvester

Running Harvesters with ScrAPI

Consuming Notifications - Feed Options

Issues & Using the Issue Tracker

Metadata Schema

Glossary

Provider Names

Statistics Snapshot

Experimental Push API

Use Cases

SHARE is a project of the ARL, AAU, and APLU. Development of the SHARE Notification Service is being carried out in partnership with the Center for Open Science and is supported by generous funding from The Institute of Museum and Library Services (IMLS) and the Alfred P. Sloan Foundation.

Clone this wiki locally