Skip to content

class based harvester creation

Erin Braswell edited this page Apr 12, 2015 · 11 revisions

Making a class based harvester


If you're creating a harvester for a service that uses OAI-PMH, you can create a harvester using scrapi classes that will harvest data and send normalized data through the SHARE pipeline.

Your harvester will live in the scrapi harvesters directory along with the other harvesters.

This class based harvester will make calls to the specified OAI PMH service using the ListRecords verb and the oai_dc namespace, with a date range of one day in the past.

You can find the base class definition for the OAI PMH class in the scrapi code, available here.

To create a class-based harvester, follow the detailed instructions on the SHARE OSF Wiki.

  1. Fork the scrapi repo, and create your own harvester in a folder with the same name under the scrapi/harvesters directory.

  2. Within your new harvester folder, create a file named __init__.py where you will create an instance of the harvester class.

    Your __init__.py will have 3 main parts:

    • The imports section at the top, where you'll import the base OAI harvester class
    • The schema transformer, which defines each main element and where in the source API that item can be found.
    • Your instance of the harvester class, with some key areas defined:
      • the name of your provider (as it will show up in the source field)
      • the base url where you will make your OAI requests. Should include everything before the ? in the request url
      • a list of "approved sets" - if your provider has a certain set of items with a particular "setSpec" entry that should make their way into the notification service, list the approved "setSpec" items here. Only those entries that are in the approved setSpec list will be normalized and set to the notification Service.
      • timeout - time in seconds to wait between subsequent requests to gather resources.
      • timezone_granularity - how much time detail to include in the OAI request. Setting timezone_granularity to True will add 'T00:00:00Z' to the date request.

    Here's an example of what your __init__.py file might look like:

    from __future__ import unicode_literals
    
    from scrapi.base import OAIHarvester
    
    
    calpoly = OAIHarvester(
        name='calpoly',
        base_url='http://digitalcommons.calpoly.edu/do/oai/',
        property_list=['type', 'source', 'publisher', 'format', 'date'],
        timezone_granularity = True,
        timeout = 5,
        approved_sets=[
            'csusymp2009',
            'acct_fac',
            'aerosp',
            'aero_fac',
        ]
    )
  3. Add your provider's favicon to the favicon folder

  4. From the root directory, run invoke provider_map

  5. Test your harvester locally by running invoke harvester harvester_name_here

  6. Create a pull request to add your new harvester to the scrapi repo

SHARE Logo

Technical Overview

Creating a Harvester

Running Harvesters with ScrAPI

Consuming Notifications - Feed Options

Issues & Using the Issue Tracker

Metadata Schema

Glossary

Provider Names

Statistics Snapshot

Experimental Push API

Use Cases

SHARE is a project of the ARL, AAU, and APLU. Development of the SHARE Notification Service is being carried out in partnership with the Center for Open Science and is supported by generous funding from The Institute of Museum and Library Services (IMLS) and the Alfred P. Sloan Foundation.

Clone this wiki locally