-
Notifications
You must be signed in to change notification settings - Fork 64
class based harvester creation
If you're creating a harvester for a service that uses OAI-PMH, you can create a harvester using scrapi classes that will harvest data and send normalized data through the SHARE pipeline.
You can find the base class definition for the OAI PMH class in the scrapi code, available here.
To create a class-based harvester, follow these steps:
-
Create a file named
__init__.py
where you will create an instance of the harvester class.Your
__init__.py
will have 3 main parts:- The imports section at the top, where you'll import the base OAI harvester class
- Your instance of the harvester class, with some key areas defined:
- the name of your provider (as it will show up in the source field)
- the base url where you will make your OAI requests. Should include everything before the ? in the request url
- the property list - a list of everything that does not match items in the main outer schema.
- a list of "approved sets" - if your provider has a certain set of items with a particular "setSpec" entry that should make their way into the notification service, list the approved "setSpec" items here. Only those entries that are in the approved setSpec list will be normalized and set to the notification Service.
- The consume and normalize function definitions. These will be very simple, and will consist of your class object and their harvest and normalize functions.
Here's an example of what your
__init__.py
file might look like:from __future__ import unicode_literals from scrapi.base import OAIHarvester calpoly = OAIHarvester( name='calpoly', base_url='http://digitalcommons.calpoly.edu/do/oai/', property_list=['type', 'source', 'publisher', 'format', 'date'], approved_sets=[ 'csusymp2009', 'acct_fac', 'aerosp', 'aero_fac', ] ) consume = calpoly.harvest normalize = calpoly.normalize
-
Create your requirements.txt and dev-requirements.txt files. They should be the same as other class-based harvesters.
Here are examples of what the files should look like:
- requirements.txt
lxml==3.4.0 requests==2.4.1 nameparser==0.3.3 python-dateutil==2.2
- dev-requirements.txt
git+http://github.com/chrisseto/scrapi.git -r requirements.txt
-
Create a github repo for your harvester, and add all of your files.
- Here's the github repo for the CalPoly Harvester
-
Create the configuration file for your harvester.
It will be a json file with a number of parameters set:
- days: the day range you'd like your harvester to be run
- hour: the hour of the day to run the harvester
- minute: the minute of the day to run the harvester
- shortName: the name that will display as "source" in search results
- longName: the more formal name of your source
- url: the location of your github repo for your harvester
- fileFormat: the data format of your raw files - all OAIPMH records are in xml format
- collisionCategory: for internal use, just enter 1
{ "days": "mon-sun", "hour": "23", "minute": "59", "shortName": "calpoly", "longName": "Digital Commons at Cal Poly", "url": "https://github.com/erinspace/CalPoly.git", "fileFormat": "xml", "collisionCategory": 1 }
-
Create a pull request to add your configuration file to the repo ScrapiConsumers
Running Harvesters with ScrAPI
Consuming Notifications - Feed Options
Issues & Using the Issue Tracker
SHARE is a project of the ARL, AAU, and APLU. Development of the SHARE Notification Service is being carried out in partnership with the Center for Open Science and is supported by generous funding from The Institute of Museum and Library Services (IMLS) and the Alfred P. Sloan Foundation.