-
Notifications
You must be signed in to change notification settings - Fork 5
x Harvest History
The harvesting history is no longer tracked, per se. Instead, we track harvest and indexing information in the CartoDB tables https://vertnet.cartodb.com/tables/resource_staging and https://vertnet.cartodb.com/tables/index_history.
AWS starting charges: $17.50 Created new instance in Zone us-east-1b. Created extra 50GB Standard Volume and attached it. Before: df . Filesystem 1K-blocks Used Available Use% Mounted on /dev/xvdf 51606140 184268 48800432 1% /mnt/beast
Sync'd 134 resources from resource_staging.
Created ~/resource_list.txt with the following: http://iptsnomnh.cloudapp.net:8080/ipt/resource.do?r=eggs http://iptsnomnh.cloudapp.net:8080/ipt/resource.do?r=nest http://gbif.rom.on.ca:8180/ipt/resource.do?r=vposteology http://gbif.rom.on.ca:8180/ipt/resource.do?r=herps http://gbif.rom.on.ca:8180/ipt/resource.do?r=fishes http://gbif.rom.on.ca:8180/ipt/resource.do?r=mamm http://gbif.rom.on.ca:8180/ipt/resource.do?r=birdsnonpass http://gbif.rom.on.ca:8180/ipt/resource.do?r=birdspass http://ipt.vertnet.org:8080/ipt/resource.do?r=cumv_rept http://ipt.vertnet.org:8080/ipt/resource.do?r=cumv_bird http://ipt.vertnet.org:8080/ipt/resource.do?r=cumv_fish http://ipt.vertnet.org:8080/ipt/resource.do?r=cumv_amph http://ipt.vertnet.org:8080/ipt/resource.do?r=cumv_mamm http://ipt.vertnet.org:8080/ipt/resource.do?r=dmns_mamm http://ipt.vertnet.org:8080/ipt/resource.do?r=dmns_bird http://ipt.vertnet.org:8080/ipt/resource.do?r=mlz_bird http://ipt.vertnet.org:8080/ipt/resource.do?r=mlz_mamm http://ipt.vertnet.org:8080/ipt/resource.do?r=msb_mamm http://ipt.vertnet.org:8080/ipt/resource.do?r=msbobs_mamm http://ipt.vertnet.org:8080/ipt/resource.do?r=msb_bird http://ipt.vertnet.org:8080/ipt/resource.do?r=mvz_mammal http://ipt.vertnet.org:8080/ipt/resource.do?r=mvzobs_mammal http://ipt.vertnet.org:8080/ipt/resource.do?r=mvzobs_herp http://ipt.vertnet.org:8080/ipt/resource.do?r=mvzobs_bird http://ipt.vertnet.org:8080/ipt/resource.do?r=mvz_herp http://ipt.vertnet.org:8080/ipt/resource.do?r=mvz_egg http://ipt.vertnet.org:8080/ipt/resource.do?r=mvz_bird http://ipt.vertnet.org:8080/ipt/resource.do?r=mvz_hild http://ipt.vertnet.org:8080/ipt/resource.do?r=uam_herp http://ipt.vertnet.org:8080/ipt/resource.do?r=uam_fish http://ipt.vertnet.org:8080/ipt/resource.do?r=uamobs_mamm http://ipt.vertnet.org:8080/ipt/resource.do?r=uam_es http://ipt.vertnet.org:8080/ipt/resource.do?r=uam_mamm http://ipt.vertnet.org:8080/ipt/resource.do?r=uam_bird http://ipt.vertnet.org:8080/ipt/resource.do?r=uwbm_herp http://ipt.vertnet.org:8080/ipt/resource.do?r=uwymv_mamm http://ipt.vertnet.org:8080/ipt/resource.do?r=uwymv_herp http://ipt.vertnet.org:8080/ipt/resource.do?r=uwymv_bird http://ipt.vertnet.org:8080/ipt/resource.do?r=wnmu_fish http://ipt.vertnet.org:8080/ipt/resource.do?r=wnmu_bird http://ipt.vertnet.org:8080/ipt/resource.do?r=wnmu_mamm http://ipt.nhm.ku.edu/ipt/resource.do?r=kubi_mammals http://ipt.nhm.ku.edu/ipt/resource.do?r=kubi_ichthyology http://ipt.nhm.ku.edu/ipt/resource.do?r=kubi_ichthyology_tissue http://hymfiles.biosci.ohio-state.edu:8080/ipt/resource.do?r=osum-mammals http://hymfiles.biosci.ohio-state.edu:8080/ipt/resource.do?r=osum-amphibians http://hymfiles.biosci.ohio-state.edu:8080/ipt/resource.do?r=osum-fish http://hymfiles.biosci.ohio-state.edu:8080/ipt/resource.do?r=osum-reptiles http://ipt.vertnet.org:8080/ipt/resource.do?r=dmnh_birds http://ipt.vertnet.org:8080/iptstrays/resource.do?r=nmnh_vert_paleo
"Harvesting 50 resources" Harvested through KU Fish, then... "Done harvesting kubi_ichthyology_tissue" "Copying /mnt/beast/2014-03-27/kubi_ichthyology_tissue-49c54fcf-d672-49cb-86ed-6e1e157f9ba4 to gs://vertnet-harvesting/data/2014-03-27/" "Downloading records from http://hymfiles.biosci.ohio-state.edu:8080/ipt/resource.do?r=osum-mammals" "Error harvesting" "http://hymfiles.biosci.ohio-state.edu:8080/ipt/resource.do?r=osum-mammals" "The archive given is a folder with more or less than 1 data files having a txt or csv suffix" "ERROR: Resource http://hymfiles.biosci.ohio-state.edu:8080/ipt/resource.do?r=osum-mammals (The archive given is a folder with more or less than 1 data files having a txt or csv suffix)" UnsupportedArchiveException The archive given is a folder with more or less than 1 data files having a txt or csv suffix org.gbif.dwc.text.ArchiveFactory.openArchive (ArchiveFactory.java:318)
Removed all harvested resources from resource_list.txt and removed OSUM Mammals as well. Then ran harvester again:
"Harvesting 5 resources" Harvested DMNH and NMNH Vert Paleo, then failed on OSUM Amphibians. Abandoning all OSUM resources until they can be fixed.
Then added NMNH Herps and NMNH Mammals to resource_list.txt and ran the harvester again to see if it will process these two resources.
"Harvesting 2 resources" Failed to do either one. Timing out and logging off server.
After harvest, without the above resources, df. shows: Filesystem 1K-blocks Used Available Use% Mounted on /dev/xvdf 51606140 10910508 38074192 23% /mnt/beast
AWS charges after harvest:
AWS starting charges: $0.16 Created new instance in Zone us-east-1c. Created extra 100GB Standard Volume and attached it. Before: df . Filesystem 1K-blocks Used Available Use% Mounted on /dev/xvdf 103212320 192248 97777192 1% /mnt/beast
Sync'd 133 resources from resource_staging. sdnhm_herps had a syntax error with an unclosed \div. Add sdnhm_herps to resource table manually.
"Harvesting 134 resources"
SNOMNH Tissues and Mammals have the same multi-file problem as the Danish resources. Set IPT to false in resource_staging.
NMNH Mammals and Herps stall during Gulo harvest, not sure why yet.
Put remaining resource URL's in /home/ubuntu/resource_list.txt, then ran:
gulo.harvest=> (harvest-all "/mnt/beast" :path-file "/home/ubuntu/resource_list.txt") "Harvesting 23 resources" "Downloading records from http://ipt.vertnet.org:8080/iptstrays/resource.do?r=nmnh_mammals" "581549 records found" ...
Without the 2 DanBIF resources, 2 SNOMNH resource, and 2 NMNH resources, the 100GB extra volume on EC2 is at 37% capacity when harvesting is done.
126 resources, 10M? records, shard_count=32 processing_rate=100, 2d 5h? http://tuco.vertnet-portal.appspot.com/mr/indexall?shard_count=32&path=vertnet-harvesting/data/2014-02-06* SUCCESSFUL (omits http://danbif.au.dk/ipt/resource.do?r=aves_tanza and http://danbif.au.dk/ipt/resource.do?r=amphibians) Index Log: http://tuco.vertnet-portal.appspot.com/mapreduce/detail?mapreduce_id=157881771083698067408
123 resources, 9265308 records, shard_count=8 processing_rate=100, 2d 5h SUCCESSFUL (omits http://danbif.au.dk/ipt/resource.do?r=aves_tanza) Index Log: http://index.vertnet-portal.appspot.com/mapreduce/detail?mapreduce_id=157897405661548EC449C This job didn't finalize. No index was created.
From the logs: http://goo.gl/Fik6gi 2014-02-05 05:34:27.270 /mr/finalize 500 360ms 0kb AppEngine-Google; (+http://code.google.com/appengine) module=default version=index 0.1.0.2 - - [05/Feb/2014:05:34:27 -0800] "POST /mr/finalize HTTP/1.1" 500 329 "http://index.vertnet-portal.appspot.com/mapreduce/controller_callback/157897405661548EC449C" "AppEngine-Google; (+http://code.google.com/appengine)" "index.vertnet-portal.appspot.com" ms=361 cpu_ms=42 cpm_usd=0.000037 queue_name=default task_name=13969085809212968902 app_engine_release=1.8.9 instance=00c61b117c3a877f86c9fbfd083084ca68e57440 I 2014-02-05 05:34:27.196 Finalizing index job for resource vertnet-harvesting/data/2014-01-19* E 2014-02-05 05:34:27.205 ApplicationError: 105 Traceback (most recent call last): File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 1511, in call rv = self.handle_exception(request, response, e) File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 1505, in call rv = self.router.dispatch(request, response) File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 1253, in default_dispatcher return route.handler_adapter(request, response) File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 1077, in call return handler.dispatch() File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 547, in dispatch return self.handle_exception(e, self.app.debug) File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 545, in dispatch return method(*args, **kwargs) File "/base/data/home/apps/s~vertnet-portal/index.372963992289942453/admin.py", line 51, in mapreduce_finalize files.finalize(job.write_path) File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/api/files/file.py", line 567, in finalize f = open(filename, 'a', exclusive_lock=True, content_type=content_type) File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/api/files/file.py", line 520, in open exclusive_lock=exclusive_lock) File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/api/files/file.py", line 276, in init self._open() File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/api/files/file.py", line 423, in _open self._make_rpc_call_with_retry('Open', request, response) File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/api/files/file.py", line 427, in _make_rpc_call_with_retry _make_call(method, request, response) File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/api/files/file.py", line 252, in _make_call _raise_app_error(e) File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/api/files/file.py", line 177, in _raise_app_error raise ExistenceError(e) ExistenceError: ApplicationError: 105
2014-01-11 124 resources 10379768 records. 8518105 mapper-calls FAILED This harvest did not complete successfully because of the following error: Index Log: http://index.vertnet-portal.appspot.com/mapreduce/detail?mapreduce_id=157904409240358332B7B
"Error harvesting" "http://danbif.au.dk/ipt/resource.do?r=aves_tanza" "The archive given is a folder with more or less than 1 data files having a txt or csv suffix" "ERROR: Resource http://danbif.au.dk/ipt/resource.do?r=aves_tanza (The archive given is a folder with more or less than 1 data files having a txt or csv suffix)"
Full harvest of 9522748 records from 108 resources took 4h20m on 14 Nov 2013. Number of records in the original Darwin Core archives can be determined after harvest in the SQL panel in the CartoDB resource table by:
SELECT sum(count) from resource