TODO List of databases that could be accessed with c14bazAAR #2

nevrome · 2017-10-19T20:26:28Z

joeroe · 2018-04-02T18:29:28Z

@nevrome I have a copy of the PPND database transcribed into a CSV. I'm not sure if it's still being updated, is it worth including it just as a static file?

nevrome · 2018-04-02T19:29:31Z

Sounds interesting, Joe! Is it ok for the creators of the database if you share it? In the end it's their decision.

joeroe · 2018-04-05T11:11:35Z

I haven't been in touch with them. I just scraped it from their website for another project. I will leave it up to you!
exoriente_ppnd.csv.txt

nevrome · 2018-04-05T15:56:47Z

We're currently establishing a cooperation with @mcolobran (Miquel Colobran) who wrote some webscrapers for GoGet. He'll provide us with a stable download link for some databases that only can be accessed by crawling.

I'm still not 100% sure if this is the way to go though. The people who are sharing their data only as html do so deliberately I guess. I think I should at least inform the authors. They should know that this is possible and that we are even trying to make it more easy.

dirkseidensticker · 2018-04-05T16:55:13Z

I had a look at their website https://www.exoriente.org/associated_projects/ppnd.php

@joeroe I guess you compiled the list yourself as I wasn't able to find a simple 'download all' button or so, only links of OxCal formatted snippets. Right?

For me, Open Access or at least the knowledge of what will be done to the data are essential. If you don't mind I could write Marion Benz and ask her if the database is still updated and if they would be willing to lend something over for the c14bazAAR. I wouldn't want to include something without the consent of the original creators. Unless you @joeroe compiled your CSV manually based on the informations on the web. Then -- in my view -- you are the creator of your list, as (at least after German law) data are not copyrightable. In that case, if you set up a repo for the table we'll be able to write a parser for it.

What's your thoughts?

joeroe · 2018-04-05T18:03:12Z

@dirkseidensticker So all I did was go through the website and copy the individual site tables into a single CSV. Copyright aside, it's not my data – I just thought that if you did want to incorporate PPND, I could save you doing that step. I agree that the polite thing to do is to ask the original compilers for permission to redistribute it first. Maybe they'll be able to simplify things

Interesting ethical questions in general, though. Is there a qualitative difference between downloading tabular data and using c14bazAAR to reformat it, versus downloading hypertext data and using a scraper to reformat it? (Or a text editor, as I did?) If you publish scientific data in any form, aren't you implicitly agreeing to let it be reused without explicit permission, in return for acknowledgement?

I guess this is why they tell you to include a LICENSE file in repos!

nevrome · 2018-04-05T19:37:22Z

There's even a legal difference between downloading predefined tables and webscraping: https://en.wikipedia.org/wiki/Web_scraping#Legal_issues
I was not aware of this. Now I would say: No webscraping data for c14bazAAR. We can and should write to database providers for a legally safe download opportunity as I did for RADON and CalPal. We should also discuss this with Miquel.

mcolobran · 2018-04-06T07:54:46Z

Hello @dirkseidensticker and @nevrome.

In response of your concerns, which I also had and have.

The 'protocol' (proposal) to follow for joining a new database in GoGet and/or c14bazAAR could be:

Check if the database could be joined even using web crawling effort by the owners.
If so, ask permission to the owners (so yes, they have the final decision).

Agreement how often they want the data be downloaded by us.

Inform when data is ready to be used by GoGet and/or c14bazAAR with a mail.

What we can offer to the owners (referred to point 2)

- To do nothing (or almost nothing) in their websites (no changes are required at all).
- a CSV file stored in goget (or any other place) to assure not to collapse their database. If they prefer, we can send to them in order to load the database from c14bazAAR
- the csv file to be used by c14bazAAR (again they have to do nothing)
- Guarantee the ownership (take a look at the fields in goget). In this sense we don't "make our" the data.
- Guarantee the crawling is doing from their public data they publish in the web. Besides it is done at a very low speed in order not to collapse their database. Finally we inform them the IP that is doing the task to avoid IDS, networking problems.


Benefits for the owners:

- They gain visibility (so citations and visits)
- We can check automatically for some errors on their data, so improving the quality. 
- Their website will have more visits because Goget (and c14bazAAR) are able to redirect/go to their web if the researcher needs more detail or clarifications.
- Goget publish the database name (http://www.ibercrono.org/goget/databases.php). In this sense our intention is not "appropriate" data but spreading their work.
- They are following the open data movement.

I answer inline some concerns you expressed

I think I should at least inform the authors. They should know that this is possible and that we are even trying to make it more easy.

The idea behind goget (and therefore c14bazAAR) is the owners have to do nothing and no changes to the data they are making public in Internet. If they have to make changes it makes things more difficult.

Maybe they'll be able to simplify things

In my humble opinion, if we want to spread their data and use it as less work and changes have the owners the better chances we all have.

interesting ethical questions in general, though. Is there a qualitative difference between downloading tabular data and using c14bazAAR to reformat it, versus downloading hypertext data and using a scraper to reformat it? (Or a text editor, as I did?) If you publish scientific data in any form, aren't you implicitly agreeing to let it be reused without explicit permission, in return for acknowledgement?

About this. First of all, no internet data/image is free in the sense the data/image always have an author and never can be considered yours.

If you download a tabular data (with a button in the website) and the owner have and acknowledgement in return, it is nice to inform the owner but you can do anything (except appropriating of course)

If one crawls the data, one has to be more careful. Maybe the author didn't had in mind this purpose. That's why it had to be informed before spread it (point 2 and following to offer what they receive in exchange)

I guess this is why they tell you to include a LICENSE file in repos!

Agree about the LICENSE FILE

They also claimed that screen-scraping constitutes what is legally known as "Misappropriation and Unjust Enrichment", as well as being a breach of the web site's user agreement.

The legal point of that is "enrichment". If you obtain profit, you are in trouble. It's considered a crime by the almost any country law. If you spread something (a photocopied book as an example) you are only trespassing Intellectual Property. As there is no profit for this data, the question goes to credit, diffusion.... which is not big deal (despite we have to be aware of)

these are early scraping decisions, and the theories of liability are not uniform, it is difficult to ignore a pattern emerging that the courts are prepared to protect proprietary content on commercial sites from uses which are undesirable to the owners of such sites.

Again, take a look at the text. They talk about commercial sites and that implies "money" and profits....not our scenario.

We can and should write to database providers for a legally safe download opportunity as I did for RADON and CalPal. We should also discuss this with Miquel

100% agree with a small detail. The way the data is obtained have to be agreed. If we offer the spreading of their data with no effort by the providers.....we have a very good point on our side.

Sorry for the long mail.

M.

yesdavid · 2020-01-22T13:50:27Z

I'd like to add two Palaeolithic databases that I need, it's

the RADIOCARBON PALAEOLITHIC EUROPE DATABASE (https://ees.kuleuven.be/geography/projects/14c-palaeolithic/)
PACEA (http://www.paleoanthro.org/media/journal/content/PA20110001.pdf) (DB in supplementaries, see http://www.paleoanthro.org/journal/volumes/2011/ and http://www.paleoanthro.org/media/journal/content/PA20110001_S01.zip)

SimonGoring · 2020-04-20T20:32:16Z

Hi, would you be interested in adding the 14C data from Neotoma (http://neotomadb)? We have a lot of radiocarbon dates, but not all are associated with archaeological records, many are age controls for stratigraphic sequences. We could probably build an API endpoint to serve specific archaeological records.

nevrome · 2020-04-20T20:58:55Z

@SimonGoring Absolutely, yes!

So far we only used .csv or .xlsx-file based interfaces. Either the database authors put a final release on their website or zenodo, or the database website provides a daily release somewhere (e.g. https://radon.ufg.uni-kiel.de/radondaily.txt).

I'm open to other interfaces, but they should be reasonably fast. Our parser functions always download the whole database, so in your case an archaeology subset. If you provide us with some sort of interface we can do the rest. You can also submit a PR with a parser function yourself following the documentation here.

SimonGoring · 2020-04-21T20:23:51Z

What are the most important fields for you? Our chronology structure is a bit complicated, but I can push through it to get something laid out for you. The example you linked shows:

ID, LABNR, C14AGE, C14STD, C13, MATERIAL, SPECIES, COUNTRY, SITE, CULTURE, PERIOD, FEATURETYPE, FEATURE, LATITUDE, LONGITUDE, REFERENCE

For the most part we don't have the Period and Culture references (as far as I know), but I think I can do most of the rest. My preference would be to give you an endpoint that simply returns the CSV table. I'd prefer to give it to you in JSON though :)

SimonGoring · 2020-04-21T20:24:59Z

As a note, you can see a bit about how our geochronological controls are structured here:
http://open.neotomadb.org/dbschema/tables/geochronology.html.

dirkseidensticker · 2020-04-21T20:33:02Z

@SimonGoring we keep track of the variables provided within each data source using this table: https://github.com/ropensci/c14bazAAR/blob/master/data-raw/variable_reference.csv

The field listed as database within the source column converted over from the sources and might give you a good overview what we compile into the c14_date_list. The entires within the columns of each database below line 31 are dropped though.

@nevrome have we documentation listing the minimum requirements for a c14_date_list somewhere?

nevrome · 2020-04-22T07:38:47Z

The technical minimum requirement is just the uncalibrated age and its standard deviation:

c14bazAAR/R/c14_date_list_basic.R

Line 36 in 69ff900

necessary_vars <- c("c14age","c14std")

That's not really useful unfortunately. I think @SimonGoring will find what he needs in the variable reference table already linked.

JSON would be ok for me as well. We probably would have to add another dependency to simplify the parsing. Don't know what the cool kids use nowadays. rjson?

nevrome · 2020-05-08T09:22:31Z

@SimonGoring Did you come up with a prototype? Or did this turn out to be more complicated than it appeared to be? Let us know if we can do something to help out.

SimonGoring · 2020-05-12T17:21:47Z

@nevrome It's just a question of timing :) I think we can do that, I just need to get the time to do it.

zoometh · 2021-03-04T20:30:20Z

@nevrome With colleagues, we are completing a radiocarbon dataset for the West and Central Mediterranean for the Meso/Neo transition. We have already defined the dataset layout and add a sample of the dataset on Zenodo, file 'c14dataset.csv'. The complete dataset will be ready in the coming days (or pair of weeks). I could write the getter function for c14bazAAR at the same time. That is fine?

zoometh · 2021-03-04T20:32:55Z

Today, the Zenodo repo name is 'Rdev' but it will be renamed 'NeoNet' soon. I will update then

nevrome · 2021-03-04T21:22:22Z

@zoometh: Excellent! I'm looking forward to your pull request. You have seen the ToDo list, right? Let us know, if anything is unclear.

zoometh · 2021-03-05T07:22:37Z

Yes, I've seen the TODO list. I'll prepare the function and inform you

zoometh · 2021-03-05T14:43:50Z

Related to a recent PlosONE paper, a new c14 database has been deposited here: https://nakala.fr/10.34847/nkl.dde9fnm8

mcolobran · 2021-03-16T13:33:06Z

Hello Thomas. The goget project is again in progress. Goget is a global indexer of radiocarbon databases all around the world. http://www.ibercrono.org/goget/ It will be a pleasure to include your database in goget. According to your mails, you have a new version of c14dataset.csv. Please, let me know where I can find it. As the "behind the scenes" are changing, a new database like yours could be a good test in order to check it all. Thanks in advance. M. El 04/03/2021 a les 21:33, Thomas Huet ha escrit:

…

Today, the Zenodo repo name is 'Rdev' but it will be renamed 'NeoNet' soon. I will update then — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AJCXFHCGQY276SR4MQ3EVELTB7VANANCNFSM4D77IDKQ>.

-- El software de antivirus Avast ha analizado este correo electrónico en busca de virus. https://www.avast.com/antivirus

zoometh · 2021-03-16T16:05:43Z

Actually, c14dataset.csv is a trial version posted here. It only concerns the South of France during Meso/Neo transition and needs checks. I first made this file to have a dataset example to prepare a new getter function. The full/correct dataset will cover the Central and Western Mediterranean for the same period. It will be ready in mid-April. @mcolobran I'll let you know

zoometh · 2021-03-16T16:08:20Z

As soon as possible, I plan to create a getter function for this published dataset:

Related to a recent PlosONE paper, a new c14 database has been deposited here: https://nakala.fr/10.34847/nkl.dde9fnm8

Best

mcolobran · 2021-03-17T08:08:06Z

Great !. I'll wait for your news. M. El 16/03/2021 a les 17:06, Thomas Huet ha escrit:

…

Actually, |c14dataset.csv| is a trial version posted here <https://zenodo.org/record/4582714#.YEFAWpvjK70>. It only concerns the South of France during Meso/Neo transition and needs checks. I first made this file to have a dataset example to prepare a new /getter/ function <https://github.com/ropensci/c14bazAAR#adding-database-getter-functions>. The full/correct dataset will cover the Central and Western Mediterranean for the same period. It will be ready in mid-April. @mcolobran <https://github.com/mcolobran> I'll let you know — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AJCXFHCW5Q5VGKRUZ5GPTX3TD56W3ANCNFSM4D77IDKQ>.

-- El software de antivirus Avast ha analizado este correo electrónico en busca de virus. https://www.avast.com/antivirus

zoometh · 2021-03-22T14:51:14Z

@nevrome To integrate the bda C14 database integration, I have:

create branch bda (local only?)
added values/complete the files:
- url_references.csv
- variable_reference.csv
- material_thesaurus.csv
- country_thesaurus.csv
added bda to get_c14data()

What is the next step to have these changes on the c14bazAAR repo ? This is the first time I use branches...

nevrome · 2021-03-22T15:04:21Z

Excellent! You can open a Pull Request with these changes now. This tutorial seems to describe the process: https://opensource.com/article/19/7/create-pull-request-github

Due to the architecture of c14bazAAR CRAN checks for your PR will fail unfortunately, so don't worry about them. I will take over and come back to you if something is missing.

zoometh · 2021-03-22T19:12:24Z

I'm struggling with branch creation but, I guess, it will works. Best

zoometh · 2021-03-22T19:12:42Z

I need more time

zoometh · 2021-03-23T10:36:00Z

I've just created a Pull Request, @nevrome thanks for the tuto

zoometh · 2022-03-07T09:21:30Z

I will create a getter function for the NeoNet dataset http://doi.org/10.5334/joad.87 in coming weeks

nevrome · 2022-04-10T15:58:34Z

As the discussion in this issue is incredibly outdated I decided to open a new TODO list in #157.

nevrome assigned dirkseidensticker Oct 19, 2017

nevrome assigned preiaen Nov 3, 2017

nevrome added the feature request label Mar 4, 2018

dirkseidensticker mentioned this issue Jun 25, 2019

get_EUBAR() #64

Merged

nevrome unassigned preiaen Nov 6, 2019

nevrome changed the title ~~databases to be queried~~ TODO List of databases that could be accessed with c14bazAAR Dec 30, 2019

yesdavid mentioned this issue Jan 24, 2020

Addition of PACEA and 14C-Palaeolithic databases #90

Merged

dirkseidensticker mentioned this issue Mar 23, 2020

New Parser for MedAfriCarbon & Jomon Databases #95

Merged

6 tasks

joeroe mentioned this issue Aug 18, 2020

Add emedyd database #102

Merged

dirkseidensticker mentioned this issue Aug 21, 2020

parser for Katsianis et al dataset #103

Merged

nevrome added enhancement and removed feature request labels Aug 31, 2020

joeroe mentioned this issue Oct 8, 2020

IntChron parser #115

Open

joeroe mentioned this issue Jan 20, 2021

Parsers for Palmisano's datasets #120

Closed

nevrome pinned this issue Jan 27, 2022

nevrome closed this as completed Apr 10, 2022

nevrome unpinned this issue Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TODO List of databases that could be accessed with c14bazAAR #2

TODO List of databases that could be accessed with c14bazAAR #2

nevrome commented Oct 19, 2017 •

edited

Loading

joeroe commented Apr 2, 2018

nevrome commented Apr 2, 2018

joeroe commented Apr 5, 2018 •

edited

Loading

nevrome commented Apr 5, 2018

dirkseidensticker commented Apr 5, 2018 •

edited

Loading

joeroe commented Apr 5, 2018 •

edited

Loading

nevrome commented Apr 5, 2018 •

edited

Loading

mcolobran commented Apr 6, 2018

yesdavid commented Jan 22, 2020 •

edited

Loading

SimonGoring commented Apr 20, 2020

nevrome commented Apr 20, 2020 •

edited

Loading

SimonGoring commented Apr 21, 2020

SimonGoring commented Apr 21, 2020

dirkseidensticker commented Apr 21, 2020 •

edited

Loading

nevrome commented Apr 22, 2020

nevrome commented May 8, 2020

SimonGoring commented May 12, 2020

zoometh commented Mar 4, 2021

zoometh commented Mar 4, 2021

nevrome commented Mar 4, 2021

zoometh commented Mar 5, 2021

zoometh commented Mar 5, 2021

mcolobran commented Mar 16, 2021 via email

zoometh commented Mar 16, 2021

zoometh commented Mar 16, 2021

mcolobran commented Mar 17, 2021 via email

zoometh commented Mar 22, 2021

nevrome commented Mar 22, 2021

zoometh commented Mar 22, 2021

zoometh commented Mar 22, 2021

zoometh commented Mar 23, 2021

zoometh commented Mar 7, 2022

nevrome commented Apr 10, 2022

TODO List of databases that could be accessed with c14bazAAR #2

TODO List of databases that could be accessed with c14bazAAR #2

Comments

nevrome commented Oct 19, 2017 • edited Loading

joeroe commented Apr 2, 2018

nevrome commented Apr 2, 2018

joeroe commented Apr 5, 2018 • edited Loading

nevrome commented Apr 5, 2018

dirkseidensticker commented Apr 5, 2018 • edited Loading

joeroe commented Apr 5, 2018 • edited Loading

nevrome commented Apr 5, 2018 • edited Loading

mcolobran commented Apr 6, 2018

yesdavid commented Jan 22, 2020 • edited Loading

SimonGoring commented Apr 20, 2020

nevrome commented Apr 20, 2020 • edited Loading

SimonGoring commented Apr 21, 2020

SimonGoring commented Apr 21, 2020

dirkseidensticker commented Apr 21, 2020 • edited Loading

nevrome commented Apr 22, 2020

nevrome commented May 8, 2020

SimonGoring commented May 12, 2020

zoometh commented Mar 4, 2021

zoometh commented Mar 4, 2021

nevrome commented Mar 4, 2021

zoometh commented Mar 5, 2021

zoometh commented Mar 5, 2021

mcolobran commented Mar 16, 2021 via email

zoometh commented Mar 16, 2021

zoometh commented Mar 16, 2021

mcolobran commented Mar 17, 2021 via email

zoometh commented Mar 22, 2021

nevrome commented Mar 22, 2021

zoometh commented Mar 22, 2021

zoometh commented Mar 22, 2021

zoometh commented Mar 23, 2021

zoometh commented Mar 7, 2022

nevrome commented Apr 10, 2022

nevrome commented Oct 19, 2017 •

edited

Loading

joeroe commented Apr 5, 2018 •

edited

Loading

dirkseidensticker commented Apr 5, 2018 •

edited

Loading

joeroe commented Apr 5, 2018 •

edited

Loading

nevrome commented Apr 5, 2018 •

edited

Loading

yesdavid commented Jan 22, 2020 •

edited

Loading

nevrome commented Apr 20, 2020 •

edited

Loading

dirkseidensticker commented Apr 21, 2020 •

edited

Loading