Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TODO List of databases that could be accessed with c14bazAAR #2

Closed
23 of 27 tasks
nevrome opened this issue Oct 19, 2017 · 33 comments
Closed
23 of 27 tasks

TODO List of databases that could be accessed with c14bazAAR #2

nevrome opened this issue Oct 19, 2017 · 33 comments
Assignees

Comments

@nevrome
Copy link
Member

nevrome commented Oct 19, 2017

Databases/Collections behind login- or paywalls. Or not accessible without web scraping. Will not be included. 😿

@joeroe
Copy link
Contributor

joeroe commented Apr 2, 2018

@nevrome I have a copy of the PPND database transcribed into a CSV. I'm not sure if it's still being updated, is it worth including it just as a static file?

@nevrome
Copy link
Member Author

nevrome commented Apr 2, 2018

Sounds interesting, Joe! Is it ok for the creators of the database if you share it? In the end it's their decision.

@joeroe
Copy link
Contributor

joeroe commented Apr 5, 2018

I haven't been in touch with them. I just scraped it from their website for another project. I will leave it up to you!
exoriente_ppnd.csv.txt

@nevrome
Copy link
Member Author

nevrome commented Apr 5, 2018

We're currently establishing a cooperation with @mcolobran (Miquel Colobran) who wrote some webscrapers for GoGet. He'll provide us with a stable download link for some databases that only can be accessed by crawling.

I'm still not 100% sure if this is the way to go though. The people who are sharing their data only as html do so deliberately I guess. I think I should at least inform the authors. They should know that this is possible and that we are even trying to make it more easy.

@dirkseidensticker
Copy link
Contributor

dirkseidensticker commented Apr 5, 2018

I had a look at their website https://www.exoriente.org/associated_projects/ppnd.php

@joeroe I guess you compiled the list yourself as I wasn't able to find a simple 'download all' button or so, only links of OxCal formatted snippets. Right?

For me, Open Access or at least the knowledge of what will be done to the data are essential. If you don't mind I could write Marion Benz and ask her if the database is still updated and if they would be willing to lend something over for the c14bazAAR. I wouldn't want to include something without the consent of the original creators. Unless you @joeroe compiled your CSV manually based on the informations on the web. Then -- in my view -- you are the creator of your list, as (at least after German law) data are not copyrightable. In that case, if you set up a repo for the table we'll be able to write a parser for it.

What's your thoughts?

@joeroe
Copy link
Contributor

joeroe commented Apr 5, 2018

@dirkseidensticker So all I did was go through the website and copy the individual site tables into a single CSV. Copyright aside, it's not my data – I just thought that if you did want to incorporate PPND, I could save you doing that step. I agree that the polite thing to do is to ask the original compilers for permission to redistribute it first. Maybe they'll be able to simplify things

Interesting ethical questions in general, though. Is there a qualitative difference between downloading tabular data and using c14bazAAR to reformat it, versus downloading hypertext data and using a scraper to reformat it? (Or a text editor, as I did?) If you publish scientific data in any form, aren't you implicitly agreeing to let it be reused without explicit permission, in return for acknowledgement?

I guess this is why they tell you to include a LICENSE file in repos!

@nevrome
Copy link
Member Author

nevrome commented Apr 5, 2018

There's even a legal difference between downloading predefined tables and webscraping: https://en.wikipedia.org/wiki/Web_scraping#Legal_issues
I was not aware of this. Now I would say: No webscraping data for c14bazAAR. We can and should write to database providers for a legally safe download opportunity as I did for RADON and CalPal. We should also discuss this with Miquel.

@mcolobran
Copy link

Hello @dirkseidensticker and @nevrome.

In response of your concerns, which I also had and have.

The 'protocol' (proposal) to follow for joining a new database in GoGet and/or c14bazAAR could be:

  1. Check if the database could be joined even using web crawling effort by the owners.

  2. If so, ask permission to the owners (so yes, they have the final decision).

Agreement how often they want the data be downloaded by us.

  1. Inform when data is ready to be used by GoGet and/or c14bazAAR with a mail.

What we can offer to the owners (referred to point 2)

- To do nothing (or almost nothing) in their websites (no changes are required at all).
- a CSV file stored in goget (or any other place) to assure not to collapse their database. If they prefer, we can send to them in order to load the database from c14bazAAR
- the csv file to be used by c14bazAAR (again they have to do nothing)
- Guarantee the ownership (take a look at the fields in goget). In this sense we don't "make our" the data.
- Guarantee the crawling is doing from their public data they publish in the web. Besides it is done at a very low speed in order not to collapse their database. Finally we inform them the IP that is doing the task to avoid IDS, networking problems.


Benefits for the owners:

- They gain visibility (so citations and visits)
- We can check automatically for some errors on their data, so improving the quality. 
- Their website will have more visits because Goget (and c14bazAAR) are able to redirect/go to their web if the researcher needs more detail or clarifications.
- Goget publish the database name (http://www.ibercrono.org/goget/databases.php). In this sense our intention is not "appropriate" data but spreading their work.
- They are following the open data movement.

I answer inline some concerns you expressed

I think I should at least inform the authors. They should know that this is possible and that we are even trying to make it more easy.

The idea behind goget (and therefore c14bazAAR) is the owners have to do nothing and no changes to the data they are making public in Internet. If they have to make changes it makes things more difficult.

Maybe they'll be able to simplify things

In my humble opinion, if we want to spread their data and use it as less work and changes have the owners the better chances we all have.

interesting ethical questions in general, though. Is there a qualitative difference between downloading tabular data and using c14bazAAR to reformat it, versus downloading hypertext data and using a scraper to reformat it? (Or a text editor, as I did?) If you publish scientific data in any form, aren't you implicitly agreeing to let it be reused without explicit permission, in return for acknowledgement?

About this. First of all, no internet data/image is free in the sense the data/image always have an author and never can be considered yours.

If you download a tabular data (with a button in the website) and the owner have and acknowledgement in return, it is nice to inform the owner but you can do anything (except appropriating of course)

If one crawls the data, one has to be more careful. Maybe the author didn't had in mind this purpose. That's why it had to be informed before spread it (point 2 and following to offer what they receive in exchange)

I guess this is why they tell you to include a LICENSE file in repos!

Agree about the LICENSE FILE

They also claimed that screen-scraping constitutes what is legally known as "Misappropriation and Unjust Enrichment", as well as being a breach of the web site's user agreement.

The legal point of that is "enrichment". If you obtain profit, you are in trouble. It's considered a crime by the almost any country law. If you spread something (a photocopied book as an example) you are only trespassing Intellectual Property. As there is no profit for this data, the question goes to credit, diffusion.... which is not big deal (despite we have to be aware of)

these are early scraping decisions, and the theories of liability are not uniform, it is difficult to ignore a pattern emerging that the courts are prepared to protect proprietary content on commercial sites from uses which are undesirable to the owners of such sites.

Again, take a look at the text. They talk about commercial sites and that implies "money" and profits....not our scenario.

We can and should write to database providers for a legally safe download opportunity as I did for RADON and CalPal. We should also discuss this with Miquel

100% agree with a small detail. The way the data is obtained have to be agreed. If we offer the spreading of their data with no effort by the providers.....we have a very good point on our side.

Sorry for the long mail.

                  M.

@nevrome nevrome changed the title databases to be queried TODO List of databases that could be accessed with c14bazAAR Dec 30, 2019
@yesdavid
Copy link
Collaborator

yesdavid commented Jan 22, 2020

I'd like to add two Palaeolithic databases that I need, it's

@SimonGoring
Copy link
Member

Hi, would you be interested in adding the 14C data from Neotoma (http://neotomadb)? We have a lot of radiocarbon dates, but not all are associated with archaeological records, many are age controls for stratigraphic sequences. We could probably build an API endpoint to serve specific archaeological records.

@nevrome
Copy link
Member Author

nevrome commented Apr 20, 2020

@SimonGoring Absolutely, yes!

So far we only used .csv or .xlsx-file based interfaces. Either the database authors put a final release on their website or zenodo, or the database website provides a daily release somewhere (e.g. https://radon.ufg.uni-kiel.de/radondaily.txt).

I'm open to other interfaces, but they should be reasonably fast. Our parser functions always download the whole database, so in your case an archaeology subset. If you provide us with some sort of interface we can do the rest. You can also submit a PR with a parser function yourself following the documentation here.

@SimonGoring
Copy link
Member

What are the most important fields for you? Our chronology structure is a bit complicated, but I can push through it to get something laid out for you. The example you linked shows:

ID, LABNR, C14AGE, C14STD, C13, MATERIAL, SPECIES, COUNTRY, SITE, CULTURE, PERIOD, FEATURETYPE, FEATURE, LATITUDE, LONGITUDE, REFERENCE

For the most part we don't have the Period and Culture references (as far as I know), but I think I can do most of the rest. My preference would be to give you an endpoint that simply returns the CSV table. I'd prefer to give it to you in JSON though :)

@SimonGoring
Copy link
Member

As a note, you can see a bit about how our geochronological controls are structured here:
http://open.neotomadb.org/dbschema/tables/geochronology.html.

@dirkseidensticker
Copy link
Contributor

dirkseidensticker commented Apr 21, 2020

@SimonGoring we keep track of the variables provided within each data source using this table: https://github.com/ropensci/c14bazAAR/blob/master/data-raw/variable_reference.csv

The field listed as database within the source column converted over from the sources and might give you a good overview what we compile into the c14_date_list. The entires within the columns of each database below line 31 are dropped though.

@nevrome have we documentation listing the minimum requirements for a c14_date_list somewhere?

@nevrome
Copy link
Member Author

nevrome commented Apr 22, 2020

The technical minimum requirement is just the uncalibrated age and its standard deviation:

necessary_vars <- c("c14age","c14std")

That's not really useful unfortunately. I think @SimonGoring will find what he needs in the variable reference table already linked.

JSON would be ok for me as well. We probably would have to add another dependency to simplify the parsing. Don't know what the cool kids use nowadays. rjson?

@nevrome
Copy link
Member Author

nevrome commented May 8, 2020

@SimonGoring Did you come up with a prototype? Or did this turn out to be more complicated than it appeared to be? Let us know if we can do something to help out.

@SimonGoring
Copy link
Member

@nevrome It's just a question of timing :) I think we can do that, I just need to get the time to do it.

@zoometh
Copy link
Contributor

zoometh commented Mar 4, 2021

@nevrome With colleagues, we are completing a radiocarbon dataset for the West and Central Mediterranean for the Meso/Neo transition. We have already defined the dataset layout and add a sample of the dataset on Zenodo, file 'c14dataset.csv'. The complete dataset will be ready in the coming days (or pair of weeks). I could write the getter function for c14bazAAR at the same time. That is fine?

@zoometh
Copy link
Contributor

zoometh commented Mar 4, 2021

Today, the Zenodo repo name is 'Rdev' but it will be renamed 'NeoNet' soon. I will update then

@nevrome
Copy link
Member Author

nevrome commented Mar 4, 2021

@zoometh: Excellent! I'm looking forward to your pull request. You have seen the ToDo list, right? Let us know, if anything is unclear.

@zoometh
Copy link
Contributor

zoometh commented Mar 5, 2021

Yes, I've seen the TODO list. I'll prepare the function and inform you

@zoometh
Copy link
Contributor

zoometh commented Mar 5, 2021

Related to a recent PlosONE paper, a new c14 database has been deposited here: https://nakala.fr/10.34847/nkl.dde9fnm8

@mcolobran
Copy link

mcolobran commented Mar 16, 2021 via email

@zoometh
Copy link
Contributor

zoometh commented Mar 16, 2021

Actually, c14dataset.csv is a trial version posted here. It only concerns the South of France during Meso/Neo transition and needs checks. I first made this file to have a dataset example to prepare a new getter function. The full/correct dataset will cover the Central and Western Mediterranean for the same period. It will be ready in mid-April. @mcolobran I'll let you know

@zoometh
Copy link
Contributor

zoometh commented Mar 16, 2021

As soon as possible, I plan to create a getter function for this published dataset:

Related to a recent PlosONE paper, a new c14 database has been deposited here: https://nakala.fr/10.34847/nkl.dde9fnm8

Best

@mcolobran
Copy link

mcolobran commented Mar 17, 2021 via email

@zoometh
Copy link
Contributor

zoometh commented Mar 22, 2021

@nevrome To integrate the bda C14 database integration, I have:

  • create branch bda (local only?)
  • added values/complete the files:
    • url_references.csv
    • variable_reference.csv
    • material_thesaurus.csv
    • country_thesaurus.csv
  • added bda to get_c14data()

What is the next step to have these changes on the c14bazAAR repo ? This is the first time I use branches...

@nevrome
Copy link
Member Author

nevrome commented Mar 22, 2021

Excellent! You can open a Pull Request with these changes now. This tutorial seems to describe the process: https://opensource.com/article/19/7/create-pull-request-github

Due to the architecture of c14bazAAR CRAN checks for your PR will fail unfortunately, so don't worry about them. I will take over and come back to you if something is missing.

@zoometh
Copy link
Contributor

zoometh commented Mar 22, 2021

I'm struggling with branch creation but, I guess, it will works. Best

@zoometh
Copy link
Contributor

zoometh commented Mar 22, 2021

I need more time

@zoometh
Copy link
Contributor

zoometh commented Mar 23, 2021

I've just created a Pull Request, @nevrome thanks for the tuto

@nevrome nevrome pinned this issue Jan 27, 2022
@zoometh
Copy link
Contributor

zoometh commented Mar 7, 2022

I will create a getter function for the NeoNet dataset http://doi.org/10.5334/joad.87 in coming weeks

@nevrome
Copy link
Member Author

nevrome commented Apr 10, 2022

As the discussion in this issue is incredibly outdated I decided to open a new TODO list in #157.

@nevrome nevrome closed this as completed Apr 10, 2022
@nevrome nevrome unpinned this issue Apr 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants