-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TODO List of databases that could be accessed with c14bazAAR #2
Comments
@nevrome I have a copy of the PPND database transcribed into a CSV. I'm not sure if it's still being updated, is it worth including it just as a static file? |
Sounds interesting, Joe! Is it ok for the creators of the database if you share it? In the end it's their decision. |
I haven't been in touch with them. I just scraped it from their website for another project. I will leave it up to you! |
We're currently establishing a cooperation with @mcolobran (Miquel Colobran) who wrote some webscrapers for GoGet. He'll provide us with a stable download link for some databases that only can be accessed by crawling. I'm still not 100% sure if this is the way to go though. The people who are sharing their data only as html do so deliberately I guess. I think I should at least inform the authors. They should know that this is possible and that we are even trying to make it more easy. |
I had a look at their website https://www.exoriente.org/associated_projects/ppnd.php @joeroe I guess you compiled the list yourself as I wasn't able to find a simple 'download all' button or so, only links of OxCal formatted snippets. Right? For me, Open Access or at least the knowledge of what will be done to the data are essential. If you don't mind I could write Marion Benz and ask her if the database is still updated and if they would be willing to lend something over for the c14bazAAR. I wouldn't want to include something without the consent of the original creators. Unless you @joeroe compiled your CSV manually based on the informations on the web. Then -- in my view -- you are the creator of your list, as (at least after German law) data are not copyrightable. In that case, if you set up a repo for the table we'll be able to write a parser for it. What's your thoughts? |
@dirkseidensticker So all I did was go through the website and copy the individual site tables into a single CSV. Copyright aside, it's not my data – I just thought that if you did want to incorporate PPND, I could save you doing that step. I agree that the polite thing to do is to ask the original compilers for permission to redistribute it first. Maybe they'll be able to simplify things Interesting ethical questions in general, though. Is there a qualitative difference between downloading tabular data and using c14bazAAR to reformat it, versus downloading hypertext data and using a scraper to reformat it? (Or a text editor, as I did?) If you publish scientific data in any form, aren't you implicitly agreeing to let it be reused without explicit permission, in return for acknowledgement? I guess this is why they tell you to include a LICENSE file in repos! |
There's even a legal difference between downloading predefined tables and webscraping: https://en.wikipedia.org/wiki/Web_scraping#Legal_issues |
Hello @dirkseidensticker and @nevrome. In response of your concerns, which I also had and have. The 'protocol' (proposal) to follow for joining a new database in GoGet and/or c14bazAAR could be:
Agreement how often they want the data be downloaded by us.
What we can offer to the owners (referred to point 2)
I answer inline some concerns you expressed
The idea behind goget (and therefore c14bazAAR) is the owners have to do nothing and no changes to the data they are making public in Internet. If they have to make changes it makes things more difficult.
In my humble opinion, if we want to spread their data and use it as less work and changes have the owners the better chances we all have.
About this. First of all, no internet data/image is free in the sense the data/image always have an author and never can be considered yours. If you download a tabular data (with a button in the website) and the owner have and acknowledgement in return, it is nice to inform the owner but you can do anything (except appropriating of course) If one crawls the data, one has to be more careful. Maybe the author didn't had in mind this purpose. That's why it had to be informed before spread it (point 2 and following to offer what they receive in exchange)
Agree about the LICENSE FILE
The legal point of that is "enrichment". If you obtain profit, you are in trouble. It's considered a crime by the almost any country law. If you spread something (a photocopied book as an example) you are only trespassing Intellectual Property. As there is no profit for this data, the question goes to credit, diffusion.... which is not big deal (despite we have to be aware of)
Again, take a look at the text. They talk about commercial sites and that implies "money" and profits....not our scenario.
100% agree with a small detail. The way the data is obtained have to be agreed. If we offer the spreading of their data with no effort by the providers.....we have a very good point on our side. Sorry for the long mail.
|
I'd like to add two Palaeolithic databases that I need, it's
|
Hi, would you be interested in adding the 14C data from Neotoma (http://neotomadb)? We have a lot of radiocarbon dates, but not all are associated with archaeological records, many are age controls for stratigraphic sequences. We could probably build an API endpoint to serve specific archaeological records. |
@SimonGoring Absolutely, yes! So far we only used .csv or .xlsx-file based interfaces. Either the database authors put a final release on their website or zenodo, or the database website provides a daily release somewhere (e.g. https://radon.ufg.uni-kiel.de/radondaily.txt). I'm open to other interfaces, but they should be reasonably fast. Our parser functions always download the whole database, so in your case an archaeology subset. If you provide us with some sort of interface we can do the rest. You can also submit a PR with a parser function yourself following the documentation here. |
What are the most important fields for you? Our chronology structure is a bit complicated, but I can push through it to get something laid out for you. The example you linked shows:
For the most part we don't have the Period and Culture references (as far as I know), but I think I can do most of the rest. My preference would be to give you an endpoint that simply returns the CSV table. I'd prefer to give it to you in JSON though :) |
As a note, you can see a bit about how our geochronological controls are structured here: |
@SimonGoring we keep track of the variables provided within each data source using this table: https://github.com/ropensci/c14bazAAR/blob/master/data-raw/variable_reference.csv The field listed as @nevrome have we documentation listing the minimum requirements for a |
The technical minimum requirement is just the uncalibrated age and its standard deviation: c14bazAAR/R/c14_date_list_basic.R Line 36 in 69ff900
That's not really useful unfortunately. I think @SimonGoring will find what he needs in the variable reference table already linked. JSON would be ok for me as well. We probably would have to add another dependency to simplify the parsing. Don't know what the cool kids use nowadays. |
@SimonGoring Did you come up with a prototype? Or did this turn out to be more complicated than it appeared to be? Let us know if we can do something to help out. |
@nevrome It's just a question of timing :) I think we can do that, I just need to get the time to do it. |
@nevrome With colleagues, we are completing a radiocarbon dataset for the West and Central Mediterranean for the Meso/Neo transition. We have already defined the dataset layout and add a sample of the dataset on Zenodo, file 'c14dataset.csv'. The complete dataset will be ready in the coming days (or pair of weeks). I could write the getter function for c14bazAAR at the same time. That is fine? |
Today, the Zenodo repo name is 'Rdev' but it will be renamed 'NeoNet' soon. I will update then |
Yes, I've seen the TODO list. I'll prepare the function and inform you |
Related to a recent PlosONE paper, a new c14 database has been deposited here: https://nakala.fr/10.34847/nkl.dde9fnm8 |
Hello Thomas.
The goget project is again in progress. Goget is a global indexer of
radiocarbon databases all around the world.
http://www.ibercrono.org/goget/
It will be a pleasure to include your database in goget. According to
your mails, you have a new version of c14dataset.csv. Please, let me
know where I can find it. As the "behind the scenes" are changing, a new
database like yours could be a good test in order to check it all.
Thanks in advance.
M.
El 04/03/2021 a les 21:33, Thomas Huet ha escrit:
…
Today, the Zenodo repo name is 'Rdev' but it will be renamed 'NeoNet'
soon. I will update then
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AJCXFHCGQY276SR4MQ3EVELTB7VANANCNFSM4D77IDKQ>.
--
El software de antivirus Avast ha analizado este correo electrónico en busca de virus.
https://www.avast.com/antivirus
|
Actually, |
As soon as possible, I plan to create a getter function for this published dataset:
Best |
Great !.
I'll wait for your news.
M.
El 16/03/2021 a les 17:06, Thomas Huet ha escrit:
…
Actually, |c14dataset.csv| is a trial version posted here
<https://zenodo.org/record/4582714#.YEFAWpvjK70>. It only concerns the
South of France during Meso/Neo transition and needs checks. I first
made this file to have a dataset example to prepare a new /getter/
function
<https://github.com/ropensci/c14bazAAR#adding-database-getter-functions>.
The full/correct dataset will cover the Central and Western
Mediterranean for the same period. It will be ready in mid-April.
@mcolobran <https://github.com/mcolobran> I'll let you know
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AJCXFHCW5Q5VGKRUZ5GPTX3TD56W3ANCNFSM4D77IDKQ>.
--
El software de antivirus Avast ha analizado este correo electrónico en busca de virus.
https://www.avast.com/antivirus
|
@nevrome To integrate the
What is the next step to have these changes on the c14bazAAR repo ? This is the first time I use branches... |
Excellent! You can open a Pull Request with these changes now. This tutorial seems to describe the process: https://opensource.com/article/19/7/create-pull-request-github Due to the architecture of c14bazAAR CRAN checks for your PR will fail unfortunately, so don't worry about them. I will take over and come back to you if something is missing. |
I'm struggling with branch creation but, I guess, it will works. Best |
I need more time |
I've just created a Pull Request, @nevrome thanks for the tuto |
I will create a getter function for the NeoNet dataset http://doi.org/10.5334/joad.87 in coming weeks |
As the discussion in this issue is incredibly outdated I decided to open a new TODO list in #157. |
Databases/Collections behind login- or paywalls. Or not accessible without web scraping. Will not be included. 😿
The text was updated successfully, but these errors were encountered: