Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exporting scraped PDF files #1

Open
dsanson opened this issue Mar 6, 2012 · 14 comments
Open

exporting scraped PDF files #1

dsanson opened this issue Mar 6, 2012 · 14 comments

Comments

@dsanson
Copy link

dsanson commented Mar 6, 2012

I'd love to see support for exporting scraped PDF files. I poked around in the code a bit but couldn't make sense of what might need to be done, but I really don't know anything about the inner workings of Zotero.

Is it a matter of passing the exportFileData option when calling the bibtex.js translator? Or is it a matter of setting an increased delay, so that the export to bibtex occurs after zotero has scraped the PDF?

@jawj
Copy link
Owner

jawj commented Mar 7, 2012

Thanks David. Yes, this has been requested a few times before, and would undoubtedly be handy.

I think you're on the right track -- indeed, the simplest thing might be to wait a while for a PDF to be scraped, and only then export the data, all at once. On the other hand, I think the nicer option would be to export the bibliographic data straight away, and then try to associate a scraped PDF with that data when available.

In either case, I've steered clear of adding this functionality until now just because it adds a fair bit of extra complexity. However, I'll keep this issue open, and perhaps I (or someone else) will find some time to address it in future.

@vancleve
Copy link

vancleve commented Oct 6, 2012

I've just started using Zot2Bib and really like it and would also like the PDF scraping ability. I also dug into the Zotero and Zot2Bib code and was left a little at a loss since I'm not really familiar with the Zotero code and the documentation isn't super detailed. I would like to help get this feature working since it would simplify my workflow and would be a boon to Bibdesk users.

I'll keep poking around in the Zotero code, but George, if you have a preliminary sense for how this might be accomplished, any general outline would be super helpful. I can get a sense for how Zotero saves PDFs and other data, but how to tie this to a function that is called on modification of a item in the library is still unclear to me. From what I understand so far, there is a single notifier that gets called when an item is modified, and this might be when its added for the first time or when a PDF is attached, and its not clear how one could tie those two events together.

Thanks for the very helpful Zot2Bib!

@jawj
Copy link
Owner

jawj commented Oct 8, 2012

I think you're on the right track. The key thing is that the listener function would have to keep a note of added publications, recognise when a PDF was added to one of them, and then be able to identify that publication to BibDesk too. The listener function currently isn't stateful at all, so this is quite a big step up in complexity. Good luck with it — afraid this is still a long way down my TODO list.

@vancleve
Copy link

Ok, so I have a working callback that simply looks for modified items with PDF attachments and copies those attachments to another directory and opens them. Thus, all one has to do is drag the PDF onto the entry in Bibdesk and autofile does its magic. This is much better than going back to the website and finding the PDF download link. Is this of interest and if so, how should I got about contributing it? I haven't modified the preferences so that you can enter the directory into Zot2Bib through Firefox and turn the function on/off at will, but I can add that if the functionality is useful to others.

Also, I think I see a little better how to get the whole process working where Zot2Bib can add the PDF itself. In the zoteroCallback function, you check for attachments of an item that don't have an existing file (it must be downloading then). Add a field to that item containing the item.id of that attachment and don't add the entry to Bibdesk yet (I know this might be suboptimal, but its much harder to add the entry now and associate the PDF later). Have a separate callback that runs when items are modified. This callback checks to see if the item has this new special field and if the attachment specified by the field has an existing file. If so, the file is done downloading and the entry can now be added to Bibdesk and the PDF auto filed.

The problem I see with adding the bibtex entry first and trying to add the PDF later is that the user can intervene too easily with the entry while the PDF is downloading. For example, when the PDF is ready, you need some identifier in the bibtex entry that will allow Bibdesk to locate the entry to attach the PDF to. The user could accidentally modify that field though while the PDF is downloading. Attaching the PDF to the newest entry in the bibtex is also problematic since a user could add another entry before the PDF is done.

Anyway, any feedback is welcome!

@jawj
Copy link
Owner

jawj commented Oct 18, 2012

Hi again @vancleve. I think the best thing would be for you to fork the repo, and make the changes you describe in your fork so I can have a look.

@vancleve
Copy link

vancleve commented Jul 8, 2013

Hi @jawj. I've forked your repo (here) and added my PDF scraping code. There is an additional preference for the folder to save the PDF in (default is FF download folder) and a preference for whether to open the pdf too. Right now, the whole thing has to wait while the PDF is downloading, so this is a bit annoying, but possibly unavoidable so as to not confuse BibDesk as to which reference the PDF belongs to.

Anyway, take a look and let me know what you think! I know its already been very useful to me.

@vancleve
Copy link

Just a quick bump on this issue since I've updated my fork again. It now continues to export the bibtex even when the PDF download fails.

@foice
Copy link

foice commented Sep 16, 2021

Well, I am considering using Zotero in placeo of BibDesk and I have to say that having the export of the bibliographic info, as it currently works, plus the location of the PDF file in the BiBDesk info would be awesome.

As far as I can see zot2bib is triggered on newly imported items and cannot be triggered to repeat the export, e.g. after that the PDF file has been fetched. Am I missing something?

I think that in general it is useful to be able to repeat (and update or overwrite) the addition to bibdesk, for instance for any zotero item that I have updated.

@nathan-artist
Copy link

@foice: If you plan to use Zotero in place of BibDesk (as opposed to with BibDesk), why not use Better BibTeX for Zotero? If you really plan to replace BibDesk with Zotero, you wouldn't need BibDesk at all.

@foice
Copy link

foice commented Sep 19, 2021

I have https://github.com/retorquere/zotero-better-bibtex currently "on trial". The most likely outcome seems to be I have to use it with Bibdeksk, because so far I can handle addition via command line only with Bibdesk, plus there are another number of "on the field" tests I have not made yet on Zotero. Still Zotero gives probably better keywords features (for my use).

Anyhow, this is all about "me". On the contrary I think the issue of zot2bib being triggerable at will is an issue with its own standing ... regardless of my usecase. So I restate the case for having zot2bib to be called on already existing items, why should it not be possible?

At any rate, exporting the position of the PDF file to bibdesk own field seems also a core feature.

@nathan-artist
Copy link

@foice: Yes, that is a major limitation of Better BibTeX for Zotero: changes to the BibTeX are one-way from Zotero. One has to close a BibDesk database before editing the BibTeX outside of BibDesk, but at least it is possible to edit it under that condition.

You may want to open a separate issue for "having zot2bib to be called on already existing items", which seems out of scope for this issue.

@vancleve
Copy link

Just FYI, I moved fully over to Zotero in 2019 because Bibdesk was just too slow with my big library. Zotero is quite a bit harder add functionality to, compared to the script hooks in Bibdesk, but its much more actively developed than Bibdesk.

@nathan-artist
Copy link

nathan-artist commented Sep 20, 2021

@vancleve: I don't know what slowed down your BibDesk library, but in my case I figured out quickly when I started using BibDesk in 2008 that the linked file fields were slowing it down, and that using them was not going to scale to the many thousands of references that I foresaw having in the near future. Instead, I made a "Downloaded" checkbox field and an AppleScript that opens the downloaded file without any use of the linked file field. Now I have well over 30k references in one BibDesk database and it is lightning fast, much faster than Zotero.

Having said that, there are some amazing plugins for Zotero that shouldn't be ignored, and by using Zotero in place of BibDesk you have easy access to those plugins.

@vancleve
Copy link

@nathan-artist, ah yes, that was probably it. I had over 10k refs with linked PDFs and it was terribly slow.

Still, there is some elegance to cutting out the zot2bib middle man and mostly just using one reference manager instead of two, even if Zotero is a bit slow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants