Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Google Scholar meta data #2191

Merged
merged 3 commits into from
May 6, 2021
Merged

Add Google Scholar meta data #2191

merged 3 commits into from
May 6, 2021

Conversation

tunetheweb
Copy link
Member

@tunetheweb tunetheweb commented May 5, 2021

Makes progress on #1325

I've added the meta data to the individual chapters, rather than the eBook as most citations are for a chapter with URL reference. This also avoids the issue of making the ebook HTML indexable (it's currently set to noindex as we prefer the PDF to be indexed, and it's VERY big to be downloaded regularly).

Example of meta data:

    <meta name="citation_title" content="The 2020 Web Almanac: HTTP/2 chapter">
    <meta name="citation_author" content="Andrew Galloni">
    <meta name="citation_author" content="Robin Marx">
    <meta name="citation_author" content="Mike Bishop">
    <meta name="citation_publication_date" content="2020/12/09">
    <meta name="citation_journal_title" content="The 2020 Web Almanac">
    <meta name="citation_volume" content="2020-English">
    <meta name="citation_technical_report_institution" content="HTTP Archive">

** Edit: Updated as discussed below, see #2191 (comment) for the latest version **

Any thought on citation_volume? Should we include the Language? Or should we only have this Metadata for English language versions as they are the primary one and Google Scholar doesn't seem to have many options for translations, or citation_language or similar.

@tunetheweb tunetheweb requested a review from rviscomi May 5, 2021 11:53
@tunetheweb tunetheweb added the development Building the Almanac tech stack label May 5, 2021
@tunetheweb tunetheweb added this to the 2020 Backlog milestone May 5, 2021
@ibnesayeed
Copy link
Contributor

Should we not use a sequence number for citation_volume instead of the year itself?

@ibnesayeed
Copy link
Contributor

Also, should we include the citation_pdf_url metadata as well?

@tunetheweb
Copy link
Member Author

Should we not use a sequence number for citation_volume instead of the year itself?

There doesn't seem to be any guidance on whether this needs to be a number or not (unless you have a reference for this?), and even it does, the fact that the year is a number works for me and is easier to know that starting some other unique sequence number to track. Bigger question is what to do about the language?

Also, should we include the citation_pdf_url metadata as well?

I deliberately didn't because we don't have a unique PDF URL per chapter and they say:

Place each article and each abstract in a separate HTML or PDF file. At this time, we're unable to effectively index multiple abstracts on the same webpage or multiple papers in the same PDF file. Likewise, we're unable to index different sections of the same paper in different files. Each paper must have its own unique URL in order for it to be included in Google Scholar.

@ibnesayeed
Copy link
Contributor

There doesn't seem to be any guidance on whether this needs to be a number or not (unless you have a reference for this?), and even it does, the fact that the year is a number works for me and is easier to know that starting some other unique sequence number to track.

There might not be a Google Scholar guidance on it, but this is an established terminology for periodical literature:

Volume typically refers to the number of years the publication has been circulated, and issue refers to how many times that periodical has been published during that year. For example, the April 2011 publication of a monthly magazine first published in 2002 would be listed as, "volume 10, issue 4". -- https://en.wikipedia.org/wiki/Periodical_literature

Bigger question is what to do about the language?

I would omit the language aspect from volume. That metadata can be provided using some other means, if necessary.

@ibnesayeed
Copy link
Contributor

I deliberately didn't because we don't have a unique PDF URL per chapter

Should we consider permalinking these? Because Chapter PDFs seem like an important outcome which should be discoverable from a stable URI.

@tunetheweb
Copy link
Member Author

tunetheweb commented May 5, 2021

Perhaps we should just omit these two fields if we're not sure? They are not mandatory.

Concern with using a volume — without the language — is that many languages have the publication title just as {{ year }} Web Almanac, so without a volume including the language many of them will look like the same publication when they are not really - they are translations of the same volume. Alternatively could include the language in the title? Or, as per my question in the initial comment, maybe we should just include this meta data in the English chapters for now?

Some PDFs viewers allow deep linking (e.g. this is a deep link that opens the JavaScript chapter in Chrome's built-in PDF viewer) but many don't, and not sure how Google Scholar handles these. I don't think it's right to include the same PDF across all the chapters, and this is optional, so definitely don't think we should include this link.

@nrllh you seem to be familiar with these, any thoughts?

@ibnesayeed
Copy link
Contributor

ibnesayeed commented May 5, 2021

wI would not pollute the title with language information. There should be other means to provide that info. The Internet Archive Scholar has support for searching literature in specific language.

Perhaps we should just omit these two fields if we're not sure? They are not mandatory.

I am thinking, can we use volume for yearly sequence number, and the issue field as the language specifier (where we can use the 2-letter short code for the langue, like volume 3, issue en or something like that)? /cc @bnewbold for his thoughts.

I don't think it's right to include the same PDF across all the chapters, and this is optional, so definitely don't think we should include this link.

I was thinking along the lines of publishing the whole book as well as each chapter separately as well, which is in line with how academic publications are distributed. Moreover, page numbers can also be specified in the bibliography, which is also a standard practice.

@tunetheweb
Copy link
Member Author

I'm not seeing the benefit of having the chapter as individual PDFs to be honest.

To me the Web Almanac works best as a website. It has links (including to raw data, queries and comments), interactive figures, an interactive ToC, is responsive and font sizes can be increased or decreased as needed be. Additionally figure descriptions can be shown/hidden and there's navigation to jump around the site.

The Ebook PDF is basically the offline version, but it is a lesser experience for that. But still it's good because some (like me!) like the fact you can print as a book so that makes sense.

A PDF of a single chapter serves no real purpose to be honest and is worst of both worlds, and not sure I want to manage another version to be honest.

I do see the value of getting these listed in a Google Scholar as do considered them well-researched scholarly documents that have been referenced in the past (and hopefully will continue to be). But Google Scholar is happy with HTML so why do you think we need PDFs? Readers can always print to PDF if they want to.

Can you give me the compelling use case for chapter PDFs?

@ibnesayeed
Copy link
Contributor

I am more than okay to not have chapter-specific PDFs. Distill is a good example of interactive HTML-only publication. :)

@tunetheweb
Copy link
Member Author

The more I think about it, the more I think we should only list the English version. That's the official document to quote for references. At least until there's a better supported citation_lang or citation_translation tags.

So in 98f0f6b I restricted to English, changed the volume to be the year - 2018 (so 2019 is volume 1) and set the issue to the Chapter number.

It now looks like this:

    <meta name="citation_title" content="The 2020 Web Almanac: HTTP/2 chapter">
    <meta name="citation_author" content="Andrew Galloni">
    <meta name="citation_author" content="Robin Marx">
    <meta name="citation_author" content="Mike Bishop">
    <meta name="citation_publication_date" content="2020/12/09">
    <meta name="citation_journal_title" content="The 2020 Web Almanac">
    <meta name="citation_volume" content="2">
    <meta name="citation_issue" content="22">
    <meta name="citation_technical_report_institution" content="HTTP Archive">

@bnewbold
Copy link

bnewbold commented May 5, 2021

Loosely held opinions, summoned by Sawood:

This work seems more like a "technical report", not "journal article". Unless you intend "to be a journal" in the long run, and register an ISSN, I would skip the volume, issue, and journal name fields, and just use citation_technical_report_institution and citation_technical_report_number. The format for the number can be whatever you want, though I would, eg, avoid whitespace. It is helpful to have the number in the document itself, or maybe in the URL/filename path, for clarity to readers. It does not need to be unique across institutions. HA-2020 or something like that?

The citation_language is the one I have seen for indicating language. Keep in mind Google Scholar is not actually the authority on these "Highwire Press" HTML tags, they are just one place you are trying to get indexed in to. Tools like Zotero also use these tags and I think support many more. Maybe relevant here is citation_publisher (in addition to citation_technical_report_institution.

Regarding HTML, PDF, and chapters, seems up to you. Folks can cite individual chapters on their own. It is common in academia to catalog both a full book/encyclopedia/manuscript and the individual chapters, but that is usually because such books frequently have distinct authors for the individual chapters, and to be honest the books are really just collections of papers or essays. It also makes it easier to distribute chapters for course assignments, or to (sigh) sell access to individual chapters instead of the whole book. I would at least have a table of contents with anchor URLs for individual chapters. I wouldn't bother having separate PDFs for each chapter unless you also make separate HTML documents / landing pages, but sort of subjective.

I would recommend making a PDF of the whole work available, with page numbers as a lot of folks will still print out subsets or the whole thing for reading and marking up. For better or worse, a lot of folks who read a lot simply have technical or personal workflows that assume a PDF and/or ePub.

In your date field, I would use "ISO" style date, with dashes not slashes (eg, 2020-12-09). Maybe Google Scholar prefers otherwise.

@tunetheweb
Copy link
Member Author

tunetheweb commented May 5, 2021

Thanks for the feedback @bnewbold ! Comments inline to make

Loosely held opinions, summoned by Sawood:

This work seems more like a "technical report", not "journal article". Unless you intend "to be a journal" in the long run, and register an ISSN, I would skip the volume, issue, and journal name fields, and just use citation_technical_report_institution and citation_technical_report_number. The format for the number can be whatever you want, though I would, eg, avoid whitespace. It is helpful to have the number in the document itself, or maybe in the URL/filename path, for clarity to readers. It does not need to be unique across institutions. HA-2020 or something like that?

For your background, this is an annual report on web practices produced by a collection of over 100 people and compromising 20+ chapters. We have so far published a 2019 and 2020 edition and have started work on the 2021 edition. So not just a once-off report. Each chapter looks at a facet of web development (CSS, Javascript, Accessibility, Security...etc.) and is a well researched and data-backed analysis of the state of that chapter, by mining the HTTP Archive (crawls of over 7.5 million website home pages) and other resources. Each chapter is produced by separate authors (often well-known experts in each field) and data analysts. It was launched in late 2019 as a collection of 20 web pages for each chapter. Shortly afterwards we published a full PDF of the entire 2019 edition as an ebook download.

Chapters of the report have been cited in multiple other scholarly article and mostly they cite the chapter in question. For example a web performance research article may cite the 2019 HTTP/2 chapter.

Citation screenshot

Not sure if that influences your opinion on whether we should use Volume and Issue or not?

The citation_language is the one I have seen for indicating language. Keep in mind Google Scholar is not actually the authority on these "Highwire Press" HTML tags, they are just one place you are trying to get indexed in to. Tools like Zotero also use these tags and I think support many more. Maybe relevant here is citation_publisher (in addition to citation_technical_report_institution.

Interesting! Yes it seems from a quick search that there's no reference document for Highwire Press or other similar tag formats 😞 So have been concentrating on Google Scholar for now

Still in two minds on adding meta tags for our translations though. As I say, the report is produced in English and then translated to 10+ languages - often only in part as we depend on volunteers for the translation. Only the 2019 Japanese edition is a complete translation and has therefore also been launched as a PDF Ebook download. Though a couple of languages (Japanese and Dutch) are close to finishing 2020 translation at which point they will also be made available as ebook PDF downloads. So to me the main references are the original English chapters.

Regarding HTML, PDF, and chapters, seems up to you. Folks can cite individual chapters on their own. It is common in academia to catalog both a full book/encyclopedia/manuscript and the individual chapters, but that is usually because such books frequently have distinct authors for the individual chapters, and to be honest the books are really just collections of papers or essays. It also makes it easier to distribute chapters for course assignments, or to (sigh) sell access to individual chapters instead of the whole book. I would at least have a table of contents with anchor URLs for individual chapters. I wouldn't bother having separate PDFs for each chapter unless you also make separate HTML documents / landing pages, but sort of subjective.

As stated above, current citations have mostly been at chapter level. The website has full navigation options, including a table of contents page for each language and year, and a separate chapter table of contents for each chapter. The HTML pages are printable thanks to a print stylesheet hiding unnecessary content like headers and footer, but only the whole edition is made available as a downloadable PDF.

I would recommend making a PDF of the whole work available, with page numbers as a lot of folks will still print out subsets or the whole thing for reading and marking up. For better or worse, a lot of folks who read a lot simply have technical or personal workflows that assume a PDF and/or ePub.

Yup already done, for 2019 (English), and 2020 (English), and also for 2019 (Japanese).

In your date field, I would use "ISO" style date, with dashes not slashes (eg, 2020-12-09). Maybe Google Scholar prefers otherwise.

Google Scholar specifically states that format must be used:

Provide full dates in the "2010/5/12" format if available; or a year alone otherwise. This tag is required for inclusion in Google Scholar.

src/templates/en/base_chapter.html Outdated Show resolved Hide resolved
src/templates/es/base_chapter.html Outdated Show resolved Hide resolved
@tunetheweb tunetheweb merged commit dec026b into main May 6, 2021
@tunetheweb tunetheweb deleted the google-scholar-support branch May 6, 2021 17:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
development Building the Almanac tech stack
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants