Skip to content

CTrees without (reserved) child files #14

@petermr

Description

@petermr

[See also https://github.com//issues/10 ]

Until recently CTrees were generated either locally or through getpapers or quickscrape. The automatically generated files contain at least one reserved file such as fulltext.pdf and this was used by CMine software to determine which directories in a CProject are actually CTrees. This was always recognised to be a heuristic, and recently with bulk download of metadata from Crossref we see many potential CTree without reserved files or even without any files. Here's a simple example:

├── PMC4678086
│   ├── eupmc_result.json
│   ├── fulltext.pdf
│   └── fulltext.xml
├── http_dx.doi.org_10.1001_jama.2016.7992
│   └── results.json
└── http_dx.doi.org_10.1007_s13201-016-0429-9

The first directory is retrieved by quickscrape from EPMC and the heuristics indicate it to be a potential CTree. The other two are retrieved from getpapers on Crossref followed by quickscrape which creates only metadata but currently are not flagged as CTrees. The empty directory is created (I think) by quickscrape which then fails to retrieve anything.

The original motivation for the heuristics is that we may introduce new reserved directories into a CProject and users might also introduce non-ctree directories. There was also the idea that we have a reserved file (e.g. metadata.json or log.xml) in any CTree directory`. At present I favour this, and we should discuss what is in it.

Currently I have added a switch

        cProject.setTreatAllChildDirectoriesAsCTrees(true);

which allows users to toggle this behaviour. I will also add results.json to the reserved files which flag "Ctree-ness".

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions