Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for >150 new comics, 4 new comic hosts, and more #140

Merged
merged 2 commits into from
Apr 2, 2020

Conversation

Techwolfy
Copy link
Contributor

This PR adds support for more than 150 new comics, plus all comics on 4 new comic-hosting sites. It also fixes various issues with more than 50 existing comics.

Sorry about the size of the pull request; let me know if there are any changes you'd like me to make.

New hosts:

New comics:

  • A&HClub
  • AbbysAgency
  • ADoemainOfOurOwn
  • AdventuresOfFifne
  • AlienDice and AlienDice/Legacy
  • AmbersNoBrainers
  • Anaria
  • AntiheroForHire
  • ApartmentForTwo
  • ATaleOfTails
  • BallerinaMafia
  • Bethellium
  • BeyondTheVeil
  • BittersweetCandyBowl
  • Bloodline
  • ButImACatPerson
  • CarryOn
  • CarryOn/AliceBlueAndTheGardensOfQ
  • CarryOn/LegendOfAnneBunny
  • CatenaManor
  • CavesAndCritters
  • Centralia2050
  • ClanOfTheCats
  • ClanOfTheCats/Reunion
  • Cloudscratcher
  • CollegeCatastrophe
  • ComicFury/DNA
  • ComicFury/Eros
  • ComicFury/ProfessorAmazingAndTheIncredibleGoldenFox
  • ComicFury/RedSpot
  • ComicFury/ResNulliusCS
  • ComicFury/Saluna
  • ComicFury/Snowfall
  • ComicFury/Swashbuckled
  • ComicFury/Threan
  • CommanderKitty
  • CrimsonFlag
  • CritterCoven
  • CrossTimeCafe
  • CutLoose
  • CynWolf
  • DarkWhite
  • DeerMe
  • DelaTheHooda
  • Delve
  • Dissonance
  • DMFA side stories
  • DocRat
  • Dreamkeepers
  • Erfworld
  • ErmaFelnaEDF
  • Everblue
  • Evon
  • FarToTheNorth
  • FoxDad
  • FoxTails
  • FreighterTails
  • FurPiled
  • FurthiaHigh
  • Ginpu
  • Guardia
  • HavocInc
  • HeyFox
  • HeyKitty
  • Housepets
  • HowToBeAWerewolf
  • IndustrialRevelations
  • InOurShadow
  • IslaAukate and IslaAukateColor
  • Kaerwyn and BlackTapestries
  • Kaspall
  • Katbox/FalseStart
  • Katbox/Knighthood
  • Katbox/KnuckleUp
  • Katbox/OurWorld
  • Katbox/ProjectZero
  • Katbox/RascalsGoyoku
  • Katbox/VampireHunterBoyfriends
  • KemonoCafe/TinaOfTheSouth
  • Kitfox
  • LasLindas bonus comics
  • LastResort
  • LifeAsRendered
  • LilithsWord
  • LittleTales
  • Moonlace
  • MyLifeWithFel
  • MynarskiForest
  • NamirDeiter
  • NamirDeiter/ApartmentForTwo
  • NamirDeiter/NicoleAndDerek
  • NamirDeiter/OneHundredPercentCat
  • NamirDeiter/SpareParts
  • NamirDeiter/TheNDU
  • NamirDeiter/UnlikeMinerva
  • NamirDeiter/WonderKittens
  • NamirDeiter/YouSayItFirst
  • NeoCTC
  • Newshounds
  • Newshounds/ProjectionEdge
  • NicoleAndDerek
  • Nightshift
  • NineToNine
  • NonPlayerCharacter
  • NotAVillain
  • OffWhite
  • OrderOfTheBlackDog
  • OriginalLife
  • OutOfPlacers
  • OzyAndMillie
  • PeanutBerrySundae
  • PeterIsTheWolf
  • PlushAndBlood
  • PoppyOPossum
  • PowerNap
  • ProjectFuture/AWalkInTheWoods
  • ProjectFuture/BenjaminBuranAndTheArkOfUr
  • ProjectFuture/BookOfTenets
  • ProjectFuture/CriticalMass
  • ProjectFuture/DarkLordRising
  • ProjectFuture/FishingTrip
  • ProjectFuture/HeadsYouLose
  • ProjectFuture/NiallsStory
  • ProjectFuture/ProjectFuture
  • ProjectFuture/RedValentine
  • ProjectFuture/ShortStories
  • ProjectFuture/StrangeBedfellows
  • ProjectFuture/TheAxemanCometh
  • ProjectFuture/ToCatchADemon
  • ProjectFuture/TheDarkAngel
  • ProjectFuture/TheEpsilonProject
  • ProjectFuture/TheHarvest
  • ProjectFuture/TheSierraChronicles
  • ProjectFuture/TheTuppenyMan
  • ProjectFuture/TurningANewPage
  • QuantumVibe
  • QuentynQuinnSpaceRanger
  • Replay
  • RestoredGeneration
  • RHJunior/GoblinHollow
  • RHJunior/NipAndTuck
  • RHJunior/TalesOfTheQuestor
  • RHJunior/TheProbabilityBomb
  • RHJunior/TheJournalOfEnniasLongscript
  • RHJunior/QuentynQuinnSpaceRanger
  • Ryugou
  • Savestate
  • SecondComing
  • SeelPeel
  • ShadesOfGray
  • Shivae side comics
  • SixPackOfOtters
  • SmackJeeves/FurryExperience
  • SmackJeeves/GrowingTroubles
  • SmackJeeves/LatchkeyKingdom
  • SmackJeeves/OffCentaured
  • SpaceFurries
  • SSDD
  • StarfireAgency
  • StolenGeneration
  • SuburbanJungle
  • SuburbanJungleRoughHousing
  • Supercell
  • SwordsAndSausages
  • TailsAndTactics
  • TalesOfTheQuestor
  • Tamberlane
  • TheClassMenagerie
  • TheCyantianChronicles comics
  • TheGentleWolf
  • TheJunkHyenasDiner
  • TheMeek
  • TheMonsterUnderTheBed
  • TheOldVictorian
  • TheProbabilityBomb
  • TinyDickAdventures
  • TracesOfThePast (SFW/NSFW)
  • TwinDragons
  • UnlikeMinerva
  • Vexxarr
  • VirmirWorld
  • WebToons/InternetExplorer
  • WebToons/SpaceVixen
  • WereIWolf
  • WhatWeRememberMost
  • WhiteNoiseLee
  • WildeLife
  • Wrongside
  • YouSayItFirst

Fixed comics:

  • AddictiveScience
  • AGirlAndHerFed
  • AlphaLuna
  • Altermeta
  • ArtificialIncident
  • BetterDays
  • CatenaCafe
  • Concession
  • Curtailed
  • Debunkers
  • DesertFox
  • DMFA
  • DoghouseDiaries
  • DominicDeegan
  • Flipside
  • Freefall
  • Galaxion
  • GrrlPower
  • GunnerkriggCourt
  • KeenSpot/GeneCatlow
  • KiwiBlitz
  • Lackadaisy
  • LookingForGroup
  • MareInternum
  • Misfile
  • Namesake
  • Oglaf
  • PeterAndCompany
  • PeterAndWhitney
  • PS238
  • PvPOnline
  • QuestionableContent
  • RealLife
  • SabrinaOnline
  • Shivae
  • SkinDeep
  • SleeplessDomain
  • SlightlyDamned
  • SluggyFreelance
  • StandStillStaySilent
  • TheGamerCat
  • TheWhiteboard
  • Twokinds
  • Unsounded
  • UrgentTransformationCrisis
  • VerloreGeleentheid
  • Weregeek
  • WhiteNoise
  • XKCD
  • YoshSaga

Additional changes:

  • Update maintainance date in README
  • Mark endOfLife comics as completed when checking for updates
  • Merge thecomicseries.com comics into ComicFury list
  • Clean up global vars used by ComicFury site engine
  • Add common handler for mgsisk's Wordpress Webcomic plugin
  • Refactor WP_LATEST_SEARCH into class variable of WordPressScraper

Techwolfy referenced this pull request in Efreak/dosage Oct 28, 2019
I tried using `scripts/tapastic.py` provided by @Techwolfy but unfortunately Tapastic has changed their format, and it no longer works...
@Techwolfy Techwolfy force-pushed the upstream branch 2 times, most recently from f2a23b3 to fb8c667 Compare October 28, 2019 08:30
@codecov
Copy link

codecov bot commented Oct 28, 2019

Codecov Report

Merging #140 into master will not change coverage.
The diff coverage is 0%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #140   +/-   ##
=======================================
  Coverage   82.61%   82.61%           
=======================================
  Files          72       72           
  Lines        6282     6282           
  Branches      423      423           
=======================================
  Hits         5190     5190           
  Misses        977      977           
  Partials      115      115
Impacted Files Coverage Δ
dosagelib/plugins/comicfury.py 85.18% <ø> (ø) ⬆️
dosagelib/plugins/webtoons.py 58.06% <0%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4d1a0d9...cac6007. Read the comment docs.

@TobiX
Copy link
Member

TobiX commented Oct 28, 2019

Wow, thanks a lot! Actually, I would like to take those changes in smaller batches, but I might just cherry-pick some changes, so you have an easier time to get this merged...

@Ruthalas
Copy link
Contributor

Ruthalas commented Oct 30, 2019

I'd like to use these updates. Should I anticipate these being merged in the next week or so, or should I pull from Techwolfy's upstream? @TobiX
Many thanks for both of your work, by the way. I appreciate it.

@TobiX
Copy link
Member

TobiX commented Nov 3, 2019

I'd like to use these updates. Should I anticipate these being merged in the next week or so, or should I pull from Techwolfy's upstream? @TobiX

Not soon. I currently don't have the time to review this as one big chunk. Breaking this into smaller parts would really help speed up the review process...

@TobiX
Copy link
Member

TobiX commented Nov 3, 2019

I've started cherry-picking commits starting from the oldest, I'm currently up to "Add Evon" - It would be really useful if you could rebase the branch (and probably also remove everything you reverted later). I didn't pick the following commits:

  • "Mark KeenSpot/GeneCatlow completed :(" - Ignores the "generated" comment in that file, that change must be made in the scripts/keenspot.py helper first
  • "Fix DMFA" - Doesn't fix anything, breaks fetching of first image?
  • Modified "Add Kaerwyn and BlackTapestries" to use the name generated by scripts/smackjeeves.py
  • "Add site engine for WebToons" - please create a pull request for each new "big" module
  • "Fix Twokinds" - Not sure what this fixes?!?
  • "Add ShadesOfGray" - Site is gone?

@Techwolfy
Copy link
Contributor Author

Rebased, and dropped the changes I reverted (those were for a few comics that block scraping).

  • KeenSpot/GeneCatlow: Fixed. I think that commit was from before I figured out how the autoupdate scripts worked.
  • DMFA: I updated this parser in a later commit. The previous comic page was filler, which is rare; it'll be moved to the bonus archive later.
  • TheRealmOfKaerwyn, BlackTapestries: Yep, using the full comic name is better. Not sure why I didn't do that in the first place.
  • WebToons: I'll look into splitting these modules out into separate PRs when I have time. Some of them are modified by later commits that affect multiple files, so it might be tricky.
  • Twokinds: Temporary news updates with header images broke the old xpath query, which this fixed. Unfortunately it now breaks pages without a sketch link, so I've reverted it.
  • ShadesOfGray: The site is indeed gone.

For ShadesOfGray, the author couldn't keep updating the comic, but fans convinced them to finish the story by posting the rest of the script and paid the hosting bills for another year. It's actually why I started working on this project; I wanted to make a copy for myself before it went offline, and decided to grab every other comic I've ever read while I was at it. ^.^

I'd originally planned on writing my own scraper, but dosage was modular enough that it was easier to extend it. Unfortunately, some comics were only partially recoverable, and a few seem to be gone forever. It looks like the remaining Katbox comics might be next, as it's shutting down soon.

@TobiX
Copy link
Member

TobiX commented Nov 5, 2019

First: Thanks for rebasing :D

  • DMFA: I updated this parser in a later commit. The previous comic page was filler, which is rare; it'll be moved to the bonus archive later.

Ah, I see. I looked at that commit and it does something which might negatively influence one of dosage's features: Using @ as a comic name to mean "all comics I have locally" (see https://github.com/webcomics/dosage/blob/master/dosagelib/director.py#L195-L200 and https://github.com/webcomics/dosage/blob/master/dosagelib/director.py#L228-L233) - But I must admit the implementation of that feature is rather... stupid :D - So if someone only downloads the DMFA side stories, using @ afterwards will also download the DMFA main comic...

As I think more about this, it's probably not worth sacrificing "good" directory names like DMFA/Guest for this corner case. Consider this just "loud thinking" then :P

(Currently with limited internet, so no cherry-picks today)

@MaddTheSane
Copy link

Currently it breaks SequentialArt and Drowtales.

$ ~/Library/Python/2.7/bin/dosage -a drowtales sequentialArt
Drowtales> Retrieving all strips
Drowtales> ERROR: 'module' object has no attribute 'unescape'
SequentialArt> Retrieving all strips
SequentialArt> ERROR: 'module' object has no attribute 'unescape'

@MaddTheSane
Copy link

MaddTheSane commented Nov 26, 2019

Correction: It's broken for SequentialArt and Drowtales on Python 2.7.

Although Misfile and WapsiSquare are having issues even under Python 3.7:

Misfile> Retrieving all strips
Misfile> ERROR: XPath //div[@class="comic"]//img not found at URL http://www.misfile.com/.
Misfile> WARN: XPath //a[contains(@title, "Previous")] not found at URL http://www.misfile.com/. Assuming no previous comic strips exist.
WapsiSquare> Retrieving all strips
WapsiSquare> Saved Comics/WapsiSquare/2019-11-25-not-likely.jpg (244.98KB).
WapsiSquare> WARN: XPath //a[contains(concat(" ", @class, " "), " comic-nav-previous ")] not found at URL http://wapsisquare.com/. Assuming no previous comic strips exist.

@Techwolfy Techwolfy force-pushed the upstream branch 2 times, most recently from afd0d33 to 4e5805a Compare November 26, 2019 06:34
@Techwolfy
Copy link
Contributor Author

Techwolfy commented Nov 26, 2019

@MaddTheSane None of those issues are related to this change; the Python 2.7 bugs were there before, and the Misfile and WapsiSquare scrapers were both broken by website redesigns. I fixed the scraper for Misfile and added one for its new sub-comic.

@TobiX I hadn't noticed the issue with the @ parser since I didn't download any sub-comics without their main comic. Since it only affects updating all comics and doesn't break downloading an individual sub-comic, I agree that it's not worth changing, though a better solution would be nice. I considered moving the base comic to a subdirectory (e.g. DMFA/DMFA) when I first added the sub-comics but decided against moving preexisting files.

@TobiX
Copy link
Member

TobiX commented Dec 17, 2019

Some more notes while picking more to master:

  • Merged both DMFA commits into one
  • TheMonsterUnderTheBed already exists as MonsterUnderTheBed
  • PeterIsTheWolf already exists in WLPComics
  • I looked a bit at the KatBox scraper: If you remove/move something (which has existed in 2.15 or before, like LasLindas), could you add a "redirect" to old.py?
  • Unfortunatly, SmackJeeves just totally changed their design (see SmackJeeves redesign #144), so I skipped all of those I came across...
  • Also skipped all DeviantArt I came across, for similar reasons...
  • Merged both misfile fix commits
  • Why is Shivae in cyantian.py? ✔️

@TobiX
Copy link
Member

TobiX commented Dec 17, 2019

I'll try to continue picking "easy" commits in a steady pace in the coming days. Hopefully, we have only the complex cases left by the end of the year 🐺

@MaddTheSane
Copy link

My guess why Shivae is in cyantian.py is because their site layout is very similar (and might be hosted on the same site). They are done by the same artist.

@Techwolfy
Copy link
Contributor Author

Rebased! I'll be off work for the holidays soon, so hopefully I'll have a bit more time to work on this as well.

  • Katbox scraper: Fixed. LasLindas and CaribbeanBlue were the only two.
  • SmackJeeves: I know, they broke some of my old bookmarks too. The comic-specific hacks won't be necessary now since the custom styles are gone, but I preferred the old designs.
  • DeviantArt: That parser is pretty hacky, but it was easier than manually downloading the comics. I had to fight the A/B testing of their redesign while I wrote it; sounds like it didn't survive.
  • Shivae: MaddTheSane is right. Same artist, same site layout, same host, and the stories are somewhat linked.

@TobiX
Copy link
Member

TobiX commented Dec 27, 2019

  • Restored SmackJeeves functionality in master, you may adopt your additions to the new layout...
  • You seem to have left some duplicates & conflicts after rebasing, could you check those?
  • ADoemainOfOurOwn is a duplicate of DoemainOfOurOwn - why do you use the Internet archive, the site seems to download just fine
  • Newshounds is already available as a KeenSpot module
  • TheMeek already exists as Meek
  • MyLifeWithFel - I'd suggest to use requests directly instead of moving the JSON through LXML first. See SmackJeeves.starter for an example

I cherry-picked all "simple" commits to master. Only bigger modules (or modules which I mentioned above) should be left after a rebase...

@Techwolfy
Copy link
Contributor Author

Rebased. I cleaned up the leftover duplicates and merge artifacts, but I think there are still a few recently-dead comics in there that I need to remove.

  • SmackJeeves: Updated. A few comics migrated to other sites, fixed those as well.
  • DoemainOfOurOwn: Fixed. The site was probably down when I wrote the scraper for it, seems fine now.
  • Newshounds: Updated. Most of the archive is inaccessible via Keenspot; the remainder is a sequel comic, ProjectionEdge.
  • TheMeek: Fixed.
  • MyLifeWithFel: Fixed.

@TobiX
Copy link
Member

TobiX commented Jan 9, 2020

  • There are still some duplicate/broken commits in your history - would you mind taking another look? (At least TwinDragons, DocRat, AdventuresOfFifne, CarryOn/AliceBlueAndTheGardensOfQ, CarryOn/LegendOfAnneBunny, ClanOfTheCats/Reunion, MonsterUnderTheBed)
  • LifeAsRendered has changed site design and doesn't work anymore
  • Could you add rename entries for the NamirDeiter comics in old.py?
  • Same with RHJunior
  • For VerloreGeleentheid, we should probably just add a language override to smackjeeves...
  • I merged MyLifeWithFel and created Add scraper for comics with API #148 to maybe make it nice in the future
  • I currently see no reason to add KatBox, since it seems everyone is moving off it - If you could restructure your commits to not add that, that would be super-nice
  • I implemented a fix for Djandora in the existing module instead of creating a new one
  • I merged some commits where one commit added a comic and another one "fixed" or moved it - this might give you some headaches rebasing :/

@Techwolfy
Copy link
Contributor Author

Rebased!

  • Duplicates: Fixed, most of those were auto-merges that went wrong without prompting me. I didn't see any issues with DocRat; what did you want me to change?
  • LifeAsRendered: Fixed. The navigation is equally broken after the redesign, but in different places...
  • NamirDeiter and RHJunior renames: Fixed.
  • VerloreGeleentheid: The translations for this comic are in the first comment by the author on each page, I don't think that will generalize well.
  • Katbox: There are still a few comics that haven't migrated away yet, and at this point likely never will. That said, I'm not sure how long the Katbox itself will last (and it's currently down), so I've removed it.
  • Djandora: I migrated the remaining changes to the existing module (some pages have duplicate names, and are skipped without the custom namer).
  • Rebase headaches: No worries, I expected those. Merging all the new fixes back into my own branch (which has a few non-upstreamable patches) will probably be a mess too.

@TobiX
Copy link
Member

TobiX commented Jan 12, 2020

  • There might still be some auto-merge issues for DocRat (1 commit adds, the next removes), but it doesn't matter as long as I'm cherry-picking from your branch
  • LifeAsRendered is 🙄 - But I might have seem worse page layouts 🤣 (hand-rolled HTML with 8.3 filesnames anyone?)
  • VerloreGeleentheid and Djandora - merged, but not happy with the extra condition inside the modules functions, but I currently don't have any better idea myself (maybe add to a new subclass?)

The only big chunk left now is webtoons. Probably will come around it next week.

PS: Feel free to add yourself to doc/authors.txt - I'm still searching for a better way to credit contributors (that isn't just dumped from the Git history), but will migrate that list once I found something better...

@Techwolfy
Copy link
Contributor Author

Rebased. Almost done! Let me know if you want me to change or remove any of the remaining minor fixes in the PR so you can merge it normally.

  • DocRat: Found the problem; I had an auto-merge duplicate that vanished in a later unrelated commit, so I didn't see it until I dug through the commit history for the file.

  • VerloreGeleentheid and Djandora: I tried using a subclass first, but ran into conflicts with the parent class's @classmethod. If there's a way to make that work, it would be a better solution.

  • Credits: I added myself to the authors file, and to the copyright block for each file I created or modified. If you're looking for ideas, perhaps a new command-line option listing contributors?

@TobiX
Copy link
Member

TobiX commented Jan 27, 2020

Arrg. Webtoons hates me and serves me some GDPR-crap instead of comic pages :/

There is dosage --version - I added your name to it, hope you don't mind.

@Techwolfy
Copy link
Contributor Author

@TobiX Ping; it's been a few weeks, anything else you want me to change so you can merge this?

@TobiX
Copy link
Member

TobiX commented Mar 26, 2020

Sorry, not much bandwidth for Dosage in the last weeks. And adding more onto this won't make it any easier for me to merge. Please split your contributions into smaller pull requests in the future...

  • DelaTheHooda has been broken for months
  • "Clean up global vars used by ComicFury site engine" - I like my global vars, thank you very much ;)

Multiple commits with no apparent functional change:

  • "Fix Twokinds"
  • "Fix XKCD"
  • "Fix Unsounded"
  • "Fix Oglaf"

Cherry-picked all the easy ones, will take a look at the rest when I'm more awake.

@Techwolfy
Copy link
Contributor Author

Rebased. Sorry again about the massive PR! I know it's a pain to code-review, thanks for pushing through it.

  • DelaTheHooda: Removed. Looks like VCL is down, I probably have one of the last surviving copies of the strip.
  • ComicFury: I dropped the global variable cleanup.
  • Minor changes: Moved to Minor fixes to several strips #158. There are a few functional changes in there, but nothing significant, so I'd rather it not block this PR.

All that's left here is one last ComicFury entry and an extra padding digit for the WebToons image filenames (some comics use >100 images per strip).

@TobiX TobiX merged commit d9988bc into webcomics:master Apr 2, 2020
@Techwolfy Techwolfy deleted the upstream branch April 3, 2020 04:25
@Techwolfy
Copy link
Contributor Author

Thanks for merging this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants