Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

big ball of mud addressing several issues #22

Open
wants to merge 23 commits into
base: master
Choose a base branch
from
Open

big ball of mud addressing several issues #22

wants to merge 23 commits into from

Conversation

schmamps
Copy link

Yeah, there's a whole lot happening here. This started as just wanting the document properties, but the process ended up addressing issues and wrapping everything in a class to maintain backwards compatibility.

Issues/PRs Addressed

Bonus Features

Documentation

The README is thorough.

Performance

Additional XML parsing overhead is more than offset by reducing the complexity of xml2text().

Correctness

Errors are output to stderr instead of print()ed.

text += '\t'
elif child.tag in (qn('w:br'), qn('w:cr')):
text += '\n'
elif child.tag == qn("w:p"):
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qn() can be called as many as four times per node!


zipf = zipfile.ZipFile(path)
paths = {}
for fname in ['_rels/.rels', 'word/_rels/document.xml.rels']:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at least in my office365 docx, the second file is named document2.xml.rels so it might not be a good idea to hardcode this, but instead deduce its name from the content of the _rels/.rels file.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of my files have this structure. I can't even fake it because there is no XML reference to word/_rels/document.xml.rels.

I could guess where that file is, but would prefer to know. A sample file would be very helpful because I don't have a 365 subscription.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it helps, here you have a simple Word Online document.

SimpleDocument.zip

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was extremely helpful. Looks like 365 does this for no particular reason.

@schmamps
Copy link
Author

OK, I get it now. That works but it's wrong, and so help us if Microsoft ever goes to document3.xml, slartibartfast.xml, or whatever. To simplify, The Right Way to find that second file is more like:

from os.path import basename, dirname

src_path = '_rels/.rels'
src_selector = '/Relationships/Relationship[Type=http://schemas…officeDocument]'
src_attr = 'Target'

doc_path = some_xml_function(src_path, src_selector, src_attr)
doc_rels = '{}/_rels/{}.rels'.format(dirname(doc_path), basename(doc_path))

I'll push that fix after accounting for another strange possibility: multiple nodes of Type officeDocument.

correctly locate document relationships
nomenclature fixes
refactor utility functions
limit imports
protect document properties
remove header and footer from document text (it's different from page to page)
nomenclature
@schmamps
Copy link
Author

After a shallow dive into Open Packaging Conventions, this looks pretty faithful to the standard.

@schmamps schmamps changed the title V080/pull request big ball of mud addressing several issues Feb 15, 2019
@JMBurley
Copy link

This is pretty great & a significant functionality upgrade. @ankushshah89 Any chance of merging this into master?

Failing that, I'm willing to go and repackage the updated version and get it on PyPI. But I'd prefer to see original authors (incl. @schmamps ) get direct credit.

@SamMorrowDrums
Copy link

SamMorrowDrums commented Aug 15, 2019

Yes I've had to use my fork (and open PR) version for years because office365 suport is not optional for me. I see there was one change recently by @ankushshah89 - with a version bump for a compatibility bug, but I'd love to know why various attempts to fix the support issue have not been merged.

If the maintainer(s) would like somebody else to get involved please do let us know because the compatibility issue is now so frequent that this repo is effectively obsolete, unless the issue is addressed.

It's almost 2 years since I attempted to fix the issue - and it's sad that there have been a few commits since, but the issue remains ignored.

We want to help and make this work, it's great having things like this and nobody want's to be rude or impatient, but also nobody wants the upstream version of a module to be dysfunctional for years without some kind of action being taken.

If we don't hear back from this then I certainly would back forking and publishing - out of desperation, rather than choice.

@ShayHill
Copy link

ShayHill commented Aug 16, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants