Skip to content

Commit

Permalink
V03 issue703 (#706)
Browse files Browse the repository at this point in the history
* code changes to change integrity to identity #703

* s/ntegrity/ndentity/g for #703 to document update both languages.

* Documenting breaking change of Integrity.

* patching up #703 brranch after several merges.

* fix for crash from #699 when contentType missing on sent files.

* correct integrity when read from file attributes #703

* adjusting documentation, part of #703

* updating version pointer link for reid's recent work.

* Redis driver for Nodupe (#705)

* Added many more NoDupe tests

* Add final tests to nodupe

* Add nodupe sub-class tests

* Created nodupe_redis

Stripped a lot of the methods out, because they just aren't needed with Redis.

* First try at Redis Nodupe

* Unit test for nodupe_redis

* Merged nodupe derivekey tests into 1

Previously had a whole bunch of them, but since we're re-instanciating the message for each derivation, they can all go in one.

* Unit test for nodupe_redis

Had to make some changes to the module itself because some functionality wasn't working

* Create generic nodupe unit test

Trying to compare the functionality of the disk-, and redis-based nodupe classes is going to be hard. Most (all?) of the methods don't actually return anything, and if we're assuming to abstract the inner workings of each one, there's nothing to validate they're doing the same thing.

* Delete individual derivekey nodupe unit tests

* Add test docs on using VSCode

* Allow picking which driver to use for NoDupe

* Add missing config options for redis nodupe

* Fix cache size counting

I'm not sure how much any of this counting matters, or if it's just for logging purposes.
Either way.. it's better now.

* Fix broken nodupe.redis tests

* Change field delimiter in nodupe redis key

* Update redis key field delimiter in nodupe test

* Fixed mis-named config option

Nodupe Redis was using retryqueue server URL.
now it's not.

* Standardized some testing aspects

- Made the add_option method for options work properly
- Added PrettyPrinter code for debugging tests

* Clean-up test debugging code

* Make Redis Nodupe count better

It wasn't loading on_start, or clearning on_stop, so it was never accurate.
I still don't really think it's all that accurate but since it's just for logging purposes, it might not need to be better than it is.

* Disabled uneeded redis Retry test

* Update retry unit test dependencies

Used to have dependencies per method, but now the comparative tests depend on both driver tests.
This ensures that nothing gets missed.

* Comparative NoDupe unit test

Compares state/output of both Disk and Redis nodupe drivers against an expected value.

* Change integrity to identity, per #703 and #706

This might cause merge conflicts, but it's probably still worth changing here.

* Resolve config.py conflict

* Resolve second config.py conflict

---------

Co-authored-by: Peter Silva <[email protected]>

---------

Co-authored-by: Greg <[email protected]>
  • Loading branch information
petersilva and gcglinton authored Jun 23, 2023
1 parent 5bc9748 commit 6223cf8
Show file tree
Hide file tree
Showing 59 changed files with 321 additions and 293 deletions.
22 changes: 11 additions & 11 deletions docs/source/Contribution/v03.rst
Original file line number Diff line number Diff line change
Expand Up @@ -137,12 +137,12 @@ sr3 code::
16 ./flow/winnow.py
793 ./__init__.py
226 ./instance.py
36 ./integrity/arbitrary.py
93 ./integrity/__init__.py
33 ./integrity/md5name.py
24 ./integrity/md5.py
17 ./integrity/random.py
24 ./integrity/sha512.py
36 ./identity/arbitrary.py
93 ./identity/__init__.py
33 ./identity/md5name.py
24 ./identity/md5.py
17 ./identity/random.py
24 ./identity/sha512.py
17 ./moth/amq1.py
585 ./moth/amqp.py
313 ./moth/__init__.py
Expand Down Expand Up @@ -274,7 +274,7 @@ the two versions, is clear:
| | |
| sr_message.py | |
+--------------------------+---------------------------+
| sr_checksum.py | integrity/ |
| sr_checksum.py | identity/ |
| | __init__.py |
| sum/* | * |
+--------------------------+---------------------------+
Expand Down Expand Up @@ -395,7 +395,7 @@ Known Problems (Solved in sr3)
sarra.tmpc.*, sr.py ) using normal imports. likely need to
refactor how checksum plugin mechanism works then try again.

totally refactored now. Integrity class is normal, and separate from flowcb.
totally refactored now. Identity class is normal, and separate from flowcb.


Concrete Plan (Done)
Expand Down Expand Up @@ -588,7 +588,7 @@ Items from the TODO list that have been addressed.
so Plugin becomes a class instantiated in sarra/__init__.py... puts
plugins and built-in code on a more even level... for example how
do plugin transfer protocols work? thinking... This is sort of done
now: plugin became flowcb. Integrity is removed from the hierarchy.
now: plugin became flowcb. Identity is removed from the hierarchy.
Class extension is now a separate kind of plugin (via import)

* change default topic_prefix to v03.post done 2021/02
Expand Down Expand Up @@ -874,7 +874,7 @@ Features

* The extension API is now vanilla python with no magic settings. just standard classes, using standard import mechanism.
debugging should be much simpler now as the interpreter will provide much better error messages on startup.
The v2 style plugins are now called *flow callbacks*, and there are a number of classes (integrity, moth,
The v2 style plugins are now called *flow callbacks*, and there are a number of classes (identity, moth,
transfer, perhaps flow) that permit extension by straightforward sub-classing. This should make it much
easier to add additional protocols for transport and messages, as well checksum algorithms for new data types.

Expand All @@ -897,7 +897,7 @@ Features
* FlowCB plugin entry_points are now based on groups of notification messages, rather than individual ones, allowing people
to organize concurrent work.

* integrity (checksums) are now plugins.
* identity (checksums) are now plugins.

* gather (inlet? sources of notification messages) are now plugins.

Expand Down
14 changes: 7 additions & 7 deletions docs/source/Explanation/CommandLineGuide.rst
Original file line number Diff line number Diff line change
Expand Up @@ -314,8 +314,8 @@ View all configuration settings (the result of all parsing... what the flow comp
'inlineEncoding': 'guess',
'inlineOnly': False,
'instances': 1,
'integrity_arbitrary_value': None,
'integrity_method': 'sha512',
'identity_arbitrary_value': None,
'identity_method': 'sha512',
'logEvents': {'after_work', 'after_accept', 'on_housekeeping'},
'logFormat': '%(asctime)s [%(levelname)s] %(name)s %(funcName)s %(message)s',
'logLevel': 'info',
Expand Down Expand Up @@ -931,8 +931,8 @@ Polling is doing the same job as a post, except for files on a remote server.
In the case of a poll, the post will have its url built from the *pollUrl*
option, with the product's path (*directory*/"matched file"). There is one
post per file. The file's size is taken from the directory "ls"... but its
checksum cannot be determined, so the default integrity method is "cod", asking
clients to calculate the integrity Checksum On Download.
checksum cannot be determined, so the default identity method is "cod", asking
clients to calculate the identity Checksum On Download.

By default, sr_poll sends its post notification message to the broker with default exchange
(the prefix *xs_* followed by the broker username). The *post_broker* is mandatory.
Expand Down Expand Up @@ -1563,7 +1563,7 @@ WINNOW

the **winnow** component subscribes to file notification messages and reposts them, suppressing redundant
ones. How to decide which ones are redundant varies by use case. In the most straight-forward case,
the messages have **Integrity** header stores a file's fingerprint as described in the `sr_post(7) <../Reference/sr_post.7.html>`_ man page,
the messages have **Identity** header stores a file's fingerprint as described in the `sr_post(7) <../Reference/sr_post.7.html>`_ man page,
and header is used exclusively. There are many other use cases, though. discussed in the following section
on `Duplicate Suppression <DuplicateSuppresion.html>`_

Expand Down Expand Up @@ -2351,11 +2351,11 @@ the directory will be checked for new files. Here is part of the Script callbac
return []
Integrity
Identity
---------

One can use the *import* directive to add new checksum algorithms by sub-classing
sarracenia.integrity.Integrity.
sarracenia.identity.Identity.

Transfer
--------
Expand Down
2 changes: 1 addition & 1 deletion docs/source/Explanation/Concepts.rst
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ In more detail:

The main components of the python implementation of Sarracenia all implement the same algorithm described above.
The algorithm has various points where custom processing can be inserted (using flowCallbacks),
or deriving classes from flow, integrity, or transfer classes.
or deriving classes from flow, identity, or transfer classes.

The components just have different default settings:

Expand Down
8 changes: 4 additions & 4 deletions docs/source/Explanation/DuplicateSuppression.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,8 +23,8 @@ Duplicate suppression must:

Duplicates are dropped to avoid further processing.

A notification message key is preferably derived from the *Integrity* field of the notification
message. If the producer does not provide an integrity checksum, algorithms may fall
A notification message key is preferably derived from the *Identity* field of the notification
message. If the producer does not provide an identity checksum, algorithms may fall
back on other metadata: *mtime*, *size*, *pubTime.* Since pubTime is a mandatory
field, a key can always be derived, but it's effectiveness in a particular use case
is not assured. (the sarracenia.flowcb.nodupe.NoDupe.deriveKey(self,msg) helper routine
Expand All @@ -47,7 +47,7 @@ Standard (path and data oriented)
**method**: when products have the same key and path, they are duplicates.

Two routes can receive the same product, with the same relative path. In normal processing,
the products should be identical, and *Integrity* checksums for it should be the same,
the products should be identical, and *Identity* checksums for it should be the same,



Expand Down Expand Up @@ -94,7 +94,7 @@ or::
Override the standard duplicate suppression key generation to use only the file name.

When multiple sources produce a product, but the result is not binary identical, and no
appropriate Integrity method is available, then then one needs a different approach.
appropriate Identity method is available, then then one needs a different approach.
Since the two sources are not, generally, synchronized,

URP
Expand Down
2 changes: 1 addition & 1 deletion docs/source/Explanation/History/mesh_gts.rst
Original file line number Diff line number Diff line change
Expand Up @@ -291,7 +291,7 @@ The requirements for a store and forward system:
- TCP/IP connectivity,
- real-time data transmission,
- per destination queueing to allow asynchrony (clients that operate at different speeds or have transient issues),
- application level integrity guarantees.
- application level identity guarantees.

In addition, the ability to tune subscriptions, according to the client's
interest will further optimize traffic.
Expand Down
2 changes: 1 addition & 1 deletion docs/source/Explanation/History/messages_v03.rst
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ pairs.
* v02 fixed fields are now "pubTime", "baseURL", and "relPath" keys
in the JSON object that is the messge body.

* v02 *sum* header with hex encoded value, is replaced by v03 *integrity* header with base64 encoding.
* v02 *sum* header with hex encoded value, is replaced by v03 *identity* header with base64 encoding.

* v03 *content* header allows file content embedding.

Expand Down
10 changes: 5 additions & 5 deletions docs/source/Explanation/SarraPluginDev.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ build new ones in a copy/paste manner, with many samples being available to read
There are other ways to extend Sarracenia v3 by subclassing of:

* Sarracenia.transfer.Transfer to add more data transfer protocols
* Sarracenia.integrity.Integrity to add more checksumming methods.
* Sarracenia.identity.Identity to add more checksumming methods.
* Sarracenia.moth.Moth to add support for more messaging protocols.
* Sarracenia.flow.Flow to create new flows.
* Sarracenia.flowcb.FlowCB to add custom callback routines to flows.
Expand Down Expand Up @@ -341,7 +341,7 @@ One can add additional functionality to Sarracenia by creating subclassing.

* sarra.moth - Messages Organized into Topic Hierarchies. (existing ones: rabbitmq-amqp)

* sarra.integrity - checksum algorithms ( existing ones: md5, sha512, arbitrary, random )
* sarra.identity - checksum algorithms ( existing ones: md5, sha512, arbitrary, random )

* sarra.transfer - additional transport protocols (https, ftp, sftp )

Expand Down Expand Up @@ -457,7 +457,7 @@ self is the notification message being processed. variables variables most used:
for non data download file operations, such as creation of symbolic links, file renames and removals.
content described in `sr_post(7) <../Reference/sr_post.7.html>`_

*msg['integrity']*
*msg['identity']*
The checksum structure, a python dictionary with 'method' and 'value' fields.

*msg['subtopic'], msg['new_subtopic']*
Expand All @@ -475,7 +475,7 @@ self is the notification message being processed. variables variables most used:
For example, all of the *new_* fields are in the *_deleteOnPost* by default.

*msg['onfly_checksum'], msg['data_checksum']*
the value of an *Integrity* checksum field calculated as data is downloaded.
the value of an *Identity* checksum field calculated as data is downloaded.
In the case where data is modified while downloading, the *onfly_checksum*
is to verify that the upstream data was correctly received, while the
*data_checksum* is calculated for downstream consumers.
Expand Down Expand Up @@ -932,7 +932,7 @@ Examples of things that would be fun to do with plugins:

- add additional message protocols (sub-classing Moth)

- additional checksums, subclassing Integrity. For example, to get GOES DCP
- additional checksums, subclassing Identity. For example, to get GOES DCP
data from sources such as USGS Sioux Falls, the reports have a trailer
that shows some antenna statistics from the reception site. So if one
receives GOES DCP from Wallops, for example, the trailer will be different
Expand Down
2 changes: 1 addition & 1 deletion docs/source/How2Guides/FlowCallbacks.rst
Original file line number Diff line number Diff line change
Expand Up @@ -424,7 +424,7 @@ It's a good idea to look at the sarracenia source code itself. For example:
reception of notification messages from message queue protocol flows.

* *sarracenia.flowcb.nodupe.NoDupe* This modules removes duplicates from message
flows based on Integrity checksums.
flows based on Identity checksums.

* *sarracenia.flowcb.post.message.Message* is a class that implements posting
notification messages to Message queue protocol flows
Expand Down
13 changes: 10 additions & 3 deletions docs/source/How2Guides/UPGRADING.rst
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,13 @@ Installation Instructions
git
---

*CHANGE*: v03 postformat field renamed: "integrity" is now "identity"

* current version will read messsages with *integrity* and map them to *identity*.
* current version will post with *identity*, so older versions will miss them.
* https://github.com/MetPX/sarracenia/issues/703


3.0.40
------

Expand Down Expand Up @@ -125,13 +132,13 @@ git
*CHANGE*: The "Vendor" string is now "MetPX" instead of "science.gc.ca".
This affects some file placement particularly on Windows.

*CHANGE*: v03 notification message encoding changed: *Integrity* checksum is now optional.
*CHANGE*: v03 notification message encoding changed: *Identity* checksum is now optional.
(details: https://github.com/MetPX/sarracenia/issues/547 )
*md5sum* is no longer defined, replaced with *none* in sr3.

*CHANGE*: v03 notification message encoding changed for symbolic links, and file renames
and removals. There is now a 'fileOp' field for these dataless file operations.
The *Integrity* sum is now used exclusively for checksums.
The *Identity* sum is now used exclusively for checksums.


3.0.15
Expand Down Expand Up @@ -336,7 +343,7 @@ V2 to Sr3
queue_name queueName
report_back report
source_from_exchange sourceFromExchange
sum integrity
sum identity
suppress_duplicates nodupe_ttl
suppress_duplicates_basis nodupe_basis
topic_prefix topicPrefix
Expand Down
4 changes: 2 additions & 2 deletions docs/source/How2Guides/v2ToSr3.rst
Original file line number Diff line number Diff line change
Expand Up @@ -331,12 +331,12 @@ In general, v3 plugins:
msg.exchange msg['exchange'] the channel on which the message was received.
msg.logger logger pythonic logging setup describe above.
msg.parts msg['size'] just omit, use sarracenia.Message constructor.
msg.sumflg msg['integrity'] just omit, use sarracenia.Message constructor.
msg.sumflg msg['identity'] just omit, use sarracenia.Message constructor.
msg.sumstr v2wrapper.sumstrFromMessage(msg) the literal string for a v2 checksum field.
parent.msg worklist.incoming v2 is 1 message at a time, sr3 has lists or messages.
================ ================================== ==========================================================

* the pubTime, baseUrl, relPath, retrievePath, size, integrity, are all standard message fields
* the pubTime, baseUrl, relPath, retrievePath, size, identity, are all standard message fields
better described in `sr_post(7) <../Reference/sr_post.7.html>`_

* if one needs to store per message state, then one can declare temporary fields in the message,
Expand Down
20 changes: 10 additions & 10 deletions docs/source/Reference/code.rst
Original file line number Diff line number Diff line change
Expand Up @@ -159,46 +159,46 @@ sarracenia.instance
:private-members:
:special-members:

sarracenia.integrity
sarracenia.identity
--------------------

.. automodule:: sarracenia.integrity
.. automodule:: sarracenia.identity
:show-inheritance:
:members:
:private-members:
:special-members:

sarracenia.integrity.arbitrary
sarracenia.identity.arbitrary
------------------------------

.. automodule:: sarracenia.integrity.arbitrary
.. automodule:: sarracenia.identity.arbitrary
:show-inheritance:
:members:
:private-members:
:special-members:

sarracenia.integrity.sha512
sarracenia.identity.sha512
---------------------------

.. automodule:: sarracenia.integrity.sha512
.. automodule:: sarracenia.identity.sha512
:show-inheritance:
:members:
:private-members:
:special-members:

sarracenia.integrity.md5
sarracenia.identity.md5
------------------------

.. automodule:: sarracenia.integrity.md5
.. automodule:: sarracenia.identity.md5
:show-inheritance:
:members:
:private-members:
:special-members:

sarracenia.integrity.random
sarracenia.identity.random
---------------------------

.. automodule:: sarracenia.integrity.random
.. automodule:: sarracenia.identity.random
:show-inheritance:
:members:
:private-members:
Expand Down
4 changes: 2 additions & 2 deletions docs/source/Reference/sr3_options.7.rst
Original file line number Diff line number Diff line change
Expand Up @@ -980,13 +980,13 @@ In directory ~/.cache/sarra/log::
loss of notifications. A queue which is not accessed for a long (implementation dependent)
period will be destroyed.

integrity <string>
identity <string>
------------------

All file notification messages include a checksum. It is placed in the amqp message header will have as an
entry *sum* with default value 'd,md5_checksum_on_data'.
The *sum* option tell the program how to calculate the checksum.
In v3, they are called Integrity methods::
In v3, they are called Identity methods::

cod,x - Calculate On Download applying x
sha512 - do SHA512 on file content (default)
Expand Down
4 changes: 2 additions & 2 deletions docs/source/Reference/sr3_post.1.rst
Original file line number Diff line number Diff line change
Expand Up @@ -290,11 +290,11 @@ nodupe_ttl on|off|999
used ( set to a value other than 0 ) as otherwise blocksize will vary as files grow,
and much duplicate data transfer will result.

integrity <method>[,<value>]
identity <method>[,<value>]
----------------------------

All file notification messages include a checksum. The *sum* option specifies how to calculate the it.
It is a comma separated string. Valid Integrity methods are ::
It is a comma separated string. Valid Identity methods are ::

cod,x - Calculate On Download applying x
sha512 - do SHA512 on file content (default)
Expand Down
Loading

0 comments on commit 6223cf8

Please sign in to comment.