Spark zipnumcluster job (draft) #8

jt55401 · 2024-11-11T17:45:01Z

This is a cc-pyspark version of the zipnum clustering job (without the use of mrjob framework)

This may be problematic, but, I am not sorting the results quite as strictly as the existing job - my read of the zipnum document makes it sound like as long as your index file points to right place, we should be fine. Not sure if this will really work until I test it though.
My test plan was going to be to run the index server locally and confirm I can get urls in the index, and also not in the index (ie: in between chunks)

jt55401 · 2024-11-11T17:50:46Z

also note: @sebastian-nagel - I do have a "more perfect sorting" version which uses reservoir sampling in my local history - if we end up needing it, it's already done.

sebastian-nagel · 2024-11-11T21:35:35Z

not sorting the results quite as strictly as the existing job - my read of the zipnum document makes it sound like as long as your index file points to right place, we should be fine.

After thinking longer about it: some types of queries will still work but others don't. The problem starts when the software reading the CDX index assumes that it is totally sorted. This especially applies to any kind of range queries.

For example, the query

https://index.commoncrawl.org/CC-MAIN-2024-42-index?url=*.com&showNumPages=true&output=json

returns {"pages": 84874, "pageSize": 5, "blocks": 424369}.

So, for the .com top-level domain there are maximum 424369 * 3000 = 1273107000 or 1.27 billion captures in CC-MAIN-2024-42.

Basically this is counting the number of lines in the cluster.idx which start with com,:

$> grep -c '^com,' collections/CC-MAIN-2024-42/indexes/cluster.idx 
424368

Because there might be also results in the zipnum block before the first one, 1 is added to the number of lines.

If the zipnum blocks are non-contiguous, we'd need to add 1 for every contiguous range of block. Naturally, the result will become less precise.

In addition, there's more work to do for larger range queries. That's what the statement "Generally, this overhead [of the zipnum index] is negligible when looking up large indexes, and non-existent when doing a range query across many CDX lines." (https://pywb.readthedocs.io/en/latest/manual/indexing.html#zipnum)

On the other end, queries for single URLs might work the same and with the same performance independent from the partitioning scheme.

a "more perfect sorting" version

What does it mean? Total order sorting?

My test plan was going to be to run the index server locally and confirm I can get urls in the index, and also not in the index (ie: in between chunks)

All kind of range queries also need to be tested:

using a wildcard: *.com, *.org.uk, youtu.be/a*
matchType={prefix,host,domain}
showNumPages

Of course, even then: we'd need to document for our users the new CDX index sorting and spread this information. We do not know which assumptions are made in any third-party software and whether they rely on the total order sorting. This alone might make it less work to implement the total order sorting.

zipnumcluster-ccpyspark.py

jt55401 · 2024-11-11T22:07:20Z

What does it mean? Total order sorting?

I mentioned in next sentence fragment - I used same technique as was used in hadoop version - reservoir sampling to produce the ranges, then another pass using those ranges to do the shards.

I will find that version in my local history and check it when I work on this next.

sebastian-nagel · 2024-11-12T15:48:52Z

reservoir sampling to produce the ranges, then another pass using those ranges to do the shards.

Maybe it's not necessary to do the sampling step - Spark has a sortBy (or sortByKey) method which does a total order sorting with N partitions. We use it to sort the vertices of the host-level webgraph before enumerating them. Same with the reservoir sampling: the partitions are not perfectly balanced, but the balance is acceptable.

Note: Spark has also methods to only sort the data within the partitions, they are usually named by "WithinPartitions", see for example repartitionAndSortWithinPartitions.

jt55401 · 2024-11-15T01:38:56Z

reservoir sampling to produce the ranges, then another pass using those ranges to do the shards.

Maybe it's not necessary to do the sampling step - Spark has a sortBy (or sortByKey) method which does a total order sorting with N partitions. We use it to sort the vertices of the host-level webgraph before enumerating them. Same with the reservoir sampling: the partitions are not perfectly balanced, but the balance is acceptable.

Note: Spark has also methods to only sort the data within the partitions, they are usually named by "WithinPartitions", see for example repartitionAndSortWithinPartitions.

Indeed - I was aware of these, but not all of them, and only some of the nuance. I've done some deep reading, and, by my best assessment:

my informal definition of "perfect sort" is the last record of 1 partition will be less than the first record of the next partition (so, if I go through the partitions in order, I will never get records out of order.)

sort() (and sortByKey)- will work - and will sort perfectly. As you say, it does require a shuffle, so, it's "expensive" by comparison.
repartitionAndSortWithinPartitions seems like the best option.
- using hash to partition, like I do now - will be faster, as it will require fewer passes through the data, and will produce more uneven partitions, but, they should be "perfectly" sorted.
- using reservoir sampling - requires a sampled pass through the data to generate ranges, and will generate more even partitions, with "almost" perfect sort - but not 100 due to sampling I think...
- using dynamic reservoir sampling - will be about as fast as hash, but, the initial distributions to each partition will be relatively "messy" (I do have a version of this as well)

I'm leaning towards repartitionAndSortWithinPartitions, using hash of the url - but I may change my mind after running a few jobs and seeing how uneven they are... IMHO, 5-10% variance seems fine, if it's much more than that, it doesn't feel as good (though, that's why I want to read the zipnum code as I state below - it may not have a practical issue... so, could be fine)

Since I'm waiting on/monitoring other jobs anyway, I'm going to take another block of time tomorrow to do similar deep read of zipnum code, just so I have much better understanding of that as well. (specifically, I'm going to read the index server's code which USES zipnum, as that's the part that is still murky to me)

Thanks again for the input @sebastian-nagel , much appreciated.

wumpus · 2024-11-15T23:01:15Z

Everyone's expectation is that the cdx index shards and parquet shards surt values should not overlap.

For cdx, that's expected by pywb.

For parquet, it's important for optimization. We do have a few parquet indexes for which that isn't true, and it's a problem we will fix someday.

jt55401 · 2024-11-17T22:09:45Z

Everyone's expectation is that the cdx index shards and parquet shards surt values should not overlap.

Got it - I think with hash and reservoir sampled approaches I outlined, they should not overlap at the shard level (as the former would not allow it, and the latter would be matching what we already do today pretty exactly). There MAY be overlapping gzip chunks that overlap (very small amounts with reservoir, and potentially rather larger amounts with hash) - but, as long as the secondary index reflects those properly, I don't think it'll be an issue based on my read of the index server side of things.

jt55401 · 2024-11-17T22:10:19Z

I will get back to this task Monday, so, plenty of time to discuss if I'm wrong there... I will bring it up on eng. call, and if we need to talk, we can do it then.

sebastian-nagel · 2024-11-18T08:53:58Z

Everyone's expectation is that the cdx index shards and parquet shards surt values should not overlap.

Got it - I think with hash and reservoir sampled approaches I outlined, they should not overlap at the shard level

What Greg means is that there must be zero overlap for all ranges defined by the first and the last SURT in a zipnum block. It's important because the secondary index (cluster.idx) only stores the first SURT but not the last one. But with strict sorting, the last one must be lower (sorts before) than the first one of the next zipnum block.

For Parquet zero overlap is an optimization but no requirement: every Parquet file and row group has the min and max values in the statistics in the footer.

… file

jt55401 · 2024-11-20T22:36:20Z

I have updates to this task in another branch - for now, I'm going to preserve the existing approach, reservoir, and get this task finished. I have it all working locally and will be testing tonight/tomorrow in s3 on a full crawl, and will then re-do the PR to reflect (I'll probably merge it into this branch, to keep things simple, and preserve the above history.

…res (need to refactor ccpyspark back into it now)

sebastian-nagel

There are some hick-ups if trying to create a zipnum index locally.

Finally, I was able to create a zipnum index by setting

--output_base_url=file:/absolute/path/to/index/
but this creates a directory tree file:/absolute/path/to/index/ in the current directory
--partition_boundries_file="/absolute/path/to/index/cluster.idx
- please, also fix the typo "boundries" (should be "boundaries")

The created index, both cdx-*.gz and cluster.idx, look good:

all records from the input are in the zipnum index
no sorting issues found
offsets in cluster.idx point to valid zipnum blocks

sebastian-nagel · 2024-12-20T12:51:52Z

zipnumcluster-ccpyspark.py

+        parser.add_argument("--output_base_url", required=False,
+                            default='my_cdx_bucket',
+                            help="destination for output")
+        parser.add_argument("--partition_boundries_file", required=False,


If --partition_boundries_file is not specified, the job fails:

Traceback (most recent call last): File ".../webarchive-indexing/zipnumcluster-ccpyspark.py", line 263, in <module> job.run() File ".../webarchive-indexing/sparkcc.py", line 187, in run self.run_job(session) File ".../webarchive-indexing/zipnumcluster-ccpyspark.py", line 238, in run_job self.write_output_file(boundries_file_uri, f) File ".../webarchive-indexing/sparkcc.py", line 840, in write_output_file uri_match = self.data_url_pattern.match(uri) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ TypeError: expected string or bytes-like object, got 'NoneType'

Is this a required argument?

Yes - it's either where the file is written to, or read from (if it already exists)

sebastian-nagel · 2024-12-20T13:13:58Z

zipnumcluster-ccpyspark.py

+        # loop over the output files and concatenate them into a single final file
+        with open('cluster.idx', 'wb') as f:
+            for idx_file,_ in rdd:
+                with self.fetch_file(output_base_url + idx_file) as idx_fd:


The default of --output_base_url is my_cdx_bucket (no trailing slash). So, when testing locally, I get the error:

4/12/20 14:03:37 INFO ZipNumClusterCdx: Reading local file my_cdx_bucketidx-00000.idx Traceback (most recent call last): File ".../webarchive-indexing/zipnumcluster-ccpyspark.py", line 263, in <module> job.run() File ".../webarchive-indexing/sparkcc.py", line 187, in run self.run_job(session) File ".../webarchive-indexing/zipnumcluster-ccpyspark.py", line 252, in run_job with self.fetch_file(output_base_url + idx_file) as idx_fd: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File ".../webarchive-indexing/sparkcc.py", line 727, in fetch_file warctemp = open(uri, 'rb') ^^^^^^^^^^^^^^^ FileNotFoundError: [Errno 2] No such file or directory: '.../webarchive-indexing/my_cdx_bucketidx-00000.idx'

Looks like, for testing locally, --output_base_url needs to be of the form: file:/absolute/path/.

yes, that's correct (and is shown in the integration tests/scripts in the cc-redact project)

sebastian-nagel · 2024-12-20T13:15:03Z

zipnumcluster-ccpyspark.py

+                            help="number of partitions/shards")
+
+    def run_job(self, session):
+        os.makedirs(self.args.output_base_url, exist_ok=True)


See below: if --output_base_url is file:/absolute/path/, on Linux a relative folder file: is created in the current directory.

I will double check that - it's not happening for me, but I may have omitted file:, so, may be testing it differently

sebastian-nagel · 2024-12-20T18:13:55Z

zipnumcluster-ccpyspark.py

+        rdd = rdd.repartitionAndSortWithinPartitions(
+            numPartitions=num_partitions,
+            partitionFunc=lambda k: get_partition_id(k,boundaries),
+            keyfunc=lambda x: x[0]) \


This will sort on the first character of the SURT key only. Should use the identity function (just leave the param keyfunc away).

humn - OK, I had put that in there because I thought x was a tuple at that point, and [0] was the key - but based on what you're saying, does keyfunc already get passed the first element of the tuple? I will double check the docs, but, it sounds reasonable.

yup, right you are:

this is from spark source:

key=lambda k_v: keyfunc(k_v[0])

That's a subtle difference between sortByKey and sortBy. And of course, from the documentation it's even less clear how repartitionAndSortWithinPartitions() does behave. Ok, in Scala or Java this would just result in an Exception "Expected list or tuple, got string". With Python this is tricky.

… was only working on first char

jt55401 added 4 commits September 9, 2024 20:06

Add an alternate spark version of indexwarcsjob (without mrjob)

48a3672

fix so we properly can skip fully failed files and log them.

406b63f

feat: draft version of zipnum cluster job as spark (untested)

65bf452

docs: adding note about zipnum testing and behaviour

fcb5695

jt55401 requested a review from sebastian-nagel November 11, 2024 17:45

jt55401 self-assigned this Nov 11, 2024

sebastian-nagel requested changes Nov 11, 2024

View reviewed changes

zipnumcluster-ccpyspark.py Outdated Show resolved Hide resolved

zipnumcluster-ccpyspark.py Outdated Show resolved Hide resolved

zipnumcluster-ccpyspark.py Outdated Show resolved Hide resolved

zipnumcluster-ccpyspark.py Outdated Show resolved Hide resolved

jt55401 added 7 commits November 20, 2024 13:15

feat: reservoir sampled method of zipnum cluster job

e6bb313

fix: simpler ordered partition id

4a99403

fix: re-init and flush gz

e8dcf78

fix: bugfix for edge conditions

2a1de31

fix: getting order and structure of final index proper, and in single…

951ede4

… file

fix: fix cdx filenames

d3f5ee9

fix: final cluster.idx logic

3f876a8

jt55401 added 6 commits November 27, 2024 09:56

fix: get zipnumcluster-pyspark working with s3/emr

688e879

chore: merged latest normal spark solution upwards

7299693

chore: merging other changes

1ce6690

fix: numerous bugfixes from complex merge issues

9c0f47b

fix: more emr bugfixes and tweaks

1d9a857

fix: temporary working version of zipnum job using normal spark featu…

55f907c

…res (need to refactor ccpyspark back into it now)

jt55401 added 2 commits December 3, 2024 11:17

chore: going back to using CCFileProcessorSparkJob

c0c454b

fix: bugfixes found from integration testing

0f0b6d4

sebastian-nagel requested changes Dec 20, 2024

View reviewed changes

jt55401 added 4 commits December 20, 2024 16:27

fix: spelling of boundaries

30b3ce7

fix: marking args as required

91a18e7

fix: remove keyfunc, it isn't needed (and was causing issue that sort…

fcd02cf

… was only working on first char

chore: adding TODO for upcoming tasks

3f60a21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark zipnumcluster job (draft) #8

Spark zipnumcluster job (draft) #8

jt55401 commented Nov 11, 2024 •

edited

Loading

jt55401 commented Nov 11, 2024

sebastian-nagel commented Nov 11, 2024

jt55401 commented Nov 11, 2024

sebastian-nagel commented Nov 12, 2024

jt55401 commented Nov 15, 2024

wumpus commented Nov 15, 2024

jt55401 commented Nov 17, 2024

jt55401 commented Nov 17, 2024

sebastian-nagel commented Nov 18, 2024 •

edited

Loading

jt55401 commented Nov 20, 2024

sebastian-nagel left a comment

sebastian-nagel Dec 20, 2024

jt55401 Dec 20, 2024

sebastian-nagel Dec 20, 2024

jt55401 Dec 20, 2024

sebastian-nagel Dec 20, 2024

jt55401 Dec 20, 2024

sebastian-nagel Dec 20, 2024

jt55401 Dec 20, 2024

jt55401 Dec 20, 2024

sebastian-nagel Dec 20, 2024

Spark zipnumcluster job (draft) #8

Are you sure you want to change the base?

Spark zipnumcluster job (draft) #8

Conversation

jt55401 commented Nov 11, 2024 • edited Loading

jt55401 commented Nov 11, 2024

sebastian-nagel commented Nov 11, 2024

jt55401 commented Nov 11, 2024

sebastian-nagel commented Nov 12, 2024

jt55401 commented Nov 15, 2024

wumpus commented Nov 15, 2024

jt55401 commented Nov 17, 2024

jt55401 commented Nov 17, 2024

sebastian-nagel commented Nov 18, 2024 • edited Loading

jt55401 commented Nov 20, 2024

sebastian-nagel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jt55401 commented Nov 11, 2024 •

edited

Loading

sebastian-nagel commented Nov 18, 2024 •

edited

Loading