Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/one splits file to rule them all #1409

Merged
merged 32 commits into from
Jul 31, 2024

Conversation

hlgp
Copy link
Collaborator

@hlgp hlgp commented Feb 1, 2022

Merge NonShardedSplitsFile and ShardedTableMapFile. Use TableSplitsCache instead of reaching out to Accumulo in every reducer.

Splits File generation now takes configured Partitioner needs into account.

  • If the table's partitioner doesn't need splits, e.g. HashPartitioner, don't write them
  • If the table's partitioner only needs splits, only write the splits
  • If the table's partitioner needs splits and locations, write them
  • If the table's partitioner needs splits in a specific order, write them in that order, so we do not need to sort them at the beginning of every mapper.

Conflicts:
	warehouse/ingest-core/src/main/java/datawave/ingest/config/BaseHdfsFileCacheUtil.java
	warehouse/ingest-core/src/main/java/datawave/ingest/mapreduce/job/ShardedTableMapFile.java
	warehouse/ingest-core/src/main/java/datawave/ingest/mapreduce/job/TableSplitsCache.java
	warehouse/ingest-core/src/main/java/datawave/ingest/mapreduce/partition/MultiTableRangePartitioner.java
	warehouse/ingest-core/src/test/java/datawave/ingest/mapreduce/job/ShardedTableMapFileTest.java
	warehouse/ingest-core/src/test/java/datawave/ingest/mapreduce/partition/MultiTableRRRangePartitionerTest.java
	warehouse/ingest-core/src/test/java/datawave/ingest/mapreduce/partition/MultiTableRangePartitionerTest.java
@hlgp hlgp added the linked label Apr 19, 2023
hlgp and others added 6 commits August 8, 2023 16:37
Conflicts:
	warehouse/ingest-core/src/main/java/datawave/ingest/config/BaseHdfsFileCacheUtil.java
	warehouse/ingest-core/src/main/java/datawave/ingest/mapreduce/job/IngestJob.java
	warehouse/ingest-core/src/main/java/datawave/ingest/mapreduce/job/MultiRFileOutputFormatter.java
	warehouse/ingest-core/src/main/java/datawave/ingest/mapreduce/job/NonShardedSplitsFile.java
	warehouse/ingest-core/src/main/java/datawave/ingest/mapreduce/job/ShardedTableMapFile.java
	warehouse/ingest-core/src/main/java/datawave/ingest/mapreduce/job/TableSplitsCache.java
	warehouse/ingest-core/src/main/java/datawave/ingest/mapreduce/partition/BalancedShardPartitioner.java
	warehouse/ingest-core/src/main/java/datawave/ingest/mapreduce/partition/MultiTableRRRangePartitioner.java
	warehouse/ingest-core/src/main/java/datawave/ingest/mapreduce/partition/MultiTableRangePartitioner.java
	warehouse/ingest-core/src/main/java/datawave/ingest/mapreduce/partition/SplitBasedHashPartitioner.java
	warehouse/ingest-core/src/main/java/datawave/ingest/mapreduce/partition/TabletLocationHashPartitioner.java
	warehouse/ingest-core/src/main/java/datawave/ingest/mapreduce/partition/TabletLocationNamePartitioner.java
	warehouse/ingest-core/src/test/java/datawave/ingest/mapreduce/job/MultiRFileOutputFormatterTest.java
	warehouse/ingest-core/src/test/java/datawave/ingest/mapreduce/job/ShardedTableMapFileTest.java
	warehouse/ingest-core/src/test/java/datawave/ingest/mapreduce/job/TableSplitsCacheTest.java
	warehouse/ingest-core/src/test/java/datawave/ingest/mapreduce/partition/BalancedShardPartitionerTest.java
	warehouse/ingest-core/src/test/java/datawave/ingest/mapreduce/partition/MultiTableRRRangePartitionerTest.java
	warehouse/ingest-core/src/test/java/datawave/ingest/mapreduce/partition/MultiTableRangePartitionerTest.java
	warehouse/ingest-core/src/test/java/datawave/ingest/mapreduce/partition/SplitBasedHashPartitionerTest.java
	warehouse/ingest-core/src/test/java/datawave/ingest/mapreduce/partition/TabletLocationHashPartitionerTest.java
	warehouse/ingest-core/src/test/java/datawave/ingest/mapreduce/partition/TabletLocationNamePartitionerTest.java
	warehouse/ingest-core/src/test/java/datawave/ingest/mapreduce/partition/TestShardGenerator.java
Conflicts:
	warehouse/ingest-core/src/main/java/datawave/ingest/mapreduce/job/ShardedTableMapFile.java
Add splits file to distributed cache, as NSPF used to.
ivakegg
ivakegg previously approved these changes Jul 29, 2024
@ivakegg ivakegg changed the title WIP: Feature/one splits file to rule them all Feature/one splits file to rule them all Jul 29, 2024
Copy link
Collaborator

@austin007008 austin007008 Jul 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While you're in here you might want to fix the test data in this and the corresponding unit tests. those "splits" are actually memory locations from something (I can't remember what) and they should be some kind of base64. The only reason they don't fail on the encode/decode is because apache commons base64 will happily let you input and output bad data . you can see https://github.com/NationalSecurityAgency/datawave/pull/2480/files for an example of how they should look

@ivakegg ivakegg merged commit 142df3a into integration Jul 31, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants