Class TokenPartitioner
- java.lang.Object
-
- org.apache.spark.Partitioner
-
- org.apache.cassandra.spark.bulkwriter.TokenPartitioner
-
- All Implemented Interfaces:
java.io.Serializable
public class TokenPartitioner extends org.apache.spark.PartitionerSpark Partitioner for distributing data across Cassandra token ranges.Serialization Architecture: This class supports TWO distinct serialization mechanisms, each serving a different purpose:
1. Direct Java Serialization (via writeObject/readObject): Used when Spark serializes this Partitioner for shuffle operations like
repartitionAndSortWithinPartitions(). During shuffle, Spark sends the Partitioner to executors to determine which partition each record belongs to. The custom serialization methods at the end of this class handle saving/restoring the partition mappings.2. Broadcast Variable Pattern (via BroadcastableTokenPartitioner): Used when broadcasting job configuration to executors. The driver extracts partition mappings into
BroadcastableTokenPartitioner(a pure data wrapper with no transient fields), which is broadcast viaBulkWriterConfig. Executors reconstruct TokenPartitioner from the broadcast data using the constructorTokenPartitioner(BroadcastableTokenPartitioner).Both mechanisms are necessary because: - Shuffle operations (repartitionAndSortWithinPartitions) serialize the Partitioner directly - Broadcast variables use the broadcastable wrapper pattern to avoid Logger serialization issues
The transient fields (partitionMap, reversePartitionMap, nrPartitions) are marked transient to avoid serializing large/complex objects when not needed, but are properly handled by custom serialization when direct serialization is required.
- See Also:
- Serialized Form
-
-
Constructor Summary
Constructors Constructor Description TokenPartitioner(BroadcastableTokenPartitioner broadcastable)Reconstruct TokenPartitioner from BroadcastableTokenPartitioner on executor.TokenPartitioner(TokenRangeMapping<RingInstance> tokenRangeMapping, java.lang.Integer userSpecifiedNumberSplits, int defaultParallelism, java.lang.Integer cores)TokenPartitioner(TokenRangeMapping<RingInstance> tokenRangeMapping, java.lang.Integer userSpecifiedNumberSplits, int defaultParallelism, java.lang.Integer cores, boolean randomize)
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description intcalculateSplits(TokenRangeMapping<RingInstance> tokenRangeMapping, java.lang.Integer numberSplits, int defaultParallelism, java.lang.Integer cores)intgetPartition(java.lang.Object key)com.google.common.collect.Range<java.math.BigInteger>getTokenRange(int partitionId)intnumPartitions()intnumSplits()
-
-
-
Constructor Detail
-
TokenPartitioner
public TokenPartitioner(TokenRangeMapping<RingInstance> tokenRangeMapping, java.lang.Integer userSpecifiedNumberSplits, int defaultParallelism, java.lang.Integer cores)
-
TokenPartitioner
public TokenPartitioner(TokenRangeMapping<RingInstance> tokenRangeMapping, java.lang.Integer userSpecifiedNumberSplits, int defaultParallelism, java.lang.Integer cores, boolean randomize)
-
TokenPartitioner
public TokenPartitioner(BroadcastableTokenPartitioner broadcastable)
Reconstruct TokenPartitioner from BroadcastableTokenPartitioner on executor.This constructor is part of the broadcast variable serialization mechanism. When BulkWriterConfig is broadcast to executors, it contains BroadcastableTokenPartitioner (a pure data wrapper). Executors use this constructor to rebuild the TokenPartitioner with all necessary partition mappings.
This reconstruction path is separate from the direct Java serialization (writeObject/readObject) used for Spark shuffle operations. The broadcast pattern is preferred for configuration data because it avoids Logger serialization issues and minimizes broadcast size.
- Parameters:
broadcastable- the broadcastable token partitioner from broadcast variable- See Also:
BroadcastableTokenPartitioner,BulkWriterConfig
-
-
Method Detail
-
numPartitions
public int numPartitions()
- Specified by:
numPartitionsin classorg.apache.spark.Partitioner
-
getPartition
public int getPartition(java.lang.Object key)
- Specified by:
getPartitionin classorg.apache.spark.Partitioner- Parameters:
key- the decorated key- Returns:
- the partition (non-negative) for the given key; if key is not present in the partition map, 0 is returned
-
numSplits
public int numSplits()
-
getTokenRange
public com.google.common.collect.Range<java.math.BigInteger> getTokenRange(int partitionId)
-
calculateSplits
public int calculateSplits(TokenRangeMapping<RingInstance> tokenRangeMapping, java.lang.Integer numberSplits, int defaultParallelism, java.lang.Integer cores)
-
-