pangpang pangpang - 2 years ago 172
Python Question

TypeError: unhashable type: 'TopicAndPartition' when createDirectStream

I want to consume kafka message from any arbitrary offset by


My source code:

import sys

from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils, TopicAndPartition

def functionToCreateContext():
sc = SparkContext(appName="PythonStreamingDirectKafkaWordCount")
ssc = StreamingContext(sc, 2)
kvs = KafkaUtils.createDirectStream(
{"": "localhost:9092"},
{TopicAndPartition("test123", 0): 100, TopicAndPartition("test123", 1): 100}
#kvs = kvs.checkpoint(10)
lines = x: x[1])
counts = lines.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a+b)
return ssc

if __name__ == "__main__":
ssc = StreamingContext.getOrCreate("./checkpoint", functionToCreateContext())


but get the error as below:

Traceback (most recent call last):
File "/usr/local/spark-1.6.0-bin-hadoop2.6/examples/src/main/python/streaming/", line 56, in <module>
ssc = StreamingContext.getOrCreate("./checkpoint", functionToCreateContext())
File "/usr/local/spark-1.6.0-bin-hadoop2.6/examples/src/main/python/streaming/", line 45, in functionToCreateContext
{TopicAndPartition("test123", 0): 100, TopicAndPartition("test123", 1): 100}
TypeError: unhashable type: 'TopicAndPartition'

pyspark source code:

def createDirectStream(ssc, topics, kafkaParams, fromOffsets=None,
keyDecoder=utf8_decoder, valueDecoder=utf8_decoder,

class TopicAndPartition(object):
Represents a specific top and partition for Kafka.

def __init__(self, topic, partition):
Create a Python TopicAndPartition to map to the Java related object
:param topic: Kafka topic name.
:param partition: Kafka partition id.
self._topic = topic
self._partition = partition

def _jTopicAndPartition(self, helper):
return helper.createTopicAndPartition(self._topic, self._partition)

jfromOffsets = dict([(k._jTopicAndPartition(helper),
v) for (k, v) in fromOffsets.items()])

fromOffsets should be a dict, the key of dict should be a

Any idea for this?

Answer Source

pyspark has a bug for python3, the TopicAndPartition class is missing a hash method, so you should change python3 to python2, the error is disappeared.

then should cast the offset from int to long:

{TopicAndPartition("test123", 0): long(100), TopicAndPartition("test123", 1): long(100)}
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download