The underlying example is just the one given in the official pyspark documentation.
Please click here to reach this example.
# the first step involves reading the source text file from HDFS text_file = sc.textFile("hdfs://...") # this step involves the actual computation for reading the number of words in the file # flatmap, map and reduceByKey are all spark RDD functions counts = text_file.flatMap(lambda line: line.split(" ")) \ .map(lambda word: (word, 1)) \ .reduceByKey(lambda a, b: a + b) # the final step is just saving the result. counts.saveAsTextFile("hdfs://...")
if you want to reproduce, please indicate the source:
Getting started with pyspark – Sample Word Count in Pyspark - CodeDay