The task of the “Reply Graph with PageRank” exercise is to define a graph using Flink’s Gelly API and then to analyze the structure of the Flink community by running a pageRank algorithm over it. The input data set for graph generation is the output of the Reply Graph exercise.
This exercise uses the output of the Reply Graph exercise as input (
The input data can be read as
DataSet<Tuple3<String, String, Integer>>. The first field is the sender email address of the reply mail, the second field is the sender email address of the mail that was replied to, and the third field is the number of reply connections between these two email addresses. When printed, the data set should look like this:
(email@example.com,firstname.lastname@example.org,72) (email@example.com,firstname.lastname@example.org,40) (email@example.com,firstname.lastname@example.org,22) (email@example.com,firstname.lastname@example.org,22)
The result of the exercise should be a
DataSet<Vertex<String,Double>>. The first field indicates the vertex id, i.e., email address, the second field is its associated rank.
When printed, the data set should look like this:
DataSet<Tuple3<String, String, Double>>of the replyGraph exercise as an Edge. The data generated from the reply-graph exercise can be read as a data set of edges with emails being the source and target edge ids, while the number of reply connections can be stored as edge weights.
fromDataSetmethod. It is recommended to create a class which takes as input arguments: path of input file, output file and number of iterations for the PageRank algorithm. Note, Add the
gelly dependencyin the pom file.
graph.run(new PageRankAlgorithm<..>(input parameters)). In Gelly’s PageRank implementation, a vertex distributes its rank to the target vertices according to the weight of its outgoing edges e.g. if vertex A has two outgoing links (
<A,C,1>), vertex B will get 2/3 of the total share of rank distributed by A, while C will get only 1/3. A suggestion would be to update edge values once a graph has been obtained, by taking into consideration the sum of all the outgoing edges from a given vertex,(
Reference solution is available at GitHub: