The task of the “Reply Graph” exercise is to extract reply connections between mails in the archives of Apache Flink’s developer mailing list. These connections can be used to define a reply graph and to analyze the structure of the Flink community.
A reply connection is defined as a pair of two email addresses (
Tuple2<String, String>) where the first email address replied to an email of the second email address. The task of this exercise is to compute all reply connections between emails in the Mail Data Set and count the number of reply connections for each unique pair of email addresses.
This exercise uses the Mail Data Set which was extracted from the Apache Flink development mailing list archive. The Mail Data Set instructions show how to read the data set in a Flink program using the
The task requires three fields
Reply-To. The input data can be read as a
DataSet<Tuple3<String, String, String>>. When printed, the data set should look similar to this:
(<CAAdrtT0-sfxxUK-BrPC03ia7t1WR_ogA5uA6J5CSRvuON+snTg@mail.gmail.com>,Fabian Hueske <firstname.lastname@example.org>,<C869A196-EB43-4109-B81C-23FE9F726AC6@apache.org>) (<CANMXwW0HOvk7n=h_rTv3RbK0E4ti1D7OdsY_3r8joib6rAAt2g@mail.gmail.com>,Aljoscha Krettek <email@example.com>,<CANC1h_vn8E8TLXD=8szDN+0HO6JrU4AsCWgrXh8ojkA=FiPxNw@mail.gmail.com>) (<0E10813D-5ED0-421F-9880-17C958A41724@fu-berlin.de>,Ufuk Celebi <firstname.lastname@example.org>,null)
Reply-To field might have the value
"null" indicating that this mail was not written in repsonse to another mail.
The result for the exercise should be a
DataSet<Tuple3<String, String, Integer>>. The first field is the sender email address of the reply mail, the second field is the sender email address of the mail that was replied to, and the third field is the number of reply connections between these two email addresses. When printed, the data set should look like this:
(email@example.com,firstname.lastname@example.org,75) (email@example.com,firstname.lastname@example.org,45) (email@example.com,firstname.lastname@example.org,22) (email@example.com,firstname.lastname@example.org,22)
The first result line indicates that
email@example.com replied 72 times to an email send by
Reply-Tofields. And finally, count the number of reply connections for each unique pair of email addresses.
MapFunctionwhich replaces the sender field by the extracted email address. This the same operation that needs to be done for the Mail Count exercise.
Reply-Tofield of the first mail record is equal to the
MessageIdfield of the second mail record. This is can be done by joining the mail record data set by itself on the
Reference solutions are available at GitHub: