The objective is to process the raw XML-based DBLP dataset and perform various parallel distributed computation on the set via the Hadoop Map-Reduce framework.
To use this framework, be sure to have the following software installed:
To install this package into a project:
-
Download cs441_Fall2020_HW01
git clone https://[email protected]/jsanch75/jacob_sanchez_hw2.git
-
Navigate to the project root folder
cd CS441/
-
Execute the command on the command line:
sbt clean compile assembly
-
Run Hadoop Command on Sandbox
hadoop jar target/scala-2.13/CS441-assembly-0.1.jar src/main/resources/dblp.xml src/main/resources/output
- Compute a spreadsheet or an CSV file that shows top ten published authors at each venue.
- Compute the list of authors who published without interruption for N years where 10 <= N.
- For each venue you will produce the list of publications that contains only one author.
- Produce the list of publications for each venue that contain the highest number of authors for each of these venues.
- Produce the list of top 100 authors in the descending order who publish with most co-authors and the list of 100 authors who publish without any co-authors.
The various mappers extract the publication tags from the dblp.xml file associated with the shards via Scala's scala-xml module.
(<author> or <editor> )
Depending on the reducer, the key is mapping to an IntWritable, MapWritable, or Text for the output.
- Reducer maps (Text, Text) => list of publications per venue.
- Reducer maps (Text, IntWritable) => List of publishers where the value is >= 10.
- Reducer maps (Text, Text) => List of publication titles
- Reducer maps (Text, Text) => (venue, List of publications)
- (Text, MapWritable) => List of authors based on MapWritable key
The Mappers and Reducers can easily incorporate the functionality of supporting multi-tag class to differentiate between the following:
<phdthesis> , <masterthesis>, etc...