CS441 - Engineering Distributed Objects for Cloud Computing

HW 02 - Distributed Computation On Research Publications

Overview

The objective is to process the raw XML-based DBLP dataset and perform various parallel distributed computation on the set via the Hadoop Map-Reduce framework.

Results

Installation

To use this framework, be sure to have the following software installed:

To install this package into a project:

Download cs441_Fall2020_HW01

git clone https://[email protected]/jsanch75/jacob_sanchez_hw2.git

Navigate to the project root folder
```
cd CS441/
```
Execute the command on the command line:
```
sbt clean compile assembly  
```
Run Hadoop Command on Sandbox

 hadoop jar target/scala-2.13/CS441-assembly-0.1.jar src/main/resources/dblp.xml src/main/resources/output

Map Reduce Jobs

Compute a spreadsheet or an CSV file that shows top ten published authors at each venue.
Compute the list of authors who published without interruption for N years where 10 <= N.
For each venue you will produce the list of publications that contains only one author.
Produce the list of publications for each venue that contain the highest number of authors for each of these venues.
Produce the list of top 100 authors in the descending order who publish with most co-authors and the list of 100 authors who publish without any co-authors.

Mapping Input XML

The various mappers extract the publication tags from the dblp.xml file associated with the shards via Scala's scala-xml module.

(<author> or <editor> )

Reducing The Output From The Mappers

Depending on the reducer, the key is mapping to an IntWritable, MapWritable, or Text for the output.

Reducer maps (Text, Text) => list of publications per venue.
Reducer maps (Text, IntWritable) => List of publishers where the value is >= 10.
Reducer maps (Text, Text) => List of publication titles
Reducer maps (Text, Text) => (venue, List of publications)
(Text, MapWritable) => List of authors based on MapWritable key

Future Work

The Mappers and Reducers can easily incorporate the functionality of supporting multi-tag class to differentiate between the following:

<phdthesis> , <masterthesis>, etc...

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
project		project
src/main		src/main
target		target
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
build.sbt		build.sbt
dblp.dtd		dblp.dtd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CS441 - Engineering Distributed Objects for Cloud Computing

HW 02 - Distributed Computation On Research Publications

Overview

Results

Installation

Map Reduce Jobs

Mapping Input XML

Reducing The Output From The Mappers

Future Work

About

Uh oh!

Releases

Packages

Languages

jsanchez78/researchPublications

Folders and files

Latest commit

History

Repository files navigation

CS441 - Engineering Distributed Objects for Cloud Computing

HW 02 - Distributed Computation On Research Publications

Overview

Results

Installation

Map Reduce Jobs

Mapping Input XML

Reducing The Output From The Mappers

Future Work

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages