Lecture 8. Data mining

8. Data Mining

Data mining is the process of automating data extraction from different sources. One of the main interesting examples came from systematically extracting data from the web or publicly available repositories. Web sources can be mined using the following methods:

Web formats - Transfer formats like html, xml, json, etc are downloaded and/or parsed (loaded) from the target web source to be processed for data extraction.
Application programming interfaces (API) - Are set of routines created by developers of the target repositories or servers that you can call using a programming language (for example, java, python or R).
Data bases - Although rare, sometimes servers give open access for their data bases. These data bases are usually implemented in mysql or mongoDB.
There are also combinations of the above-mentioned methods.

8.1. Data mining of Pubmed abstracts

NCBI is one of the most important resources for diverse kind of data as for example genetic variations, nucleotides and protein sequences, and literature. Here, we will see a very simple method of data extraction that combines API conexion via Entrez programming utilities (e-utilities) and Web formats parsing web formats in the form of html files into R.

Our goal is to systematically extract publications from PubMed related to autism genetics.

The following steps comprises our data mining algorithm:

1. Construction of a query using e-utilities API.
1. Performing a query and downloading the response from PubMed in html format.
1. Parsing the format into R and extracting data.
1. Iterating steps 1-3.

This algorithm is sketched in this script.

8.2. More resources

Coursera notes. Getting data section provides methods for parsing xml and html formats into R.
e-utilities. Official NCBI documentation of e-utilities on how to construct queries to PubMed.
NCBI download. NCBI full documentation on how to download all the data stored in their repositories.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Lecture 8. Data mining

8. Data Mining

8.1. Data mining of Pubmed abstracts

8.2. More resources

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Data Science Course

Lecture 1. Introduction to Data Science

Lecture 2. Essential Linux

Lecture 3. R Installation

Lecture 4. Introduction to R

Lecture 5. Git version control manager

Lecture 6. Data frames and matrices

Lecture 7. Data Visualization

Lecture 8. Data Mining

Clone this wiki locally