-
Notifications
You must be signed in to change notification settings - Fork 0
Lecture 8. Data mining
Data mining is the process of automating data extraction from different sources. One of the main interesting examples came from systematically extracting data from the web or publicly available repositories. Web sources can be mined using the following methods:
-
Web formats - Transfer formats like html, xml, json, etc are downloaded and/or parsed (loaded) from the target web source to be processed for data extraction.
-
Application programming interfaces (API) - Are set of routines created by developers of the target repositories or servers that you can call using a programming language (for example, java, python or R).
-
Data bases - Although rare, sometimes servers give open access for their data bases. These data bases are usually implemented in mysql or mongoDB.
-
There are also combinations of the above-mentioned methods.
NCBI is one of the most important resources for diverse kind of data as for example genetic variations, nucleotides and protein sequences, and literature. Here, we will see a very simple method of data extraction that combines API conexion via Entrez programming utilities (e-utilities) and Web formats parsing web formats in the form of html files into R.
Our goal is to systematically extract publications from PubMed related to autism genetics.
The following steps comprises our data mining algorithm:
-
- Construction of a query using e-utilities API.
-
- Performing a query and downloading the response from PubMed in html format.
-
- Parsing the format into R and extracting data.
-
- Iterating steps 1-3.
This algorithm is sketched in this script.
-
Coursera notes. Getting data section provides methods for parsing xml and html formats into R.
-
e-utilities. Official NCBI documentation of e-utilities on how to construct queries to PubMed.
-
NCBI download. NCBI full documentation on how to download all the data stored in their repositories.