Skip to content

Commit 6c9743c

Browse files
committed
add class 11
1 parent 52b0e41 commit 6c9743c

File tree

11 files changed

+740
-0
lines changed

11 files changed

+740
-0
lines changed

11-2020-02-14/data/full_text.p

37.8 KB
Binary file not shown.
72.4 KB
Loading
265 KB
Loading
Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# Web Scraping"
8+
]
9+
},
10+
{
11+
"cell_type": "markdown",
12+
"metadata": {},
13+
"source": [
14+
"[Web scraping](https://en.wikipedia.org/wiki/Web_scraping) refers to extracting data from websites. Web scraping a web page involves fetching it and extracting from it. Once fetched, then extraction can take place. The content of a page may be parsed, searched, reformatted, its data copied into a spreadsheet, and so on. Web scrapers typically take something out of a page, to make use of it for another purpose somewhere else.\n",
15+
"\n",
16+
"__Web scraping is not difficult, but how you select which data to select for your analysis is the work of art.__"
17+
]
18+
},
19+
{
20+
"cell_type": "markdown",
21+
"metadata": {},
22+
"source": [
23+
"Web scraping is a task that has to be performed responsibly so that it does not have a detrimental effect on the sites being scraped. Machines can retrieve data much quicker, in greater depth than humans, so bad scraping practices can have some impact on the performance of the site.\n",
24+
"\n",
25+
"Most websites may not have anti-scraping mechanisms since it would affect the user experience, but some sites do block scraping.\n",
26+
"\n",
27+
"> __\"With great power there must also come -- great responsibility!\"__\n",
28+
"\n",
29+
"\n",
30+
"\n",
31+
"In order to check what types of interactions are compliant wiht the data hosting website check the `robiots.txt` files.\n",
32+
"\n",
33+
"For instance amazon ([https://www.amazon.de/robots.txt](https://www.amazon.de/robots.txt)) makes it clear that it does not want to be scraped. In contrasts other sites such as the website from [Freie Universität Berlin](https://www.fu-berlin.de/) does impose less restrictions (see: [https://www.fu-berlin.de/robots.txt](https://www.fu-berlin.de/robots.txt))\n",
34+
"\n",
35+
"\n",
36+
"\n"
37+
]
38+
},
39+
{
40+
"cell_type": "markdown",
41+
"metadata": {},
42+
"source": [
43+
"> ## Task: The task is to scarpe the content of ABV training courses from the [Vorlesungsverzeichnis](https://www.fu-berlin.de/vv/de/modul?id=478016&sm=498562) and analyze its content by generating a wordcloud.\n",
44+
"__1. Understand the structure of the website__ \n",
45+
"__2. Get the data__ \n",
46+
"__3. Analyze/visualize the data__"
47+
]
48+
},
49+
{
50+
"cell_type": "markdown",
51+
"metadata": {},
52+
"source": [
53+
"***"
54+
]
55+
}
56+
],
57+
"metadata": {
58+
"kernelspec": {
59+
"display_name": "Python 3",
60+
"language": "python",
61+
"name": "python3"
62+
},
63+
"language_info": {
64+
"codemirror_mode": {
65+
"name": "ipython",
66+
"version": 3
67+
},
68+
"file_extension": ".py",
69+
"mimetype": "text/x-python",
70+
"name": "python",
71+
"nbconvert_exporter": "python",
72+
"pygments_lexer": "ipython3",
73+
"version": "3.7.3"
74+
}
75+
},
76+
"nbformat": 4,
77+
"nbformat_minor": 4
78+
}
Lines changed: 126 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,126 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# Introduction\n",
8+
"\n",
9+
"Modern websites usually consist of three components:\n",
10+
"* HTML (Hypertext Markup Language)\n",
11+
"* CSS (Cascading Style Sheets) and\n",
12+
"* JavaScript (JS)\n",
13+
"\n",
14+
"[__Hypertext Markup Language (HTML)__](https://en.wikipedia.org/wiki/Hypertext_Markup_Language) is a text-based, machine-readable language (markup language, _markup language_) for the structuring of web content. These contents include texts, lists, tables, hyperlinks, images, etc. \n",
15+
"\n",
16+
"[__Cascading Stylesheets (CSS)__](https://en.wikipedia.org/wiki/Cascading_Style_Sheets) is a formal language used to define the appearance of HTML documents. It is a so-called \"living standard\". CSS is constantly being further developed by the [World Wide Web Consortium (W3C)](https://en.wikipedia.org/wiki/World_Wide_Web_Consortium).\n",
17+
"With this language, individual components of the website can be formatted and adapted to your own needs (including color, font size, font, spacing, etc.).\n",
18+
"\n",
19+
"[__JavaScript (JS)__](https://en.wikipedia.org/wiki/JavaScript) is a programming language that allows you to create interactive web content. Thus contents can be changed, loaded or generated by user interactions (input fields, animations, games etc.).\n",
20+
"\n",
21+
"***"
22+
]
23+
},
24+
{
25+
"cell_type": "markdown",
26+
"metadata": {},
27+
"source": [
28+
"## Basics HTML\n",
29+
"\n",
30+
"[Hypertext Markup Language (HTML)](https://en.wikipedia.org/wiki/Hypertext_Markup_Language) is not a programming language in the strict sense. It is rather a markup language that describes the structure of a web page. The basic building block of HTML is the so-called _element_. It allows content to be structured and provided with attributes. \n",
31+
"\n",
32+
"### Elements\n",
33+
"\n",
34+
"An element can contain text, data, images, etc.. Typically an element starts with an opening tag `<...>`, contains attributes, encloses text and ends with a closing tag `</...>`.\n",
35+
"\n",
36+
"Here is an example of a `p` (_paragraph_) element: \n",
37+
"\n",
38+
"`<p>class=\"abcd\">Hello world!</p>`, \n",
39+
"\n",
40+
"- `<p>` opening _day_,\n",
41+
"- `class=\"abcd\"` an attribute and its value,\n",
42+
"- `'Hello world!'` Text and the\n",
43+
"- `</p>` closing _day_\n",
44+
"\n",
45+
"There are also elements that have no content (_empty elements_):\n",
46+
"\n",
47+
"`<img src=\"mypath/image.png\">`\n",
48+
"\n",
49+
"This element contains an attribute but no closing tag (`</img>`) and no content.\n",
50+
"\n",
51+
"#### Texts\n",
52+
"\n",
53+
"##### Headings\n",
54+
"Heading elements make it possible to display individual text passages as headings of different sizes. HTML contains 6 predefined sizes (`<h1>`–`<h6>`).\n",
55+
"\n",
56+
"```\n",
57+
"<h1>Heading 1st order</h1>\n",
58+
"<h2>Heading 2nd order</h2>\n",
59+
"<h3>Heading 3rd order</h3>\n",
60+
"<h4>Heading 4th order</h4>\n",
61+
"<h5>Heading 5th order</h5>\n",
62+
"<h6>Heading 6th order</h6>\n",
63+
"```\n",
64+
"\n",
65+
"#### Paragraphs \n",
66+
"The `<p>` element identifies a paragraph.\n",
67+
"\n",
68+
"```\n",
69+
"<p>I'm a paragraph</p>\n",
70+
"```\n",
71+
"\n",
72+
"#### Images\n",
73+
"\n",
74+
"The `<img>` element inserts image files into the document. The `src` (_source_) attribute refers to the path to the image file (a local file or a _url_).\n",
75+
"\n",
76+
"`<img src=\"images/my_image.png\">`\n",
77+
"\n",
78+
"\n",
79+
"### The anatomy of a HMTL document\n",
80+
"\n",
81+
"```\n",
82+
"<!DOCTYPE html>\n",
83+
"<html>\n",
84+
" <head>\n",
85+
" <meta charset=\"utf-8\">\n",
86+
" <title>Coding Workshop</title>\n",
87+
" </head>\n",
88+
" <body>\n",
89+
" <img src=\"image/Beiersdorf.png\">\n",
90+
" </body>\n",
91+
"</html>\n",
92+
"```\n",
93+
"\n",
94+
"* `<!DOCTYPE html>` The document type. A historical artifact that corresponded to a (best-practice) standard in the early 90s. \n",
95+
"* `<html></html>` The `<html>` element. The element includes the entire content (_root element_).\n",
96+
"* `<head></head>` The `<head>` element. This element corresponds to a container in which everything relevant can be found that is not part of the content displayed on the web page.\n",
97+
"* `<meta charset=\"utf-8\">` The element describes the character encoding used.\n",
98+
"* `<title></title>` The `<title>` element. It describes the title of the web page that is displayed by the browser in the tab and is also used as the name of the page when it bookmarked.\n",
99+
"* `<body></body>` The `<body>` element. This element contains all the contents of the website that are displayed to the user (text, images, videos, games, etc).\n",
100+
"\n",
101+
"***\n"
102+
]
103+
}
104+
],
105+
"metadata": {
106+
"kernelspec": {
107+
"display_name": "Python 3",
108+
"language": "python",
109+
"name": "python3"
110+
},
111+
"language_info": {
112+
"codemirror_mode": {
113+
"name": "ipython",
114+
"version": 3
115+
},
116+
"file_extension": ".py",
117+
"mimetype": "text/x-python",
118+
"name": "python",
119+
"nbconvert_exporter": "python",
120+
"pygments_lexer": "ipython3",
121+
"version": "3.7.3"
122+
}
123+
},
124+
"nbformat": 4,
125+
"nbformat_minor": 4
126+
}

0 commit comments

Comments
 (0)