webcrawler

This is a simple, recursive Java Web-Crawler for internal and external links which excludes facebook and twitter and images on a specific website or same domain, which creates a simple XML-file including the found pages and the returned status-code. While it attempts to crawl through any website and find new links, it won't crawl a site multiple times or try to crawl a downloadable file.

Download

GitHub Releases

Run

mvn clean dependency:copy-dependencies package

GUI

Double-click the downloaded file or use the console:

java -jar WebCrawler-1.0.jar

Console

java -jar WebCrawler-1.0.jar http://wiprodigital.com

Example Output

GUI

![GUI]

Console

INTERNAL LINKS:
[1] [200] http://wiprodigital.com
[2] [200] http://wiprodigital.com/who-we-are
[3] [200] http://wiprodigital.com/what-we-do
[4] [200] http://wiprodigital.com/what-we-think
[5] [XXX] ...

EXTERNAL LINKS:

[200] https://designit.com/happening/news/create-the-future-together

[200] http://www.un.org/sustainabledevelopment/sustainable-development-goals/

INTERNAL / EXTERNAL IMAGES:

[200] http://17776-presscdn-0-6.pagely.netdna-cdn.com/wp-content/themes/wiprodigital/images/wdlogo.png

[200] http://17776-presscdn-0-6.pagely.netdna-cdn.com/wp-content/themes/wiprodigital/images/designit_logo.png

[200] http://17776-presscdn-0-6.pagely.netdna-cdn.com/wp-content/uploads/2016/05/designit-logo.jpeg

### XML-File


## License

Copyright (C) 2016 [Suresh Inuguru]

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the [GNU General Public License](https://github.com/sureshinuguru/webcrawler/blob/master/LICENSE) for more details.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
doc		doc
src/main/java/com/java/webcrawler		src/main/java/com/java/webcrawler
target		target
.project		.project
LICENSE		LICENSE
README.md		README.md
WebCrawler.iml		WebCrawler.iml
dependency-reduced-pom.xml		dependency-reduced-pom.xml
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

webcrawler

Download

Run

GUI

Console

Example Output

GUI

Console

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

sureshinuguru/webcrawler

Folders and files

Latest commit

History

Repository files navigation

webcrawler

Download

Run

GUI

Console

Example Output

GUI

Console

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages