How to Extract Text from Web Pages in Java

Extract Text from Web Pages in Java

In this example, We will show you simple program about, How to extract text from web pages in Java. The following example were done using Java Jsoup api and output shared in the same post.

Example Program (ExtractText.java)

package com.dineshkrish;

import java.io.IOException;
import java.net.MalformedURLException;
import java.net.URL;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

/**
 * 
 * @author Dinesh Krishnan
 *
 */

public class ExtractText {

	// method to extract text from url
	public String getText(final String link) {

		String text = null;

		try {

			// creating URL object
			URL url = new URL(link);

			// getting the HTML documents from the url
			Document document = Jsoup.parse(url, 5000);

			// extracting the text from give url
			text = document.text();

		} catch (MalformedURLException e) {

			System.out.println(e.getMessage());
			e.printStackTrace();
		} catch (IOException e) {

			System.out.println(e.getMessage());
			e.printStackTrace();
		}

		return text;
	}

	public static void main(String[] args) {

		// input url you can change accordingly
		String link = "http://www.google.com";

		ExtractText text = new ExtractText();

		// printing the extracted text
		System.out.println(text.getText(link));
	}
}

Maven Dependecy (pom.xml)

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
	<modelVersion>4.0.0</modelVersion>
	<groupId>com.dineshkrish</groupId>
	<artifactId>JsoupExample</artifactId>
	<version>0.0.1-SNAPSHOT</version>

	<dependencies>

		<dependency>
			<groupId>org.jsoup</groupId>
			<artifactId>jsoup</artifactId>
			<version>1.9.2</version>
		</dependency>

	</dependencies>

</project>

Output

Google Search Images Maps Play YouTube News Gmail Drive More….

References

1. Jsoup Documentation
2. JavaDoc – Java JSoup API
3. JavaDoc – Jsoup Class

Hello everyone, I am a Founder of dineshkrish.com.