Use DOM methods to navigate a document
Problem
You have a HTML document that you want to extract data from. You know generally the structure of the HTML document.
Solution
Use the DOM-like methods available after parsing HTML into a Document
.
File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
Element content = doc.getElementById("content");
Elements links = content.getElementsByTag("a");
for (Element link : links) {
String linkHref = link.attr("href");
String linkText = link.text();
}
Description
Elements provide a range of DOM-like methods to find elements, and extract and manipulate their data. The DOM getters are contextual: called on a parent Document they find matching elements under the document; called on a child element they find elements under that child. In this way you can winnow in on the data you want.
Finding elements
getElementById(String id)
getElementsByTag(String tag)
getElementsByClass(String className)
getElementsByAttribute(String key)
(and related methods)- Element siblings:
siblingElements()
,firstElementSibling()
,lastElementSibling()
;nextElementSibling()
,previousElementSibling()
- Graph:
parent()
,children()
,child(int index)
Element data
attr(String key)
to get andattr(String key, String value)
to set attributesattributes()
to get all attributesid()
,className()
andclassNames()
text()
to get andtext(String value)
to set the text contenthtml()
to get andhtml(String value)
to set the inner HTML contentouterHtml()
to get the outer HTML valuedata()
to get data content (e.g. ofscript
andstyle
tags)tag()
andtagName()
Manipulating HTML and text
Cookbook
Introduction
Input
- Parse a document from a String
- Parsing a body fragment
- Load a Document from a URL
- Load a Document from a File
Extracting data
- Use DOM methods to navigate a document
- Use CSS selectors to find elements
- Use XPath selectors to find elements and nodes
- Extract attributes, text, and HTML from elements
- Working with relative and absolute URLs
- Example program: list links