Getting Started With Html Scraping in Java
One rainy Sunday afternoon since I can't get out to go somewhere I've decided to create an organized excel file (for now) for the li...
https://www.czetsuyatech.com/2012/07/java-html-scraping.html
One rainy Sunday afternoon since I can't get out to go somewhere I've decided to create an organized excel file (for now) for the list of birds commonly found in the Philippines. I found a good site to start with, http://www.birding2asia.com/tours/reports/PhilFeb2010_list.html, but I don't want to copy each detail into an excel document because that would take time. So I searched the internet for html scraping tools, I've used HTMLAgility for .net before and I think I'll still use the same if I'm working with .net again, but I want to do it in java today.
Here's a list of the most used html scraper for different PL: http://stackoverflow.com/questions/2861/options-for-html-scraping
And I've chosen jsoup for java since it's the most simplest to implement, with minimal dependencies compared to HTMLUnit and the rest.
Here's how I've written my implementation:
Here's a list of the most used html scraper for different PL: http://stackoverflow.com/questions/2861/options-for-html-scraping
And I've chosen jsoup for java since it's the most simplest to implement, with minimal dependencies compared to HTMLUnit and the rest.
Here's how I've written my implementation:
package org.ipiel.ipielHtmlParser;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.Iterator;
import java.util.List;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
/**
* @author Edward P. Legaspi
* @since Jul 29, 2012
**/
public class JsoupParserImpl {
public static void main(String args[]) {
try {
new JsoupParserImpl();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
public JsoupParserImpl() throws IOException {
File input = new File("input.html");
Document doc = Jsoup.parse(input, "UTF-8");
Elements birdNames = doc.select("p[class=MsoNormal]");
Iterator ite = birdNames.iterator();
PrintWriter pw = new PrintWriter(new FileOutputStream("out.txt"));
while (ite.hasNext()) {
Element bird = (Element) ite.next(); // comm name + sci name
Element birdName = (Element) bird.select("span[class=comname]").first();
Element sciName = (Element) bird.select("span[class=sciname]").first();
List endemics = (List) bird.select("span[class=endemic]");
Element endemic = null;
if(endemics.size() > 0) {
endemic = endemics.get(0);
}
Element location = (Element) ite.next(); // where found
String out = birdName.text().trim() + "," + sciName.text() + "," + ((endemic != null) ? endemic.text() : "") + "," + location.text();
System.out.println(out);
pw.write(out);
pw.write("\n");
ite.next(); // spacer
}
pw.close();
}
}




Post a Comment