Getting Started With Html Scraping in Java
One rainy Sunday afternoon since I can't get out to go somewhere I've decided to create an organized excel file (for now) for the li...
https://www.czetsuyatech.com/2012/07/java-html-scraping.html
One rainy Sunday afternoon since I can't get out to go somewhere I've decided to create an organized excel file (for now) for the list of birds commonly found in the Philippines. I found a good site to start with, http://www.birding2asia.com/tours/reports/PhilFeb2010_list.html, but I don't want to copy each detail into an excel document because that would take time. So I searched the internet for html scraping tools, I've used HTMLAgility for .net before and I think I'll still use the same if I'm working with .net again, but I want to do it in java today.
Here's a list of the most used html scraper for different PL: http://stackoverflow.com/questions/2861/options-for-html-scraping
And I've chosen jsoup for java since it's the most simplest to implement, with minimal dependencies compared to HTMLUnit and the rest.
Here's how I've written my implementation:
Here's a list of the most used html scraper for different PL: http://stackoverflow.com/questions/2861/options-for-html-scraping
And I've chosen jsoup for java since it's the most simplest to implement, with minimal dependencies compared to HTMLUnit and the rest.
Here's how I've written my implementation:
package org.ipiel.ipielHtmlParser; import java.io.File; import java.io.FileOutputStream; import java.io.IOException; import java.io.PrintWriter; import java.util.Iterator; import java.util.List; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; /** * @author Edward P. Legaspi * @since Jul 29, 2012 **/ public class JsoupParserImpl { public static void main(String args[]) { try { new JsoupParserImpl(); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } } public JsoupParserImpl() throws IOException { File input = new File("input.html"); Document doc = Jsoup.parse(input, "UTF-8"); Elements birdNames = doc.select("p[class=MsoNormal]"); Iteratorite = birdNames.iterator(); PrintWriter pw = new PrintWriter(new FileOutputStream("out.txt")); while (ite.hasNext()) { Element bird = (Element) ite.next(); // comm name + sci name Element birdName = (Element) bird.select("span[class=comname]").first(); Element sciName = (Element) bird.select("span[class=sciname]").first(); List endemics = (List ) bird.select("span[class=endemic]"); Element endemic = null; if(endemics.size() > 0) { endemic = endemics.get(0); } Element location = (Element) ite.next(); // where found String out = birdName.text().trim() + "," + sciName.text() + "," + ((endemic != null) ? endemic.text() : "") + "," + location.text(); System.out.println(out); pw.write(out); pw.write("\n"); ite.next(); // spacer } pw.close(); } }
Post a Comment