开发者

Java - Searching For Data within a Website

开发者 https://www.devze.com 2023-01-13 15:39 出处:网络
I\'m new to java and having some problems. The main idea is to connect to a website and collect information off it and store it in an array.

I'm new to java and having some problems.

The main idea is to connect to a website and collect information off it and store it in an array.

What I want the program to do is to search the website find a key word, and store what comes after the key word..

on the front page of daniweb along the bottom of the website there is a section called "Tag Cloud" which is filled with tags / short words

Tag Cloud: "i want to store what is written here"

My idea is to first read in the html of the website and then search that file for the key word followed by the text using Scanner and StringTokenizer then store as a array.

is there a better way / easier?

where do you suggest i look for some examples

here is what i have so far.

import java.net.*;
import java.io.*;

public class URLReader {

    public static void main(String[] args) throws Exception {

        URL dweb = new URL("http://www.daniweb.com/");
        URLConnection dw = dweb.openConnection();
        BufferedReader in = new BufferedReader(new InputStreamReader(hc.getInputStream()));
        System.out.println("connected to daniweb");
        String inputLine;

        PrintStream out = new PrintStream(new FileOutputStream("OutFile.txt"));

        try {
        while ((inputLine = in.readLine()) != null)
  开发者_Go百科          out.println(inputLine);

            //System.out.println(inputLine);
            //in.close();
        out.close();
        System.out.println("printed text to outfile");
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        }

        try {
            Scanner scan = new Scanner(OutFile.txt);
            String search = txtSearch.getText();
            while (scan.hasNextLine()) {
                line = scan.nextLine();
            //still working
                while (st.hasMoreTokens()) {
                    word = st.nextToken();
                    if (word == search) {

                    } else {

                    }
                }
            }
            scan.close();
            SearchWin.dispose();
        } catch (IOException iox) {
        }
    }

any help at all would be very much appreciated!


I recommend jsoup. It will retrieve and parse the page for you.

On daniweb, each tag cloud link has the CSS class tagcloudlink. So you just need to tell jsoup to extract all text in tags that have the class tagcloudlink.

This is off the top of my head plus some help from the jsoup site; I haven't tested it but it should get you started:

List<String> tags = new ArrayList<String>();
Document doc = Jsoup.connect("http://daniweb.com/").get();
Elements taglinks = doc.select("a.tagcloudlink");
for (Element link : taglinks) {
    tags.add(link.text());
}


You could use HTML Parser for this. Here is a link to it: HTML Parser. Another one I've used a lot and like is Jericho HTML Parser. Here is a link: Jericho HTML Parser

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号