Little bit of a beginner here, working on a personal project to scrape my schools course offerings into a easy-to-read tabular format, but am having trouble with the initial step of scraping the data from the site.
I just added the JSoup library to my project in eclipse, and am now having trouble initializing the connection when using the documentation for Jsoup.
In the end, my goal is to grab each class name / time / description, but for now I want to just grab the name. The HTML of the source website appears like this:
<td class='CourseNum'><img src='images/minus.gif' class='ICS3330 SW' onclick="toggledetails('CS3330')
My first guess was to getElementsByTag(td), and then query these elements for the parameter of onclick= or the value of the 'class' parameter, cleaning it up by removing the initial "I" and the suffix of " SW" leaving behind the name "CS3330."
Now onto the actual implementation:
Document doc = Jsoup.parse("UTF-8", "http://rabi.phys.virginia.edu/mySIS/CS2/page.php?Semester=1118&Type=Group&Group=CompSci").get();
Elements td = doc.getElementsByTag("td");
At this point, I am already running into problems (even though I am not straying far from the examples provided in the documentation) and would apprec开发者_开发问答iate some guidance on getting my code to function!
edit: GOT IT! Thank you all!
According to documentation you should be doing:
Document doc = Jsoup.connect(url).get();
The parse()
method is for files.
I just downloaded JSoup and tried it out on your school's website and got this output:
Unit: Computer Science
CS 1010: Introduction to Information Technology
CS 1110: Introduction to Programming
CS 1111: Introduction to Programming
CS 1112: Introduction to Programming
CS 1120: From Ada and Euclid to Quantum Computing and the World Wide Web
CS 2102: Discrete Mathematics I
CS 2110: Software Development Methods
CS 2150: Program and Data Representation
CS 2220: Engineering Software
CS 2330: Digital Logic Design
CS 2501: Special Topics in Computer Science
CS 3102: Theory of Computation
CS 3330: Computer Architecture
CS 4102: Algorithms
CS 4240: Principles of Software Design
CS 4414: Operating Systems
CS 4444: Introduction to Parallel Computing
CS 4457: Computer Networks
CS 4501: Special Topics in Computer Science
CS 4753: Electronic Commerce Technologies
CS 4810: Introduction to Computer Graphics
CS 4993: Independent Study
CS 4998: Distinguished BA Majors Research
CS 6161: Design and Analysis of Algorithms
CS 6190: Computer Science Perspectives
CS 6354: Computer Architecture
CS 6444: Introduction to Parallel Computing
CS 6501: Special Topics in Computer Science
CS 6610: Programming Languages
CS 7457: Computer Networks
CS 7993: Independent Study
CS 7995: Supervised Project Research
CS 8501: Special Topics in Computer Science
CS 8524: Topics in Software Engineering
CS 8897: Graduate Teaching Instruction
CS 8999: Thesis
CS 9999: Dissertation
Too flippin' cool! Vlad is right though; use the connect(...) method. 1+ to Vlad
Other suggestions and hints:
These are the constants that I used in my little program:
private static final String URL = "http://rabi.phys.virginia.edu/mySIS/CS2/" +
"page.php?Semester=1118&Type=Group&Group=CompSci";
private static final String TD_TAG = "td";
private static final String CLASS_ATTRIB = "class";
private static final String CLASS_ATTRIB_UNIT_NAME = "UnitName";
private static final String CLASS_ATTRIB_COURSE_NUM = "CourseNum";
private static final String CLASS_ATTRIB_COURSE_NAME = "CourseName";
And these are the variables I used inside the scraping method:
String unitName = "";
List<String> courseNumbNameList = new ArrayList<String>();
String courseNumbName = "";
Edit 1
Based on your recent comments, I think that you're over-thinking things a bit. What worked well for me is this simple algorithm:
- Create the 3 variables I have listed above
- Get my document as Vlad recommends.
- Create a td Elements variable and assign to it all elements that have a td tag.
- Use a for loop with int i going from 0 to < td.size() and get each Element, element using
td.get(i);
- Inside the loop check the element's class attribute.
- If the attribute String equals the CLASS_ATTRIB_UNIT_NAME String (see above), get the element's text and use it to set the unitName variable.
- If the attribute String equals CLASS_ATTRIB_COURSE_NUM set the courseNumbName to the element's text.
- If the attribute String equals CLASS_ATTRIB_COURSE_NAME append the element's text to the courseNumbName String, add the String to the array list, and set courseNumbName = to "".
精彩评论