开发者

Need help parsing a File in Java

开发者 https://www.devze.com 2023-02-22 17:53 出处:网络
I am currently doing a small data structures project, and I am trying to get data on universities across the country; and then do some data manipulation with them. I have found this data here: http://

I am currently doing a small data structures project, and I am trying to get data on universities across the country; and then do some data manipulation with them. I have found this data here: http://archive.ics.uci.edu/ml/machine-learning-databases/university/university.data

BUT, the problem with this data is (and I quote from the website): "It is a LISP readable file with a few relevant functions at the end of the data file." I plan on taking this data and saving it as a .txt file.

The file looks a bit like:

(def-instance Adelphi
      (state newyork)
      (control private)
      (no-of-students thous:5-10)
      (male:female ratio:30:70)
      (student:faculty ratio:15:1)
      (sat verbal 500)
      (sat math 475)
      (expenses thous$:7-10)
      (percent-financial-aid 60)
      (no-applicants thous:4-7)
      (percent-admittance 70)
      (percent-enrolled 40)
      (academics scale:1-5 2)
      (social scale:1-5 2)
      (quality-of-life scale:1-5 2)
      (academic-emphasis business-administration)
      (academic-emphasis biology))
(def-instance Arizona-State
      (state arizona)
      (control state)
      (no-of-students thous:20+)
      (male:female ratio:50:50)
      (student:faculty ratio:20:1)
      (sat verbal 450)
      (sat math 500)
      (expenses thous$:4-7)
      (percent-financial-aid 50)
      (no-applicants thous:17+)
      (percent-admittance 80)
      (percent-enrolled 60)
      (academics scale:1-5 3)
      (social scale:1-5 4)
      (quality-of-life scale:1-5 5)
      (academic-emphasis business-education)
      (academic-emphasis engineering)
      (academic-emphasis accounting)
      (academic-emphasis fine-arts))

      ......

The End Of the File:

(dfx def-instance (l)
  (tlet (instance (car l) f-list (cdr l))
    (cond ((or (null instance) (consp instance))
         开发者_C百科  (msg t instance " is not a valid instance name (must be an atom)"))
          (t (make:event instance)
             (push instance !instances)
             (:= (get instance 'features)
                 (tfor (f in f-list)
                   (when (cond ((or (atom f) (null (cdr f)))
                                (msg t f " is not a valid feature "
                                       "(must be a 2 or 3 item list)") nil)
                               ((consp (car f))
                                (msg t (car f) " is not a valid feature "
                                     "name (must be an atom)") nil)
                               ((and (cddr f) (consp (cadr f)))
                                (msg t (cadr f) " is not a valid feature "
                                     "role (must be an atom)") nil)
                               (t t)))
                   (save (cond ((equal (length f) 3)
                                (make:feature (car f) (cadr f) (caddr f)))
                               (t (make:feature (car f) 'value (cadr f)))))))
             instance))))

(set-if !instances nil)



(dex run-uniq-colleges (l n)
  (tfor (sc in l)
    (when (cond ((ge (length *events-added*) n))
                ((not (get sc 'duplicate))
                 (run-instance sc)
~                 (remprop sc 'features)
                 nil)
                (t (remprop sc 'features) nil)))
    (stop)))

The data I am mostly interested in is Number of students, Academic emphases, and School name.

Any help is greatly appreciated.


You can work on/use a Lisp file parser, or you can ignore the language it's written on and focus on the data. You mentioned you need:

  • School name
  • Number of students
  • Academic emphases

You can grep the relevant keywords (def-instance, no-of-students, academic-emphasis), which would leave you with (based on your example):

(def-instance Adelphi
      (no-of-students thous:5-10)
      (academic-emphasis business-administration)
      (academic-emphasis biology))
(def-instance Arizona-State
      (no-of-students thous:20+)
      (academic-emphasis business-education)
      (academic-emphasis engineering)
      (academic-emphasis accounting)
      (academic-emphasis fine-arts))

Which simplifies writing a specific parser (def-instance is followed by the name, then all academic-emphasis and no-of-students before the next def-instance refer to the previously defined name)


Have you though about running that Lisp file in a Lips interpreter for the Java VM ?

As an example, Armed Bear Common Lisp, which is cimpatible with JSR-223 would hapily parse your file.

And using JSR-233, you'll be able to access script-defined variables (like Adephi and other ones), like examples show.

EDIT From comment request, some complementary explanations (although it seems quite straightforward to me).

So, suppose you have Armed bear Common Lisp in your classpath, and file is the absolute file name of your script (this example is heavily inspired by/borrowed from JSR-223 example).

First, install script engine

ScriptEngineManager scriptManager = new ScriptEngineManager();
scriptManager.registerEngineExtension("lisp", new AbclScriptEngineFactory());
ScriptEngine lispEngine = scriptManager.getEngineByExtension("lisp");

Then, load your script in script engine

Object eval = lispEngine.eval(new FileReader(file));

Now, armed with one little debugger, go see what's in (I'm not courageous enought to install all the environment to do the job for you)


If you're going to parse lisp, you need to be aware of 'the stack'.

When you encounter a (, you push onto the stack. You're now in a new scope, one level above where you were before.

Similiarly, when you encounter a ) you pop off the stack - finish that layer and go down a level.

So in this case, you're at the empty state to start. The first thing you encounter is the ( so now you're in the "define" state. (I just made that up. Call it whatever you want.) You encounter the def-instance token, and then the name of the university. You keep reading and you encounter another ( (Ignore whitespace, just parse tokens.) This puts you in the properties state. (I made that up too.) Since you're jumping from define to properties, it's okay to make your object now. Something like UnivData data = new UnivData(parsedToken) (Where parsedToken evaluates to "Adelphi".

Okay back to properties - you've read that first (, then you read "state" and "newyork", and then another ). So, you can assign the state variable of the current UnivData to newyork.

You repeat this behavior for all the properties, but then you encounter an additional ) after academic-emphasis. This is your cue to close the current object and start looking for another one.

At first, I was tempted to say use a Map. The fact that there are multiple academic-emphasis tokens indicates you should use a better datastructure, perhaps a Map>. It may even be better to roll your own Property class that has a String, or if it acquires multiple values, it switches to a list of strings.

0

精彩评论

暂无评论...
验证码 换一张
取 消