Processing (too) many XML files (with TagSoup)_问答_开发者

Processing (too) many XML files (with TagSoup)

开发者 https://www.devze.com 2023-03-05 03:40 出处：网络

I have a directory with about 4500 XML (HTML5) files, and I want to create a \"manifest\" of their data (essentially title and base/@href).

I have a directory with about 4500 XML (HTML5) files, and I want to create a "manifest" of their data (essentially title and base/@href).

To this end, I've been using a function to collect all the relevant file paths, opening them with readFile, sending them into a tagsoup based parser and then outputting/formatting the resultant list.

This works for a subset of the files, but eventually runs into a openFile: resource exhausted (Too many open files) error. After doing some reading, this isn't so surprising: I'm using mapM parseMetaDataFile files which opens all the handles straight away.

What I can't figure out is how to work around the problem. I've tried reading a bit about Iteratee; Can I hook that up with Tagsoup easily? Strict IO, the way I used it anyway (heh), froze my computer even though the files aren't very big (28 KB on average).

Any pointers would be greatly appreciated. I realize the approach of creating a big list might fail as well, but 4.5k elements isn't that long... Also, there should probably be less String and more ByteString everywhere.

Here's some code. I apologize for the naivety:

import System.FilePath
import Text.HTML.TagSoup

data MetaData = MetaData String String deriving (Show, Eq)

-- | Given HTML input, produces a MetaData structure of its essentials.
-- Should obviously account for errors, but simplified here.
readMetaData :: String -> MetaData
readMetaData input = MetaData title base
 where
  title =
    innerText $
    (takeWhile (~/= TagClose "title") . dropWhile (~/= TagOpen "title" []))
    tags
  base = fromAttrib "href" $ head $ dropWhile (~/= TagOpen "base" []) tags
  tags = parseTags input

-- | Parses MetaData from a file.
parseMetaDataFile :: FilePath -> IO MetaData
parseMetaDataFile path = fmap readMetaData $ readFile path

-- | From a given root, gets the FilePaths of the files we are int开发者_运维技巧erested in.
-- Not implemented here.
getHtmlFilePaths :: FilePath -> IO [FilePath]
getHtmlFilePaths root = undefined

main :: IO
main = do
  -- Will call openFile for every file, which gives too many open files.
  metas <- mapM parseMetaDataFile =<< getHtmlFilePaths

  -- Do stuff with metas, which will cause files to actually be read.

The quick and dirty solution:

parseMetaDataFile path = withFile path $ \h -> do
    res@(MetaData x y) <- fmap readMetaData $ hGetContents h
    Control.Exception.evaluate (length (x ++ y))
    return res

A slightly nicer solution is to write a proper NFData instance for MetaData, instead of just using evaluate.

If you want to keep the current design you must make sure parseMetaDataFile has consumed the entire string from readFile before returning. When readFile reaches end-of-file the file descriptor will be closed.