I've been reading in HTML files in Matlab with readfile
, with the interest of using regexp
to extract data from it. The function is returning the data the file as a string, which preserves the 'structure' of the HTML file, for example newlines. For example, if you try to do a file read on a file with the below contents it will return a string with the same structure.
<!DOCTYPE html PUBLIC "-//IETF//DTD HTML 2.0//EN">
<HTML>
<HEAD>
<TITLE>
A Small Hello
</TITLE>
</HEAD>
</HTML>
I'm looking for a function that will return a continuous string like ...
<!DOCTYPE html PUBLIC "-//IETF//DTD HTML 2.0//EN">
<HTML>
<HEAD>
<TITLE>
A Small Hello
开发者_运维技巧 </TITLE>
</HEAD>
<BODY>
<H1>Hi</H1>
<P>This is very minimal "hello world" HTML document.</P>
</BODY>
</HTML>
This format will assist in my regexp
endeavours.
Many thanks, Bob M
A quick way to jam these things together might be to import the data then concatenate them using strcat.
The code
imported_string = importdata(filename)
imported_string_together = strcat(imported_string{:})
produces the following output
imported_string =
'<!DOCTYPE html PUBLIC "-//IETF//DTD HTML 2.0//EN">'
'<HTML>'
' <HEAD>'
' <TITLE>'
' A Small Hello'
' </TITLE>'
' </HEAD>'
'</HTML>'
imported_string_together =
<!DOCTYPE html PUBLIC "-//IETF//DTD HTML 2.0//EN"><HTML> <HEAD> <TITLE> A Small Hello </TITLE> </HEAD></HTML>
but this isn't really efficient.
I find that it is sometimes useful to go back to fopen/fread/fscanf type functions to quickly load things in a predictable manner. For example, you can use the following code to create what you want without so much copying and and other nonsense:
filename = 'test.html';
maxReadSize = 2^10;
fid = fopen(filename);
mystr = fscanf(fid, '%c', maxReadSize)
to produce the following output:
<!DOCTYPE html PUBLIC "-//IETF//DTD HTML 2.0//EN"><HTML> <HEAD> <TITLE> A Small Hello </TITLE> </HEAD></HTML>
</HTML>
Regular expressions can do that:
str = fileread('file.html');
str = regexprep(str,'\s*',' '); %# replace multiple whitespaces with a space
精彩评论