I am new to programming so bear with me. I have many XML documents that look like this:
File name: PRIDE_Exp_Complete_Ac_10094.xml.gz
<ExperimentCollection version="2.1">
<Experiment>
<ExperimentAccession>1015</ExperimentAccession>
<Title>Protein complexes in Saccharomyces cerevisiae (GPM06600002310)</Title>
<ShortLabel>GPM06600002310</ShortLabel>
<Protocol>
<ProtocolName>None</ProtocolName>
</Protocol>
<mzData version="1.05" accessionNumber="1015">
<cvLookup cvLabel="RESID" fullName="RESID Database of Protein Modifications" version="0.0" address="http://www.ebi.ac.uk/RESID/" />
<cvLookup cvLabel="UNIMOD" fullName="UNIMOD Protein Modifications for Mass Spectrometry" version="0.0" address="http://www.unimod.org/" />
<description>
<admin>
<sampleName>GPM06600002310</sampleName>
<sampleDescription comment="Ho, Y., et al., Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature. 2002 Jan 10;415(6868):180-3.">
<cvParam cvLabel="NEWT" accession="4932" name="Saccharomyces cerevisiae (Baker's yeast)" value="Saccharomyces cerevisiae" />
开发者_C百科 </sampleDescription>
</admin>
</description>
<spectrumList count="0" />
</mzData>
</Experiment>
I want to take out the text in between "Title", "ProtocolName", and "SampleName" and save into a text file that has the same name as the .xml.gz. I have the following code so far (based on posts I saw on this site), but it seems not to work:
require 'rubygems'
require 'nokogiri'
doc = Nokogiri::XML(File.open("PRIDE_Exp_Complete_Ac_10094.xml.gz"))
@ExperimentCollection = doc.css("ExperimentCollection Title").map {|node| node.children.text }
Can someone help me?
Thanks
IF you are happy with REXML, AND there's only one <Experiment>
per file, then something like the following should help ... (by the way, above text is invalid XML since no closing <ExperimentCollection>
tag)
require "rexml/document"
include REXML
xml=<<EOD
<Experiment>
<ExperimentAccession>1015</ExperimentAccession>
<Title>Protein complexes in Saccharomyces cerevisiae (GPM06600002310)</Title>
<ShortLabel>GPM06600002310</ShortLabel>
<Protocol>
<ProtocolName>None</ProtocolName>
</Protocol>
<mzData version="1.05" accessionNumber="1015">
<cvLookup cvLabel="RESID" fullName="RESID Database of Protein Modifications" version="0.0" address="http://www.ebi.ac.uk/RESID/" />
<cvLookup cvLabel="UNIMOD" fullName="UNIMOD Protein Modifications for Mass Spectrometry" version="0.0" address="http://www.unimod.org/" />
<description>
<admin>
<sampleName>GPM06600002310</sampleName>
<sampleDescription comment="Ho, Y., et al., Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature. 2002 Jan 10;415(6868):180-3.">
<cvParam cvLabel="NEWT" accession="4932" name="Saccharomyces cerevisiae (Baker's yeast)" value="Saccharomyces cerevisiae" />
</sampleDescription>
</admin>
</description>
<spectrumList count="0" />
</mzData>
</Experiment>
EOD
doc = Document.new xml
doc.elements["Experiment/Title"].text
doc.elements["Experiment/Protocol/ProtocolName"].text
doc.elements["Experiment/mzData/description/admin/sampleName"].text
精彩评论