使用XPATH从XML中提取节目信息

创作时间:

作者:

@小白创作中心

使用XPATH从XML中提取节目信息

引用

CSDN

https://blog.csdn.net/D0126_/article/details/137037676

本文介绍如何使用Python的lxml库和XPath从复杂的XML文件中提取特定信息。通过一个实际案例，展示了如何从包含多个"ProgramInformation"分支的XML结构中提取节目ID、标题和流派信息，并将结果写入外部文件。

XML结构示例

需要从XML中提取一些条目。最开始尝试使用一种相当手动的方式来做到这一点，后来有人建议使用XML解析器。已经尝试使用lxml，但无法掌握它（我刚开始学习Python）。

这是一个XML结构的示例（将会有许多"ProgramInformation"的分支）：

<TVAMain xml:lang="NL" publisher="" publicationTime="2013-09-12T01:43:09+00:00" version="217" xmlns="urn:tva:metadata:2010" xmlns:mpeg7="urn:tva:mpeg7:2008" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:tva:metadata:2010>
  <ProgramDescription>
    <ProgramInformationTable>
        <ProgramInformation programId="crid://bds.tv/95291775">
            <BasicDescription>
                <Title xml:lang="EN" type="main">Rip Off Britain</Title>
                <Synopsis xml:lang="EN" length="short">Consumer series. The team investigates why some viewers have been hit with bills they did not expect for hundreds or even thousands of pounds, and offers some advice.</Synopsis>
                <Keyword xml:lang="EN" type="main">bills</Keyword>
                <Genre href="urn:tva:metadata:cs:UPCEventGenreCS:2009:82">
                     <Name xml:lang="EN">Economics</Name>
                </Genre>
                <Language>EN</Language>
            </BasicDescription>
        </ProgramInformation>
    </ProgramInformationTable>
  </ProgramDescription>
</TVAMain>

初始尝试

可以使用以下代码检索"crid"和"title"：

import xml.etree.ElementTree as etree
tree = etree.parse('UPC_Medium.xml')
root = tree.getroot()
print (root[0][0][0].attrib)
print (root[0][0][0][0][0].text)

这两个项始终位于每个分支下的相同位置（因此理论上可以用这种相当粗略的方式检索它们）。还需要检索流派，但可能会改变（因为会有数量不定的"关键字"条目）。认为XPath是关键，但不知道如何实现它（已经查看了很多关于这个的示例）。

解决方案

以下是使用XPath实现的完整解决方案：

import xml.etree.ElementTree as etree
nsmap = {'xmlns': 'urn:tva:metadata:2010'}
with open('UPC_Medium.xml', 'r') as f:
    tree = etree.parse(f)
    root = tree.getroot()
    for info in root.xpath('//xmlns:ProgramInformation', namespaces=nsmap):
        crid = info.get('programId')
        title = info.find('.//xmlns:Title', namespaces=nsmap).text
        genre = info.find('.//xmlns:Genre/xmlns:Name', namespaces=nsmap).text
        # 将信息写入外部文件
        with open('output.txt', 'a') as out:
            out.write(f'{crid}, {title}, {genre}\n')

上述代码使用ElementTree库解析XML文件并提取所需的信息。通过使用xpath()方法，我们可以使用XPath表达式在XML文档中查找所需的元素。通过get()方法可以获取元素的属性值，通过text属性可以获取元素的文本内容。

最后，使用with语句将信息写入外部文件。with语句确保在发生错误时文件被正确关闭。

热门推荐

库珀·弗拉格：与近年状元秀的比较及其NBA发展潜力