提取HTML中的文本

Posted on 2018-06-08 | Edited on 2018-06-08 | In tools , markdown | Comments: | Views:

利用python从HTML文件中提取出文本

提出HTML中的文本

使用NTLK，参考自Shatu的代码如下:

import nltk
from urllib import urlopen

url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
html = urlopen(url).read()
raw = nltk.clean_html(html)
print(raw)

将HTML文件转化为Markdown

参考aaronsw/html2text/html2text.py

参考

本文作者: Wei LI
本文链接: https://VVingerfly.github.io/2018/06-08-Py-ExtractTextFromHTML/
版权声明: 本博客所有文章除特别声明外，均采用 BY-NC-SA 许可协议。转载请注明出处！