中国DOS联盟

-- 联合DOS 推动DOS 发展DOS --

联盟域名：www.cn-dos.net 论坛域名：www.cn-dos.net/forum
DOS，代表着自由开放与发展，我们努力起来，学习FreeDOS和Linux的自由开放与GNU精神，共同创造和发展美好的自由与GNU GPL世界吧！

游客: 注册 | 登录 | 命令行 | 搜索 | 上传 | 帮助 »

中国DOS联盟论坛 » DOS批处理 & 脚本技术（批处理室） » [原创]CMD与Curl双剑合璧：自动合并多页主题

English/Chinese Fix Translation

作者:

标题: [原创]CMD与Curl双剑合璧：自动合并多页主题

取消高亮 | 上一主题 | 下一主题

ikari
初级用户

积分 58
发帖 6
注册 2006-8-1
状态离线

『楼主』: [原创]CMD与Curl双剑合璧：自动合并多页主题使用 LLM 解释/回答一下

现实需求

当遇到长串经典的讨论帖；
当看到分多页的软件教程；
当发现让人爱不释手的连载小说的时候。
如何保存这些有很多分页的内容就成为了一件冗杂而又枯燥的机械劳动。
无论是手工复制还是依靠软件保存，都需要大量的人为干预，这是身为智慧生物的我们所不能容忍的。
既然计算机的出现就是替代人进行一些繁复的工作的，那为什么不把尽可能多的工作扔给它们呢？
可惜豆腐目前还没有发现一款软件可以满足我的要求，既然没有现成的可用那就自己动手吧。

思路分析

要解决一个问题必须先有一个环境，毕竟一个方案不可能通吃所有问题。我们就先设问题是要合并论坛中常见的多页主题。
要合并一个多页主题，我们首先得获取这个主题的每一个分页的内容，这种重复性的工作让机器来做是再适合不过的了。
其次我们需要分辨用户贴出的内容从哪里开始，在哪里结束。这部分第一次需要人来完成，后面的就交给机器吧。
最后我们需要获取我们需要的内容并把它重新组织起来生成最终的成果，这同样只需机器就可以很好的完成。
只要我们满足了上面三点，我们就可以把自己从重复劳动中解救出来做其它的事情了。

解决方案

由于高级语言需要专门的学习和配套的软件，这无形提高了应用的难度，最终豆腐选择了用CMD命令行来完成这个工作。
当然，CMD命令中是没有获取网页内容的功能的，我们还需要Curl这个强大的命令行工具来助我们一臂之力。
我们就以合并CCF精品技术论坛的MPlayer 2006-03-03 K&K 更新在 992 楼为例，顺着刚才的思路来一步步尝试以达到最终的Goal。

网页抓取

在Curl的帮助下，我们可以轻松的通过命令行来抓取我们想要的网页：

curl -o tmp1.txt http://bbs.et8.net/bbs/showthread.php?t=634659&page=1&pp=15

这样我们就把该主题第一页的内容保存在了tmp1.txt文件中。
对于某些需要检测浏览器信息的网站，我们可以用

-A "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"

来伪装成IE浏览器。
对于需要使用cookies的网站，我们可以用

-D cookie1.txt

来保存cookies，用

-b cookie1.txt

来读取cookies。
对于防盗链的网站，我们可以用

-e "http://bbs.et8.net/"

来伪装成从某个相关联接进入的。再与CMD中强大的 FOR 命令和变量相结合，加上人类的小小智慧，就可以打造出自动抓取该主题的全部内容的脚本。
分析该主题的URL，我们可以知道 page= 表示页数，这为自动化处理提供了基础，同时我们知道该主题有73页，最终的抓取脚本如下：

 @echo off

setlocal ENABLEDELAYEDEXPANSION

set last=1

for /l %%i in (1,1,73) do (

echo %%i

curl -b cookie!last!.txt -D cookie%%i.txt -A "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" -e "http://bbs.et8.net/bbs/showthread.php?t=634659^&page=!last!^&pp=15" -o tmp%%i.txt http://bbs.et8.net/bbs/showthread.php?t=634659^&page=%%i^&pp=15

set /a last=%%i-1

)

copy tmp*.txt temp.txt

del cookie*.txt

del tmp*.txt

endlocal

将上面脚本保存为 grab.cmd 运行后我们就的到了保存了该主题全部73页内容的 temp.txt 文件。

内容分析

由于CMD字符处理的问题，我们先把 temp.txt 另存为 ANSI 编码。
分析单页的内容后豆腐发现该论坛程序在用户内容开始之前有一个每页唯一的 <div id="posts">，
而在结束的时候有一个同样唯一的  ，这正是我们所希望找到的可以作为标志位的地方。

文本处理

由于 FOR 命令一次只能以同样的规则处理一行的内容，于是豆腐便采用 FOR 嵌套的方式来处理整个大文件。
先用

for /f "delims=" %%i in (temp.txt) do ( echo %%i >tmp.txt )

将 temp.txt 的内容一次一行地写入 tmp.txt。
再套用另一个 FOR 来处理 tmp.txt 的一行。

标志设置

我们可以通过 FOR 的 delims= 和 tokens= 参数来分割和保存一行的内容
我们用

for /f "tokens=1-3 delims=<->= " %%j in (tmp.txt)

参数设定以 "<"、">"、"-"、"="、" "来分割一行，
并把分割后的前三段内容存入 %%j %%k %%l 三个变量中。接着我们用 if 语句来判断这三个变量是否符合设置标志位的条件：

 if "%%j"=="div" if "%%k"=="id" if %%l=="posts" set flag=1

if "%%j"=="start" if "%%k"=="content" if "%%l"=="table" set flag=0

flag=1 代表用户内容开始，flag=0代表用户内容结束。

内容剪裁

由于CMD命令行处理的限制，HTML中的注释开始符号 "<!--" 会被处理成 "<--" 这就会造成不期望的内容被显示出来。
我们可以再加一个 FOR 来解决这个问题：

 for /f "tokens=1-8 delims=< " %%m in (tmp.txt) do (

if not "%%t"=="-->" if not "%%s"=="-->" if not "%%r"=="-->" if not "%%q"=="-->" if not "%%p"=="-->" if not "%%o"=="-->" if not "%%m"=="ECHO" if !flag!==1 echo %%i >>new.htm)

同时，我们也完成了把开始标志位后的内容存入 new.htm 的工作。

最终脚本

 @echo off

setlocal ENABLEDELAYEDEXPANSION

set flag=0

for /f "delims=" %%i in (temp.txt) do (

echo %%i >tmp.txt

for /f "tokens=1-3 delims=<->= " %%j in (tmp.txt) do (

if "%%j"=="div" if "%%k"=="id" if %%l=="posts" set flag=1

if "%%j"=="start" if "%%k"=="content" if "%%l"=="table" set flag=0

for /f "tokens=1-8 delims=< " %%m in (tmp.txt) do (

if not "%%t"=="-->" if not "%%s"=="-->" if not "%%r"=="-->" if not "%%q"=="-->" if not "%%p"=="-->" if not "%%o"=="-->" if not "%%m"=="ECHO" if !flag!==1 echo %%i >>new.htm

)

)

)

del tmp.txt

endlocal

保存脚本为 merge.cmd 运行后得到合并出的 new.htm 文件就是该主题全部1083帖的内容。

优化改进

该脚本只完成了抓取文本内容的工作，我们还可以通过判断 IMG 元素来找到图片内容，
并把 src 属性后面的路径补完成完整路径，就可以正确显示出内容中的图片。

后记总结

CMD和Curl相结合可以完成很多批量的复杂工作，虽然第一次多花点时间，但之后就可以方便的使用了。
该脚本可以顺利抓取合并CCF精品技术论坛的任意主题以及部分基于vBulletin的论坛，但对于其它论坛还需要分别修改才可使用。

本文为chenke_ikari原创，首发于豆腐的简陋小屋
本文采用Creative Commons 署名-非商业性使用-相同方式共享 2.5 China 许可协议进行许可

Practical Needs

When encountering long and classic discussion threads;
When seeing software tutorials divided into multiple pages;
When discovering serialized novels that are hard to put down.

How to save these multi-page contents has become a tedious and boring mechanical task.
Whether it is manual copying or relying on software to save, a large amount of human intervention is required, which is something we intelligent beings cannot tolerate.
Since the appearance of computers is to replace people in some complicated work, why not leave as much work as possible to them?
Unfortunately, Tofu has not found a software that meets my requirements yet. Since there is no ready-made one available, I have to do it myself.

Ideological Analysis

To solve a problem, there must first be an environment, after all, a single solution cannot cover all problems. Let's first assume the problem is to merge multi-page topics common in forums.
To merge a multi-page topic, we first need to obtain the content of each page of this topic. This repetitive work is most suitable for machines to do.
Secondly, we need to distinguish where the content posted by the user starts and where it ends. The first time this part needs to be done by humans, and the rest can be left to the machine.
Finally, we need to obtain the content we need and reorganize it to generate the final result, which can also be well done by the machine.
As long as we meet the above three points, we can free ourselves from repetitive work and do other things.

Solution

Since high-level languages require specialized learning and supporting software, which invisibly increases the difficulty of application, finally Tofu chose to use the CMD command line to complete this task.
Of course, there is no function to obtain web content in the CMD command. We also need the powerful command line tool Curl to help us.
Let's take merging the CCF Elite Technology Forum's MPlayer 2006-03-03 K&K Update at Post 992 as an example, and follow the previous ideas to try step by step to achieve the final Goal.

Web Page Crawling

With the help of Curl, we can easily crawl the web pages we want through the command line:

curl -o tmp1.txt http://bbs.et8.net/bbs/showthread.php?t=634659&page=1&pp=15

In this way, we have saved the content of the first page of this topic in the tmp1.txt file.
For some websites that need to detect browser information, we can use

-A "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"

to disguise as an IE browser.
For websites that need to use cookies, we can use

-D cookie1.txt

to save cookies and

-b cookie1.txt

to read cookies.
For websites with anti-hotlinking, we can use

-e "http://bbs.et8.net/"

to disguise as entering from a certain related link. Combined with the powerful FOR command and variables in CMD, plus a little human wisdom, we can create a script to automatically crawl all the content of this topic.
Analyzing the URL of this topic, we can know that page= represents the page number, which provides the basis for automated processing. At the same time, we know that this topic has 73 pages. The final crawling script is as follows:

 @echo off

setlocal ENABLEDELAYEDEXPANSION

set last=1

for /l %%i in (1,1,73) do (

echo %%i

curl -b cookie!last!.txt -D cookie%%i.txt -A "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" -e "http://bbs.et8.net/bbs/showthread.php?t=634659^&page=!last!^&pp=15" -o tmp%%i.txt http://bbs.et8.net/bbs/showthread.php?t=634659^&page=%%i^&pp=15

set /a last=%%i-1

)

copy tmp*.txt temp.txt

del cookie*.txt

del tmp*.txt

endlocal

Save the above script as grab.cmd. After running it, we get the temp.txt file that saves all 73 pages of this topic.

Content Analysis

Due to the problem of CMD character processing, we first save temp.txt as ANSI encoding.
After analyzing the content of a single page, Tofu found that the forum program has a <div id="posts"> unique to each page before the user content starts,
and there is an equally unique  at the end, which is exactly the flag we hope to find as a marker.

Text Processing

Since the FOR command can only process one line of content at a time with the same rules, Tofu then uses the nested FOR method to process the entire large file.
First, use

for /f "delims=" %%i in (temp.txt) do ( echo %%i >tmp.txt )

to write the content of temp.txt line by line into tmp.txt.
Then apply another FOR to process a line of tmp.txt.

Flag Setting

We can use the delims= and tokens= parameters of FOR to split and save the content of a line.
We use

for /f "tokens=1-3 delims=<->= " %%j in (tmp.txt)

to set the parameters to split a line with "<", ">", "-", "=", " ",
and store the first three segments of the split content into the three variables %%j %%k %%l. Then we use the if statement to judge whether these three variables meet the conditions for setting the flag:

 if "%%j"=="div" if "%%k"=="id" if %%l=="posts" set flag=1

if "%%j"=="start" if "%%k"=="content" if "%%l"=="table" set flag=0

flag=1 means the user content starts, and flag=0 means the user content ends.

Content Trimming

Due to the limitation of CMD command line processing, the HTML comment start symbol "<!--" will be processed into "<--", which will cause unexpected content to be displayed.
We can add another FOR to solve this problem:

 for /f "tokens=1-8 delims=< " %%m in (tmp.txt) do (

if not "%%t"=="-->" if not "%%s"=="-->" if not "%%r"=="-->" if not "%%q"=="-->" if not "%%p"=="-->" if not "%%o"=="-->" if not "%%m"=="ECHO" if !flag!==1 echo %%i >>new.htm)

At the same time, we have also completed the work of storing the content after the start flag into new.htm.

Final Script

 @echo off

setlocal ENABLEDELAYEDEXPANSION

set flag=0

for /f "delims=" %%i in (temp.txt) do (

echo %%i >tmp.txt

for /f "tokens=1-3 delims=<->= " %%j in (tmp.txt) do (

if "%%j"=="div" if "%%k"=="id" if %%l=="posts" set flag=1

if "%%j"=="start" if "%%k"=="content" if "%%l"=="table" set flag=0

for /f "tokens=1-8 delims=< " %%m in (tmp.txt) do (

if not "%%t"=="-->" if not "%%s"=="-->" if not "%%r"=="-->" if not "%%q"=="-->" if not "%%p"=="-->" if not "%%o"=="-->" if not "%%m"=="ECHO" if !flag!==1 echo %%i >>new.htm

)

)

)

del tmp.txt

endlocal

Save the script as merge.cmd. After running it, the merged new.htm file obtained is the content of all 1083 posts of this topic.

Optimization and Improvement

This script only completes the work of crawling text content. We can also find picture content by judging the IMG element,
and complete the path after the src attribute to the full path, so that the pictures in the content can be displayed correctly.

Postscript and Summary

The combination of CMD and Curl can complete many batch complex tasks. Although it takes a little more time at the first time, it can be used conveniently later.
This script can smoothly crawl and merge any topics of the CCF Elite Technology Forum and some forums based on vBulletin, but it needs to be modified separately for other forums to be used.

This article is original by chenke_ikari and first published on Tofu's Simple Hut
This article is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 2.5 China License

豆腐的简陋小屋

2006-8-1 23:07

查看资料访问主页发短消息网志

编辑帖子回复引用回复

IceCrack
中级用户

DOS之友

积分 332
发帖 168
注册 2005-10-6
来自天涯
状态离线

『第 2 楼』: 使用 LLM 解释/回答一下

豆腐的简陋小屋这个为什么登陆不上啊

测试环境: windows xp pro sp2 高手是这样炼成的:C:\WINDOWS\Help\ntcmds.chm

2006-8-2 21:15

查看资料发送邮件访问主页发短消息网志 OICQ

(369525996)

编辑帖子回复引用回复

无奈何
荣誉版主

积分 1338
发帖 356
注册 2005-7-15
状态离线

『第 3 楼』: 使用 LLM 解释/回答一下

RE ikari
一登录论坛就看到了你的三篇加为精华的文章，首先欢迎加入我们论坛，同时也希望以后能够更多的参加论坛的讨论。
关于 curl 的使用，为什么不选用 curl "http://bbs.et8.net/bbs/showthread.php?t=634659&page=" -o tmp#1.htm 这样的方式来下载多页链接呢？
事实上我也在做一个类似于采集器功能的批处理脚本，不过只是才刚刚开始。当初我在选择使用 curl 、还是 wget 时颇为犯难， curl 有很强大的模拟浏览器的功能但没有递归下载链接的能力， wget 递归下载能力强大但没有 curl 下载顺序并有规律的链接方便。但我最终还是选择了 wget ，因为我比较看中递归下载能力及将相对链接转为绝对链接的功能，可以方便的对网页进行下一步处理。针对 wget 的不足，我写了段脚本来完成多页链接的下载，这是我提及批处理脚本的一部分。

downhtm.cmd

@echo off

for /f "eol=# tokens=1,2 delims= " %%i in (url.txt) do (

call :setpage "%%i" %%j

)

goto :EOF

:setpage

set flag=0

set _url="%~1"

set pages=%2

set endpage=%pages:*-=%

call set startpage=%%pages:-%endpage%=%%

if "%pages:~0,1%" GTR "9" (

set pages=%pages:~1%

set startpage=1%startpage:~1%

set endpage=1%endpage%

set flag=1

)

for /l %%i in (%startpage%,1,%endpage%) do (

call :download %%i

)

goto :EOF

:download

set num=%1

if "%flag%" == "1" (

set num=%num:~1%

)

call set url=%%_url:(*)=%num%%%

wget -k %url%

goto :EOF

无奈何发表于 2006-08-02 22:38

url.txt
格式类似这样：



#此行是注释行，以“#”号开头

#网页序号字母开头，表示可以多位数字对齐。

http://www.cn-dos.net/forum/forumdisplay.php?fid=9&page=(*)	1-5

http://www1.mydeskcity.com/xpbz(*).htm	A01-05

http://www.cn-dos.net/forum/forumdisplay.php?fid=23

请注意downhtm.cmd 文件第二行 delims= 后为制表符，可能显示为多个空格。

RE ikari
As soon as I logged in to the forum, I saw your three articles that were marked as essence. First of all, welcome to join our forum, and I hope you can participate in the forum discussions more in the future.
Regarding the use of curl, why not choose the way of curl "http://bbs.et8.net/bbs/showthread.php?t=634659&page=" -o tmp#1.htm to download multi-page links?
In fact, I am also making a batch script similar to the crawler function, but it has just started. At the beginning, I was quite hesitant when choosing between using curl or wget. Curl has very powerful functions of simulating browsers but does not have the ability to recursively download links. Wget has powerful recursive download capabilities but does not have the convenience of downloading links in order and with regularity like curl. But I finally chose wget because I value the recursive download ability and the function of converting relative links to absolute links more, which can facilitate further processing of web pages. For the deficiencies of wget, I wrote a script to complete the download of multi-page links, which is part of the batch script I mentioned.

downhtm.cmd

@echo off

for /f "eol=# tokens=1,2 delims= " %%i in (url.txt) do (

call :setpage "%%i" %%j

)

goto :EOF

:setpage

set flag=0

set _url="%~1"

set pages=%2

set endpage=%pages:*-=%

call set startpage=%%pages:-%endpage%=%%

if "%pages:~0,1%" GTR "9" (

set pages=%pages:~1%

set startpage=1%startpage:~1%

set endpage=1%endpage%

set flag=1

)

for /l %%i in (%startpage%,1,%endpage%) do (

call :download %%i

)

goto :EOF

:download

set num=%1

if "%flag%" == "1" (

set num=%num:~1%

)

call set url=%%_url:(*)=%num%%%

wget -k %url%

goto :EOF

Posted helplessly on 2006-08-02 22:38

url.txt
The format is like this:



#This line is a comment line, starting with "#"

#Web page number starts with a letter, indicating that it can be aligned with multiple digits.

http://www.cn-dos.net/forum/forumdisplay.php?fid=9&page=(*)	1-5

http://www1.mydeskcity.com/xpbz(*).htm	A01-05

http://www.cn-dos.net/forum/forumdisplay.php?fid=23

Please note that the second line of the downhtm.cmd file has a tab after delims=, which may be displayed as multiple spaces.

此帖被 +4 点积分

点击查看详情