|
plp626
银牌会员
     钻石会员
积分 2278
发帖 1020
注册 2007-11-19
状态 离线
|
『楼 主』:
【出题】30万文件快速替换
使用 LLM 解释/回答一下
手上现在有30多万htm文件(还有少量的html),这些文件里有相当多的一部分挂马,发现这些挂马的html及htm源文件的最后一行含有mm.aa88567.cn这个网址,
现在, 要找出那些挂马的网页(最后一行包含字符串"mm.aa88567.cn”),然后把它的最后一行删掉,
使用工具,windows 自带的常用工具,或者外部工具:perl,sed,。。。(体积大小不超过1M的皆可)
现在 求处理速度最快的工具,及其方案(代码)
sed (40K 版本 4.0.7)下载: http://upload.cn-dos.net/img/1813.rar
perl(369K 版本 5.005_03) 下载: http://upload.cn-dos.net/img/1814.rar
下面是生产测试文件的bat代码:
@echo off&setlocal enabledelayedexpansion
md test&cd test
echo 正在生产测试文件。。。
for /l %%a in (1 1 2000)do (
(echo ^<html^>
echo ...!random!...
echo ^<... http://mm.aa88567.cn/ ... ^>
)>!random!!time:~-2!!random!.htm
)
for /l %%a in (1 1 3000)do (
(echo ^<html^>
echo ...!random!...
echo dfddfd
)>!random!!time:~-2!!random!!time:~-2!.html
)
set ff=!random!
md %ff%
for /l %%a in (1 1 2000)do (
(echo ^<html^>
echo ...!random!...
echo ^<... http://mm.aa88567.cn/ ... ^>
)>%ff%\!random!!time:~-2!!random!.htm
)
for /l %%a in (1 1 3000)do (
(echo ^<html^>
echo ...!random!...
echo dfddfd
)>%ff%\!random!!time:~-2!!random!!time:~-2!.html
)
Last edited by plp626 on 2010-3-4 at 23:00 ]
Right now there are more than 300,000 htm files (and a small number of html files). A considerable portion of these files have been defaced with malware. It is found that the source files of these defaced html and htm files have the URL mm.aa88567.cn in the last line.
Now, we need to find those defaced web pages (whose last line contains the string "mm.aa88567.cn"), and then delete their last line.
Tools to be used: commonly used tools in Windows, or external tools: perl, sed,... (those with a volume size not exceeding 1M are all acceptable).
Now we need to find the tool with the fastest processing speed and its solution (code)
sed (40K version 4.0.7) download: http://upload.cn-dos.net/img/1813.rar
perl (369K version 5.005_03) download: http://upload.cn-dos.net/img/1814.rar
The following is the bat code for generating test files:
@echo off&setlocal enabledelayedexpansion
md test&cd test
echo Generating test files...
for /l %%a in (1 1 2000)do (
(echo ^<html^>
echo ...!random!...
echo ^<... http://mm.aa88567.cn/ ... ^>
)>!random!!time:~-2!!random!.htm
)
for /l %%a in (1 1 3000)do (
(echo ^<html^>
echo ...!random!...
echo dfddfd
)>!random!!time:~-2!!random!!time:~-2!.html
)
set ff=!random!
md %ff%
for /l %%a in (1 1 2000)do (
(echo ^<html^>
echo ...!random!...
echo ^<... http://mm.aa88567.cn/ ... ^>
)>%ff%\!random!!time:~-2!!random!.htm
)
for /l %%a in (1 1 3000)do (
(echo ^<html^>
echo ...!random!...
echo dfddfd
)>%ff%\!random!!time:~-2!!random!!time:~-2!.html
)
Last edited by plp626 on 2010-3-4 at 23:00 ]
|

山外有山,人外有人;低调做人,努力做事。
进入网盘(各种工具)~~ 空间~~cmd学习 |
|
2010-3-2 23:59 |
|
|
radem
高级用户
    CMD感染者
积分 691
发帖 383
注册 2008-5-23
状态 离线
|
|
2010-3-3 01:46 |
|
|
Pierre
初级用户
 
积分 30
发帖 19
注册 2009-4-4
状态 离线
|
『第 3 楼』:
使用 LLM 解释/回答一下
试试
for /f "delims=" %%i in ('dir /s /b *.html') do sed -i "${/mm\.aa88567\.cn/d}" "%%i"
Last edited by Pierre on 2010-3-3 at 02:23 ]
Try
for /f "delims=" %%i in ('dir /s /b *.html') do sed -i "${/mm\.aa88567\.cn/d}" "%%i"
Last edited by Pierre on 2010-3-3 at 02:23 ]
|
|
2010-3-3 02:19 |
|
|
HAT
版主
       
积分 9023
发帖 5017
注册 2007-5-31
状态 离线
|
『第 4 楼』:
使用 LLM 解释/回答一下
sed -i "/mm\.aa88567\.cn/d" *.html
```
sed -i "/mm\.aa88567\.cn/d" *.html
```
|

 |
|
2010-3-3 04:36 |
|
|
qq330878338
新手上路

积分 13
发帖 7
注册 2008-7-18
状态 离线
|
『第 5 楼』:
使用 LLM 解释/回答一下
看了下 不简单 只能帮顶了。
Just took a look, it's not simple. Can only help to give a thumbs up.
|
|
2010-3-3 14:34 |
|
|
plp626
银牌会员
     钻石会员
积分 2278
发帖 1020
注册 2007-11-19
状态 离线
|
『第 6 楼』:
使用 LLM 解释/回答一下
4楼的代码应该是比较快的---- 相比3路没有频繁的启动sed
可是:
用1楼的bat代码生成测试文件在test目录内,于test父目录路径下打开cmd,键入 sed -i "/mm\.aa88567\.cn/d" test\*.html
后,不等几秒钟便出现sed什么内存不能为read的错误
貌似sed最多支持批量处理文件为512个,因为在目录生成了511个临时文件就弹出了这错误对话框,
不知何故,继续关注。。。。
Last edited by plp626 on 2010-3-3 at 15:11 ]
The code on floor 4 should be relatively fast ---- compared to the 3-way, there are no frequent starts of sed
However:
Generate a test file using the bat code on floor 1 in the test directory. Open cmd under the parent directory of test, and type sed -i "/mm\.aa88567\.cn/d" test\*.html
After that, in less than a few seconds, an error that sed cannot read memory appears
It seems that sed supports batch processing of at most 512 files, because when 511 temporary files are generated in the directory, this error dialog pops up
I don't know why, continue to pay attention....
Last edited by plp626 on 2010-3-3 at 15:11 ]
|

山外有山,人外有人;低调做人,努力做事。
进入网盘(各种工具)~~ 空间~~cmd学习 |
|
2010-3-3 15:09 |
|
|
tachyon
初级用户
 
积分 33
发帖 32
注册 2006-2-21
状态 离线
|
『第 7 楼』:
使用 LLM 解释/回答一下
直接sed速度最快,不要用通过for, for的效率是最低的。 频繁开启&结束sed进程,对系统资源也是极大消耗,估计用不了多久cpu / mem 就out了。
还有,300,000 文件挨个处理速度不会太快的,毕竟牵扯到磁盘操作,不像是在内存里做计算类任务。
Directly using sed is the fastest. Don't use for loops, as for loops are the least efficient. Frequent opening and closing of sed processes also consumes a great deal of system resources. It's estimated that the CPU/memory will run out soon. Also, processing 300,000 files one by one won't be too fast, after all, it involves disk operations, unlike computational tasks done in memory.
|
|
2010-3-3 16:06 |
|
|
tachyon
初级用户
 
积分 33
发帖 32
注册 2006-2-21
状态 离线
|
『第 8 楼』:
使用 LLM 解释/回答一下
Originally posted by plp626 at 2010-3-3 15:09:
4楼的代码应该是比较快的---- 相比3路没有频繁的启动sed
可是:
用1楼的bat代码生成测试文件在test目录内,于test父目录路径下打开cmd,键入 sed -i &q ...
这个还是要看sed文档了,如果有同时编辑的文件数目限制,那可以先通过batch
把所有文件按数量划分到若干个目录,之后再依次对各个目录进行sed 处理。
Originally posted by plp626 at 2010-3-3 15:09:
The code on floor 4 should be faster---- compared to floor 3, there is no frequent startup of sed
But:
Use the bat code on floor 1 to generate test files in the test directory, open cmd under the parent directory path of test, and type sed -i &q ...
This still depends on the sed documentation. If there is a limit on the number of files edited simultaneously, then you can first divide all files into several directories by quantity through batch, and then perform sed processing on each directory in turn.
|
|
2010-3-3 16:10 |
|
|
sady2009
初级用户
 
积分 58
发帖 60
注册 2009-2-18
状态 离线
|
『第 9 楼』:
使用 LLM 解释/回答一下
现在的内存都很大.
安装个ramdisk .把文件弄到ramdisk 盘中再用批处理. 它的小文件存取速度是非常惊人的. 相对于在内存中执行一样了.
Nowadays, memory is very large. Install a ramdisk, move files to the ramdisk, and then use batch processing. The access speed of small files is extremely amazing. It is almost the same as executing in memory.
|
|
2010-3-3 16:41 |
|
|
Pierre
初级用户
 
积分 30
发帖 19
注册 2009-4-4
状态 离线
|
『第 10 楼』:
使用 LLM 解释/回答一下
直接 *.html的话,通配符匹配到的文件都是以参数形式传递给sed来执行的,所以不能超出某个量数
—— 如果你在shell下这样用sed的话,就会出现
-bash: /bin/sed: Argument list too long
这样的错误
当然,不仅仅是sed这样, awk/grep 等均有此待遇
如果想效率再高点的话,可以切割列表,500个一次送给sed处理
Last edited by Pierre on 2010-3-3 at 17:20 ]
Directly using *.html, the files matched by the wildcard are all passed to sed as parameters for execution, so it cannot exceed a certain amount.
——If you use sed like this under the shell, the following will appear
-bash: /bin/sed: Argument list too long
Such an error.
Of course, it's not just for sed, but also for awk/grep, etc.
If you want higher efficiency, you can split the list and send 500 at a time to sed for processing
Last edited by Pierre on 2010-3-3 at 17:20 ]
|
|
2010-3-3 17:19 |
|
|
Pierre
初级用户
 
积分 30
发帖 19
注册 2009-4-4
状态 离线
|
『第 11 楼』:
使用 LLM 解释/回答一下
或者,直接用perl处理,让它循环处理
可惜用得不熟,迟点试着写个解决看看罢。。。
Alternatively, directly use Perl to handle it and let it loop through the processing. Unfortunately, I'm not very familiar with using it. I'll try to write a solution later...
|
|
2010-3-3 17:27 |
|
|
plp626
银牌会员
     钻石会员
积分 2278
发帖 1020
注册 2007-11-19
状态 离线
|
『第 12 楼』:
使用 LLM 解释/回答一下
我对perl也不怎么懂,还真希望有某高人能用一楼提供的perl来试试,
期待中。。。。。
I don't know Perl very well either, and I really hope some expert can try using the Perl provided in the first floor.
Looking forward...
|

山外有山,人外有人;低调做人,努力做事。
进入网盘(各种工具)~~ 空间~~cmd学习 |
|
2010-3-3 20:42 |
|
|
freeants001
中级用户
  
积分 330
发帖 244
注册 2006-4-14 来自 湖北
状态 离线
|
『第 13 楼』:
使用 LLM 解释/回答一下
贴个自已用的JS文件替换代码,按楼主的要求修改了下。
if(WScript.arguments.length==0)PathSpec=get_path();
else PathSpec=WScript.arguments(0);
fso=new ActiveXObject("Scripting.FileSystemObject");
WshShell=WScript.CreateObject("WScript.Shell");
WshShell.CurrentDirectory=fso.GetParentFolderName(WScript.ScriptFullName);
if(!fso.FolderExists("#ReplacedFiles#"))fso.CreateFolder("#ReplacedFiles#");
WshShell.CurrentDirectory="#ReplacedFiles#"
Main(PathSpec);
WScript.quit();
function Main(FileSpec){
var fld,fs,fds,f,fd,curdir;
curdir=fso.GetBaseName(FileSpec);
if(!fso.FolderExists(curdir))fso.CreateFolder(curdir);
curdir=fso.GetAbsolutePathName(curdir);
WshShell.CurrentDirectory=curdir;
fld = fso.getfolder(FileSpec);
fds = new Enumerator(fld.subfolders);
fs = new Enumerator(fld.files)
for(;!fs.atEnd();fs.moveNext()){
f=fs.item();if(f.size==0)continue;
if(/^html?$/gi.test(fso.getextensionname(f.name).toLowerCase())){
try{
var fl=fso.opentextfile(f.path,1);
var sss=fl.readall();
fl.close();
}catch(err){
WScript.quit();
}
var fl=fso.opentextfile(fso.GetBaseName(f.path)+".txt",2,true);
sss=sss.replace(/.*mm\.aa88567\.cn.*\s*$/,"");
fl.write(sss);
fl.close();
}
}
for(;!fds.atEnd();fds.moveNext()){
d=fds.item();
Main(d.path)
WshShell.CurrentDirectory=curdir
}
}
function get_path(){
var objShell = new ActiveXObject("Shell.Application")
do{
var objFolder = objShell.BrowseForFolder(0, "请选择文件夹:",0x301,0x11)
if(objFolder == null)WScript.quit()
var objPath = objFolder.Self.Path;
if(/^:\\.+$/gi.test(objPath))break;
}while(true)
return objPath;
}
Post a self-used JS file replacement code, which has been modified according to the building owner's requirements.
if(WScript.arguments.length==0)PathSpec=get_path();
else PathSpec=WScript.arguments(0);
fso=new ActiveXObject("Scripting.FileSystemObject");
WshShell=WScript.CreateObject("WScript.Shell");
WshShell.CurrentDirectory=fso.GetParentFolderName(WScript.ScriptFullName);
if(!fso.FolderExists("#ReplacedFiles#"))fso.CreateFolder("#ReplacedFiles#");
WshShell.CurrentDirectory="#ReplacedFiles#"
Main(PathSpec);
WScript.quit();
function Main(FileSpec){
var fld,fs,fds,f,fd,curdir;
curdir=fso.GetBaseName(FileSpec);
if(!fso.FolderExists(curdir))fso.CreateFolder(curdir);
curdir=fso.GetAbsolutePathName(curdir);
WshShell.CurrentDirectory=curdir;
fld = fso.getfolder(FileSpec);
fds = new Enumerator(fld.subfolders);
fs = new Enumerator(fld.files)
for(;!fs.atEnd();fs.moveNext()){
f=fs.item();if(f.size==0)continue;
if(/^html?$/gi.test(fso.getextensionname(f.name).toLowerCase())){
try{
var fl=fso.opentextfile(f.path,1);
var sss=fl.readall();
fl.close();
}catch(err){
WScript.quit();
}
var fl=fso.opentextfile(fso.GetBaseName(f.path)+".txt",2,true);
sss=sss.replace(/.*mm\.aa88567\.cn.*\s*$/,"");
fl.write(sss);
fl.close();
}
}
for(;!fds.atEnd();fds.moveNext()){
d=fds.item();
Main(d.path)
WshShell.CurrentDirectory=curdir
}
}
function get_path(){
var objShell = new ActiveXObject("Shell.Application")
do{
var objFolder = objShell.BrowseForFolder(0, "请选择文件夹:",0x301,0x11)
if(objFolder == null)WScript.quit()
var objPath = objFolder.Self.Path;
if(/^:\\.+$/gi.test(objPath))break;
}while(true)
return objPath;
}
|
|
2010-3-4 16:19 |
|
|
freeants001
中级用户
  
积分 330
发帖 244
注册 2006-4-14 来自 湖北
状态 离线
|
『第 14 楼』:
使用 LLM 解释/回答一下
再来一段AHK脚本
FileSelectFolder, sPath ,,3,请选择你要处理的文件夹:
if not %errorlevel%
{
IfNotExist,%sPath%_BAK
{
MsgBox,4,,没有发现备份,是否创建备份?
IfMsgBox Yes
{
TrayTip,,正在创建备份 ……,5
FileCopyDir %sPath% , %sPath%_BAK
}
}
Loop, %sPath%\*.html, , 1 ; Recurse into subfolders.
{
FileRead, sText, %A_LoopFileFullPath%
sss := RegExReplace( sText ,".*mm\.aa88567\.cn.*\s$", "")
FileDelete %A_LoopFileFullPath%
FileAppend %sss%, %A_LoopFileFullPath%
}
}
loop 2
{
soundBeep 2500,500
sleep 500
}
Another AHK script
FileSelectFolder, sPath ,,3,Please select the folder you want to process:
if not %errorlevel%
{
IfNotExist,%sPath%_BAK
{
MsgBox,4,,No backup found, do you want to create a backup?
IfMsgBox Yes
{
TrayTip,,Creating backup...,5
FileCopyDir %sPath% , %sPath%_BAK
}
}
Loop, %sPath%\*.html, , 1 ; Recurse into subfolders.
{
FileRead, sText, %A_LoopFileFullPath%
sss := RegExReplace( sText ,".*mm\.aa88567\.cn.*\s$", "")
FileDelete %A_LoopFileFullPath%
FileAppend %sss%, %A_LoopFileFullPath%
}
}
loop 2
{
soundBeep 2500,500
sleep 500
}
|
|
2010-3-4 20:18 |
|
|
plp626
银牌会员
     钻石会员
积分 2278
发帖 1020
注册 2007-11-19
状态 离线
|
『第 15 楼』:
使用 LLM 解释/回答一下
13楼的代码试了下,速度那时没的说,只是,文件的后缀名统一被改为txt了,那些网页文件,索引的后缀是html,非索引的是htm,
对js不懂,你的代码后缀名不变的情形如何修改?
I tried the code on floor 13. The speed was really good at that time, but the file extensions were all uniformly changed to txt. Those web files, the indexed ones have the suffix html and the non-indexed ones have htm.
I don't understand JS. How to modify the code so that the file extensions remain unchanged?
|

山外有山,人外有人;低调做人,努力做事。
进入网盘(各种工具)~~ 空间~~cmd学习 |
|
2010-3-4 21:23 |
|