|
guixi
初级用户
积分 76
发帖 29
注册 2007-10-2
状态 离线
|
『楼 主』:
求助,对单词的中文注解,先谢谢。
测试文本test.txt
----------------------------
retrieval is the process of extracting data from a file and generating
a report.
The key to all of these operations is that the data has some kind of
structure. Let us illustrate this with the analogy of a bureau. A bureau
consists of multiple drawers, and each drawer has a certain set of contents:
socks in one drawer, underwear in another, and sweaters in a third drawer.
homework drawers have compartments allowing different kinds of things
to be homework together. These are all structures that determine where
things go - when you are sorting the laundry - and where things can be
found - when you are getting dressed. Awk allows you to use the structure
of a text file in writing the procedures for putting things in and taking
things out.
Thus, the benefits of awk are best realized when the data has some kind of structure. A text file can be loosely or tightly structured. A chapter
containing major and minor sections has some structure. We'll look at a
script that extracts section headings and numbers them to produce an
outline. A table consisting of tab-separated items in columns might be
considered very structured. You could use an awk script to reorder columns
of data, or even change columns into rows and rows into columns.
Like sed scripts, awk scripts are typically invoked by means of a shell
wrapper. This is a shell script that usually contains the command line
that invokes awk as well as the script that awk interprets. Simple one-line
awk scripts can be entered from the command line.
Some of the things awk allows you to do are:
View a text file as a textual database made up of records and fields.
Use variables to manipulate the database.
Use arithmetic and string operators.
Use common programming constructs such as loops and conditionals.
Generate formatted reports.
Define functions.
Execute UNIX commands from a script.
Process the result of UNIX commands.
Process command-line arguments more gracefully.
Work more easily with multiple input streams.
Because of these features, awk has the power and range that users might
rely upon to do the kinds of tasks performed by shell scripts. In this
-----------------------------------------
上面的一段文本是一本叫 《Sed&awk》 的书籍摘下来的,我的想法是这样的
因为我英文比叫差,长一点的单词就不认识了,我想对长一点的单词注中文解释。
我昨天整理了文本形式的单词库,在上面的文本 查找长度大于7的单词,然后将
长度 大于7 的单词与单词库a.txt比对。如果有这个单词的解释就加在这个单词所在的行
的下一行。效果如下。
-----------------------------------------------------------------
retrieval is the process of extracting data from a file and generating
retrieval n.取回, 恢复, 修补,重获,挽救,拯救
extracting 萃取, 提取, 提炼
generating 发生, 产生
a report.
The key to all of these operations is that the data has some kind of
operations n.运转, 操作, 实施, 作用, 业务, 工作, 手术, 军事行动
structure. Let us illustrate this with the analogy of a bureau. A bureau
structure 数据类型)结构,结构体[STRUC]
illustrate vt.举例说明, 图解, 加插图于, 阐明 vi.举例
----------------------------------------------------
单词库部分 a.txt
------------------------------------
retrieval n.取回, 恢复, 修补,重获,挽救,拯救
extracting 萃取, 提取, 提炼
generating 发生, 产生
operations n.运转, 操作, 实施, 作用, 业务, 工作, 手术, 军事行动
structure 数据类型)结构,结构体[STRUC]
illustrate vt.举例说明, 图解, 加插图于, 阐明 vi.举例
conditionals adj.有条件的, 引起条件反应的
multiple adj.多样的, 多重的 n.倍数, 若干 v.成倍增加
conditionals adj.有条件的, 引起条件反应的
manipulate vt.(熟练地)操作, 使用(机器等), 操纵(人或市价、市场), 利用, 应付
------------------------------------------------
先谢谢各位兄弟!!!
[ Last edited by guixi on 2007-10-26 at 11:58 PM ]
|
|
2007-10-25 10:41 |
|
|
abcd
银牌会员
积分 1436
发帖 739
注册 2007-10-11
状态 离线
|
『第
2 楼』:
@echo off
for /f "delims=" %%i in (test.txt) do (
echo %%i
for %%a in (%%i) do (
echo %%a|findstr /r "[a-zA-Z][a-zA-Z][a-zA-Z][a-zA-Z][a-zA-Z][a-zA-Z][a-zA-Z][a-zA-Z]">nul 2>nul&&findstr /r "^%%a" a.txt
)
)
pause 就是效率不高
|
|
2007-10-25 12:24 |
|
|
lxmxn
版主
积分 11386
发帖 4938
注册 2006-7-23
状态 离线
|
『第
3 楼』:
貌似楼主也在学习awk&sed,就贴一个awk的脚本吧,效率快点。
awk-file.awk
BEGIN{
FS="[ \t,.-]";
}
{
IGNORECASE=1;
if(FNR==NR){
danci[$1]=$0;
}else{
print;
for(i=1;i<=NF;i++){
if($i in danci){
print danci[$i];
}
}
}
} Command-Line:
gawk -f awk-file.awk a.txt test.txt
|
|
2007-10-25 17:54 |
|
|
guixi
初级用户
积分 76
发帖 29
注册 2007-10-2
状态 离线
|
『第
4 楼』:
非常感谢abcd兄和lxmax版主热情帮助,我现在对整本书进行测试
|
|
2007-10-25 22:34 |
|
|
guixi
初级用户
积分 76
发帖 29
注册 2007-10-2
状态 离线
|
『第
5 楼』:
lxmax版主,小第测试了一下0.8兆的文本执行起来还是很快的,但我仅要翻译长度 大于7 的单词,好象将7个和7个以下的也翻译出来了
下面是部分输出结果
-----------------------------------------------------------------------------
C.3.2 Background Details
Tim O'Reilly recommends The Joy of Cooking (JofC) index as an ideal index.
I examined the JofC index quite thoroughly and set out to write a new
indexing program that duplicated its features. I did not wholly duplicate
wholly ad. 完全地,全部,一概
duplicate n. 复制品,副本
the JofC format, but this could be done fairly easily if desired. Please
format n. 设计,安排,样式 v. 使格式化
look at the JofC index yourself to examine its features.
I also tried to do a few other things to improve on the previous index
previous a. 先,前,以前的;(to)在……之前
program and provide more support for the person coding the index.
C.3.3 Coding Index Entries
This section describes the coding of index entries in the document file.
file n. 锉刀;文件,档案 v. 锉
We use the .XX macro for placing index entries in a file. The simplest
file n. 锉刀;文件,档案 v. 锉
case is:
.XX "entry"
If the entry consists of primary and secondary sort keys, then we can code
primary a. 最初的,初级的;首要的,主要的,基本的
secondary a. 次要的,二级的;中级的,第二的
it as:
.XX "primary, secondary"
A comma delimits the two keys. We also have a .XN macro for generating
"See" references without a page number. It is specified as:
.XN "entry (See anotherEntry)"
第 497 页 共 502 页
While these coding forms continue to work as they have, masterindex
provides greater flexibility by allowing three levels of keys: primary,
primary a. 最初的,初级的;首要的,主要的,基本的
secondary, and tertiary. You'd specify the entry like so:
secondary a. 次要的,二级的;中级的,第二的
specify v. 指定,详细说明
.XX "primary: secondary; tertiary"
Note that the comma is not used as a delimiter. A colon delimits the primary
primary a. 最初的,初级的;首要的,主要的,基本的
and secondary entry; the semicolon delimits the secondary and tertiary
secondary a. 次要的,二级的;中级的,第二的
secondary a. 次要的,二级的;中级的,第二的
entry. This means that commas can be a part of a key using this syntax.
means n. 方法,手段
Don't worry, though, you can continue to use a comma to delimit the primary
primary a. 最初的,初级的;首要的,主要的,基本的
and secondary keys. (Be aware that the first comma in a line is converted
secondary a. 次要的,二级的;中级的,第二的
aware a.(of)知道的,意识到的
to a colon, if no colon delimiter is found.) I'd recommend that new books
recommend v. 推荐,介绍;劝告,建议
be coded using the above syntax, even if you are only specifying a primary
primary a. 最初的,初级的;首要的,主要的,基本的
and secondary key.
secondary a. 次要的,二级的;中级的,第二的
Another feature is automatic rotation of primary and secondary keys if
feature n. 特征,特色;特写
automatic n. 自动机械 a. 自动的,无意识的,机械的
primary a. 最初的,初级的;首要的,主要的,基本的
secondary a. 次要的,二级的;中级的,第二的
a tilde (~) is used as the delimiter. So the following entry:
.XX "cat~command"
is equivalent to the following two entries:
equivalent a.(to)相等的等值的 n. 相等物,等值物
.XX "cat command"
.XX "command: cat"
You can think of the secondary key as a classification (command, attribute,
secondary a. 次要的,二级的;中级的,第二的
classification n. 分类,分级
attribute n. 属性,品质,特征 v.(to)把……归于
function, etc.) of the primary entry. Be careful not to reverse the two,
etc. 等等
primary a. 最初的,初级的;首要的,主要的,基本的
reverse n. 相反,反转,颠倒;北面,后面
as "command cat" does not make much sense. To use a tilde in an entry,
enter "~~".
I added a new macro, .XB, that is the same as .XX except that the page
number for this index entry will be output in bold to indicate that it
bold a. 大胆的,勇敢的;冒失的;黑体的,粗体的
indicate v. 指出,指示;表明,暗示
is the most significant page number in a range. Here is an example:
significant a. 有意义的;重大的,重要的
.XB "cat command"
When troff processes the index entries, it outputs the page number
followed by an asterisk. This is how it appears when output is seen in
screen format. When coded for troff formatting, the page number is
format n. 设计,安排,样式 v. 使格式化
surrounded by the bold font change escape sequences. (By the way, in the
bold a. 大胆的,勇敢的;冒失的;黑体的,粗体的
JofC index, I noticed that they allowed having the same page number in
roman and in bold.) Also, this page number will not be combined in a range
bold a. 大胆的,勇敢的;冒失的;黑体的,粗体的
of consecutive numbers.
第 498 页 共 502
One other feature of the JofC index is that the very first secondary key
feature n. 特征,特色;特写
secondary a. 次要的,二级的;中级的,第二的
appears on the same line with the primary key. The old index program placed
primary a. 最初的,初级的;首要的,主要的,基本的
any secondary key on the next line. The one advantage of doing it the JofC
secondary a. 次要的,二级的;中级的,第二的
way is that entries containing only one secondary key will be output on
secondary a. 次要的,二级的;中级的,第二的
the same line and look much better. Thus, you'd have "line justification,
definition of" rather than having "definition of" indented on the next
definition n. 定义,解释
line. The next secondary key would be indented. Note that if the primary
secondary a. 次要的,二级的;中级的,第二的
primary a. 最初的,初级的;首要的,主要的,基本的
key exists as a separate entry (it has page numbers associated with it),
the page references for the primary key will be output on the same line
primary a. 最初的,初级的;首要的,主要的,基本的
and the first secondary entry will be output on the next line.
secondary a. 次要的,二级的;中级的,第二的
To reiterate, while the syntax of the three-level entries is different,
this index entry is perfectly valid:
.XX "line justification, definition of"
definition n. 定义,解释
It also produces the same result as:
.XX "line justification: definition of"
definition n. 定义,解释
(The colon disappears in the output.) Similarly, you could write an entry,
such as
.XX "justification, lines, defined"
or
.XX "justification: lines, defined"
where the comma between "lines" and "defined" does not serve as a delimiter
but is part of the secondary key.
secondary a. 次要的,二级的;中级的,第二的
The previous example could be written as an entry with three levels:
previous a. 先,前,以前的;(to)在……之前
.XX "justification: lines; defined"
where the semicolon delimits the tertiary key. The semicolon is output
with the key, and multiple tertiary keys may follow immediately after the
multiple a. 多样的,多重的 n. 倍数
secondary key.
secondary a. 次要的,二级的;中级的,第二的
The main thing, though, is that page numbers are collected for all primary,
primary a. 最初的,初级的;首要的,主要的,基本的
secondary, and tertiary keys. Thus, you could have output such as:
secondary a. 次要的,二级的;中级的,第二的
第 499 页 共 502
justification 4-9
lines 4,6; defined, 5
C.3.4 Output Format
One thing I wanted to do that our previous program did not do is generate
previous a. 先,前,以前的;(to)在……之前
an index without the troff codes. masterindex has three output modes:
troff, screen, and page.
The default output is intended for processing by troff (via fmt). It
contains macros that are defined in /work/macros/current/indexmacs.
These macros should produce the same index format as before, which was
format n. 设计,安排,样式 v. 使格式化
largely done directly through troff requests. Here are a few lines off
largely ad. 主要地,基本上;大量地,大规模地
the top:
$ masterindex ch01
.so /work/macros/current/indexmacs
so-called a. 所谓的,号称的
.Se "" "Index"
.XC
.XF A "A"
.XF 1 "applications, structure of 2; program 1"
.XF 1 "attribute, WIN_CONSUME_KBD_EVENTS 13"
.XF 2 "WIN_CONSUME_PICK_EVENTS 13"
.XF 2 "WIN_NOTIFY_EVENT_PROC 13"
.XF 2 "XV_ERROR_PROC 14"
.XF 2 "XV_INIT_ARGC_PTR_ARGV 5,6"
The top two lines should be obvious. The .XC macro produces multicolumn
output. (It will print out two columns for smaller books. It's not smart
enough to take arguments specifying the width of columns, but that should
be done.) The .XF macro has three possible values for its first argument.
An "A" indicates that the second argument is a letter of the alphabet that
alphabet n. 字母表
should be output as a divider. A "1" indicates that the second argument
contains a primary entry. A "2" indicates that the entry begins with a
primary a. 最初的,初级的;首要的,主要的,基本的
secondary entry, which is indented.
secondary a. 次要的,二级的;中级的,第二的
When invoked with the -s argument, the program prepares the index for
viewing on the screen (or printing as an ASCII file). Again, here are a
few lines:
$ masterindex -s ch01
第 500 页 共 502 页
A
applications, structure of 2; program 1
attribute, WIN_CONSUME_KBD_EVENTS 13
attribute n. 属性,品质,特征 v.(to)把……归于
WIN_CONSUME_PICK_EVENTS 13
WIN_NOTIFY_EVENT_PROC 13
XV_ERROR_PROC 14
XV_INIT_ARGC_PTR_ARGV 5,6
XV_INIT_ARGS 6
XV_USAGE_PROC 6
Obviously, this is useful for quickly proofing the index. The third type
of format is also used for proofing the index. Invoked using -p, it
format n. 设计,安排,样式 v. 使格式化
provides a page-by-page listing of the index entries.
$ masterindex -p ch01
Page 1
structure of XView applications
applications, structure of; program
XView applications
XView applications, structure of
XView interface
interface n.[地质]分界面,接触面,[物、化]界面
compiling XView programs
XView, compiling programs
Page 2
XView libraries
C.3.5 Compiling a Master Index
A multivolume master index is invoked by specifying the -m option. Each
master n. 名家;[M-]硕士 v. 精通,掌握 a. 主要的
option n.选项,选择权,[经]买卖的特权
set of index entries for a particular volume must be placed in a separate
volume n. 容积,体积;卷,册;音量,响度
file.
file n. 锉刀;文件,档案 v. 锉
$ masterindex -m -s book1 book2 book3
xv_init() procedure II: 4; III: 5
procedure n. 程序,手续,步骤
XV_INIT_ARGC_PTR_ARGV attribute II: 5,6
attribute n. 属性,品质,特征 v.(to)把……归于
XV_INIT_ARGS attribute I: 6
attribute n. 属性,品质,特征 v.(to)把……归于
Files must be specified in consecutive order. If the first file is not
file n. 锉刀;文件,档案 v. 锉
Volume 1, you can specify the number as an argument.
specify v. 指定,详细说明
$ masterindex -m 4 -s book4 book5
第 501 页 共 502 页
第 502 页 共 502 页
--------------------------------------------------------------------
麻烦给小弟再看一下,谢谢!!!
|
|
2007-10-25 23:20 |
|
|
abcd
银牌会员
积分 1436
发帖 739
注册 2007-10-11
状态 离线
|
『第
6 楼』:
这个应该可以用length函数判断一下就行了。
|
|
2007-10-26 10:18 |
|
|
guixi
初级用户
积分 76
发帖 29
注册 2007-10-2
状态 离线
|
『第
7 楼』:
由于本人对gawk不了解,虽然可以通过改单词库的长度比如
sed "/^[a-zA-Z]\{8\}/!d" a.txt
将单词库的长度改成大于等于8的集合
但本人还是想学学版主的gawk脚本。
谢谢abcd兄指点,昨天对你的脚本测试了一下,速度不容乐观,呵呵,不过还是可以执行,没有脚本崩溃的情况,我昨天没有想到用for %%a in (string) do command的结构
还一直想用for /f 语句,呵呵太苯了
[ Last edited by guixi on 2007-10-26 at 11:32 AM ]
|
|
2007-10-26 11:30 |
|
|
abcd
银牌会员
积分 1436
发帖 739
注册 2007-10-11
状态 离线
|
『第
8 楼』:
我在6楼已经说了啊。
gawk内置函数length啊
|
|
2007-10-26 12:05 |
|
|
lxmxn
版主
积分 11386
发帖 4938
注册 2006-7-23
状态 离线
|
『第
9 楼』:
abcd 兄说得对,只需要加一个length函数判断一下就应该可以了。
BEGIN{
FS="[ \t,.-]";
}
{
IGNORECASE=1;
if(FNR==NR){
danci[$1]=$0;
}else{
print;
for(i=1;i<=NF;i++){
if(length($i)>7){if($i in danci){
print danci[$i];
}
}
}
}
} [ Last edited by lxmxn on 2007-10-26 at 12:15 PM ]
|
|
2007-10-26 12:14 |
|
|
guixi
初级用户
积分 76
发帖 29
注册 2007-10-2
状态 离线
|
『第
10 楼』:
呵呵,不错,再次感谢两位兄弟!!!
|
|
2007-10-26 23:18 |
|
|