Testbed for Information Extraction from Deep Web
1. Introduction
Search results generated by searchable databases are served
dynamically and far larger than the static documents on the Web.
These results pages has been referred to as the Deep Web. We propose a
testbed for information extraction from search results. We chose 100
databases randomly from 114540 pages with forms. Therefore, these
databases have a good variety. We selected 51 databases which include
URLs in results page and manually identify target information to be
extracted. We also suggest evaluation measures for comparing
extraction methods and methods for extending the target data.
2. Download
3. Files
- Each directory stands for ID of each database.
- A file "top.txt" is described the URL of the query page.
- A file "top.html" is the query page.
- A file "num.txt" is described the number of search results on each file.
- A file "keyword.txt" is described keywords to get results page.
- 5 files (1.html, 2.html 3.html, 4.html and 5.html) are the results page.
- 5 files (1.url.txt, 2.url.txt 3.url.txt, 4.url.txt and 5.url.txt) are described extracting
the URLs from each result.
- 5 files (1.record1.txt, 2.record1.txt 3.record1.txt, 4.record1.txt and 5.record1.txt) are
described the first search result.
- 5 files (1.field.txt, 2.field.txt 3.field.txt, 4.field.txt and
5.field.txt) are described the more detail information about first search result.
4. Evaluation measure
- An important part of the testbed is evaluation measures. For URL
extraction, let N be the number of URLs identified in testbed, R be
the number of result URLs listed by a wrapper extraction, and n be
the number of URLs correctly identified by a wrapper extraction.
The precision of extraction is n/N and the recall is n/R. These
measures can be averaged over all databases.
- For identifying the result extent, we measure "success rate" out of
51 databases. A database's result extent is successfully
identified only if the precise characters are identified.
-
Tetsuya Nakatoh, Yasuhiro Yamada and Sachio Hirokawa: Automatic
Generation of Deep Web Wrappers based on Discovery of Repetition,
Proceeding of the First Asia Information Retrieval Symposium (AIRS
2004), pp.269-272, Beijing, China, 2004.
-
Yasuhiro Yamada, Nick Craswell, Tetsuya Nakatoh and Sachio Hirokawa:
Testbed for Information Extraction from Deep Web,
Proc. of the 13th International World Wide Web Conference,
Alternate Track Papers and Posters, pp.346-347,
New York, USA, May 17-22, 2004.
(pdf, poster)
- Tetsuya Nakatoh, Keisuke Ohmori, Yasuhiro Yamada and Sachio Hirokawa:
Complex Query and Metadata, International Symposium on
Information Science and Electrical Engineering 2003, Japan,
November 13-15, 2003. (pdf)
- Tetsuya Nakatoh, Yasunori Koga, A. Uhl and Sachio Hirokawa:
Automatic Estimation of Query Syntax for Search Sites,
Proc. PYIWIT2002 (Pan-Yellow-Sea International Workshop on
Information Technologies for Network Era), pp.329-332, March
2002. (pdf)
- Tetsuya Nakatoh, Miyuki Sakai, Yasunori Koga and Sachio Hirokawa:
Generation of Query URL for Search Sites, Proc. SSGRR2002,
CD-ROM, January 2002. (pdf)
- Sachio Hirokawa, Seiichirou Watanabe, Yasunori Koga, Tsuyoshi
Taguchi: Automatic Feature Extraction of Search Sites,
Proc. SSGRR 2001, CD-ROM, 2001. (pdf)
- Tsuyoshi Taguchi, Yasunori Koga and Sachio Hirokawa: Integration
of Search Sites of the World Wide Web, Proc. of the
International Forum cum Conference on Information Technology and
Communication, Vol.2, pp. 25-32, 2000. (pdf)
ChangeLog
- 2004/6/9 testbed Ver 1.02
(changes of 1.record1.txt .... 5.record1.txt)
- 2004/3/26 testbed Ver1.01 (adding to more detail information about
first search result)
- 2004/2/20 testbed Ver1.00
Hirokawa lab.
Mail: daisen(at)matu.cc.kyushu-u.ac.jp
Kyushu University
Hakozaki 6-10-1, Higasi-ku, Fukuoka 812-8581, Japan
Tel: +81-92-642-2296
Fax: +81-92-642-2294