Datasets for semi-structured data record detection
The first dataset, named TWEB_TB2, has 200 pages. The pages are static Web pages collected from different online shopping and university Web sites. The second dataset, named TWEB_TB3, has 100 pages. The pages mainly contain complicated flat data records and intertwined data records.
These two datasets were generated along with the paper “Lidong Bing, Wai Lam, and Tak-Lam Wong. Robust Detection of Semi-structured Web Records Using DOM Structure Knowledge Driven Model. ACM Transactions on the Web (TWEB)”. More details about the datasets can be found in the paper.
Download: