Task Description

In this challenge, about 10,000,000 URLs are provided and only 100,000 of them are provided with label information. The task is the classification of the unlabeled URLs.

1) Data Description

The labeled samples are provided in the labeled_data.tsv (http://14.18.146.162:7180/apweb/labeled_data.tsv.gz), and unlabeled samples are provided in the unlabel_data.tsv (http://14.18.146.162:7180/apweb/unlabel_data.tsv.gz).

Each line of the labeled_data.tsv is as follows. In detail, each line contains two columns, URL and the label, and the two columns are separated by tab.

URL Classification
http://sports.sina.com.cn/j/2015-04-20/21287581972.shtml 9001004000000

Each line of the unlabel_data.tsv is as follows. In detail, each line corresponds to an URL.

URL
http://sports.sina.com.cn/j/2015-04-20/18157581914.shtml

2) The descriptions of the labels are as follows:

9001000000000 Information
9001004000000 Information-Sport
9001006000000 Information-Travel
9001007000000 Information-Finance
9001013000000 Information-Health
9002000000000 Entertainment
9002001000000 Entertainment-Music
9002002000000 Entertainment-Video
9002003000000 Entertainment-Reading
9002004000000 Entertainment-Cartoon
9002005000000 Entertainment-Picture
9002006000000 Entertainment-Game

Evaluation

Random selected unlabeled samples are used as the evaluation set and the normalized accuracy is used as the evaluation metric. If ti samples are correctly classified among the ci URLs belongs to category i, the accuracy is for the category i. Assume there are K categories, the normalized accuracy is as follows.The final board is ranked based on R.


Submission Guideline

Each submission include: 1) the source code, 2) the executable code, 3) the associated documentation with environment setting and algorithm description. The above materials should be send to apweb@189.com before the deadline, and the email title should be 'teamname+datachallenge'.

In the output file of the submitted code should be "unlabel_data.tsv", and the format of the output file must be same as that of the provided labeled_data.tsv format, i.e. each line contains two columns, URL and the label, and the two columns are separated by tab.

If you have any questions, please contact xiangy@gsta.com .


Important Date

Submission Deadline: Aug 5,2015


Awards

  • First Prize: 2,000 RMB
  • Second Prize: 1,000 RMB
  • Third Prize: 500 RMB

The top 3 teams may be invited to present their solutions in the meeting.


Links