Abstract |
This article discusses the effect of inconsistency in training data on the performance of text classifiers. Our experiments show that the inconsistency, even reaching a level as high as 34%,hardly affects the effectiveness of he classifiers.Better classifiers perform better independent of duplicates and label inconsistency.The implication is that past experiments(especially on the Reuters-21578 collection) remain valid. In the experiment process,the author proposes a duplicate detection technique that is far more effective than previous ones.A new Chinese test collection for text categorization is also introduced for deneral free download.
|