M. Hanumanthappa, Deepa T. Nagalavi, Manish Kumar
Information retrieval is the task of retrieving relevant and useful information from e-newspapers. Electronic newspapers are electronic replicas of traditional newspapers. E-newspapers are becoming increasingly popular because of the ease and convenience in accessing them. Newspapers are the source of timely information. These are the documents comprising news items and several independent informative articles. It is also interesting to note that many newspapers present news on the same subject with different perspectives. In this fast moving era, it is impossible to read multiple newspapers. Thus, it is an essential to quickly summarize an article collected from different newspapers and present it to the reader in a compact and concise manner without compromising the structure and format of the news. A system that achieves this task should parse the e-newspapers available in PDF format and convert to text format. Secondly, data mining techniques are applied to identify and summarize the articles from various newspapers. This survey, focuses on article identification methods and popular extraction tools used for extracting the contents of e-newspapers for conversion from PDF to text format. A comparative study on extraction tools based on the source type, programming language and working characteristics is also presented.