A Proposed Model for Focused Crawling and Automatic Text Classification of Online Crime Web Pages
DOI:
https://doi.org/10.59167/tujnas.v6i6.1329Keywords:
Crime Data Mining, Web Mining, Focused Crawling, ClassificationAbstract
With the exponential growth of textual information available from the Internet, there has been an emergent need to find relevant, in-time and in-depth knowledge about crime topic. The huge size of such data makes the process of retrieving and analyzing and use of the valuable information in such texts manually a very difficult task. In this paper, we attempt to address a challenging task i.e. a crawling and classification of crime-specific knowledge on the Web. To do that, a model for online crime text crawling and classification is introduced. First, a crime-specific web crawler is designed to collect web pages of crime topic from the news websites. In this crawler, a binary Naive Bayes classifier is used for filtering crime web pages from others. Second, a multi-classes classification model is applied to categorize the crime pages into their appropriate crime types. In both steps, several feature selection methods are applied to select the most important features. Finally, the model has been evaluated on manually labeled corpus and also on online real world data. The experimental results on manually labeled corpus indicate that Naive Bayes with mutual information and odd ratio feature selection methods can accurately distinguish crime web pages from others with an F1 measure of 0.99. In addition, the experimental results also show that the Naive Bayes classification models can accurately classify crime documents to their appropriate crime types with Macro-F1 measure of 0.87. Our results also on online real word data show that the focused crawler with two-level classification is very effective for gathering high-quality collections of crime Web documents and also for classifying them.Downloads
Published
How to Cite
Issue
Section
Categories
License
Copyright (c) 2023 Muneer A. S. Hazaa, Fadl M. Ba-Alwi, Mohammed Albared, Helmi Al-Salehi

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
From July 2025 onward, all TUJNAS publications are licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. This open license allows anyone to share (copy and redistribute) and adapt (remix, transform, and build upon) the work in any medium or format, for any purpose (even commercially), as long as appropriate credit is given to the original author(s) and source. This permissive framework encourages scholarly innovation, translation, and integration into wider academic outputs by removing unnecessary legal barriers. Users of TUJNAS content must provide proper attribution and indicate if any changes were made to the original work. By enabling unrestricted reuse, the CC BY 4.0 license maximizes the reach and impact of research findings while ensuring that authors receive full recognition for their work. (For complete legal details of the CC BY 4.0 license, please refer to the official Creative Commons website.)
Submissions (from July 2025 onward): By submitting a manuscript to TUJNAS for publication (Volume 10, Issue 2, 2025 and thereafter), authors confirm the following:
- Originality: The submission is original, has not been published elsewhere, and is not under consideration by another journal.
- Copyright Retention: The author(s) retain copyright of the work, but grant TUJNAS a non-exclusive right to publish, reproduce, distribute, and archive the article.
- Open Access License: Upon acceptance, the article will be published open access under the CC BY 4.0 license.
- Repository Deposit: The author(s) agree that the full text and metadata of the article may be deposited in digital archives or repositories, to facilitate indexing and reuse under the CC BY 4.0 license.
- Indexing and Sharing: The author(s) acknowledge that TUJNAS may make the article available to third-party indexing, abstracting, and discovery services under the CC BY 4.0 license, without the need for additional permission.
These submission terms ensure that authors understand and consent to the open-access, licensed nature of TUJNAS publications from the outset.