Research Projects

My research has focused on adaptive information retrieval, query performance prediction, query expansion, embedding-based data representation, search intent mining and result diversification, bipartite graph-based ranking, and Deep learning for IR. I am always interested in expanding my research area in other related areas. I am also conducting research on natural language processing and social media analysis. My current and past research projects are briefly described as follows:

Query Performance Prediction

In information retrieval (IR), query performance prediction (QPP) aims at predicting the effectiveness of a system for a given query without resorting to relevance judgments. QPP is useful to inform an IR system whether a given query is easy or incompetent, allowing the system to process it differently. For example, concerning an ineffective query, the system could either apply a specific automatic query reformulation or engage in an interactive session with the user (i.e., conversational IR) to understand the query and provide a better answer. Therefore, having accurate QPP predictors is a challenging topic. Query performance prediction usually uses query features that are extracted before running the query through the system (pre-retrieval) or from the initially retrieved documents (post-retrieval). In this project, the objectives are to develop QPP features and learn a machine-learning model that combines pre- and post-retrieval features to predict the effectiveness of a query. We have introduced summarized LETOR features as query features and combined them using a machine learning (ML) model to predict the QPP. We have also developed several effective Post-retrieval QPP predictors. The experiments on standard TREC benchmarks show that our proposed method outperforms the known related QPP methods. We are currently focusing on selecting QPP features and learning an ML model for enhancing the query performance prediction. In future, we will experiment and evaluate the QPP on Question answering collections.

  1. Md Zia Ullah et al., Query Performance Prediction Focused on Summarized Letor Features, The 41st International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2018.
  2. Md Zia Ullah et al., Query Performance Prediction and Effectiveness Evaluation Without Relevance Judgments: Two Sides of the Same Coin, The 41st International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR)), 2018.
  3. Md Zia Ullah et al., Forward and backward feature selection for query performance prediction, The 35th ACM/SIGAPP Symposium On Applied Computing (SAC), 2020.

Adaptive Information Retrieval

Modern information retrieval (IR) systems have become more and more complex, involving a large number of components and hyper-parameters. For example, a system may choose from a set of possible retrieval models (e.g., BM25, LM, etc.) or various query expansion parameters, whose values greatly influence the overall retrieval effectiveness. Traditionally, these parameters are set globally at a system level based on training queries at once. However, a global configuration of the system may not treat all future queries, which may have diverse characteristics. In this project, the objective is to treat individual queries and improve retrieval effectiveness. To deal with per-query, we have proposed to employ an adaptive IR approach by predicting the best system configuration. We cast this problem as a ranking of different possible system configurations and applied a learning-to-rank technique. The experiments on standard ad hoc TREC benchmarks show that this approach can significantly outperform the traditional method to tune the system and known related methods. In a further study, we improve our adaptive IR approach by selecting a pool of configurations using Risk-sensitive criteria

  1. Md Zia Ullah et al., Learning to Adaptively Rank Document Retrieval System Configurations, ACM Transactions of Information Systems (ACM TOIS), 2019.
  2. Josiane Mothe and Md Zia Ullah, Apparatus and method for information retrieval using a set of pre-selected search configurations using efficiency and risk functions, 19305984.7, European patent, 2019
  3. Josinae Mothe and Md Zia Ullah, Defining an Optimal Configuration Set for Selective Search Strategy – A Risk-Sensitive Approach, The 30th ACM International Conference on Information and Knowledge Management (CIKM ‘21), 2021

Statistical Analysis of Information Retrieval System Components

Search engines involve various components and hyper-parameters whose values greatly influence the overall retrieval effectiveness while treating individual queries. When handling a new collection, finding the most appropriate system component for tuning as the most priority is a challenge since many factors influence the system effectiveness, such as the system components, internal parameters, and document collection. In this project, the objective is to analyze which modules and parameters influence system effectiveness the most. We have performed this statistical analysis at various levels based on several data analysis methods; each method is appropriate to reveal different aspects of the problem. We use an analysis of variance (ANOVA) to explore the components that statistically significantly influence the effectiveness, classification and regression tree (CART) to model the impact of the different component modalities and data visualization. We are currently focusing on analyzing these IR components on a large scale for various collections.

  1. Md Zia Ullah et al., Studying the Variability of System Setting Effectiveness by Data Analytics and Visualization, Conference and Labs of the Evaluation Forum (CLEF), 2019.

Check-worthy Claim Prediction

Social media eases information spreading, makes information diffusion quicker, and reaches potentially more people than traditional media, in many cases regardless of the information quality. Automatic fact-checking could be a solution to warn social media users and readers or even to stop the spreading of fake news. In this project, the objective is to identify and verify the check-worthy claims. In our approach, we represent the claims using features based on Information nutritional-based labels, word embedding, and linguistics. To predict the check-worthy claims, we learn a machine-learning model. The experimental results on standard benchmark datasets show that our approach improves the performance compared to the related methods. We are currently focusing on integrating context-aware features and performing a feature selection considering a deeper analysis.

  1. Md Zia Ullah et al., Information Nutritional Label and Word Embedding to Estimate Information Check-Worthiness, The 42nd ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2019.

Histopathology Slides based Cancer Detection

Germinal center (GC) and non-germinal center (NGC) are two different types of lymphomas (cancer), and classification of these types are important in cancer diagnosis. Several algorithms were developed based on gene expression and some biological assumptions that are too costly to obtain. In this project, the objective is to classify GC vs NGC types from a whole-slide histopathology image. In this regard, we propose to classify the whole-slide image (WSI) by leveraging the region of interest (ROI) based on whole-slide annotated images. First, we consider the image patch from the bounding box of the annotated slide at a specific resolution level. Then, we transform the extracted annotated image patch from the annotation slide to the original slide at a higher resolution level and extract the regular patches from the original whole-slide image. For GC vs NGC classification, we consider multiple convolutional neural networks (CNN), including GoogleNet, AlexNet, VGGNet, and ResNet, to be enabled diverse learning from the patches. We also ensemble the prediction probabilities of multiple deep models to estimate an accurate prediction. We evaluate our approach using a recent WSI dataset from Oncopole Toulouse. Currently, we are focusing on combining the prediction of classifiers learned from higher and lower resolution levels of images.

Jargon Detection and its Application to Cyber-Security

Jargon detection aims at extracting meaningful keyphrases from text documents where the keyphrases provide a compact representation of document content. These extracted keyphrases can serve as seeds in various potential applications, such as filtering documents, clustering text documents, identifying possibly related documents or scoring the text documents. A keyphrase is a sequence of one or more words (or n-grams). The extraction pipeline is comprised of several steps such as pre-treatment of the text (e.g., parsing, POS tagging, stopword removal, stemming or lemmatizing), candidate keyphrase extraction, and scoring and ranking the candidates. In this project, our objective is to develop tools to help the extract of keyphrases and the creation/updating of specific and topic-oriented lexicon by selecting keyphrases. The contextualized lexicon is then used to score risk-sensitive documents considering the occurrences of keyphrases.

Search Intent Mining and Result Diversification

With an information need in mind, the user usually formulates a query and issues it to the search engine. The search engine responds to a ranked list of documents/answers to fulfil the information need. Web search queries are usually vague, ambiguous, or tend to have multiple intents. Users may have different search intents while issuing the same search query (e.g., the word –address– may mean the place someone lives or a formal speech delivered to an audience). If the issued search query conveys a variety of interpretations, the ad hoc search result may be far from “what the user wants to have.” In this project, our objectives are to understand the search query and present the ranked list of documents/answers accordingly. We have performed the query understanding by extracting the subtopics covering the dynamic search intents underlying the search query. We have also developed a method for ranking the subtopics of the search query by exploiting the locally trained word-embedding based features, a bipartite graph-based ranking, and estimating the novelty of subtopics by combining contextual and categorical similarities. We have experimented and evaluated our proposed methods on two benchmark datasets and a large-scale Web corpus; the result shows better performance than known related works.

  1. Md Zia Ullah and Masaki Aono, A Bipartite Graph-based Ranking Approach to Query Subtopics Diversification Focused on Word Embedding Features, IEICE Transactions on Information and Systems (IEICE TOIS), 2016.
  2. Md Zia Ullah et al., Query Subtopic Mining Exploiting Word Embedding for Search Result Diversification, Asia Information Retrieval Societies Conference (AIRS), 2016.
  3. Md Zia Ullah and Masaki Aono, Query Subtopic Mining for Search Result Diversification, 1st International Conference of Advanced Informatics: Concept, Theory and Application (ICAICTA), 2014.

Information Retrieval in the Microblog Sphere

Social network platforms are not only places for maintaining social relationships but also act as valuable information sources. Users often turn to social network platforms for sharing their personal views, experiences, and important news while also getting some information. Among several social network sites, Twitter is now the most popular, where users post short-text tweets or engage in interaction with a reply/mention whenever a notable event occurs. That is why Information retrieval in the Twitter-sphere has made a hit with many complacencies. In this project, our objective is to retrieve and re-rank the tweets with the temporal dimension for a search query. Vocabulary mismatch and temporal burst are two challenging issues of short-text in microblog retrieval. To re-rank the short-text tweets, we have proposed efficient and effective learning to rank approach while addressing vocabulary mismatch and temporal burst issues based on content and context-aware features. We have experimented, evaluated our proposed approach on two benchmark datasets, and shown better performance than the known related methods.

  1. Md Zia Ullah et al., Query Expansion for Microblog Retrieval Focusing on an Ensemble of Features, Journal of Information Processing (JIP), 2018.
  2. Md Zia Ullah et al., Microblog Retrieval Using Ensemble of Feature Sets Through Supervised Feature Selection, IEICE Transactions on Information and Systems (IEICE TOIS), 2017.

Image Annotation using Multi-modal Information

With the spread of various social network services, including Facebook, Twitter, Instagram, and Flickr, there has been an explosive growth of images on the Internet. Such collections of images could be leveraged in various potential applications such as recommending restaurant menus, tourist guides and entertainment facilities, traffic congestions, local weather information, crime investigation, and so on. There should have semantic links or annotations among these images to develop such an application. However, there is no such annotation for most of the image data on the Internet. Some description of the image is available on the Web page where the image appears; however, the relationship between the surrounding text and images is redundant and irrelevant, varies greatly. Despite the vast applicability of such multi-modal information in machine learning, it makes weak supervision, which is a challenging problem. In this project, our objective is to annotate images with semantically relevant keywords or concepts. We have proposed a new method based on ontology, graph structure immerged from multi-modal information, and machine learning to annotate images. First, we have developed an ontology-based approach to harvest training examples from the noisy-labelled images collected from the Web. Second, we have introduced a graph structure to model the context around the image and define a new kernel function to propagate ontology-based text and image features across multi-domain concepts. Third, we have an ensemble of voting strategies and probability estimates from multiple binary classifiers for tackling multi-class multi-label problems. We have experimented with, evaluated the proposed method using two benchmark datasets (the ImageCLEF 2013 and 2014), and shown that our approach outperforms the known related methods.

  1. Md Zia Ullah et al., Ontology-based Classification for Multi-label Image Annotation, 1st International Conference of Advanced Informatics: Concept, Theory and Application (ICAICTA), 2014.
  2. Md Zia Ullah et al., KDEVIR at ImageCLEF 2014 Scalable Concept Image Annotation Task: Ontology-based Automatic Image Annotation, Conference and Labs of the Evaluation Forum (CLEF), 2014.

Bipartite graph and its application to Health informatics

With a vast amount of medical knowledge data available on the Internet, it is becoming increasingly practical to help doctors in clinical diagnostics by suggesting plausible diseases predicted by applying data and text mining technologies. Since genetic diseases are difficult to diagnose because of their diverse symptoms, patients are often misdiagnosed or experienced long diagnostic delays. In this project, our objective is to retrieve and rank possible genetic diseases linked through causative genes, given a set of clinical phenotypes. First, we have analyzed the human disease network (HDN) and protein-protein interaction (PPI) network to predict causal genes and explored the pathways from phenotype to genetic diseases through their causal genes. Second, we have associated two sets of bipartite graphs and introduced a weighting scheme to approximate the weight of the edge. We have experimented and evaluated our proposed method on publicly available datasets, and the result shows that our proposed approach outperforms the known related methods.

  1. Md Zia Ullah et al., Estimating a Ranked List of Human Genetic Diseases by Associating Phenotype-Gene with Gene-Disease Bipartite Graphs, ACM Transactions on Intelligent Systems and Technology (ACM TIST), 2015.
  2. Md Zia Ullah et al., Estimating a Ranked List of Human Hereditary Diseases for Clinical Phenotypes by Using Weighted Bipartite Networks, IEEE Engineering in Medicine and Biology Society (EMBS), 2013.