Vibou.fr

Publications

Dec. 2015
Use of Web Popularity on Entity Centric Document Filtering
Vincent Bouvier
Patrice Bellot
Web Intelligence Conference 2015
Filtering pages about an entity (person, company, music band...) so that only interesting pages are kept is a real challenge. The interest can be qualified using criteria such as recency, novelty. In the last decade, we have seen classification systems trained to detect the interest for a document regarding an entity. For scalability reasons, it is not possible to consider a manual annotation of a training set for each tracked entity. Some approaches strive to build entity independent systems. These approaches obtain state of the art performances, but we show that they can be improved. Time features differ from one entity to another, therefore no relevant statistics can be estimated out of these observations by the classifier. Instead of having one model per entity or one model for all entities, we propose an approach that uses one model per cluster of entities based on the entity web popularity. We also introduce different strategies for automatic classification model selection. We test our approach on the Knowledge Base Acceleration (KBA) framework from TREC and we show that our approach brings significant improvements over a non-cluster-based method.
Bouvier, V., Bellot, P. (2015). Use of Web Popularity on Entity Centric Document Filtering. In Web Intelligence Conference 2015.
@inproceedings{ Bouvier2015-1, author={Bouvier, V. and Bellot, P.}, page={-}, year={2015}, title={Use of Web Popularity on Entity Centric Document Filtering}, journal={Web Intelligence Conference 2015}, }
May. 2015
Modèles de langue adaptatifs et métacritères pour le filtrage de documents et le suivi temporel d’entités
Vincent Bouvier
Patrice Bellot
Document Numérique
This article addresses an issue on entity driven filtering task. While detecting and disambiguating entities within documents, our approach strives to select documents of interest according to their centrality to some given named entities. We focus on selecting documents that bring novelty or relate an important event about an entity. We enhance entity profiles so that temporal aspects can be considered by means of new time-aware language models. We designed meta-criteria aimed to help disambiguating an entity within a document and detect novelty/interestingness. Using meta-criteria makes our approach entity independent. We test our approach on the Knowledge Base Acceleration framework provided for the Text REtrieval Conference (TREC). Our strategies outperform best systems presented on this framework.
Bouvier, V., Bellot, P. (2015). Modèles de langue adaptatifs et métacritères pour le filtrage de documents et le suivi temporel d’entités. In Document Numérique. Volume 18/1. (p. 75-96).
@article{ Bouvier2015-1, author={Bouvier, V. and Bellot, P.}, page={75-96}, year={2015}, title={Mod{\`e}les de langue adaptatifs et m{\´e}tacrit{\`e}res pour le filtrage de documents et le suivi temporel d’entit{\´e}s}, journal={Document Num{\´e}rique}, volume={18/1}, }
Mar. 2015
Regroupement par popularité pour la RI semi-supervisée centrée sur les entités
Vincent Bouvier
Patrice Bellot
Actes de la conférence CORIA 2015
Filtering pages about an entity (person, company, ...) so that only documents being of interest are kept is a real challenge. The interest can be qualified using criteria such as recency, novelty. In the last decade, we have seen classification systems trained to detect the interest for a document regarding an entity. For scalability reasons, it is not possible to consider having a manually annotated training set for each entity. That is why some approaches strive to build entity independent classification systems . Those approaches obtain good performances, but we show that they can be improved . The entities may differ on certain aspects that we think can be caught using clustering. Thus, instead of having one model per entity or one model for all entities, we propose an approach that uses one model per cluster of entities . This article is aimed to show how valuable can be the entity clustering on an entity driven filtering system while using simple clustering techniques. We also introduce different strategies for automatic classification model selection. In this article, we detail the different aspects of our approach and we test it on the Knowledge Base Acceleration framework from the Text REtrieval Conference. Eventually, we show that our approach brings significant improvements over a non-cluster based method.
Bouvier, V., Bellot, P. (2015). Regroupement par popularité pour la RI semi-supervisée centrée sur les entités. In Actes de la conférence CORIA 2015. (p. 503-512).
@inproceedings{ Bouvier2015-1, author={Bouvier, V. and Bellot, P.}, page={503-512}, year={2015}, title={Regroupement par popularité pour la RI semi-supervisée centrée sur les entités}, journal={Actes de la conf{\´e}rence CORIA 2015}, }
Mar. 2015
Évolution des profils d’entités à l’aide d’un modèle de langue sensible au temps
Vincent Bouvier
Patrice Bellot
Actes de la conférence CORIA 2015
Finding important information in real time on a particular named entity is a real challenge. It requires to be able to detect the entity within the document and to be able to assess how important is the document regarding the entity. In this article, we formalize a new time-aware language model that we use as part of entity profiles. We design meta-criteria to fully use this new profile design. Using meta-criteria ensures to have an entity independent and scalable system. We evaluate our approach on the data from the TREC on the KBA 2013 track and we obtain satisfying results and interesting conclusions.
Bouvier, V., Bellot, P. (2015). Évolution des profils d’entités à l’aide d’un modèle de langue sensible au temps. In Actes de la conférence CORIA 2015. (p. 421-436).
@inproceedings{ Bouvier2015, author={Bouvier, V. and Bellot, P.}, page={421-436}, year={2015}, title={Évolution des profils d’entités à l’aide d’un modèle de langue sensible au temps}, journal={Actes de la conf{\´e}rence CORIA 2015}, }
Nov. 2014
Use of Time-Aware Language Model in Entity Driven Filtering System
Vincent Bouvier
Patrice Bellot
The Twenty-Third Text REtrieval Conference (TREC 2014) Proceedings
Tracking entities, so that new or important information about that entities are caught, is a real challenge and has many applications (e.g., information monitoring, marketing,...). We are interesting in how to represent an entity profile to fulfill two purposes: 1. entity detection and disambiguation, 2. novelty and importance quantification. We propose an entity profile, which uses two language models. First, the Reference Language Model (RLM), which is mainly used for disambiguation. Second, we propose a formalization of a Time-Aware Language Model, which is used for novelty detection. To rank documents, we propose a semi-supervised classification approach which uses meta-features computed on documents using entity profiles and time series.
Bouvier, V., Bellot, P. (2014). Use of Time-Aware Language Model in Entity Driven Filtering System. In The Twenty-Third Text REtrieval Conference (TREC 2014) Proceedings. Volume SP 500-308. NIST.
@inproceedings{ Bouvier2014-1, author={Bouvier, V. and Bellot, P.}, page={-}, year={2014}, title={Use of Time-Aware Language Model in Entity Driven Filtering System}, journal={The Twenty-Third Text REtrieval Conference (TREC 2014) Proceedings}, volume={SP 500-308}, organization={NIST}, }
May. 2014
Approches de classification pour le filtrage de documents importants au sujet d’une entité nommée
Ludovic Bonnefoy
Vincent Bouvier
Patrice Bellot
Document Numérique
Our aim is to filter a stream of Web documents according to whether they refer or not an entity , while estimating the importance of the information contained about this entity . Our approach relies on the use of classifiers taking into account features such as the frequency of the entity over time and in the documents, their positions and the presence of known related entities. Our approach was evaluated during "Knowledge Base Acceleration " tracks of TREC 2012 and 2013 and has been ranked among the best ones.
Bonnefoy, L., Bouvier, V., Bellot, P. (2014). Approches de classification pour le filtrage de documents importants au sujet d’une entité nommée. In Document Numérique. Volume 17/1. (p. 9-36).
@article{ Bonnefoy2014, author={Bonnefoy, L. and Bouvier, V. and Bellot, P.}, page={9-36}, year={2014}, title={Approches de classification pour le filtrage de documents importants au sujet d’une entit{\´e} nomm{\´e}e}, journal={Document Num{\´e}rique}, volume={17/1}, }
May. 2014
Critères numériques et temporels pour la détection de documents vitaux dans un flux
Vincent Bouvier
Patrice Bellot
Acte du XXXIIème congrès INFORSID 2014
This paper addresses to a classification challenge in a filtering task . We use different kind of features to depict vital documents and filter them out from the stream . A vital document has to be relevant for a particular entity and has to relate a new story about it. We introduce different features that uses time as well as entity profile to perform classification . We evaluate our method on framework from TREC KBA 2013 (Knowledge Base Acceleration ).
Bouvier, V., Bellot, P. (2014). Critères numériques et temporels pour la détection de documents vitaux dans un flux. In Acte du XXXIIème congrès INFORSID 2014. (p. 311-325).
@inproceedings{ Bouvier2014, author={Bouvier, V. and Bellot, P.}, page={311-325}, year={2014}, title={Crit{\`e}res num{\´e}riques et temporels pour la d{\´e}tection de documents vitaux dans un flux}, journal={Acte du XXXII{\`e}me congr{\`e}s INFORSID 2014}, }
Feb. 2014
Large Scale Text Mining Approaches for Information Retrieval and Extraction
Patrice Bellot
Ludovic Bonnefoy
Vincent Bouvier
Frédéric Duvert
Young-Min Kim
Innovations in Intelligent Machines-4
The issues for Natural Language Processing and Information Retrieval have been studied for long time but the recent availability of very large resources (Web pages, digital documents…) and the development of statistical machine learning methods exploiting annotated texts (manual encoding by crowdsourcing is a new major way) have transformed these fields. This allows not limiting these approaches to highly specialized domains and reducing the cost of their implementation. For this chapter, our aim is to present some popular text-mining statistical approaches for information retrieval and information extraction and to discuss the practical limits of actual systems that introduce challenges for future.
Bellot, P., Bonnefoy, L., Bouvier, V., Duvert, F., Kim, Y. M. (2014). Large Scale Text Mining Approaches for Information Retrieval and Extraction. Innovations in Intelligent Machines-4. (p. 3-45).
@inbook{ Bellot2014, author={Bellot, P. and Bonnefoy, L. and Bouvier, V. and Duvert, F. and Kim, Y. M.}, page={3-45}, year={2014}, title={Large Scale Text Mining Approaches for Information Retrieval and Extraction}, booktitle={Innovations in Intelligent Machines-4}, }
Nov. 2013
Filtering Entity Centric Documents using Numerics and Temporals features within RF Classifier
Vincent Bouvier
Patrice Bellot
The Twenty-Second Text REtrieval Conference (TREC 2013) Proceedings
This paper addresses to a classification challenge in a filtering task . We use different kind of features to depict vital documents and filter them out from the stream . A vital document has to be relevant for a particular entity and has to relate a new story about it. We introduce different features that uses time as well as entity profile to perform classification . We evaluate our method on framework from TREC KBA 2013 (Knowledge Base Acceleration ).
Bouvier, V., Bellot, P. (2013). Filtering Entity Centric Documents using Numerics and Temporals features within RF Classifier. In The Twenty-Second Text REtrieval Conference (TREC 2013) Proceedings. Volume SP 500-302. NIST.
@inproceedings{ Bouvier2013-1, author={Bouvier, V. and Bellot, P.}, page={-}, year={2013}, title={Filtering Entity Centric Documents using Numerics and Temporals features within RF Classifier}, journal={The Twenty-Second Text REtrieval Conference (TREC 2013) Proceedings}, volume={SP 500-302}, organization={NIST}, }
Sep. 2013
LIA @RepLab 2013
Jean-Valère Cossu
Benjamin Bigot
Ludovic Bonnefoy
Mohamed Morchid
Xavier Bost
Grégory Senay
Richard Dufour
Vincent Bouvier
Juan-Manuel Torres-moreno
Marc El-bèze
Working Notes for CLEF 2013 Conference
In this paper, we present the participation of the Computer Science Laboratory of Avignon (LIA) to RepLab 2013 edition. RepLab is an evaluation campaign for Online Reputation Management Systems. LIA has produced a important number of experiments for every tasks of the campaign: filtering, topic priority detection, Polarity for Reputation and topic detection. Our approaches rely on a large variety of machine learning methods. We have chosen to mainly exploit tweet contents. In several of our experiments we have also added selected metadata. A fewer number of our proposals have integrated external information by using provided links to Wikipedia and users homepage.
Cossu, J. V., Bigot, B., Bonnefoy, L., Morchid, M., Bost, X., Senay, G., Dufour, R.... (2013). LIA @RepLab 2013. In Working Notes for CLEF 2013 Conference.
@inproceedings{ Cossu2013, author={Cossu, J. V. and Bigot, B. and Bonnefoy, L. and Morchid, M. and Bost, X. and Senay, G. and Dufour, R....}, page={-}, year={2013}, title={LIA @RepLab 2013}, journal={Working Notes for CLEF 2013 Conference}, }
Aug. 2013
A weakly-supervised detection of entity central documents in a stream
Ludovic Bonnefoy
Vincent Bouvier
Patrice Bellot
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Name entity disambiguation is the action of linking an ambiguous name, found in a document, to a real-world entity, represented by a unique node in a knowledge base (KB). In this work, we see one step further : filtering a time-ordered corpus for documents that are highly relevant to an entity (represented by an entry in a KB). One application is reducing delay between the moment an information is being first observed and the moment a knowledge base is updated. We present one of the first and most effective works in this direction. Our weakly-supervised approach relies on three types of features: document centric features, entity profile related features and time features. Evaluated within the framework of the ”Knowledge Base Acceleration” track at TREC 2012, our approach achieved good results (3rd system on 11) and has the advantage to work well without requiring new training data when processing a new entity.
Bonnefoy, L., Bouvier, V., Bellot, P. (2013). A weakly-supervised detection of entity central documents in a stream. In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval. (p. 769-772). ACM.
@inproceedings{ Bonnefoy2013-1, author={Bonnefoy, L. and Bouvier, V. and Bellot, P.}, page={769-772}, year={2013}, title={A weakly-supervised detection of entity central documents in a stream}, journal={Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval}, organization={ACM}, }
Apr. 2013
Amélioration d’un corpus de requêtes à l’aide d’une méthode non-supervisée
Vincent Bouvier
Patrice Bellot
Actes de CORIA 2013 et des RJCRI 2013
This article introduces a method to build a set of clusters that contains similarly spelled words. Based on a modified edit distance and distribution statistics, this approach is completely knowledge free. The method has been developed for a real business issue. The concerned company obtains product’s descriptions made up of keywords where some of them are mistyped or misspeled. The aim of the algorithm is to find the most understandable (i.e., to human as well as computer) writing for each keywords.
Bouvier, V., Bellot, P. (2013). Amélioration d’un corpus de requêtes à l’aide d’une méthode non-supervisée. In Actes de CORIA 2013 et des RJCRI 2013. (p. 373-382).
@inproceedings{ Bouvier2013, author={Bouvier, V. and Bellot, P.}, page={373-382}, year={2013}, title={Am{\´e}lioration d’un corpus de requ{\ˆe}tes {\`a} l’aide d’une m{\´e}thode non-supervis{\´e}e}, journal={Actes de CORIA 2013 et des RJCRI 2013}, }
Apr. 2013
Vers une détection en temps réel de documents Web centrés sur une entité donnée
Ludovic Bonnefoy
Vincent Bouvier
Romain Deveaud
Patrice Bellot
Actes de CORIA 2013 et des RJCRI 2013
Name entity disambiguation is the task of linking an ambiguous name found in a document to the unique real-world entity in a knowledge base (KB) its represents. We took in this work the opposite problem and add a time constraint : we want to monitor a data stream to detect in real-time documents about an entity from a KB and determine to what extent the information in those documents matter. It could be used to reduce time lag between the moment a new important information about an entity shows up and when it will be added to the knowledge base. We propose to use Random Forests combined with time-related features (eg. count of mentions in time) and document and related entities centric features to tackle this problem. The effectiveness and impact of the features used have been evaluated through our participation to the "Knowledge Base Acceleration" task at TREC 2012 and positionned our team rank 3 on 11.
Bonnefoy, L., Bouvier, V., Deveaud, R., Bellot, P. (2013). Vers une détection en temps réel de documents Web centrés sur une entité donnée. In Actes de CORIA 2013 et des RJCRI 2013. (p. 21-36).
@inproceedings{ Bonnefoy2013, author={Bonnefoy, L. and Bouvier, V. and Deveaud, R. and Bellot, P.}, page={21-36}, year={2013}, title={Vers une d{\´e}tection en temps r{\´e}el de documents Web centr{\´e}s sur une entit{\´e} donn{\´e}e}, journal={Actes de CORIA 2013 et des RJCRI 2013}, }
Nov. 2012
LSIS/LIA at TREC 2012 knowledge base acceleration
Ludovic Bonnefoy
Vincent Bouvier
Patrice Bellot
The Twenty-First Text REtrieval Conference (TREC 2012) Proceedings
This paper describes our joint participation in the TREC 2012 KBA task. The system is broken down as follows : first name variations of the entity topics are searched then documents containing at least one of them are retrieved. Finally documents go through two classifiers to categorize them as garbage, neutrals, relevant or centrals. This system got good results (above median).
Bonnefoy, L., Bouvier, V., Bellot, P. (2012). LSIS/LIA at TREC 2012 knowledge base acceleration. In The Twenty-First Text REtrieval Conference (TREC 2012) Proceedings. Volume SP 500-298. NIST.
@inproceedings{ Bonnefoy2012, author={Bonnefoy, L. and Bouvier, V. and Bellot, P.}, page={-}, year={2012}, title={LSIS/LIA at TREC 2012 knowledge base acceleration}, journal={The Twenty-First Text REtrieval Conference (TREC 2012) Proceedings}, volume={SP 500-298}, organization={NIST}, }