Dec. 2015
Approches numériques pour le filtrage de documents centré sur une entité
Thesis
Our main contributions are: 1. We propose an entity centric classification system, which helps finding documents that are related to an entity based on its profile and a set of meta criteria. We propose to use the classification result to filter out unrelated documents. This approach is entity independent and uses transfer learning principles. We trained the classification system with a set of annotated concerning a set of entities and we categorized documents that concerns other entities;2. We introduce a diachronical language model, which extends our definition of entity profile in order to add to the capability of updating an entity profile. Tracking an entity implies to distinguish between a known piece of information from a new one. This new language model enables automatic update of entity profile while minimizing the noise;3. We develop a method to detect the entity popularity in order to enhance the coherence of a classification model concerning temporal aspects. In order to detect the importance of a document regarding an entity, we propose to use temporal sensors, which may vary from an entity to another. We cluster entities sharing the same amount of popularity on the Web at each time t to enhance the coherence of classification model and thus improve classifier performances.
Bouvier, V. (2015). Approches numériques pour le filtrage de documents centré sur une entité. In Thesis.
@inproceedings{
Bouvier2015-1,
author={Bouvier, V.},
page={-},
year={2015},
title={Approches num{\´e}riques pour le filtrage de documents centr{\´e} sur une entit{\´e}},
journal={Thesis},
}
Dec. 2015
Use of Web Popularity on Entity Centric Document Filtering
Web Intelligence Conference 2015
Bouvier, V., Bellot, P. (2015). Use of Web Popularity on Entity Centric Document Filtering. In Web Intelligence Conference 2015.
@inproceedings{
Bouvier2015-1,
author={Bouvier, V. and Bellot, P.},
page={-},
year={2015},
title={Use of Web Popularity on Entity Centric Document Filtering},
journal={Web Intelligence Conference 2015},
}
May. 2015
Modèles de langue adaptatifs et métacritères pour le filtrage de documents et le suivi temporel d’entités
Document Numérique
This article addresses an issue on entity driven filtering task . While detecting and disambiguating entities within documents, our approach strives to select documents of interest according to their centrality to some given named entities. We focus on selecting documents that bring novelty or relate an important event about an entity. We enhance entity profiles so that temporal aspects can be considered by means of new time-aware language models . We designed meta-criteria aimed to help disambiguating an entity within a document and detect novelty/interestingness. Using meta-criteria makes our approach entity independent . We test our approach on the Knowledge Base Acceleration framework provided for the Text REtrieval Conference (TREC). Our strategies outperform best systems presented on this framework.
Bouvier, V., Bellot, P. (2015). Modèles de langue adaptatifs et métacritères pour le filtrage de documents et le suivi temporel d’entités. In Document Numérique. Volume 18/1. (p. 75-96).
@article{
Bouvier2015-1,
author={Bouvier, V. and Bellot, P.},
page={75-96},
year={2015},
title={Mod{\`e}les de langue adaptatifs et m{\´e}tacrit{\`e}res pour le filtrage de documents et le suivi temporel d’entit{\´e}s},
journal={Document Num{\´e}rique},
volume={18/1},
}
Mar. 2015
Regroupement par popularité pour la RI semi-supervisée centrée sur les entités
Actes de la conférence CORIA 2015
Filtering pages about an entity (person, company, ...) so that only documents being of interest are kept is a real challenge. The interest can be qualified using criteria such as recency, novelty. In the last decade, we have seen classification systems
trained to detect the interest for a document regarding an entity. For scalability reasons, it is not possible to consider having a manually annotated training set for each entity. That is why some approaches strive to build entity independent
classification systems
. Those approaches obtain good performances, but we show that they can be improved
. The entities may differ on certain aspects that we think can be caught using clustering. Thus, instead of having one model per entity or one model for all entities, we propose an approach that uses one model per cluster of entities
. This article is aimed to show how valuable can be the entity clustering
on an entity driven filtering system while using simple clustering techniques. We also introduce different strategies for automatic classification model selection. In this article, we detail the different aspects of our approach and we test it on the Knowledge Base Acceleration
framework from the Text REtrieval Conference. Eventually, we show that our approach brings significant improvements
over a non-cluster based method.
Bouvier, V., Bellot, P. (2015). Regroupement par popularité pour la RI semi-supervisée centrée sur les entités. In Actes de la conférence CORIA 2015. (p. 503-512).
@inproceedings{
Bouvier2015-1,
author={Bouvier, V. and Bellot, P.},
page={503-512},
year={2015},
title={Regroupement par popularité pour la RI semi-supervisée centrée sur les entités},
journal={Actes de la conf{\´e}rence CORIA 2015},
}
Mar. 2015
Évolution des profils d’entités à l’aide d’un modèle de langue sensible au temps
Actes de la conférence CORIA 2015
Finding important information in real time on a particular named
entity is a real challenge. It requires to be able to detect the entity within the document and
to be able to assess how important is the document regarding the entity. In this article,
we formalize a new time-aware language model that we use as part of entity
profiles . We design meta-criteria to fully use this new profile design. Using meta-criteria ensures
to have an entity independent and scalable system . We evaluate our approach on the data from the
TREC on the KBA 2013 track and we obtain satisfying results and interesting
conclusions.
Bouvier, V., Bellot, P. (2015). Évolution des profils d’entités à l’aide d’un modèle de langue sensible au temps. In Actes de la conférence CORIA 2015. (p. 421-436).
@inproceedings{
Bouvier2015,
author={Bouvier, V. and Bellot, P.},
page={421-436},
year={2015},
title={Évolution des profils d’entités à l’aide d’un modèle de langue sensible au temps},
journal={Actes de la conf{\´e}rence CORIA 2015},
}
Nov. 2014
Use of Time-Aware Language Model in Entity Driven Filtering System
The Twenty-Third Text REtrieval Conference (TREC 2014) Proceedings
Bouvier, V., Bellot, P. (2014). Use of Time-Aware Language Model in Entity Driven Filtering System. In The Twenty-Third Text REtrieval Conference (TREC 2014) Proceedings. Volume SP 500-308. NIST.
@inproceedings{
Bouvier2014-1,
author={Bouvier, V. and Bellot, P.},
page={-},
year={2014},
title={Use of Time-Aware Language Model in Entity Driven Filtering System},
journal={The Twenty-Third Text REtrieval Conference (TREC 2014) Proceedings},
volume={SP 500-308},
organization={NIST},
}
May. 2014
Critères numériques et temporels pour la détection de documents vitaux dans un flux
Acte du XXXIIème congrès INFORSID 2014
This paper addresses to a classification
challenge in a filtering task
. We use different kind of features to depict vital documents
and filter them out from the stream
. A vital document has to be relevant for a particular entity
and has to relate a new story
about it. We introduce different features that uses time as well as entity
profile
to perform classification
. We evaluate our method on framework from TREC KBA 2013
(Knowledge Base Acceleration
).
Bouvier, V., Bellot, P. (2014). Critères numériques et temporels pour la détection de documents vitaux dans un flux. In Acte du XXXIIème congrès INFORSID 2014. (p. 311-325).
@inproceedings{
Bouvier2014,
author={Bouvier, V. and Bellot, P.},
page={311-325},
year={2014},
title={Crit{\`e}res num{\´e}riques et temporels pour la d{\´e}tection de documents vitaux dans un flux},
journal={Acte du XXXII{\`e}me congr{\`e}s INFORSID 2014},
}
May. 2014
Approches de classification pour le filtrage de documents importants au sujet d’une entité nommée
Document Numérique
Our aim is to filter a stream of Web documents according to whether they refer or not an entity
, while estimating the importance of the information
contained about this entity
. Our approach relies on the use of classifiers
taking into account features such as the frequency of the entity
over time and in the documents, their positions and the presence of known related entities. Our approach was evaluated during "Knowledge Base Acceleration
" tracks of TREC 2012 and 2013
and has been ranked among the best ones.
Bonnefoy, L., Bouvier, V., Bellot, P. (2014). Approches de classification pour le filtrage de documents importants au sujet d’une entité nommée. In Document Numérique. Volume 17/1. (p. 9-36).
@article{
Bonnefoy2014,
author={Bonnefoy, L. and Bouvier, V. and Bellot, P.},
page={9-36},
year={2014},
title={Approches de classification pour le filtrage de documents importants au sujet d’une entit{\´e} nomm{\´e}e},
journal={Document Num{\´e}rique},
volume={17/1},
}
Feb. 2014
Large Scale Text Mining Approaches for Information Retrieval and Extraction
Innovations in Intelligent Machines-4
The issues for Natural Language Processing
and Information Retrieval
have been studied for long time but the recent availability of very large resources (Web pages, digital documents…) and the development of statistical machine learning methods exploiting annotated texts (manual encoding by crowdsourcing is a new major way) have transformed these fields. This allows not limiting these approaches to highly specialized domains and reducing the cost of their implementation. For this chapter, our aim is to present some popular text-mining statistical approaches
for information retrieval
and information extraction and to discuss the practical limits of actual systems
that introduce challenges for future.
Bellot, P., Bonnefoy, L., Bouvier, V., Duvert, F., Kim, Y. M. (2014). Large Scale Text Mining Approaches for Information Retrieval and Extraction. Innovations in Intelligent Machines-4. (p. 3-45).
@inbook{
Bellot2014,
author={Bellot, P. and Bonnefoy, L. and Bouvier, V. and Duvert, F. and Kim, Y. M.},
page={3-45},
year={2014},
title={Large Scale Text Mining Approaches for Information Retrieval and Extraction},
booktitle={Innovations in Intelligent Machines-4},
}
Nov. 2013
Filtering Entity Centric Documents using Numerics and Temporals features within RF Classifier
The Twenty-Second Text REtrieval Conference (TREC 2013) Proceedings
This paper addresses to a classification
challenge in a filtering task
. We use different kind of features to depict vital documents
and filter them out from the stream
. A vital document has to be relevant for a particular entity
and has to relate a new story
about it. We introduce different features that uses time as well as entity
profile
to perform classification
. We evaluate our method on framework from TREC KBA 2013
(Knowledge Base Acceleration
).
Bouvier, V., Bellot, P. (2013). Filtering Entity Centric Documents using Numerics and Temporals features within RF Classifier. In The Twenty-Second Text REtrieval Conference (TREC 2013) Proceedings. Volume SP 500-302. NIST.
@inproceedings{
Bouvier2013-1,
author={Bouvier, V. and Bellot, P.},
page={-},
year={2013},
title={Filtering Entity Centric Documents using Numerics and Temporals features within RF Classifier},
journal={The Twenty-Second Text REtrieval Conference (TREC 2013) Proceedings},
volume={SP 500-302},
organization={NIST},
}
Sep. 2013
LIA @RepLab 2013
Working Notes for CLEF 2013 Conference
In this paper, we present the participation of the Computer Science Laboratory of Avignon (LIA)
to RepLab 2013 edition. RepLab is an evaluation campaign for Online Reputation Management Systems. LIA has
produced a important number of experiments for every tasks of the campaign: filtering, topic priority detection,
Polarity for Reputation and topic detection. Our approaches rely on a large variety of machine learning methods .
We have chosen to mainly exploit tweet contents . In several of our experiments we have also added selected metadata.
A fewer number of our proposals have integrated external information by using provided links to Wikipedia and users homepage .
Cossu, J. V., Bigot, B., Bonnefoy, L., Morchid, M., Bost, X., Senay, G., Dufour, R.... (2013). LIA @RepLab 2013. In Working Notes for CLEF 2013 Conference.
@inproceedings{
Cossu2013,
author={Cossu, J. V. and Bigot, B. and Bonnefoy, L. and Morchid, M. and Bost, X. and Senay, G. and Dufour, R....},
page={-},
year={2013},
title={LIA @RepLab 2013},
journal={Working Notes for CLEF 2013 Conference},
}
Aug. 2013
A weakly-supervised detection of entity central documents in a stream
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Bonnefoy, L., Bouvier, V., Bellot, P. (2013). A weakly-supervised detection of entity central documents in a stream. In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval. (p. 769-772). ACM.
@inproceedings{
Bonnefoy2013-1,
author={Bonnefoy, L. and Bouvier, V. and Bellot, P.},
page={769-772},
year={2013},
title={A weakly-supervised detection of entity central documents in a stream},
journal={Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval},
organization={ACM},
}
Apr. 2013
Amélioration d’un corpus de requêtes à l’aide d’une méthode non-supervisée
Actes de CORIA 2013 et des RJCRI 2013
This article introduces a method to build a set of clusters that contains
similarly spelled words . Based on a modified edit distance and
distribution statistics ,
this approach is completely knowledge free . The method has been developed for
a real business issue. The concerned company obtains product’s descriptions made up of keywords
where some of them are mistyped or misspeled . The aim of the algorithm is to find the most
understandable (i.e., to human as well as computer) writing for each keywords.
Bouvier, V., Bellot, P. (2013). Amélioration d’un corpus de requêtes à l’aide d’une méthode non-supervisée. In Actes de CORIA 2013 et des RJCRI 2013. (p. 373-382).
@inproceedings{
Bouvier2013,
author={Bouvier, V. and Bellot, P.},
page={373-382},
year={2013},
title={Am{\´e}lioration d’un corpus de requ{\ˆe}tes {\`a} l’aide d’une m{\´e}thode non-supervis{\´e}e},
journal={Actes de CORIA 2013 et des RJCRI 2013},
}
Apr. 2013
Vers une détection en temps réel de documents Web centrés sur une entité donnée
Actes de CORIA 2013 et des RJCRI 2013
Bonnefoy, L., Bouvier, V., Deveaud, R., Bellot, P. (2013). Vers une détection en temps réel de documents Web centrés sur une entité donnée. In Actes de CORIA 2013 et des RJCRI 2013. (p. 21-36).
@inproceedings{
Bonnefoy2013,
author={Bonnefoy, L. and Bouvier, V. and Deveaud, R. and Bellot, P.},
page={21-36},
year={2013},
title={Vers une d{\´e}tection en temps r{\´e}el de documents Web centr{\´e}s sur une entit{\´e} donn{\´e}e},
journal={Actes de CORIA 2013 et des RJCRI 2013},
}
Nov. 2012
LSIS/LIA at TREC 2012 knowledge base acceleration
The Twenty-First Text REtrieval Conference (TREC 2012) Proceedings
This paper describes our joint participation in the TREC 2012 KBA task .
The system is broken down as follows : first name variations of the entity topics are searched then documents containing at least one of them are retrieved. Finally documents go through two classifiers to categorize them as garbage, neutrals, relevant or centrals .
This system got good results (above median).
Bonnefoy, L., Bouvier, V., Bellot, P. (2012). LSIS/LIA at TREC 2012 knowledge base acceleration. In The Twenty-First Text REtrieval Conference (TREC 2012) Proceedings. Volume SP 500-298. NIST.
@inproceedings{
Bonnefoy2012,
author={Bonnefoy, L. and Bouvier, V. and Bellot, P.},
page={-},
year={2012},
title={LSIS/LIA at TREC 2012 knowledge base acceleration},
journal={The Twenty-First Text REtrieval Conference (TREC 2012) Proceedings},
volume={SP 500-298},
organization={NIST},
}