en:site:recherche:projets:epique:start
Differences
This shows you the differences between two versions of the page.
Next revision | Previous revision | ||
en:site:recherche:projets:epique:start [20/10/2016 12:09] – created amann | en:site:recherche:projets:epique:start [03/12/2019 11:23] (current) – amann | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | ====== EPIQUE | + | ====== EPIQUE |
+ | |||
+ | ===== Reconstructing the long terms evolution of science through large scale analysis of science productions ===== | ||
**AAP ANR La Révolution numérique : rapports aux savoirs et à la culture** | **AAP ANR La Révolution numérique : rapports aux savoirs et à la culture** | ||
Line 10: | Line 12: | ||
* Institut de Recherche en Informatique et Systèmes Aléatoires (IRISA Rennes) | * Institut de Recherche en Informatique et Systèmes Aléatoires (IRISA Rennes) | ||
- | ===== Abstract ===== | + | **[[https:// |
=== Towards quantitative epistemology - Reconstructing the long terms evolution of science through large scale analysis of science productions === | === Towards quantitative epistemology - Reconstructing the long terms evolution of science through large scale analysis of science productions === | ||
Line 18: | Line 20: | ||
**Keywords :** quantitative epistemology, | **Keywords :** quantitative epistemology, | ||
+ | [[en: | ||
- | ===== Challenges and ambition ===== | ||
- | |||
- | ==== Challenge 1: Understanding Science evolution ==== | ||
- | |||
- | Our first challenge is to build global semantic maps of the evolution of science on large scientific domains by applying appropriate scientometric models on large databases like the Web of Science (about 30% of the worldwide scientific production, interdisciplinary but biased toward hard science), MedLine (the main biomedical archive), Repec (the main archive for economics) or Open Archives likes arxiv.org (main pre-print archive for physics, maths and computer science). The ISC-PIF partner already has access to these datasets, including the WoS raw format until 2015, with the appropriate licences to share them within a research project. These data sets will be a general driving force for the project and for the experimental validation of the solutions. As a first goal, the project intends to contribute to our overall understanding of the evolution of science, and therefore will confront some of the results to competing extant philosophical theories about the science building and scientific change. The challenge is to pursue this task in the light of the main existing accounts of scientific evolution – traditional cumulative accounts of science, Darwinian accounts such as Hull (1988) or memetics, popperian accounts (Popper 1963), kuhnian accounts in terms of revolution and paradigms (Kuhn 1970), or lakatosian accounts in terms of labile research programs (Lakatos 1978), as well as more recent views of bayesian theory change. Most of these conceptions of scientific evolution have been elaborated and tested with a small amount of papers and books; we intend to assess their validity through a confrontation to the patterns that we will unravel through databases that cover a large range of time and a huge amount of scientific publication. The project will also consider the recent claim that current science is more “hypothesis driven” than “theory driven”, and have as a background the idea (Nowotny et al. 2001) that the novel regime of science focuses on problems rather than disciplines: | ||
- | The project partners have been pioneers in the reconstruction of science dynamics mining corpora at large scale [Chavalarias & Cointet 2013][Chavalarias et al. 2011] and they have shown that we can characterize quantitatively the different phases of the evolution of scientific fields and automatically build “phylomemetic” topic lattices (as an analogy with genealogic trees of natural species) representing this evolution. The reconstruction of phylomemetic lattices from scientific production is of utmost importance for a wide range of actors : | ||
- | philosophers and historians of science, who need to test their theories with data, in particular about the ways fields cross-fertilize and novelty emerge, | ||
- | scientists who want to position themselves in their field, understand the ins-and-outs and find domains with high discovery potential, | ||
- | policy makers who want to spot emergent fields, foster innovation and get key indicators to assist them in decision-making processes, | ||
- | industry, that have to find its ways through the scientific production and evaluate the potential for innovation and technological transfer, | ||
- | librarians who need to propose | ||
- | |||
- | This reconstruction is now within reach since science has been one of the first domains of human activity to have digitized archives. | ||
- | |||
- | Science and human activity in general is largely built on the exchange and registration of knowledge in textual form. Well before the advent of the Web, the increasing amount of scholarly literature made it difficult for a scientist to keep up to date in his field of interest and a main task of scientific editors was and still is to collect and organize articles in different domains. Today, for example, collections like the Web of Science (WoS) or Scopus provide researchers, | ||
- | EPIQUE will involve philosophers of science, who perceive phylomemetic lattice structures as a tool for testing general accounts of progress and change in science; they will also play a role in validating the lattices produced in the context of the project. This input will allow a better calibration of the protocol for reconstructing lattices, and therefore, will provide feedback on the methodology itself. As users, historians and philosophers of science involved in the project will design case studies (esp. in biology, ecology, economics) of a particular evolution of a target concept across subfields of a general field (i.e. the concept of “diversity” or of “function” in ecology); those case studies will in turn contribute to fine-tune the algorithmic tools. Finally, within EPIQUE, philosophers of science will integrate the research in philosophy of science and the computer science work on big data [NPS14][NPS11]. | ||
- | |||
- | ==== Challenge 2: Large-scale text topic detection and alignment ==== | ||
- | |||
- | The goal of building a global map of the evolution of science is also challenging from a computer science point of view. It is the ambitious goal of making sense of unstructured text through generic data processing tasks (graph clustering, similarity matching, indexing) which become complex when dealing with very large amounts of digitized text. The size of the digital archives and science repositories (Medline, Arxiv.org, WoS, etc.) needs the development of new solutions exploiting recent parallel data processing frameworks like Hadoop, Spark, and Pregel. | ||
- | |||
- | ==== Challenge 3: Dynamicity, Interactivity and Customization ==== | ||
- | |||
- | The third and probably most ambitious challenge is to make the whole mining workflow more flexible and interactive. Consider for example a chronologically ordered list of digitized publications from which the user wants to extract a phylomemetic network of the research domains. Building the result involves a number of complex processing steps which make it difficult to handle dynamic information and customization. | ||
- | |||
- | ===== General project statement ===== | ||
- | |||
- | EPIQUE is the first project where science evolution will be studied at such a large scale (over the entire datasets like the WoS or MedLine). From the viewpoint of philosophy of science, it allows testing theories on science evolution and nature which have been formulated only by considering a few canonical texts (the “great scientists of the pasts”, which introduces numerous biases) on a corpus that can reliably be seen as a plausible testimony of scientific activity. Preliminary results on a small part of the corpus already demonstrate that phylomemetic lattices reveal novel semantic insights about science evolution [CC13]. We are confident that taking into account the whole corpus will not only apply to other scientific fields but it will also more fundamentally reveal deeper understanding of inter-disciplinary evolution. Facing an ever growing corpus, we do not consider scientometric workflows as sequences of independent tasks on a given dataset, but we strive for a more integrated framework which allows end users to interact and control the whole process through high level languages and interfaces (e.g. for specifying the scientific field and time-range of interest, or any criteria about the corpus such as the country of the authors). The architecture of the project strongly relies on feedback loops between production of lattices and users such as philosophers of science, historians having reconstructed some small size semantic networks and other experts. This allows for the controlled production of phylomemies by assessing results via comparison with expert knowledge in the field and expert historians having reconstructed some small size semantic networks. We not only focus on the workflow itself, but we aim to come up with a system for producing, maintaining and adjusting phylomemetic lattices on demand. This brings the double opportunity to manage complex data more efficiently and to optimize the text mining workflow; e.g., compute only the required phylomemetic lattices, share (and save) computation among users, reuse workflow refinement strategies among users. The system, serving several users, will leverage on users’ experience to provide both unprecedented efficiency and new incentives to enrich users collaborations. | ||
- | |||
- | |||
- | ===== Expected results ===== | ||
- | |||
- | A first direct outcome of EPIQUE will be the enrichment of the open source ISC-PIF software catalog with new innovative tools for the reconstruction and exploration of multi-scale dynamics in complete real-world scientific corpora and for obtaining new insights in the evolution of complex human generated knowledge and information. In particular, advances in phylomemies reconstruction will be implemented in the Gargantext platform that is used, among others, for the teaching of controversies analysis to students in several higher-education schools and universities. | ||
- | The second, more generic result, will be a uniform framework for specifying, implementing and integrating | ||
- | A third result will be the opportunity to revisit classical hypotheses concerning the evolution of scientific fields and contents and to test and improve these hypotheses in the light of the reconstructed phylomemies and of general patterns detectable within them; this outcome will be exploited in various academic publications coauthored by computer scientists and philosophers or historians of science participating to the project. | ||
- | |||
- | |||
- | |||
- | ===== Scientific program ===== | ||
- | |||
- | The main goal of the EPIQUE project is to define, implement and compose a set of tools for extracting customizable maps of the evolution of science from large and representative scientific corpora like Web-of-Science, | ||
- | |||
- | - The term extraction and proximity graph construction step transforms a collection of text documents into a set of weighted term graphs / matrices for different user-defined time slices. A node in this graph is a set of semantically equivalent n-grams. We consider different ways to compute the relation between nodes, and in particular those computed from co-occurrence data like mutual information, | ||
- | - The topic detection and alignment step consists first in the detection of topics (sets of strongly semantically related terms) within term graphs (each of these graphs | ||
- | - The phylomemetic tree analysis and customization step which allows experts to interact with the workflow by generating and visualizing phylomemetic trees and interactively customizing the workflow by changing data (for example removing or adding a term in a topic) and parameters (for example the time interval for dividing the document collection). | ||
- | |||
- | |||
- | ===== Related work and scientific contributions ===== | ||
- | |||
- | ==== Epistemology and maps of science ==== | ||
- | | ||
- | A major issue in philosophy of science is the uncovering of time to reflect on the conceptual structure of scientific fields of their dynamics. Several theories have been formulated in the field of science evolution [Popper 1963][Kuhn 1970][Lakatos 1980][Bonaccorsi 2008] and a lot of (often conflicting) descriptions and explanations of scientific change and revision have been proposed. These theories diverge on the continuous/ | ||
- | Science and technology studies have advocated that “science in action” is more than published results, and include for instance “tacit knowledge” and often unpublished controversies. EPIQUE does not intend to account for all these dimensions of the dynamics of science. It is positioned at the level of published science, as many classical works in history and philosophy of science, and considers science evolution at the level of scientific archives and databases. This distinguishes the approach of EPIQUE both from STS and from more classical history of science. We assume that phylomemetic trees can reveal the signature of specific dynamics of science, for instance regarding the impact of natural selection, the rather gradual or discontinuous pattern of evolution, etc. Therefore, it will provide a perspective on science likely to complement what we can learn through the study of local social dynamics underlying the social construction of science. | ||
- | |||
- | ==== Large-scale phylogenetic problems and methods in biology ==== | ||
- | |||
- | In biology, the goal of phylogenetic analysis is to reconstruct phylogenetic trees [Baum & Smith 2013] representing ancestor relationships between species, genes, etc. from molecular data like ADN sequences. There exists a huge number of computational phylogenetic methods using different machine learning techniques (maximum likelihood, Markov Chain Monte Carlo [Li 1996], Bayesian inference) depending on a formal description of the observed species characters. There are some similarities to the problems we study in EPIQUE (usage of large-scale parallel data processing [Tyson et al. 2014], scientific workflows). However most forms of molecular phylogenetics make extensive use of sequence alignment in constructing and refining phylogenetic trees (for detecting similarities), | ||
- | |||
- | ==== Narration and Topic Tracking ==== | ||
- | The idea of reconstructing evolution of knowledge by bridging topic detection algorithms with time tracking is a recent and hot topic in different communities : knowledge discovery [Shahaf 2013], scientometrics [Chavalarias et al. 2011][Chavalarias and Cointet 2013], social networks analysis [Shahaf 2012], visual analytics [Liu 2013], news tracking | ||
- | ==== Patent mining and visualisation ==== | ||
- | Intellectual property has become a major economic factor for many industrial companies, but also for scientific and technical organisations developing new technology. The number of patents increases every year and the U.S. Patent and Trademark Office (USPTO) registered 615 234 patent applications in 2014 and granted about half of them (326 033). Patent documents are rich technical documents which are difficult to analyse for non experts and there have been developed a number of tools for assisting patent engineers and decision makers. Most existing patent analysis systems such as Thomson Reuter’s Aureka, Google Patent or WikiPatent mainly focus on searching (“prior art search” [Oh et.al. 2013]) and ranking [Po et.al 2012], others like Patents and PatentLens provide more advanced analysis capabilities. These tools are mainly based on text mining techniques [Tseng et.al. 2007]. More recent work [Tang et.al. 2012] propose to combine more advanced mining techniques like topic driven modeling, heterogeneous network co-ranking and competitive analysis for building topic maps over patent collections.Whereas there are some similarities in the goal of building “topic maps” over patent collections, | ||
en/site/recherche/projets/epique/start.1476958172.txt.gz · Last modified: by amann