Tim Berners-Lee
Date: 2007-10-23, last change: $Date: 2021/11/01 10:16:02 $
Status: personal view only. Editing status: draft. Written in response to another round of circular discussions of web architecture.
Für viele Forscher und Statistiker gilt Jupyter Notebook als De-facto-Standard, wenn es um schnelles Prototyping und die explorative Datenanalyse geht. Außer auf Notebook wirft der Artikel aber auch einen Blick auf Jupyter Lab, das als nächste Generation von Notebooks in den Startlöchern steht.
Abstract. In order to support web applications to understand the content of HTML pages an increasing number of websites have started to annotate structured data within their pages using markup formats such as Microdata, RDFa, Microformats. The annotations are used by Google, Yahoo!, Yandex, Bing and Facebook to enrich search results and to display entity descriptions within their applications. In this paper, we present a series of publicly accessible Microdata, RDFa, Microformats datasets that we have extracted from three large web corpora dating from 2010, 2012 and 2013.
2012. Metadata Statistics for a Large Web Corpus
ABSTRACT
We provide an analysis of the adoption of metadata standards on the Web based a large crawl of the Web. In particular, we look at what forms of syntax and vocabularies publishers are using to mark up data inside HTML pages. We also describe the process that we have followed and the difficulties involved in web data extraction.
HTML microdata [MICRODATA] is an extension to HTML used to embed machine-readable data into HTML documents. Whereas the microdata specification describes a means of markup, the output format is JSON. This specification describes processing rules that may be used to extract RDF [RDF11-CONCEPTS] from an HTML document containing microdata.
The Web is designed to support flexible exploration of information by human users and by automated agents. For such exploration to be productive, information published by many different sources and for a variety of purposes must be comprehensible to a wide range of Web client software, and to users of that software.
HTTP and other Web technologies can be used to deploy resource representations that are self-describing: information about the encodings used for each representation is provided explicitly within the representation. Starting with a URI, there is a standard algorithm that a user agent can apply to retrieve and interpret such representations. Furthermore, representations can be what we refer to as grounded in the Web, by ensuring that specifications required to interpret them are determined unambiguously based on the URI, and that explicit references connect the pertinent specifications to each other. Web-grounding ensures that the specifications needed to interpret information on the Web can be identified unambiguously. When such self-describing, Web-grounded resources are linked together, the Web as a whole can support reliable, ad hoc discovery of information.
This finding describes how document formats, markup conventions, attribute values, and other data formats can be designed to facilitate the deployment of self-describing, Web-grounded Web content.
RDFa is an extension to HTML5 that helps you markup things like People, Places, Events, Recipes and Reviews. Search Engines and Web Services use this markup to generate better search listings and give you better visibility on the Web, so that people can find your website more easily.
This document describes how a Dublin Core metadata description set can be encoded in HTML/XHTML <meta> and <link> elements. It is an HTML meta data profile, as defined by the HTML specification.
2018. Welche Teile des Webs sollen für zukünftige Generationen archiviert werden? Das erkundet derzeit die Deutsche Nationalbibliothek und befragt Internetnutzer. Im Interview spricht Vizedirektorin Ute Schwens über den Stand der Dinge bei der Webarchivierung und die Auswirkungen des neuen Urheberrechts.
$Date: 2013-03-01 15:54:47 $
The content of the vocabulary prefixes, to be included in the RDFa 1.1 Default Profile, is defined based on the general usage of those vocabularies on the Semantic Web. This general usage is established using search crawl data, courtesy of Sindice and of Yahoo!. This page describes the methodology used during crawls as well as the possible post-processing steps.
This is the public wiki for the Heritrix archival crawler project. Heritrix is the Internet Archive’s open-source, extensible, web-scale, archival-quality web crawler project. Heritrix (sometimes spelled heretrix, or misspelled or mis-said as heratrix/heritix/ heretix/heratix) is an archaic word for heiress (woman who inherits).
S. Staab, J. Lehmann, and R. Verborgh. Companion Proceedings of the The Web Conference 2018, page 885--886. Republic and Canton of Geneva, Switzerland, International World Wide Web Conferences Steering Committee, (2018)