Essay

Website Proposal for Historians to Create Databases on the Web

This essay proposes a website for the creation of databases for historians on the web using industry standard methodologies and crowd sourcing for the input of data. The essay is divided into three main sections with a final section forming a conclusion. The first section discusses the proposed site and metadata standards as well as giving links to a dummy site for demonstration purposes. The second section discusses some advantages and disadvantages of crowd sourcing and possible methods for attracting database managers and contributors. The third section will discuss some further aspects of crowd sourcing and solutions as well as ways to fund such a site.  The final segment will review the previous ones and form a conclusion to this essay
          The design for the site is based along the lines of Wikipedia, however it is a specific database construction site as opposed to a free encyclopaedia. The reasoning behind this is that many small groups of historians do not necessarily have the technological expertise to set up a website for a reasonably large set of data. If a template could be constructed that would allow them and others with an interest in that subject to create a database that was searchable via crowd sourcing it would reduce costs and create a valuable resource. For example, the information of the 1851 census for a town could run to many pages of information. If via this site you could cut it up into small packages and allow people to input it in a controlled and searchable way it would breakdown the task immensely. This is the basic idea, however it is expanded upon that not only alpha numeric information could be entered. If we logically extended this to use the benefits of the web such as images, video, links to books or other websites could be included as pages in that database. Over time the site could become a federated search for a subject (for an example of a federated search site please see www.nines.org) searching a number of different databases that might share a similar interest. The controlling aspect would be a form that a user could set up for their project that would describe each entry and set up metadata that would be searchable and measureable within the database. For example, for a site for World War I you could have option button for images, highlight images and you would have a dropdown list for tanks, soldiers or airplanes etc. This would create searchable data linked to that page that can be quickly managed or indexed by a search engine. However an option to add further searchable terms should be provided to aid this process with permissions set by the project manager. Forms can have error checking built into them, requiring fields to be completed before they can be uploaded. Repeat items with small differences could be carried over from one page to another with differences changed rather than filling out a completely new form. Large scale data entry could be input via tables if this proved easier and added to a data set in batches.
            A very basic site to show the potential layout of a project has been set up at Weebly.com, with a description of the site on the homepage and mock up login. A single project page with an individual item page have also been added for demonstration purposes and a form page with tick boxes and drop down lists has been included. Depending on the size of a particular database and whether wider public access is allowed with regard to data input, it would probably be advisable to have a history page with previous amendments or discussions on that particular item. Please note that none of the underlying code required for database creation has been added and these pages are just to give an example for the purposes of this essay. In original research, it was envisaged that the MARC 21 electronic library cataloguing system would be interlinked with the forms page, thus limiting the individual form inputs via predictive text. The example used on the website for MARC 21 is limiting a book on 'CATS' to only advise the input of the word string 'CATS' rather than 'FELINE' or CAT. However, further research brought up the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) and the use of Dublin Core. Whether an application of OAI-PMH can be used to build up cross indexing across many databases would remain to be seen. However the adoption of Dublin Core as at least a basis for setting up the site would perhaps make this a future possibility. Dublin Core it was thought would be a better protocol than MARC 21 as it is an XML metadata format that allows many resources available on the web and beyond to be referenced. However it was also thought that the majority of this formatting would be done by the application rather than the user to try to keep the system as easy as possible to use. For example, if the name of the page is input by the user on the form page, the application and not the user would input this as the 'name' element with the relevant code directly into the metadata. This way the page would be Dublin Core compliant and the user would hardly be aware that it was happening. With regards to the kind of database management system required I would expect the use of an existing system modified to the sites needs. Perhaps MySQL or some other variant would be better however a professional licence may be required if not using an open source program. This in turn would add to costs and possible fees which are discussed further on in this essay.      
          Although their are advantages in cost with crowd sourcing, there are disadvantages in that input can have errors within it that are difficult to find. The ability to edit an item other than ones own entries could be limited by the project instigator, almost in the vain of sub-editors checking publishing content (where this idea is similar to Wikipedia). However this much control would be entirely dependent on the particular database manager. It would make changing errors or contacting someone to change them a potentially off putting element unless open to everyone. This is where the history or discussion page could be helpful as it would show previous incarnations of a particular entry or discussions about it. Whether a lock out function would be necessary is debatable which is what Wikipedia have done, however please see further on in this essay for discussions about this. A positive aspect of crowd sourcing is the community aspect where the user may find people with similar interests locally, nationally or possibly even internationally. The crowd applications for the site would make the dissemination of information hopefully easier. The amateur historian who is active locally could have recourse to national expertise via discussion boards or their contact details on other databases (twitter/email accounts). Similarly, national historians could have access to current international trends and possibly ongoing work. If historians are working on a database for themselves, perhaps an automated interface which can translate their work and put it online for all to see would be a significant additional application.
            The main difficulty would be to encourage not only database managers to set up a site (when there are already blog sites or online community sites like twitter or Facebook) but also casual users who might want to help input information. A reward system could be emplaced whereby free web space on the system could be offered for data entry on other databases. So for example input ten individual pages and get one point, get ten points and get one gigabyte of web space or some other reward. The key to attracting actual database managers would be towards the service level offered; useful features and backup facilities, indexing and system caching, gadgets and applications that make the project page look attractive and display current progress, etc. Whether limitations to the size of an individual database would be required would depend on whether the site was provided charitably (i.e. via a educational grant of some kind) or required funds to purchase space from a web space provider. A fee of some kind maybe required (perhaps refundable if related to an educational establishment or project) however please see the next section for a further discussion of this topic. A subscription would make the site just like many other database hosting providers, so perhaps donations could be asked for instead.
           Further possible downsides to crowd sourcing, apart from inaccurate data previously discussed are fake sites, or sites deliberately set up to falsify information. Wikipedia suffered from this in its early days, however whether the application of the website would put such individuals off is a key question. The fact that the database would have to be set up on a specific historical topic (if not already set up) and whether setting up the form page would put such people off is debatable. The site would have to rely on the community to report such fake projects and then the management would have to remove them as quickly as possible. Perhaps a requirement to send a text message or email verification so that people on a black list would be refused an account is a possibility. However there are ways around this (new SIM card or email address) and once again the community would have to be relied on to report such actions. A further negative possibility is deliberate input of false or inaccurate information on project databases already in existence. This is more difficult to detect (like unintentional inaccuracies) unless the project manager or other authorised user is checking individual pages on a regular basis.  As with the previous problem the community would have to check for such activity and report any deliberate vandalism. If an individual page within a database is subject to deliberate successive attacks a lock out function maybe necessary (as used by Wikipedia) to dissuade such attempts.           
           One way around this would be to use a method used by peer to peer sharing of information and that is to make the site like a club, whereby members are referred by existing members. If a member has to be excluded from the site, the member who referred them could also be excluded or downgraded in some way. Perhaps a ratings or statistics system of some kind (like eBay) with regards accuracy, number of pages input etc could be implemented as an incentive. However with specific relation to historical subjects whereby there is disagreement on many aspects of one subject whether this would work could be a mute point. There could also be an argument that if a historian has the technical knowledge to compile a database of some kind they will probably already know how to set up a database on an existing hosting website. There are also wiki sites available (e.g. http://www.wikispaces.com/) which could be set up for an individual project, although they can be limited by price or size. The only difference is the proposed site would probably be open source, with an already existing user interface and the added advantage of crowd sourcing of data. This does bring up the question of funding and with the current economic climate it could be dubious as to whether a provider would give free resources to a site of this kind which would require large amounts of space and processing power. Setting up ones own server could be costly, however perhaps a one off charge, or not for profit company could be set up to cover such problems. Many internet service providers do like to be seen as charitable so perhaps some sort of accommodation could be accomplished.
         This website project was inspired by Wikipedia although with a restructuring towards a more academic and searchable application of history. The hypothesis for this is that not many historians are familiar with the technicalities of creating an online database and that an easy to use application would make this process much more simple. The site's aim is of an IMBd style interface, with an easy way to create searchable tables of data from many different sources on one particular topic or locality. The linking of a form page to an individual page with preset identifiers seemed to make sense and would make compiling a database from them that much more straight forward. The use of Dublin Core in the metadata from the outset would make any future progression of the site that much easier. Crowd sourcing was an obvious solution, however as previously discussed in the essay the downsides to the reliance on this are outweighed by the use of a not for profit labour force. If the site and individual project was set up correctly, I would envisage that the negative aspects could be kept to a minimum. The major reservation that has re-occurred throughout this essay is the subject of funding. Due to the current economic climate I do not necessarily see that a site of this kind being easily set up. The necessary programming skill to set up the user interface and database servers requires funding. Perhaps a collaboration between a number of different sources could make this a completely open source project, without the necessity to charge a fee for its use. It would certainly make the task of inputting a historians work that much more easier, displaying it and enabling it to reach as wide an audience as possbile.  
Total Word Count 2268  
Bibliography
A Wiki-page set up site;
Consulted 3/4/12 
Picture for website in Weebly (Newport rising)
Weebly example site
What historians don't know about database design...
Consulted 2/4/12
MARC 21 Standards and information
http://www.loc.gov/marc/ Consulted 6/4/12
Dublin Core
Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH)
Consulted 7/4/12

MySQL FAQ
IMBd - Internet Movies Database
http://www.imdb.com/  Consulted 13/4/12
Database website


No comments:

Post a Comment